Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs

Help | Advanced Search

arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Computer Science

  • New submissions
  • Cross-lists
  • Replacements

See recent articles

Showing new listings for Wednesday, 2 July 2025

Total of 812 entries
Showing up to 2000 entries per page: fewer | more | all

New submissions (showing 452 of 452 entries)

[1] arXiv:2507.00002 [pdf, html, other]
Title: Hypertokens: Holographic Associative Memory in Tokenized LLMs
Christopher James Augeri
Comments: preprint as accepted to this https URL - Quantum AI and NLP Conference 2025
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Large Language Models (LLMs) exhibit remarkable capabilities but suffer from apparent precision loss, reframed here as information spreading. This reframing shifts the problem from computational precision to an information-theoretic communication issue. We address the K:V and V:K memory problem in LLMs by introducing HDRAM (Holographically Defined Random Access Memory), a symbolic memory framework treating transformer latent space as a spread-spectrum channel. Built upon hypertokens, structured symbolic codes integrating classical error-correcting codes (ECC), holographic computing, and quantum-inspired search, HDRAM recovers distributed information through principled despreading. These phase-coherent memory addresses enable efficient key-value operations and Grover-style search in latent space. By combining ECC grammar with compressed sensing and Krylov subspace alignment, HDRAM significantly improves associative retrieval without architectural changes, demonstrating how Classical-Holographic-Quantum-inspired (CHQ) principles can fortify transformer architectures.

[2] arXiv:2507.00003 [pdf, other]
Title: Deciding When Not to Decide: Indeterminacy-Aware Intrusion Detection with NeutroSENSE
Eyhab Al-Masri
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)

This paper presents NeutroSENSE, a neutrosophic-enhanced ensemble framework for interpretable intrusion detection in IoT environments. By integrating Random Forest, XGBoost, and Logistic Regression with neutrosophic logic, the system decomposes prediction confidence into truth (T), falsity (F), and indeterminacy (I) components, enabling uncertainty quantification and abstention. Predictions with high indeterminacy are flagged for review using both global and adaptive, class-specific thresholds. Evaluated on the IoT-CAD dataset, NeutroSENSE achieved 97% accuracy, while demonstrating that misclassified samples exhibit significantly higher indeterminacy (I = 0.62) than correct ones (I = 0.24). The use of indeterminacy as a proxy for uncertainty enables informed abstention and targeted review-particularly valuable in edge deployments. Figures and tables validate the correlation between I-scores and error likelihood, supporting more trustworthy, human-in-the-loop AI decisions. This work shows that neutrosophic logic enhances both accuracy and explainability, providing a practical foundation for trust-aware AI in edge and fog-based IoT security systems.

[3] arXiv:2507.00004 [pdf, html, other]
Title: A Theory of Inference Compute Scaling: Reasoning through Directed Stochastic Skill Search
Austin R. Ellis-Mohr, Anuj K. Nayak, Lav R. Varshney
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Performance (cs.PF)

Large language models (LLMs) demand considerable computational, energy, and financial resources during both training and deployment. While scaling laws for training have guided much of the field's recent progress, inference costs now represent a significant and growing component of the overall resource burden, particularly for reasoning-focused models. Existing characterizations of compute-optimality that consider model size, dataset size, and inference tokens in isolation or in fixed combinations risk overlooking more efficient operating points. We introduce directed stochastic skill search (DS3), a general framework that represents inference as stochastic traversal over a learned skill graph. From a simplified yet expressive instantiation, we derive closed-form expressions for task success and compute cost across a wide range of inference strategies -- including chain-of-thought (CoT) and tree-of-thought (ToT) -- enabling comparative analysis as a function of task difficulty and model capability. To that end, we extend a prior first-principles tripartite graph framework of LLM training to incorporate inference, and separately bridge DS3 with empirical methods that characterize LLM scaling behavior. We theoretically recover empirically observed patterns, including: linear accuracy scaling with logarithmic compute; variation in preferred inference strategies as a function of task difficulty and model capability; emergent behavior elicited by reasoning even when performance plateaus under parameter scaling; and both best-of-N (BoN) and majority voting behavior captured within a unified analytical framework. By explicitly characterizing training-inference interdependencies, our framework deepens theoretical understanding and supports principled algorithmic design and resource allocation.

[4] arXiv:2507.00005 [pdf, other]
Title: SwarmFusion: Revolutionizing Disaster Response with Swarm Intelligence and Deep Learning
Vasavi Lankipalle
Comments: 6
Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)

Disaster response requires rapid, adaptive decision-making in chaotic environments. SwarmFusion, a novel hybrid framework, integrates particle swarm optimization with convolutional neural networks to optimize real-time resource allocation and path planning. By processing live satellite, drone, and sensor data, SwarmFusion enhances situational awareness and operational efficiency in flood and wildfire scenarios. Simulations using the DisasterSim2025 dataset demonstrate up to 40 percentage faster response times and 90 percentage survivor coverage compared to baseline methods. This scalable, data-driven approach offers a transformative solution for time-critical disaster management, with potential applications across diverse crisis scenarios.

[5] arXiv:2507.00006 [pdf, html, other]
Title: MVGBench: Comprehensive Benchmark for Multi-view Generation Models
Xianghui Xie, Chuhang Zou, Meher Gitika Karumuri, Jan Eric Lenssen, Gerard Pons-Moll
Comments: 17 pages, 11 figures, 9 tables, project page: this https URL
Subjects: Graphics (cs.GR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

We propose MVGBench, a comprehensive benchmark for multi-view image generation models (MVGs) that evaluates 3D consistency in geometry and texture, image quality, and semantics (using vision language models). Recently, MVGs have been the main driving force in 3D object creation. However, existing metrics compare generated images against ground truth target views, which is not suitable for generative tasks where multiple solutions exist while differing from ground truth. Furthermore, different MVGs are trained on different view angles, synthetic data and specific lightings -- robustness to these factors and generalization to real data are rarely evaluated thoroughly. Without a rigorous evaluation protocol, it is also unclear what design choices contribute to the progress of MVGs. MVGBench evaluates three different aspects: best setup performance, generalization to real data and robustness. Instead of comparing against ground truth, we introduce a novel 3D self-consistency metric which compares 3D reconstructions from disjoint generated multi-views. We systematically compare 12 existing MVGs on 4 different curated real and synthetic datasets. With our analysis, we identify important limitations of existing methods specially in terms of robustness and generalization, and we find the most critical design choices. Using the discovered best practices, we propose ViFiGen, a method that outperforms all evaluated MVGs on 3D consistency. Our code, model, and benchmark suite will be publicly released.

[6] arXiv:2507.00007 [pdf, other]
Title: Integrating Universal Generative AI Platforms in Educational Labs to Foster Critical Thinking and Digital Literacy
Vasiliy Znamenskiy, Rafael Niyazov, Joel Hernandez
Comments: this http URL
Journal-ref: International Journal on Cybernetics & Informatics (IJCI) Vol.14, No.3, June 2025
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

This paper presents a new educational framework for integrating generative artificial intelligence (GenAI) platforms such as ChatGPT, Claude, and Gemini into laboratory activities aimed at developing critical thinking and digital literacy among undergraduate students. Recognizing the limitations and risks of uncritical reliance on large language models (LLMs), the proposed pedagogical model reframes GenAI as a research subject and cognitive tool. Students formulate discipline-specific prompts and evaluate GenAI-generated responses in text, image, and video modalities. A pilot implementation in a general astronomy course for non-science majors demonstrated high levels of engagement and critical reflection, with many students continuing the activity after class and presenting results at a research symposium. The results highlight the importance of structured AI interactions in education and suggest that GenAI can improve learning outcomes when combined with reflective assessment methods. The study proposes a replicable model for interdisciplinary AI-integrated lab work, adaptable to scientific disciplines. See the guide to learning activities based on Generative-Ai platforms: this https URL

[7] arXiv:2507.00008 [pdf, html, other]
Title: DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning
Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang
Comments: 8 pages, 6 figures
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model's initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.

[8] arXiv:2507.00011 [pdf, html, other]
Title: Novel RL approach for efficient Elevator Group Control Systems
Nathan Vaartjes, Vincent Francois-Lavet
Comments: 15 pages, 12 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Efficient elevator traffic management in large buildings is critical for minimizing passenger travel times and energy consumption. Because heuristic- or pattern-detection-based controllers struggle with the stochastic and combinatorial nature of dispatching, we model the six-elevator, fifteen-floor system at Vrije Universiteit Amsterdam as a Markov Decision Process and train an end-to-end Reinforcement Learning (RL) Elevator Group Control System (EGCS). Key innovations include a novel action space encoding to handle the combinatorial complexity of elevator dispatching, the introduction of infra-steps to model continuous passenger arrivals, and a tailored reward signal to improve learning efficiency. In addition, we explore various ways to adapt the discounting factor to the infra-step formulation. We investigate RL architectures based on Dueling Double Deep Q-learning, showing that the proposed RL-based EGCS adapts to fluctuating traffic patterns, learns from a highly stochastic environment, and thereby outperforms a traditional rule-based algorithm.

[9] arXiv:2507.00012 [pdf, html, other]
Title: Towards Undistillable Models by Minimizing Conditional Mutual Information
Linfeng Ye, Shayan Mohajer Hamidi, En-hui Yang
Comments: 27 pages, 6 figures, Transactions on Machine Learning Research
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

A deep neural network (DNN) is said to be undistillable if, when used as a black-box input-output teacher, it cannot be distilled through knowledge distillation (KD). In this case, the distilled student (referred to as the knockoff student) does not outperform a student trained independently with label smoothing (LS student) in terms of prediction accuracy. To protect intellectual property of DNNs, it is desirable to build undistillable DNNs. To this end, it is first observed that an undistillable DNN may have the trait that each cluster of its output probability distributions in response to all sample instances with the same label should be highly concentrated to the extent that each cluster corresponding to each label should ideally collapse into one probability distribution. Based on this observation and by measuring the concentration of each cluster in terms of conditional mutual information (CMI), a new training method called CMI minimized (CMIM) method is proposed, which trains a DNN by jointly minimizing the conventional cross entropy (CE) loss and the CMI values of all temperature scaled clusters across the entire temperature spectrum. The resulting CMIM model is shown, by extensive experiments, to be undistillable by all tested KD methods existing in the literature. That is, the knockoff students distilled by these KD methods from the CMIM model underperform the respective LS students. In addition, the CMIM model is also shown to performs better than the model trained with the CE loss alone in terms of their own prediction accuracy.

[10] arXiv:2507.00013 [pdf, html, other]
Title: ST-MTM: Masked Time Series Modeling with Seasonal-Trend Decomposition for Time Series Forecasting
Hyunwoo Seo, Chiehyeon Lim
Comments: Accepted by KDD 2025 research track
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Forecasting complex time series is an important yet challenging problem that involves various industrial applications. Recently, masked time-series modeling has been proposed to effectively model temporal dependencies for forecasting by reconstructing masked segments from unmasked ones. However, since the semantic information in time series is involved in intricate temporal variations generated by multiple time series components, simply masking a raw time series ignores the inherent semantic structure, which may cause MTM to learn spurious temporal patterns present in the raw data. To capture distinct temporal semantics, we show that masked modeling techniques should address entangled patterns through a decomposition approach. Specifically, we propose ST-MTM, a masked time-series modeling framework with seasonal-trend decomposition, which includes a novel masking method for the seasonal-trend components that incorporates different temporal variations from each component. ST-MTM uses a period masking strategy for seasonal components to produce multiple masked seasonal series based on inherent multi-periodicity and a sub-series masking strategy for trend components to mask temporal regions that share similar variations. The proposed masking method presents an effective pre-training task for learning intricate temporal variations and dependencies. Additionally, ST-MTM introduces a contrastive learning task to support masked modeling by enhancing contextual consistency among multiple masked seasonal representations. Experimental results show that our proposed ST-MTM achieves consistently superior forecasting performance compared to existing masked modeling, contrastive learning, and supervised forecasting methods.

[11] arXiv:2507.00014 [pdf, html, other]
Title: SWE-Bench-CL: Continual Learning for Coding Agents
Thomas Joshi, Shayan Chowdhury, Fatih Uysal
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

Large Language Models (LLMs) have achieved impressive results on static code-generation benchmarks, but real-world software development unfolds as a continuous stream of evolving issues, fixes, and feature requests. We introduce SWE-Bench-CL, a novel continual learning benchmark built on the human-verified SWE-Bench Verified dataset introduced by OpenAI and Princeton-NLP in 2024. By organizing GitHub issues into chronologically ordered sequences that reflect natural repository evolution, SWE-Bench-CL enables direct evaluation of an agent's ability to accumulate experience, transfer knowledge across tasks, and resist catastrophic forgetting. We complement the dataset with (i) a preliminary analysis of inter-task structural similarity and contextual sensitivity, (ii) an interactive LangGraph-based evaluation framework augmented with a FAISS-backed semantic memory module, and (iii) a suite of specialized continual learning metrics -- including average accuracy, forgetting, forward/backward transfer, tool-use efficiency, and a generalized Composite Continual Learning Score and CL-F-beta score -- to capture the stability-plasticity trade-off. We outline a rigorous experimental protocol comparing memory-enabled and memory-disabled agents across diverse Python repositories. All code and data are publicly available at this https URL, providing the community with a reproducible platform for developing more adaptive and robust AI agents in software engineering.

[12] arXiv:2507.00015 [pdf, html, other]
Title: Vision Transformer with Adversarial Indicator Token against Adversarial Attacks in Radio Signal Classifications
Lu Zhang, Sangarapillai Lambotharan, Gan Zheng, Guisheng Liao, Xuekang Liu, Fabio Roli, Carsten Maple
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

The remarkable success of transformers across various fields such as natural language processing and computer vision has paved the way for their applications in automatic modulation classification, a critical component in the communication systems of Internet of Things (IoT) devices. However, it has been observed that transformer-based classification of radio signals is susceptible to subtle yet sophisticated adversarial attacks. To address this issue, we have developed a defensive strategy for transformer-based modulation classification systems to counter such adversarial attacks. In this paper, we propose a novel vision transformer (ViT) architecture by introducing a new concept known as adversarial indicator (AdvI) token to detect adversarial attacks. To the best of our knowledge, this is the first work to propose an AdvI token in ViT to defend against adversarial attacks. Integrating an adversarial training method with a detection mechanism using AdvI token, we combine a training time defense and running time defense in a unified neural network model, which reduces architectural complexity of the system compared to detecting adversarial perturbations using separate models. We investigate into the operational principles of our method by examining the attention mechanism. We show the proposed AdvI token acts as a crucial element within the ViT, influencing attention weights and thereby highlighting regions or features in the input data that are potentially suspicious or anomalous. Through experimental results, we demonstrate that our approach surpasses several competitive methods in handling white-box attack scenarios, including those utilizing the fast gradient method, projected gradient descent attacks and basic iterative method.

[13] arXiv:2507.00016 [pdf, html, other]
Title: Gradient-based Fine-Tuning through Pre-trained Model Regularization
Xuanbo Liu, Liu Liu, Fuxiang Wu, Fusheng Hao, Xianglong Liu
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Large pre-trained models have demonstrated extensive applications across various fields. However, fine-tuning these models for specific downstream tasks demands significant computational resources and storage. One fine-tuning method, gradient-based parameter selection (GPS), focuses on fine-tuning only the parameters with high gradients in each neuron, thereby reducing the number of training parameters. Nevertheless, this approach increases computational resource requirements and storage demands. In this paper, we propose an efficient gradient-based and regularized fine-tuning method (GRFT) that updates the rows or columns of the weight matrix. We theoretically demonstrate that the rows or columns with the highest sum of squared gradients are optimal for updating. This strategy effectively reduces storage overhead and improves the efficiency of parameter selection. Additionally, we incorporate regularization to enhance knowledge transfer from the pre-trained model. GRFT achieves state-of-the-art performance, surpassing existing methods such as GPS, Adapter Tuning, and LoRA. Notably, GRFT requires updating only 1.22% and 0.30% of the total parameters on FGVC and VTAB datasets, respectively, demonstrating its high efficiency and effectiveness. The source code will be released soon.

[14] arXiv:2507.00018 [pdf, html, other]
Title: Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
Bo Wang, Qinyuan Cheng, Runyu Peng, Rong Bao, Peiji Li, Qipeng Guo, Linyang Li, Zhiyuan Zeng, Yunhua Zhou, Xipeng Qiu
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical framework bridging Supervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training. Through rigorous mathematical derivation, we demonstrate that both SFT and preference learning methods like Direct Preference Optimization (DPO) operate within the same optimal policy-reward subspace, with SFT representing a special case of implicit reward learning. Our analysis reveals a critical limitation in conventional SFT: the KL divergence term in distribution matching becomes constant with respect to the policy during optimization, failing to constrain model updates. To address this, we propose a simple yet effective learning rate reduction approach that yields significant performance improvements (up to \textbf{25\%} relative gain and \textbf{6\%} absolute win rate increase in instruction following tasks. Additionally, we derive alternative SFT objectives from various f-divergence functions that preserve the KL term during optimization, further enhancing post-DPO model performance. Finally, we extend the theoretical relationship between LLM logits and Q-functions from preference learning to the SFT context, providing mathematical derivations and experimental validation.

[15] arXiv:2507.00019 [pdf, html, other]
Title: Quantum Inspired Encoding Strategies for Machine Learning Models: Proposing and Evaluating Instance Level, Global Discrete, and Class Conditional Representations
Minati Rath, Hema Date
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)

In this study, we propose, evaluate and compare three quantum inspired data encoding strategies, Instance Level Strategy (ILS), Global Discrete Strategy (GDS) and Class Conditional Value Strategy (CCVS), for transforming classical data into quantum data for use in pure classical machine learning models. The primary objective is to reduce high encoding time while ensuring correct encoding values and analyzing their impact on classification performance. The Instance Level Strategy treats each row of dataset independently; mimics local quantum states. Global Discrete Value Based encoding strategy maps all unique feature values across the full dataset to quantum states uniformly. In contrast, the Class conditional Value based encoding strategy encodes unique values separately for each class, preserving class dependent information.
We apply these encoding strategies to a classification task and assess their impact on en-coding efficiency, correctness, model accuracy, and computational cost. By analyzing the trade offs between encoding time, precision, and predictive performance, this study provides insights into optimizing quantum inspired data transformations for classical machine learning workflows.

[16] arXiv:2507.00020 [pdf, html, other]
Title: Variational Autoencoder for Generating Broader-Spectrum prior Proposals in Markov chain Monte Carlo Methods
Marcio Borges, Felipe Pereira, Michel Tosin
Comments: The main contribution of this work is to show the advantages of using deep generative models like VAE to provide more flexible and versatile prior distributions
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

This study uses a Variational Autoencoder method to enhance the efficiency and applicability of Markov Chain Monte Carlo (McMC) methods by generating broader-spectrum prior proposals. Traditional approaches, such as the Karhunen-Loève Expansion (KLE), require previous knowledge of the covariance function, often unavailable in practical applications. The VAE framework enables a data-driven approach to flexibly capture a broader range of correlation structures in Bayesian inverse problems, particularly subsurface flow modeling. The methodology is tested on a synthetic groundwater flow inversion problem, where pressure data is used to estimate permeability fields. Numerical experiments demonstrate that the VAE-based parameterization achieves comparable accuracy to KLE when the correlation length is known and outperforms KLE when the assumed correlation length deviates from the true value. Moreover, the VAE approach significantly reduces stochastic dimensionality, improving computational efficiency. The results suggest that leveraging deep generative models in McMC methods can lead to more adaptable and efficient Bayesian inference in high-dimensional problems.

[17] arXiv:2507.00022 [pdf, html, other]
Title: GLU Attention Improve Transformer
Zehao Wang
Comments: 4 pages 4 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

Gated Linear Units (GLU) have shown great potential in enhancing neural network performance. In this paper, I introduce a novel attention mechanism called GLU Attention, which introduces nonlinearity into the values of Attention. My experiments demonstrate that GLU Attention improves both model performance and convergence speed across text and vision modalities with zero additional parameters and negligible computational costs. GLU Attention is lightweight and can seamlessly integrate with other technologies, such as Flash Attention, Rotary Position Embedding (RoPE), and various Multi-Head Attention (MHA) variants such as Grouped-Query Attention (GQA). This project is open-sourced at github.

[18] arXiv:2507.00024 [pdf, html, other]
Title: AIMatDesign: Knowledge-Augmented Reinforcement Learning for Inverse Materials Design under Data Scarcity
Yeyong Yu, Xilei Bian, Jie Xiong, Xing Wu, Quan Qian
Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)

With the growing demand for novel materials, machine learning-driven inverse design methods face significant challenges in reconciling the high-dimensional materials composition space with limited experimental data. Existing approaches suffer from two major limitations: (I) machine learning models often lack reliability in high-dimensional spaces, leading to prediction biases during the design process; (II) these models fail to effectively incorporate domain expert knowledge, limiting their capacity to support knowledge-guided inverse design. To address these challenges, we introduce AIMatDesign, a reinforcement learning framework that addresses these limitations by augmenting experimental data using difference-based algorithms to build a trusted experience pool, accelerating model convergence. To enhance model reliability, an automated refinement strategy guided by large language models (LLMs) dynamically corrects prediction inconsistencies, reinforcing alignment between reward signals and state value functions. Additionally, a knowledge-based reward function leverages expert domain rules to improve stability and efficiency during training. Our experiments demonstrate that AIMatDesign significantly surpasses traditional machine learning and reinforcement learning methods in discovery efficiency, convergence speed, and success rates. Among the numerous candidates proposed by AIMatDesign, experimental synthesis of representative Zr-based alloys yielded a top-performing BMG with 1.7GPa yield strength and 10.2\% elongation, closely matching predictions. Moreover, the framework accurately captured the trend of yield strength variation with composition, demonstrating its reliability and potential for closed-loop materials discovery.

[19] arXiv:2507.00025 [pdf, html, other]
Title: Generalizing to New Dynamical Systems via Frequency Domain Adaptation
Tiexin Qin, Hong Yan, Haoliang Li
Comments: Accepted by TPAMI 2025
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

Learning the underlying dynamics from data with deep neural networks has shown remarkable potential in modeling various complex physical dynamics. However, current approaches are constrained in their ability to make reliable predictions in a specific domain and struggle with generalizing to unseen systems that are governed by the same general dynamics but differ in environmental characteristics. In this work, we formulate a parameter-efficient method, Fourier Neural Simulator for Dynamical Adaptation (FNSDA), that can readily generalize to new dynamics via adaptation in the Fourier space. Specifically, FNSDA identifies the shareable dynamics based on the known environments using an automatic partition in Fourier modes and learns to adjust the modes specific for each new environment by conditioning on low-dimensional latent systematic parameters for efficient generalization. We evaluate our approach on four representative families of dynamic systems, and the results show that FNSDA can achieve superior or competitive generalization performance compared to existing methods with a significantly reduced parameter cost. Our code is available at this https URL.

[20] arXiv:2507.00026 [pdf, html, other]
Title: ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models
Jiale Ding, Xiang Zheng, Cong Wang, Wei-Bin Lee, Xingjun Ma, Yu-Gang Jiang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

As Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications, evaluating their safety-especially under adversarial prompting-has become critical. Arguably, effective safety evaluations should be adaptive, evolving with LLM capabilities, and also cover a broad spectrum of harmful topics and real-world scenarios to fully expose potential vulnerabilities. Existing manual safety benchmarks, built on handcrafted adversarial prompts, are limited by their static nature and the intensive labor required to update them, making it difficult to keep pace with rapidly advancing LLMs. In contrast, automated adversarial prompt generation offers a promising path toward adaptive evaluation. However, current methods often suffer from insufficient adversarial topic coverage (topic-level diversity) and weak alignment with real-world contexts. These shortcomings stem from the exploration-exploitation dilemma in black-box optimization and a lack of real-world contextualization, resulting in adversarial prompts that are both topically narrow and scenario-repetitive. To address these issues, we propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM for generating topically diverse and contextually rich adversarial prompts. Experiments show that ROSE outperforms existing methods in uncovering safety vulnerabilities in state-of-the-art LLMs, with notable improvements in integrated evaluation metrics. We hope ROSE represents a step toward more practical and reality-oriented safety evaluation of LLMs. WARNING: This paper contains examples of potentially harmful text.

[21] arXiv:2507.00028 [pdf, html, other]
Title: HiT-JEPA: A Hierarchical Self-supervised Trajectory Embedding Framework for Similarity Computation
Lihuan Li, Hao Xue, Shuang Ao, Yang Song, Flora Salim
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

The representation of urban trajectory data plays a critical role in effectively analyzing spatial movement patterns. Despite considerable progress, the challenge of designing trajectory representations that can capture diverse and complementary information remains an open research problem. Existing methods struggle in incorporating trajectory fine-grained details and high-level summary in a single model, limiting their ability to attend to both long-term dependencies while preserving local nuances. To address this, we propose HiT-JEPA (Hierarchical Interactions of Trajectory Semantics via a Joint Embedding Predictive Architecture), a unified framework for learning multi-scale urban trajectory representations across semantic abstraction levels. HiT-JEPA adopts a three-layer hierarchy that progressively captures point-level fine-grained details, intermediate patterns, and high-level trajectory abstractions, enabling the model to integrate both local dynamics and global semantics in one coherent structure. Extensive experiments on multiple real-world datasets for trajectory similarity computation show that HiT-JEPA's hierarchical design yields richer, multi-scale representations. Code is available at: this https URL.

[22] arXiv:2507.00029 [pdf, html, other]
Title: LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing
Wenbing Li, Zikai Song, Hang Zhou, Yunyao Zhang, Junqing Yu, Wei Yang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Recent efforts to combine low-rank adaptation (LoRA) with mixture-of-experts (MoE) for adapting large language models (LLMs) to multiple tasks still exhibit prevailing limitations: they either swap entire attention/feed-forward layers for switch experts or bolt on parallel expert branches, diluting parameter efficiency and task fidelity. We propose the LoRA-Mixer, a modular and lightweight MoE framework that integrates LoRA experts. Our core innovation lies in replacing the projection matrices of the attention module's input/output linear layers with dynamically routed, task-specific LoRA experts. This design ensures seamless compatibility with diverse foundation models, including transformers and state space models (SSMs), by leveraging their inherent linear projection structures. The framework supports two operational paradigms: (1) joint optimization of LoRA experts and routing mechanisms via a novel hard-soft routing strategy, or (2) direct deployment of pre-trained, frozen LoRA modules sourced from external repositories. To enable robust router training with limited data while ensuring stable routing decisions and maximizing expert reuse, we introduce an adaptive Specialization Balance Loss (SBL) that jointly optimizes expert balance and task-specific alignment. Extensive experiments on seven benchmark datasets, including MedQA, CoLA, SST-2, GSM8K, ARC-E, ARC-C, and HumanEval, demonstrate the effectiveness of LoRA-Mixer. On datasets such as GSM8K, HumanEval, and MedQA, LoRA-Mixer achieves significant improvements of 7.61%, 4.88%, and 3.08% over the base models, respectively. Compared with state-of-the-art methods, LoRA-Mixer achieves additional improvements of 1.09%, 1.45%, and 1.68%, respectively, using only 48% of the parameters, demonstrating its efficiency and strong performance.

[23] arXiv:2507.00030 [pdf, html, other]
Title: Adaptive Action Duration with Contextual Bandits for Deep Reinforcement Learning in Dynamic Environments
Abhishek Verma, Nallarasan V, Balaraman Ravindran
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Deep Reinforcement Learning (DRL) has achieved remarkable success in complex sequential decision-making tasks, such as playing Atari 2600 games and mastering board games. A critical yet underexplored aspect of DRL is the temporal scale of action execution. We propose a novel paradigm that integrates contextual bandits with DRL to adaptively select action durations, enhancing policy flexibility and computational efficiency. Our approach augments a Deep Q-Network (DQN) with a contextual bandit module that learns to choose optimal action repetition rates based on state contexts. Experiments on Atari 2600 games demonstrate significant performance improvements over static duration baselines, highlighting the efficacy of adaptive temporal abstractions in DRL. This paradigm offers a scalable solution for real-time applications like gaming and robotics, where dynamic action durations are critical.

[24] arXiv:2507.00031 [pdf, html, other]
Title: Enhancing Spatio-Temporal Forecasting with Spatial Neighbourhood Fusion:A Case Study on COVID-19 Mobility in Peru
Chuan Li, Jiang You, Hassine Moungla, Vincent Gauthier, Miguel Nunez-del-Prado, Hugo Alatrista-Salas
Subjects: Machine Learning (cs.LG)

Accurate modeling of human mobility is critical for understanding epidemic spread and deploying timely interventions. In this work, we leverage a large-scale spatio-temporal dataset collected from Peru's national Digital Contact Tracing (DCT) application during the COVID-19 pandemic to forecast mobility flows across urban regions. A key challenge lies in the spatial sparsity of hourly mobility counts across hexagonal grid cells, which limits the predictive power of conventional time series models. To address this, we propose a lightweight and model-agnostic Spatial Neighbourhood Fusion (SPN) technique that augments each cell's features with aggregated signals from its immediate H3 neighbors. We evaluate this strategy on three forecasting backbones: NLinear, PatchTST, and K-U-Net, under various historical input lengths. Experimental results show that SPN consistently improves forecasting performance, achieving up to 9.85 percent reduction in test MSE. Our findings demonstrate that spatial smoothing of sparse mobility signals provides a simple yet effective path toward robust spatio-temporal forecasting during public health crises.

[25] arXiv:2507.00032 [pdf, html, other]
Title: Ken Utilization Layer: Hebbian Replay Within a Student's Ken for Adaptive Knowledge Tracing
Grey Kuling, Marinka Zitnik
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

We introduce KUL-KT, a biologically inspired architecture for knowledge tracing (KT), combining Hebbian memory encoding with gradient-based consolidation in a scalable, input-agnostic framework. KUL-KT adapts the principle of memory consolidation in neural systems, to student modeling by introducing two key innovations: (i) a time-decaying Hebbian memory update that enables graceful forgetting, and (ii) a novel Loss-aligned Internal Target (LIT) method to compute an ideal internal state, allowing continual learning without backpropagation through time. The architecture consists of a fast Hebbian memory that captures each learner interaction via a single associative update, and a slower linear network that consolidates recalled samples through gradient descent. This design enables few-shot personalization and natural forgetting without storing raw data or relying on large cohort training. Operating entirely in embedding space, KUL-KT supports both structured (tabular) and unstructured (short-answer) inputs. Empirically, KUL-KT outperforms strong baselines on ten public KT benchmarks in rank-sensitive metrics such as nDCG and Recall@10. In a classroom deployment, KUL-KT personalized quizzes from short-answer data, leading to improved learner-perceived helpfulness and reduced difficulty (p < 0.05). Ablation studies confirm that Hebbian decay and LIT are critical for continual adaptation. Compared to a strong graph-based KT model, KUL-KT trains 1.75x faster and uses 99.01\% less memory. These results position KUL-KT as a biologically grounded, memory-efficient, and input-flexible framework for personalized learning at scale.

[26] arXiv:2507.00033 [pdf, html, other]
Title: Moment Sampling in Video LLMs for Long-Form Video QA
Mustafa Chasmai, Gauri Jagatap, Gouthaman KV, Grant Van Horn, Subhransu Maji, Andrea Fanelli
Comments: Workshop on Video Large Language Models (VidLLMs) at CVPR 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Recent advancements in video large language models (Video LLMs) have significantly advanced the field of video question answering (VideoQA). While existing methods perform well on short videos, they often struggle with long-range reasoning in longer videos. To scale Video LLMs for longer video content, frame sub-sampling (selecting frames at regular intervals) is commonly used. However, this approach is suboptimal, often leading to the loss of crucial frames or the inclusion of redundant information from multiple similar frames. Missing key frames impairs the model's ability to answer questions accurately, while redundant frames lead the model to focus on irrelevant video segments and increase computational resource consumption. In this paper, we investigate the use of a general-purpose text-to-video moment retrieval model to guide the frame sampling process. We propose "moment sampling", a novel, model-agnostic approach that enables the model to select the most relevant frames according to the context of the question. Specifically, we employ a lightweight moment retrieval model to prioritize frame selection. By focusing on the frames most pertinent to the given question, our method enhances long-form VideoQA performance in Video LLMs. Through extensive experiments on four long-form VideoQA datasets, using four state-of-the-art Video LLMs, we demonstrate the effectiveness of the proposed approach.

[27] arXiv:2507.00034 [pdf, html, other]
Title: Data Collection with Non-Uniform Axial Power for Phase II of the OECD/NEA AI/ML Critical Heat Flux Benchmark
Reece Bourisaw, Reid McCants, Jean-Marie Le Corre, Anna Iskhakova, Arsen S. Iskhakov
Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)

Critical heat flux (CHF) marks the onset of boiling crisis in light-water reactors, defining safe thermal-hydraulic operating limits. To support Phase II of the OECD/NEA AI/ML CHF benchmark, which introduces spatially varying power profiles, this work compiles and digitizes a broad CHF dataset covering both uniform and non-uniform axial heating conditions. Heating profiles were extracted from technical reports, interpolated onto a consistent axial mesh, validated via energy-balance checks, and encoded in machine-readable formats for benchmark compatibility.
Classical CHF correlations exhibit substantial errors under uniform heating and degrade markedly when applied to non-uniform profiles, while modern tabular methods offer improved but still imperfect predictions. A neural network trained solely on uniform data performs well in that regime but fails to generalize to spatially varying scenarios, underscoring the need for models that explicitly incorporate axial power distributions. By providing these curated datasets and baseline modeling results, this study lays the groundwork for advanced transfer-learning strategies, rigorous uncertainty quantification, and design-optimization efforts in the next phase of the CHF benchmark.

[28] arXiv:2507.00036 [pdf, html, other]
Title: IDRIFTNET: Physics-Driven Spatiotemporal Deep Learning for Iceberg Drift Forecasting
Rohan Putatunda, Sanjay Purushotham, Ratnaksha Lele, Vandana P. Janeja
Comments: 16 pages, 4 figures
Subjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)

Drifting icebergs in the polar oceans play a key role in the Earth's climate system, impacting freshwater fluxes into the ocean and regional ecosystems while also posing a challenge to polar navigation. However, accurately forecasting iceberg trajectories remains a formidable challenge, primarily due to the scarcity of spatiotemporal data and the complex, nonlinear nature of iceberg motion, which is also impacted by environmental variables. The iceberg motion is influenced by multiple dynamic environmental factors, creating a highly variable system that makes trajectory identification complex. These limitations hinder the ability of deep learning models to effectively capture the underlying dynamics and provide reliable predictive outcomes. To address these challenges, we propose a hybrid IDRIFTNET model, a physics-driven deep learning model that combines an analytical formulation of iceberg drift physics, with an augmented residual learning model. The model learns the pattern of mismatch between the analytical solution and ground-truth observations, which is combined with a rotate-augmented spectral neural network that captures both global and local patterns from the data to forecast future iceberg drift positions. We compare IDRIFTNET model performance with state-of-the-art models on two Antarctic icebergs: A23A and B22A. Our findings demonstrate that IDRIFTNET outperforms other models by achieving a lower Final Displacement Error (FDE) and Average Displacement Error (ADE) across a variety of time points. These results highlight IDRIFTNET's effectiveness in capturing the complex, nonlinear drift of icebergs for forecasting iceberg trajectories under limited data and dynamic environmental conditions.

[29] arXiv:2507.00037 [pdf, html, other]
Title: Model Fusion via Neuron Interpolation
Phoomraphee Luenam, Andreas Spanopoulos, Amit Sant, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh
Comments: 5 figures, 15 tables, 23 pages
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Model fusion aims to combine the knowledge of multiple models by creating one representative model that captures the strengths of all of its parents. However, this process is non-trivial due to differences in internal representations, which can stem from permutation invariance, random initialization, or differently distributed training data. We present a novel, neuron-centric family of model fusion algorithms designed to integrate multiple trained neural networks into a single network effectively regardless of training data distribution. Our algorithms group intermediate neurons of parent models to create target representations that the fused model approximates with its corresponding sub-network. Unlike prior approaches, our approach incorporates neuron attribution scores into the fusion process. Furthermore, our algorithms can generalize to arbitrary layer types. Experimental results on various benchmark datasets demonstrate that our algorithms consistently outperform previous fusion techniques, particularly in zero-shot and non-IID fusion scenarios. The code is available at this https URL.

[30] arXiv:2507.00038 [pdf, other]
Title: Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information
Fei Chen, Wenchi Zhou
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Data reduction plays a vital role in data-centric AI by identifying the most informative instance within large-scale datasets to enhance model training efficiency. The core challenge lies in how to select the optimal instances-rather than the entire datasets-to improve data quality and training efficiency. In this paper, we propose an effective data reduction strategy based on Pointwise V-information(PVI). First, we quantify instance difficulty using PVI and filter out low-difficulty instances enabling a static approach. Experiments demonstrate that removing 10%-30% of the data preserves the classifier performance with only a 0.0001% to 0.76% loss in this http URL, we use a progressive learning approach to training the classifiers on instances sorted by ascending PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional training. Our results suggest that with the effective data reduction strategy, training a classifier on the selected optimal subset could enhance the model performance and boost training efficiency. Moreover, we have transferred the PVI framework, which previously applied only to English datasets, to diverse Chinese NLP tasks and base models, leading to valuable insights for cross-lingual data reduction and faster training. The codes are released at this https URL.

[31] arXiv:2507.00039 [pdf, other]
Title: Pattern-Based Graph Classification: Comparison of Quality Measures and Importance of Preprocessing
Lucas Potin, Rosa Figueiredo, Vincent Labatut, Christine Largeron
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Graph classification aims to categorize graphs based on their structural and attribute features, with applications in diverse fields such as social network analysis and bioinformatics. Among the methods proposed to solve this task, those relying on patterns (i.e. subgraphs) provide good explainability, as the patterns used for classification can be directly interpreted. To identify meaningful patterns, a standard approach is to use a quality measure, i.e. a function that evaluates the discriminative power of each pattern. However, the literature provides tens of such measures, making it difficult to select the most appropriate for a given application. Only a handful of surveys try to provide some insight by comparing these measures, and none of them specifically focuses on graphs. This typically results in the systematic use of the most widespread measures, without thorough evaluation. To address this issue, we present a comparative analysis of 38 quality measures from the literature. We characterize them theoretically, based on four mathematical properties. We leverage publicly available datasets to constitute a benchmark, and propose a method to elaborate a gold standard ranking of the patterns. We exploit these resources to perform an empirical comparison of the measures, both in terms of pattern ranking and classification performance. Moreover, we propose a clustering-based preprocessing step, which groups patterns appearing in the same graphs to enhance classification performance. Our experimental results demonstrate the effectiveness of this step, reducing the number of patterns to be processed while achieving comparable performance. Additionally, we show that some popular measures widely used in the literature are not associated with the best results.

[32] arXiv:2507.00041 [pdf, html, other]
Title: TalentMine: LLM-Based Extraction and Question-Answering from Multimodal Talent Tables
Varun Mannam, Fang Wang, Chaochun Liu, Xin Chen
Comments: Submitted to KDD conference, workshop: Talent and Management Computing (TMC 2025), this https URL
Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

In talent management systems, critical information often resides in complex tabular formats, presenting significant retrieval challenges for conventional language models. These challenges are pronounced when processing Talent documentation that requires precise interpretation of tabular relationships for accurate information retrieval and downstream decision-making. Current table extraction methods struggle with semantic understanding, resulting in poor performance when integrated into retrieval-augmented chat applications. This paper identifies a key bottleneck - while structural table information can be extracted, the semantic relationships between tabular elements are lost, causing downstream query failures. To address this, we introduce TalentMine, a novel LLM-enhanced framework that transforms extracted tables into semantically enriched representations. Unlike conventional approaches relying on CSV or text linearization, our method employs specialized multimodal reasoning to preserve both structural and semantic dimensions of tabular data. Experimental evaluation across employee benefits document collections demonstrates TalentMine's superior performance, achieving 100% accuracy in query answering tasks compared to 0% for standard AWS Textract extraction and 40% for AWS Textract Visual Q&A capabilities. Our comparative analysis also reveals that the Claude v3 Haiku model achieves optimal performance for talent management applications. The key contributions of this work include (1) a systematic analysis of semantic information loss in current table extraction pipelines, (2) a novel LLM-based method for semantically enriched table representation, (3) an efficient integration framework for retrieval-augmented systems as end-to-end systems, and (4) comprehensive benchmarks on talent analytics tasks showing substantial improvements across multiple categories.

[33] arXiv:2507.00042 [pdf, html, other]
Title: Catastrophic Forgetting Mitigation via Discrepancy-Weighted Experience Replay
Xinrun Xu, Jianwen Yang, Qiuhong Zhang, Zhanbiao Lian, Zhiming Ding, Shan Jiang
Comments: ICANN 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Continually adapting edge models in cloud-edge collaborative object detection for traffic monitoring suffers from catastrophic forgetting, where models lose previously learned knowledge when adapting to new data distributions. This is especially problematic in dynamic traffic environments characterised by periodic variations (e.g., day/night, peak hours), where past knowledge remains valuable. Existing approaches like experience replay and visual prompts offer some mitigation, but struggle to effectively prioritize and leverage historical data for optimal knowledge retention and adaptation. Specifically, simply storing and replaying all historical data can be inefficient, while treating all historical experiences as equally important overlooks their varying relevance to the current domain. This paper proposes ER-EMU, an edge model update algorithm based on adaptive experience replay, to address these limitations. ER-EMU utilizes a limited-size experience buffer managed using a First-In-First-Out (FIFO) principle, and a novel Domain Distance Metric-based Experience Selection (DDM-ES) algorithm. DDM-ES employs the multi-kernel maximum mean discrepancy (MK-MMD) to quantify the dissimilarity between target domains, prioritizing the selection of historical data that is most dissimilar to the current target domain. This ensures training diversity and facilitates the retention of knowledge from a wider range of past experiences, while also preventing overfitting to the new domain. The experience buffer is also updated using a simple random sampling strategy to maintain a balanced representation of previous domains. Experiments on the Bellevue traffic video dataset, involving repeated day/night cycles, demonstrate that ER-EMU consistently improves the performance of several state-of-the-art cloud-edge collaborative object detection frameworks.

[34] arXiv:2507.00043 [pdf, html, other]
Title: MR-CLIP: Efficient Metadata-Guided Learning of MRI Contrast Representations
Mehmet Yigit Avci, Pedro Borges, Paul Wright, Mehmet Yigitsoy, Sebastien Ourselin, Jorge Cardoso
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Accurate interpretation of Magnetic Resonance Imaging scans in clinical systems is based on a precise understanding of image contrast. This contrast is primarily governed by acquisition parameters, such as echo time and repetition time, which are stored in the DICOM metadata. To simplify contrast identification, broad labels such as T1-weighted or T2-weighted are commonly used, but these offer only a coarse approximation of the underlying acquisition settings. In many real-world datasets, such labels are entirely missing, leaving raw acquisition parameters as the only indicators of contrast. Adding to this challenge, the available metadata is often incomplete, noisy, or inconsistent. The lack of reliable and standardized metadata complicates tasks such as image interpretation, retrieval, and integration into clinical workflows. Furthermore, robust contrast-aware representations are essential to enable more advanced clinical applications, such as achieving modality-invariant representations and data harmonization. To address these challenges, we propose MR-CLIP, a multimodal contrastive learning framework that aligns MR images with their DICOM metadata to learn contrast-aware representations, without relying on manual labels. Trained on a diverse clinical dataset that spans various scanners and protocols, MR-CLIP captures contrast variations across acquisitions and within scans, enabling anatomy-invariant representations. We demonstrate its effectiveness in cross-modal retrieval and contrast classification, highlighting its scalability and potential for further clinical applications. The code and weights are publicly available at this https URL.

[35] arXiv:2507.00044 [pdf, other]
Title: HistoART: Histopathology Artifact Detection and Reporting Tool
Seyed Kahaki, Alexander R. Webber, Ghada Zamzmi, Adarsh Subbaswamy, Rucha Deshpande, Aldo Badano
Comments: 14 pages, 5 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

In modern cancer diagnostics, Whole Slide Imaging (WSI) is widely used to digitize tissue specimens for detailed, high-resolution examination; however, other diagnostic approaches, such as liquid biopsy and molecular testing, are also utilized based on the cancer type and clinical context. While WSI has revolutionized digital histopathology by enabling automated, precise analysis, it remains vulnerable to artifacts introduced during slide preparation and scanning. These artifacts can compromise downstream image analysis. To address this challenge, we propose and compare three robust artifact detection approaches for WSIs: (1) a foundation model-based approach (FMA) using a fine-tuned Unified Neural Image (UNI) architecture, (2) a deep learning approach (DLA) built on a ResNet50 backbone, and (3) a knowledge-based approach (KBA) leveraging handcrafted features from texture, color, and frequency-based metrics. The methods target six common artifact types: tissue folds, out-of-focus regions, air bubbles, tissue damage, marker traces, and blood contamination. Evaluations were conducted on 50,000+ image patches from diverse scanners (Hamamatsu, Philips, Leica Aperio AT2) across multiple sites. The FMA achieved the highest patch-wise AUROC of 0.995 (95% CI [0.994, 0.995]), outperforming the ResNet50-based method (AUROC: 0.977, 95% CI [0.977, 0.978]) and the KBA (AUROC: 0.940, 95% CI [0.933, 0.946]). To translate detection into actionable insights, we developed a quality report scorecard that quantifies high-quality patches and visualizes artifact distributions.

[36] arXiv:2507.00045 [pdf, html, other]
Title: CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning
Ming Li, Chenguang Wang, Yijun Liang, Xiyao Wang, Yuhang Zhou, Xiyang Wu, Yuqing Zhang, Ruiyi Zhang, Tianyi Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Recent agentic Multi-Modal Large Language Models (MLLMs) such as GPT-o3 have achieved near-ceiling scores on various existing benchmarks, motivating a demand for more challenging test tasks. These MLLMs have been reported to excel in a few expert-level tasks for humans, e.g., GeoGuesser, reflecting their potential as a detective who can notice minuscule cues in an image and weave them into coherent, situational explanations, leading to a reliable answer. But can they match the performance of excellent human detectives? To answer this question, we investigate some hard scenarios where GPT-o3 can still handle, and find a common scenario where o3's performance drops to nearly zero, which we name CaughtCheating. It is inspired by the social media requests that ask others to detect suspicious clues from photos shared by the poster's partner. We conduct extensive experiments and analysis to understand why existing MLLMs lack sufficient capability to solve this kind of task. CaughtCheating provides a class of challenging visual perception and reasoning tasks with great value and practical usage. Success in these tasks paves the way for MLLMs to acquire human-level detective perception and reasoning capabilities.

[37] arXiv:2507.00046 [pdf, other]
Title: Evolutionary computing-based image segmentation method to detect defects and features in Additive Friction Stir Deposition Process
Akshansh Mishra, Eyob Mesele Sefene, Shivraman Thapliyal
Comments: 7 pages, 4 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)

This work proposes an evolutionary computing-based image segmentation approach for analyzing soundness in Additive Friction Stir Deposition (AFSD) processes. Particle Swarm Optimization (PSO) was employed to determine optimal segmentation thresholds for detecting defects and features in multilayer AFSD builds. The methodology integrates gradient magnitude analysis with distance transforms to create novel attention-weighted visualizations that highlight critical interface regions. Five AFSD samples processed under different conditions were analyzed using multiple visualization techniques i.e. self-attention maps, and multi-channel visualization. These complementary approaches reveal subtle material transition zones and potential defect regions which were not readily observable through conventional imaging. The PSO algorithm automatically identified optimal threshold values (ranging from 156-173) for each sample, enabling precise segmentation of material interfaces. The multi-channel visualization technique effectively combines boundary information (red channel), spatial relationships (green channel), and material density data (blue channel) into cohesive representations that quantify interface quality. The results demonstrate that attention-based analysis successfully identifies regions of incomplete bonding and inhomogeneities in AFSD joints, providing quantitative metrics for process optimization and quality assessment of additively manufactured components.

[38] arXiv:2507.00047 [pdf, html, other]
Title: Reducing Profile-Based Matching to the Maximum Weight Matching Problem
Seongbeom Park
Comments: 8 pages, in English; 9 pages, in Korean;
Subjects: Discrete Mathematics (cs.DM); Theoretical Economics (econ.TH); Combinatorics (math.CO)

The profile-based matching problem is the problem of finding a matching that optimizes profile from an instance $(G, r, \langle u_1, \dots, u_r \rangle)$, where $G$ is a bipartite graph $(A \cup B, E)$, $r$ is the number of utility functions, and $u_i: E \to \{ 0, 1, \dots, U_i \}$ is utility functions for $1 \le i \le r$. A matching is optimal if the matching maximizes the sum of the 1st utility, subject to this, maximizes the sum of the 2nd utility, and so on. The profile-based matching can express rank-maximal matching \cite{irving2006rank}, fair matching \cite{huang2016fair}, and weight-maximal matching \cite{huang2012weight}. These problems can be reduced to maximum weight matching problems, but the reduction is known to be inefficient due to the huge weights.
This paper presents the condition for a weight function to find an optimal matching by reducing profile-based matching to the maximum weight matching problem. It is shown that a weight function which represents utilities as a mixed-radix numeric system with base-$(2U_i+1)$ can be used, so the complexity of the problem is $O(m\sqrt{n}(\log{n} + \sum_{i=1}^{r}\log{U_i}))$ for $n = |V|$, $m = |E|$. In addition, it is demonstrated that the weight lower bound for rank-maximal/fair/weight-maximal matching, better computational complexity for fair/weight-maximal matching, and an algorithm to verify a maximum weight matching can be reduced to rank-maximal matching. Finally, the effectiveness of the profile-based algorithm is evaluated with real data for school choice lottery.

[39] arXiv:2507.00048 [pdf, html, other]
Title: A collaborative digital twin built on FAIR data and compute infrastructure
Thomas M. Deucher, Juan C. Verduzco, Michael Titus, Alejandro Strachan
Comments: 10 pages, 5 figures
Subjects: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)

The integration of machine learning with automated experimentation in self-driving laboratories (SDL) offers a powerful approach to accelerate discovery and optimization tasks in science and engineering applications. When supported by findable, accessible, interoperable, and reusable (FAIR) data infrastructure, SDLs with overlapping interests can collaborate more effectively. This work presents a distributed SDL implementation built on nanoHUB services for online simulation and FAIR data management. In this framework, geographically dispersed collaborators conducting independent optimization tasks contribute raw experimental data to a shared central database. These researchers can then benefit from analysis tools and machine learning models that automatically update as additional data become available. New data points are submitted through a simple web interface and automatically processed using a nanoHUB Sim2L, which extracts derived quantities and indexes all inputs and outputs in a FAIR data repository called ResultsDB. A separate nanoHUB workflow enables sequential optimization using active learning, where researchers define the optimization objective, and machine learning models are trained on-the-fly with all existing data, guiding the selection of future experiments. Inspired by the concept of ``frugal twin", the optimization task seeks to find the optimal recipe to combine food dyes to achieve the desired target color. With easily accessible and inexpensive materials, researchers and students can set up their own experiments, share data with collaborators, and explore the combination of FAIR data, predictive ML models, and sequential optimization. The tools introduced are generally applicable and can easily be extended to other optimization problems.

[40] arXiv:2507.00049 [pdf, html, other]
Title: AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training
Feiyang Kang, Nadine Chang, Maying Shen, Marc T. Law, Rafid Mahmood, Ruoxi Jia, Jose M. Alvarez
Comments: Preprint
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

The computational burden and inherent redundancy of large-scale datasets challenge the training of contemporary machine learning models. Data pruning offers a solution by selecting smaller, informative subsets, yet existing methods struggle: density-based approaches can be task-agnostic, while model-based techniques may introduce redundancy or prove computationally prohibitive. We introduce Adaptive De-Duplication (AdaDeDup), a novel hybrid framework that synergistically integrates density-based pruning with model-informed feedback in a cluster-adaptive manner. AdaDeDup first partitions data and applies an initial density-based pruning. It then employs a proxy model to evaluate the impact of this initial pruning within each cluster by comparing losses on kept versus pruned samples. This task-aware signal adaptively adjusts cluster-specific pruning thresholds, enabling more aggressive pruning in redundant clusters while preserving critical data in informative ones. Extensive experiments on large-scale object detection benchmarks (Waymo, COCO, nuScenes) using standard models (BEVFormer, Faster R-CNN) demonstrate AdaDeDup's advantages. It significantly outperforms prominent baselines, substantially reduces performance degradation (e.g., over 54% versus random sampling on Waymo), and achieves near-original model performance while pruning 20% of data, highlighting its efficacy in enhancing data efficiency for large-scale model training. Code is open-sourced.

[41] arXiv:2507.00050 [pdf, html, other]
Title: SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network
Devin Y. De Silva, Sandareka Wickramanayake, Dulani Meedeniya, Sanka Rasnayaka
Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

Human Activity Recognition (HAR), which uses data from Inertial Measurement Unit (IMU) sensors, has many practical applications in healthcare and assisted living environments. However, its use in real-world scenarios has been limited by the lack of comprehensive IMU-based HAR datasets that cover a wide range of activities and the lack of transparency in existing HAR models. Zero-shot HAR (ZS-HAR) overcomes the data limitations, but current models struggle to explain their decisions, making them less transparent. This paper introduces a novel IMU-based ZS-HAR model called the Self-Explainable Zero-shot Human Activity Recognition Network (SEZ-HARN). It can recognize activities not encountered during training and provide skeleton videos to explain its decision-making process. We evaluate the effectiveness of the proposed SEZ-HARN on four benchmark datasets PAMAP2, DaLiAc, HTD-MHAD and MHealth and compare its performance against three state-of-the-art black-box ZS-HAR models. The experiment results demonstrate that SEZ-HARN produces realistic and understandable explanations while achieving competitive Zero-shot recognition accuracy. SEZ-HARN achieves a Zero-shot prediction accuracy within 3\% of the best-performing black-box model on PAMAP2 while maintaining comparable performance on the other three datasets.

[42] arXiv:2507.00052 [pdf, html, other]
Title: VSF-Med:A Vulnerability Scoring Framework for Medical Vision-Language Models
Binesh Sadanandan, Vahid Behzadan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Vision Language Models (VLMs) hold great promise for streamlining labour-intensive medical imaging workflows, yet systematic security evaluations in clinical settings remain scarce. We introduce VSF--Med, an end-to-end vulnerability-scoring framework for medical VLMs that unites three novel components: (i) a rich library of sophisticated text-prompt attack templates targeting emerging threat vectors; (ii) imperceptible visual perturbations calibrated by structural similarity (SSIM) thresholds to preserve clinical realism; and (iii) an eight-dimensional rubric evaluated by two independent judge LLMs, whose raw scores are consolidated via z-score normalization to yield a 0--32 composite risk metric. Built entirely on publicly available datasets and accompanied by open-source code, VSF--Med synthesizes over 30,000 adversarial variants from 5,000 radiology images and enables reproducible benchmarking of any medical VLM with a single command. Our consolidated analysis reports mean z-score shifts of $0.90\sigma$ for persistence-of-attack-effects, $0.74\sigma$ for prompt-injection effectiveness, and $0.63\sigma$ for safety-bypass success across state-of-the-art VLMs. Notably, Llama-3.2-11B-Vision-Instruct exhibits a peak vulnerability increase of $1.29\sigma$ for persistence-of-attack-effects, while GPT-4o shows increases of $0.69\sigma$ for that same vector and $0.28\sigma$ for prompt-injection attacks.

[43] arXiv:2507.00054 [pdf, html, other]
Title: Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation
Shreyansh Padarha
Comments: 17 Pages, 7 figures
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

The push to compress and impart the proficiency of Large Language Models (LLMs) into more deployable and efficient Small Language Models (SLMs) has benefited from improvements in knowledge distillation (KD) techniques. These techniques allow a smaller student model to learn from a more capable and larger teacher model's responses. However, distillation often revolves around the student model merely copying the teacher's in-distribution responses, limiting its generalisability. This limitation is amplified on reasoning tasks and can be computationally expensive. In this study, we propose AdvDistill, a reward-guided dataset distillation framework. We utilise multiple generations (responses) from a teacher for each prompt and assign rewards based on rule-based verifiers. These varying and normally distributed rewards serve as weights when training student models. Our methods and their subsequent behavioural analysis demonstrate a significant improvement in student model performance for mathematical and complex reasoning tasks, showcasing the efficacy and benefits of incorporating a rewarding mechanism in dataset distillation processes.

[44] arXiv:2507.00055 [pdf, html, other]
Title: Leveraging Unlabeled Audio-Visual Data in Speech Emotion Recognition using Knowledge Distillation
Varsha Pendyala, Pedro Morgado, William Sethares
Comments: Accepted at INTERSPEECH 2025
Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV); Signal Processing (eess.SP)

Voice interfaces integral to the human-computer interaction systems can benefit from speech emotion recognition (SER) to customize responses based on user emotions. Since humans convey emotions through multi-modal audio-visual cues, developing SER systems using both the modalities is beneficial. However, collecting a vast amount of labeled data for their development is expensive. This paper proposes a knowledge distillation framework called LightweightSER (LiSER) that leverages unlabeled audio-visual data for SER, using large teacher models built on advanced speech and face representation models. LiSER transfers knowledge regarding speech emotions and facial expressions from the teacher models to lightweight student models. Experiments conducted on two benchmark datasets, RAVDESS and CREMA-D, demonstrate that LiSER can reduce the dependence on extensive labeled datasets for SER tasks.

[45] arXiv:2507.00057 [pdf, html, other]
Title: Estimating Correctness Without Oracles in LLM-Based Code Generation
Thomas Valentin, Ardi Madadi, Gaetano Sapia, Marcel Böhme
Comments: 8 pages + refs and appendix
Subjects: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

Generating code from natural language specifications is one of the most successful applications of Large Language Models (LLMs). Yet, they hallucinate: LLMs produce outputs that may be grammatically correct but are factually incorrect. Without an existing, correct implementation (i.e., an oracle), can we quantify how likely the generated program is correct?
In this paper, we propose a measure of incorrectness, called incoherence, that can be estimated efficiently in the absence of an oracle and provides a lower bound on the error, i.e., the probability that the LLM-generated program for that specification is incorrect. Our experiments demonstrate an extraordinary effectiveness. For the average code generation task, our incoherence-based methodology can automatically identify about two-thirds of incorrect programs without reports of false positives. In fact, an oracle-based evaluation of LLMs can be reliably replaced by an incoherence-based evaluation. In particular, we find a very strong agreement between the ranking of LLMs by the number of programs deemed correct via an oracle (pass@1) and the ranking of LLMs by the number of programs deemed correct via our incoherence.

[46] arXiv:2507.00059 [pdf, html, other]
Title: Verification of Hamiltonian Path Conjecture (BHR Conjecture) for Integers up to p=31
Ranjan N Naik
Comments: This is a result an improvement over by Mariusz Meszka for all primes up to 23 (included) with the aid of a computer
Subjects: Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Combinatorics (math.CO)

The BHR (Buratti-Horak-Rosa) Conjecture (2006) proposes that for every p and a multiset L of (p-1) positive integers modulo p, there exists a Hamiltonian path in the Complete Graph Kp with consecutive edge lengths given by the elements of L. In this article, we outline an approach to the conjecture based on frequency partitions and local/global adjustment operations and backtracking. We describe the mathematical strategy, experimental evidence, and implementation in a Python Program to explore valid Hamiltonian paths p < 37. This is a result an improvement over by Mariusz Meszka for all primes up to 23 (included) with the aid of a computer.

[47] arXiv:2507.00061 [pdf, html, other]
Title: Smooth-Distill: A Self-distillation Framework for Multitask Learning with Wearable Sensor Data
Hoang-Dieu Vu, Duc-Nghia Tran, Quang-Tu Pham, Hieu H. Pham, Nicolas Vuillerme, Duc-Tan Tran
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)

This paper introduces Smooth-Distill, a novel self-distillation framework designed to simultaneously perform human activity recognition (HAR) and sensor placement detection using wearable sensor data. The proposed approach utilizes a unified CNN-based architecture, MTL-net, which processes accelerometer data and branches into two outputs for each respective task. Unlike conventional distillation methods that require separate teacher and student models, the proposed framework utilizes a smoothed, historical version of the model itself as the teacher, significantly reducing training computational overhead while maintaining performance benefits. To support this research, we developed a comprehensive accelerometer-based dataset capturing 12 distinct sleep postures across three different wearing positions, complementing two existing public datasets (MHealth and WISDM). Experimental results show that Smooth-Distill consistently outperforms alternative approaches across different evaluation scenarios, achieving notable improvements in both human activity recognition and device placement detection tasks. This method demonstrates enhanced stability in convergence patterns during training and exhibits reduced overfitting compared to traditional multitask learning baselines. This framework contributes to the practical implementation of knowledge distillation in human activity recognition systems, offering an effective solution for multitask learning with accelerometer data that balances accuracy and training efficiency. More broadly, it reduces the computational cost of model training, which is critical for scenarios requiring frequent model updates or training on resource-constrained platforms. The code and model are available at this https URL\_distill.

[48] arXiv:2507.00066 [pdf, html, other]
Title: InSight-R: A Framework for Risk-informed Human Failure Event Identification and Interface-Induced Risk Assessment Driven by AutoGraph
Xingyu Xiao, Jiejuan Tong, Peng Chen, Jun Sun, Zhe Sui, Jingang Liang, Hongru Zhao, Jun Zhao, Haitao Wang
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)

Human reliability remains a critical concern in safety-critical domains such as nuclear power, where operational failures are often linked to human error. While conventional human reliability analysis (HRA) methods have been widely adopted, they rely heavily on expert judgment for identifying human failure events (HFEs) and assigning performance influencing factors (PIFs). This reliance introduces challenges related to reproducibility, subjectivity, and limited integration of interface-level data. In particular, current approaches lack the capacity to rigorously assess how human-machine interface design contributes to operator performance variability and error susceptibility. To address these limitations, this study proposes a framework for risk-informed human failure event identification and interface-induced risk assessment driven by AutoGraph (InSight-R). By linking empirical behavioral data to the interface-embedded knowledge graph (IE-KG) constructed by the automated graph-based execution framework (AutoGraph), the InSight-R framework enables automated HFE identification based on both error-prone and time-deviated operational paths. Furthermore, we discuss the relationship between designer-user conflicts and human error. The results demonstrate that InSight-R not only enhances the objectivity and interpretability of HFE identification but also provides a scalable pathway toward dynamic, real-time human reliability assessment in digitalized control environments. This framework offers actionable insights for interface design optimization and contributes to the advancement of mechanism-driven HRA methodologies.

[49] arXiv:2507.00068 [pdf, html, other]
Title: MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding
Ziqi Zhong, Daniel Tang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

While multi-modal learning has advanced significantly, current approaches often treat modalities separately, creating inconsistencies in representation and reasoning. We introduce MANTA (Multi-modal Abstraction and Normalization via Textual Alignment), a theoretically-grounded framework that unifies visual and auditory inputs into a structured textual space for seamless processing with large language models. MANTA addresses four key challenges: (1) semantic alignment across modalities with information-theoretic optimization, (2) adaptive temporal synchronization for varying information densities, (3) hierarchical content representation for multi-scale understanding, and (4) context-aware retrieval of sparse information from long sequences. We formalize our approach within a rigorous mathematical framework, proving its optimality for context selection under token constraints. Extensive experiments on the challenging task of Long Video Question Answering show that MANTA improves state-of-the-art models by up to 22.6% in overall accuracy, with particularly significant gains (27.3%) on videos exceeding 30 minutes. Additionally, we demonstrate MANTA's superiority on temporal reasoning tasks (23.8% improvement) and cross-modal understanding (25.1% improvement). Our framework introduces novel density estimation techniques for redundancy minimization while preserving rare signals, establishing new foundations for unifying multimodal representations through structured text.

[50] arXiv:2507.00070 [pdf, other]
Title: An efficient plant disease detection using transfer learning approach
Bosubabu Sambana, Hillary Sunday Nnadi, Mohd Anas Wajid, Nwosu Ogochukwu Fidelia, Claudia Camacho-Zuñiga, Henry Dozie Ajuzie, Edeh Michael Onyema
Comments: 15 pages , 4 figures. Scientific Reports 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Plant diseases pose significant challenges to farmers and the agricultural sector at large. However, early detection of plant diseases is crucial to mitigating their effects and preventing widespread damage, as outbreaks can severely impact the productivity and quality of crops. With advancements in technology, there are increasing opportunities for automating the monitoring and detection of disease outbreaks in plants. This study proposed a system designed to identify and monitor plant diseases using a transfer learning approach. Specifically, the study utilizes YOLOv7 and YOLOv8, two state-ofthe-art models in the field of object detection. By fine-tuning these models on a dataset of plant leaf images, the system is able to accurately detect the presence of Bacteria, Fungi and Viral diseases such as Powdery Mildew, Angular Leaf Spot, Early blight and Tomato mosaic virus. The model's performance was evaluated using several metrics, including mean Average Precision (mAP), F1-score, Precision, and Recall, yielding values of 91.05, 89.40, 91.22, and 87.66, respectively. The result demonstrates the superior effectiveness and efficiency of YOLOv8 compared to other object detection methods, highlighting its potential for use in modern agricultural practices. The approach provides a scalable, automated solution for early any plant disease detection, contributing to enhanced crop yield, reduced reliance on manual monitoring, and supporting sustainable agricultural practices.

[51] arXiv:2507.00073 [pdf, html, other]
Title: Fractional Policy Gradients: Reinforcement Learning with Long-Term Memory
Urvi Pawar, Kunal Telangi
Comments: Submitted to Journal of Machine Learning Research (JMLR), June 2025. 24 pages, 3 figures. Under review
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We propose Fractional Policy Gradients (FPG), a reinforcement learning framework incorporating fractional calculus for long-term temporal modeling in policy optimization. Standard policy gradient approaches face limitations from Markovian assumptions, exhibiting high variance and inefficient sampling. By reformulating gradients using Caputo fractional derivatives, FPG establishes power-law temporal correlations between state transitions. We develop an efficient recursive computation technique for fractional temporal-difference errors with constant time and memory requirements. Theoretical analysis shows FPG achieves asymptotic variance reduction of order O(t^(-alpha)) versus standard policy gradients while preserving convergence. Empirical validation demonstrates 35-68% sample efficiency gains and 24-52% variance reduction versus state-of-the-art baselines. This framework provides a mathematically grounded approach for leveraging long-range dependencies without computational overhead.

[52] arXiv:2507.00075 [pdf, html, other]
Title: Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap
Yifan Sun, Yushan Liang, Zhen Zhang, Jiaye Teng
Comments: 24 pages
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Self-improvement is among the most prominent techniques within the realm of large language models (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM's solver capability and verifier capability. Based on the theoretical framework, we further introduce how to predict the ultimate power of self-improvement using only information from the first few training epochs. We empirically validate the effectiveness of the theoretical model on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.

[53] arXiv:2507.00078 [pdf, html, other]
Title: The language of time: a language model perspective on time-series foundation models
Yi Xie, Yun Xiong, Zejian Shi, Hao Niu, Zhengfu Liu
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

With the rise of large language models, the paradigm of training foundation models with massive parameter counts on vast datasets has been adopted in multiple domains to achieve remarkable success. Time series foundation models represent a significant extension of this paradigm, demonstrating exceptional expressive power, generalization, and cross-domain transferability. However, this gives rise to a fundamental paradox: time series data reflect distinct dynamical systems, making cross-domain transfer intuitively implausible, yet this is contradicted by the models' empirical success. To resolve this paradox, this paper investigates, from both theoretical and experimental perspectives, the representation learning mechanisms and generalization capabilities of patch-based time series foundation models. We argue that such models are not merely applying a new architecture but are fundamentally generalizing the representation paradigm of language models by extending deterministic vector-based representations to latent probabilistic distributional forms. Our theoretical analysis supports this framework by demonstrating that continuous time-series patches can be faithfully quantized into a discrete vocabulary whose key statistical properties are highly consistent with those of natural language. This generalization allows time series models to inherit the robust representation and transfer abilities of large language models, thereby explaining their superior performance in temporal tasks. Ultimately, our work provides a rigorous theoretical cornerstone for understanding, evaluating, and improving the safety and reliability of large-scale time series foundation models.

[54] arXiv:2507.00079 [pdf, other]
Title: VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems
Ethan Smyth, Alessandro Suglia
Comments: website: this https URL
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Open-endedness is an active field of research in the pursuit of capable Artificial General Intelligence (AGI), allowing models to pursue tasks of their own choosing. Simultaneously, recent advancements in Large Language Models (LLMs) such as GPT-4o [9] have allowed such models to be capable of interpreting image inputs. Implementations such as OMNI-EPIC [4] have made use of such features, providing an LLM with pixel data of an agent's POV to parse the environment and allow it to solve tasks. This paper proposes that providing these visual inputs to a model gives it greater ability to interpret spatial environments, and as such, can increase the number of tasks it can successfully perform, extending its open-ended potential. To this aim, this paper proposes VoyagerVision -- a multi-modal model capable of creating structures within Minecraft using screenshots as a form of visual feedback, building on the foundation of Voyager. VoyagerVision was capable of creating an average of 2.75 unique structures within fifty iterations of the system, as Voyager was incapable of this, it is an extension in an entirely new direction. Additionally, in a set of building unit tests VoyagerVision was successful in half of all attempts in flat worlds, with most failures arising in more complex structures. Project website is available at this https URL

[55] arXiv:2507.00080 [pdf, html, other]
Title: Online Meal Detection Based on CGM Data Dynamics
Ali Tavasoli, Heman Shakeri
Journal-ref: IFAC Diabetes Technology Conference 2025
Subjects: Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Applications (stat.AP)

We utilize dynamical modes as features derived from Continuous Glucose Monitoring (CGM) data to detect meal events. By leveraging the inherent properties of underlying dynamics, these modes capture key aspects of glucose variability, enabling the identification of patterns and anomalies associated with meal consumption. This approach not only improves the accuracy of meal detection but also enhances the interpretability of the underlying glucose dynamics. By focusing on dynamical features, our method provides a robust framework for feature extraction, facilitating generalization across diverse datasets and ensuring reliable performance in real-world applications. The proposed technique offers significant advantages over traditional approaches, improving detection accuracy,

[56] arXiv:2507.00081 [pdf, other]
Title: State and Memory is All You Need for Robust and Reliable AI Agents
Matthew Muhoberac, Atharva Parikh, Nirvi Vakharia, Saniya Virani, Aco Radujevic, Savannah Wood, Meghav Verma, Dimitri Metaxotos, Jeyaraman Soundararajan, Thierry Masquelin, Alexander G. Godfrey, Sean Gardner, Dobrila Rudnicki, Sam Michael, Gaurav Chopra
Comments: 5 Main Figures, 10 Extended Data Figures (37 Pages) for Manuscript ; 9 Supplementary Tables, 40 Supplementary Figures (180 Pages) for Supporting Information
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Chemical Physics (physics.chem-ph)

Large language models (LLMs) have enabled powerful advances in natural language understanding and generation. Yet their application to complex, real-world scientific workflows remain limited by challenges in memory, planning, and tool integration. Here, we introduce SciBORG (Scientific Bespoke Artificial Intelligence Agents Optimized for Research Goals), a modular agentic framework that allows LLM-based agents to autonomously plan, reason, and achieve robust and reliable domain-specific task execution. Agents are constructed dynamically from source code documentation and augmented with finite-state automata (FSA) memory, enabling persistent state tracking and context-aware decision-making. This approach eliminates the need for manual prompt engineering and allows for robust, scalable deployment across diverse applications via maintaining context across extended workflows and to recover from tool or execution failures. We validate SciBORG through integration with both physical and virtual hardware, such as microwave synthesizers for executing user-specified reactions, with context-aware decision making and demonstrate its use in autonomous multi-step bioassay retrieval from the PubChem database utilizing multi-step planning, reasoning, agent-to-agent communication and coordination for execution of exploratory tasks. Systematic benchmarking shows that SciBORG agents achieve reliable execution, adaptive planning, and interpretable state transitions. Our results show that memory and state awareness are critical enablers of agentic planning and reliability, offering a generalizable foundation for deploying AI agents in complex environments.

[57] arXiv:2507.00082 [pdf, html, other]
Title: Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission
Faranaksadat Solat, Joohyung Lee, Mohamed Seif, Dusit Niyato, H. Vincent Poor
Comments: 17 pages, 16 figures, IEEE Internet of Things
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers. Unlike traditional end-to-end LLM inference, HLMs reduce latency and communication by invoking LLMs only when local SLM predictions are uncertain, i.e., when token-level confidence is low or entropy is high. However, ambiguous or low-confidence predictions still require frequent offloading to the LLM, leading to significant communication overhead in bandwidth-constrained settings. To address this, we propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL). FedHLM's key innovation lies in collaboratively learning token-level uncertainty thresholds that govern when LLM assistance is needed. Rather than using static or manually tuned thresholds, FedHLM employs FL to optimize these thresholds in a privacy-preserving, distributed manner. Additionally, it leverages embedding-based token representations for Peer-to-Peer (P2P) resolution, enabling clients to reuse tokens inferred by semantically similar peers without engaging the LLM. We further introduce hierarchical model aggregation: edge servers refine local routing policies through client updates, while cross-cluster coordination aligns global decision boundaries. This layered design captures recurring uncertainty patterns, reducing redundant LLM queries. Experiments on large-scale news classification tasks show that FedHLM reduces LLM transmissions by over 95 percent with negligible accuracy loss, making it well-suited for scalable and efficient edge-AI applications.

[58] arXiv:2507.00083 [pdf, other]
Title: Strategic Counterfactual Modeling of Deep-Target Airstrike Systems via Intervention-Aware Spatio-Causal Graph Networks
Wei Meng
Comments: This paper proposes the first closed-loop causal modeling framework (IA-STGNN) that links tactical strike variables to strategic delay outcomes via graph neural networks with counterfactual reasoning
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

This study addresses the lack of structured causal modeling between tactical strike behavior and strategic delay in current strategic-level simulations, particularly the structural bottlenecks in capturing intermediate variables within the "resilience - nodal suppression - negotiation window" chain. We propose the Intervention-Aware Spatio-Temporal Graph Neural Network (IA-STGNN), a novel framework that closes the causal loop from tactical input to strategic delay output. The model integrates graph attention mechanisms, counterfactual simulation units, and spatial intervention node reconstruction to enable dynamic simulations of strike configurations and synchronization strategies. Training data are generated from a multi-physics simulation platform (GEANT4 + COMSOL) under NIST SP 800-160 standards, ensuring structural traceability and policy-level validation. Experimental results demonstrate that IA-STGNN significantly outperforms baseline models (ST-GNN, GCN-LSTM, XGBoost), achieving a 12.8 percent reduction in MAE and 18.4 percent increase in Top-5 percent accuracy, while improving causal path consistency and intervention stability. IA-STGNN enables interpretable prediction of strategic delay and supports applications such as nuclear deterrence simulation, diplomatic window assessment, and multi-strategy optimization, providing a structured and transparent AI decision-support mechanism for high-level policy modeling.

[59] arXiv:2507.00085 [pdf, html, other]
Title: A Joint Topology-Data Fusion Graph Network for Robust Traffic Speed Prediction with Data Anomalism
Ruiyuan Jiang, Dongyao Jia, Eng Gee Lim, Pengfei Fan, Yuli Zhang, Shangbo Wang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Accurate traffic prediction is essential for Intelligent Transportation Systems (ITS), yet current methods struggle with the inherent complexity and non-linearity of traffic dynamics, making it difficult to integrate spatial and temporal characteristics. Furthermore, existing approaches use static techniques to address non-stationary and anomalous historical data, which limits adaptability and undermines data smoothing. To overcome these challenges, we propose the Graph Fusion Enhanced Network (GFEN), an innovative framework for network-level traffic speed prediction. GFEN introduces a novel topological spatiotemporal graph fusion technique that meticulously extracts and merges spatial and temporal correlations from both data distribution and network topology using trainable methods, enabling the modeling of multi-scale spatiotemporal features. Additionally, GFEN employs a hybrid methodology combining a k-th order difference-based mathematical framework with an attention-based deep learning structure to adaptively smooth historical observations and dynamically mitigate data anomalies and non-stationarity. Extensive experiments demonstrate that GFEN surpasses state-of-the-art methods by approximately 6.3% in prediction accuracy and exhibits convergence rates nearly twice as fast as recent hybrid models, confirming its superior performance and potential to significantly enhance traffic prediction system efficiency.

[60] arXiv:2507.00087 [pdf, html, other]
Title: pUniFind: a unified large pre-trained deep learning model pushing the limit of mass spectra interpretation
Jiale Zhao, Pengzhi Mao, Kaifei Wang, Yiming Li, Yaping Peng, Ranfei Chen, Shuqi Lu, Xiaohong Ji, Jiaxiang Ding, Xin Zhang, Yucheng Liao, Weinan E, Weijie Zhang, Han Wen, Hao Chi
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Deep learning has advanced mass spectrometry data interpretation, yet most models remain feature extractors rather than unified scoring frameworks. We present pUniFind, the first large-scale multimodal pre-trained model in proteomics that integrates end-to-end peptide-spectrum scoring with open, zero-shot de novo sequencing. Trained on over 100 million open search-derived spectra, pUniFind aligns spectral and peptide modalities via cross modality prediction and outperforms traditional engines across diverse datasets, particularly achieving a 42.6 percent increase in the number of identified peptides in immunopeptidomics. Supporting over 1,300 modifications, pUniFind identifies 60 percent more PSMs than existing de novo methods despite a 300-fold larger search space. A deep learning based quality control module further recovers 38.5 percent additional peptides including 1,891 mapped to the genome but absent from reference proteomes while preserving full fragment ion coverage. These results establish a unified, scalable deep learning framework for proteomic analysis, offering improved sensitivity, modification coverage, and interpretability.

[61] arXiv:2507.00089 [pdf, html, other]
Title: A new machine learning framework for occupational accidents forecasting with safety inspections integration
Aho Yapi, Pierre Latouche, Arnaud Guillin, Yan Bailly
Subjects: Machine Learning (cs.LG); Methodology (stat.ME)

We propose a generic framework for short-term occupational accident forecasting that leverages safety inspections and models accident occurrences as binary time series. The approach generates daily predictions, which are then aggregated into weekly safety assessments to better inform decision making. To ensure the reliability and operational applicability of the forecasts, we apply a sliding-window cross-validation procedure specifically designed for time series data, combined with an evaluation based on aggregated period-level metrics. Several machine learning algorithms, including logistic regression, tree-based models, and neural networks, are trained and systematically compared within this framework. Unlike the other approaches, the long short-term memory (LSTM) network outperforms the other approaches and detects the upcoming high-risk periods with a balanced accuracy of 0.86, confirming the robustness of our methodology and demonstrating that a binary time series model can anticipate these critical periods based on safety inspections. The proposed methodology converts routine safety inspection data into clear weekly risk scores, detecting the periods when accidents are most likely. Decision-makers can integrate these scores into their planning tools to classify inspection priorities, schedule targeted interventions, and funnel resources to the sites or shifts classified as highest risk, stepping in before incidents occur and getting the greatest return on safety investments.

[62] arXiv:2507.00090 [pdf, html, other]
Title: Generating Heterogeneous Multi-dimensional Data : A Comparative Study
Corbeau Michael, Claeys Emmanuelle, Serrurier Mathieu, Zaraté Pascale
Comments: accepted at IEEE SMC 2025 Vienna
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Allocation of personnel and material resources is highly sensible in the case of firefighter interventions. This allocation relies on simulations to experiment with various scenarios. The main objective of this allocation is the global optimization of the firefighters response. Data generation is then mandatory to study various scenarios In this study, we propose to compare different data generation methods. Methods such as Random Sampling, Tabular Variational Autoencoders, standard Generative Adversarial Networks, Conditional Tabular Generative Adversarial Networks and Diffusion Probabilistic Models are examined to ascertain their efficacy in capturing the intricacies of firefighter interventions. Traditional evaluation metrics often fall short in capturing the nuanced requirements of synthetic datasets for real-world scenarios. To address this gap, an evaluation of synthetic data quality is conducted using a combination of domain-specific metrics tailored to the firefighting domain and standard measures such as the Wasserstein distance. Domain-specific metrics include response time distribution, spatial-temporal distribution of interventions, and accidents representation. These metrics are designed to assess data variability, the preservation of fine and complex correlations and anomalies such as event with a very low occurrence, the conformity with the initial statistical distribution and the operational relevance of the synthetic data. The distribution has the particularity of being highly unbalanced, none of the variables following a Gaussian distribution, adding complexity to the data generation process.

[63] arXiv:2507.00091 [pdf, html, other]
Title: On the Optimality of Coded Distributed Computing for Ring Networks
Zhenhao Huang, Minquan Cheng, Kai Wan, Qifu Tyler Sun, Youlong Wu
Comments: Part of the work has been presented at ISIT 2025
Subjects: Information Theory (cs.IT)

We consider a coded distributed computing problem in a ring-based communication network, where $N$ computing nodes are arranged in a ring topology and each node can only communicate with its neighbors within a constant distance $d$. To mitigate the communication bottleneck in exchanging intermediate values, we propose new coded distributed computing schemes for the ring-based network that exploit both ring topology and redundant computation (i.e., each map function is computed by $r$ nodes). Two typical cases are considered: all-gather where each node requires all intermediate values mapped from all input files, and all-to-all where each node requires a distinct set of intermediate values from other nodes. For the all-gather case, we propose a new coded scheme based on successive reverse carpooling where nodes transmit every encoded packet containing two messages traveling in opposite directions along the same path. Theoretical converse proof shows that our scheme achieves the optimal tradeoff between communication load, computation load $r$, and broadcast distance $d$ when $N\gg d$. For the all-to-all case, instead of simply repeating our all-gather scheme, we delicately deliver intermediate values based on their proximity to intended nodes to reduce unnecessary transmissions. We derive an information-theoretic lower bound on the optimal communication load and show that our scheme is asymptotically optimal under the cyclic placement when $N\gg r$. The optimality results indicate that in ring-based networks, the redundant computation $r$ only leads to an additive gain in reducing communication load while the broadcast distance $d$ contributes to a multiplicative gain.

[64] arXiv:2507.00092 [pdf, html, other]
Title: Thinking About Thinking: SAGE-nano's Inverse Reasoning for Self-Aware Language Models
Basab Jha, Firoj Paudel, Ujjwal Puri, Zhang Yuting, Choi Donghyuk, Wang Junhao
Comments: 19 pages, 2 figures, 9 tables
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Large Language Models (LLMs) have demonstrated remarkable capabilities at solving complex reasoning tasks with Chain-of-Thought (CoT) prompting, but their decision-making processes remain somewhat blackbox. We introduce textbfinverse reasoning, a novel paradigm enabling LLMs to decompose and explain their own reasoning chains post-hoc. Our approach, used in SAGE-nano, a 4-billion-parameter reasoning model, employs a metacognitive structure that reflects back via attention processes to identify major decision points and generate explanations of reasoning choices. While typical CoT approaches are directed towards forward reasoning generation, inverse reasoning provides insight into why specific reasoning chains were selected over others. Through thorough testing of logical reasoning puzzles, math problems and ethical dilemmas from AQUA-RAT, CommonsenseQA, and customized benchmarks, we demonstrate that SAGE-nano is at the cutting edge both on reasoning accuracy (74.6% on AQUA-RAT) and explanation quality (92.1% human preference score) for its task, and offers performance almost on par with models like Claude-3.5 Sonnet or GPT-4o. Our contributions are: (i) the first rigorous framework for LLM self-reflection via inverse reasoning, (ii) a novel metalearning framework to reverse the attention flow, (iii) comprehensive evaluation frameworks for reasoning transparency, and (iv) evidence that increasing reasoning using inverse reasoning improves interpretability along with reasoning performance. Our work creates new avenues for transparent AI systems and closes significant gaps in AI safety, education, and scientific discovery.

[65] arXiv:2507.00093 [pdf, html, other]
Title: $σ$-Maximal Ancestral Graphs
Binghua Yao, Joris M. Mooij
Comments: It has beee accepted by the 41st Conference on Uncertainty in Artificial Intelligence (UAI)
Subjects: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST)

Maximal Ancestral Graphs (MAGs) provide an abstract representation of Directed Acyclic Graphs (DAGs) with latent (selection) variables. These graphical objects encode information about ancestral relations and d-separations of the DAGs they represent. This abstract representation has been used amongst others to prove the soundness and completeness of the FCI algorithm for causal discovery, and to derive a do-calculus for its output. One significant inherent limitation of MAGs is that they rule out the possibility of cyclic causal relationships. In this work, we address that limitation. We introduce and study a class of graphical objects that we coin ''$\sigma$-Maximal Ancestral Graphs'' (''$\sigma$-MAGs''). We show how these graphs provide an abstract representation of (possibly cyclic) Directed Graphs (DGs) with latent (selection) variables, analogously to how MAGs represent DAGs. We study the properties of these objects and provide a characterization of their Markov equivalence classes.

[66] arXiv:2507.00094 [pdf, html, other]
Title: Efficient Conformance Checking of Rich Data-Aware Declare Specifications (Extended)
Jacobo Casas-Ramos, Sarah Winkler, Alessandro Gianola, Marco Montali, Manuel Mucientes, Manuel Lama
Comments: Extended version of the paper of the same title accepted at the 23rd International Conference on Business Process Management (BPM 2025)
Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)

Despite growing interest in process analysis and mining for data-aware specifications, alignment-based conformance checking for declarative process models has focused on pure control-flow specifications, or mild data-aware extensions limited to numerical data and variable-to-constant comparisons. This is not surprising: finding alignments is computationally hard, even more so in the presence of data dependencies. In this paper, we challenge this problem in the case where the reference model is captured using data-aware Declare with general data types and data conditions. We show that, unexpectedly, it is possible to compute data-aware optimal alignments in this rich setting, enjoying at once efficiency and expressiveness. This is achieved by carefully combining the two best-known approaches to deal with control flow and data dependencies when computing alignments, namely A* search and SMT solving. Specifically, we introduce a novel algorithmic technique that efficiently explores the search space, generating descendant states through the application of repair actions aiming at incrementally resolving constraint violations. We prove the correctness of our algorithm and experimentally show its efficiency. The evaluation witnesses that our approach matches or surpasses the performance of the state of the art while also supporting significantly more expressive data dependencies, showcasing its potential to support real-world applications.

[67] arXiv:2507.00096 [pdf, html, other]
Title: AI-Governed Agent Architecture for Web-Trustworthy Tokenization of Alternative Assets
Ailiya Borjigin, Wei Zhou, Cong He
Comments: 8 Pages, 1 figure
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)

Alternative Assets tokenization is transforming non-traditional financial instruments are represented and traded on the web. However, ensuring trustworthiness in web-based tokenized ecosystems poses significant challenges, from verifying off-chain asset data to enforcing regulatory compliance. This paper proposes an AI-governed agent architecture that integrates intelligent agents with blockchain to achieve web-trustworthy tokenization of alternative assets. In the proposed architecture, autonomous agents orchestrate the tokenization process (asset verification, valuation, compliance checking, and lifecycle management), while an AI-driven governance layer monitors agent behavior and enforces trust through adaptive policies and cryptoeconomic incentives. We demonstrate that this approach enhances transparency, security, and compliance in asset tokenization, addressing key concerns around data authenticity and fraud. A case study on tokenizing real estate assets illustrates how the architecture mitigates risks (e.g., fraudulent listings and money laundering) through real-time AI anomaly detection and on-chain enforcement. Our evaluation and analysis suggest that combining AI governance with multi-agent systems and blockchain can significantly bolster trust in tokenized asset ecosystems. This work offers a novel framework for trustworthy asset tokenization on the web and provides insights for practitioners aiming to deploy secure, compliant tokenization platforms.

[68] arXiv:2507.00101 [pdf, html, other]
Title: DFReg: A Physics-Inspired Framework for Global Weight Distribution Regularization in Neural Networks
Giovanni Ruggieri
Subjects: Machine Learning (cs.LG)

We introduce DFReg, a physics-inspired regularization method for deep neural networks that operates on the global distribution of weights. Drawing from Density Functional Theory (DFT), DFReg applies a functional penalty to encourage smooth, diverse, and well-distributed weight configurations. Unlike traditional techniques such as Dropout or L2 decay, DFReg imposes global structural regularity without architectural changes or stochastic perturbations.

[69] arXiv:2507.00102 [pdf, other]
Title: Towards transparent and data-driven fault detection in manufacturing: A case study on univariate, discrete time series
Bernd Hofmann, Patrick Bruendl, Huong Giang Nguyen, Joerg Franke
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)

Ensuring consistent product quality in modern manufacturing is crucial, particularly in safety-critical applications. Conventional quality control approaches, reliant on manually defined thresholds and features, lack adaptability to the complexity and variability inherent in production data and necessitate extensive domain expertise. Conversely, data-driven methods, such as machine learning, demonstrate high detection performance but typically function as black-box models, thereby limiting their acceptance in industrial environments where interpretability is paramount. This paper introduces a methodology for industrial fault detection, which is both data-driven and transparent. The approach integrates a supervised machine learning model for multi-class fault classification, Shapley Additive Explanations for post-hoc interpretability, and a do-main-specific visualisation technique that maps model explanations to operator-interpretable features. Furthermore, the study proposes an evaluation methodology that assesses model explanations through quantitative perturbation analysis and evaluates visualisations by qualitative expert assessment. The approach was applied to the crimping process, a safety-critical joining technique, using a dataset of univariate, discrete time series. The system achieves a fault detection accuracy of 95.9 %, and both quantitative selectivity analysis and qualitative expert evaluations confirmed the relevance and inter-pretability of the generated explanations. This human-centric approach is designed to enhance trust and interpretability in data-driven fault detection, thereby contributing to applied system design in industrial quality control.

[70] arXiv:2507.00105 [pdf, html, other]
Title: Graph Neural Networks in Wind Power Forecasting
Javier Castellano, Ignacio Villanueva
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)

We study the applicability of GNNs to the problem of wind energy forecasting. We find that certain architectures achieve performance comparable to our best CNN-based benchmark. The study is conducted on three wind power facilities using five years of historical data. Numerical Weather Prediction (NWP) variables were used as predictors, and models were evaluated on a 24 to 36 hour ahead test horizon.

[71] arXiv:2507.00108 [pdf, html, other]
Title: Teaching Programming in the Age of Generative AI: Insights from Literature, Pedagogical Proposals, and Student Perspectives
Clemente Rubio-Manzano, Jazna Meza, Rodolfo Fernandez-Santibanez, Christian Vidal-Castro
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Programming Languages (cs.PL)

Computer programming is undergoing a true transformation driven by powerful new tools for automatic source code generation based on large language models. This transformation is also manifesting in introductory programming courses at universities around the world, generating an in-depth debate about how programming content should be taught, learned, and assessed in the context of generative artificial intelligence.
This article aims, on the one hand, to review the most relevant studies on this issue, highlighting the advantages and disadvantages identified in the specialized literature. On the other hand, it proposes enriching teaching and learning methodologies by focusing on code comprehension and execution rather than on mere coding or program functionality. In particular, it advocates for the use of visual representations of code and visual simulations of its execution as effective tools for teaching, learning, and assessing programming, thus fostering a deeper understanding among students.
Finally, the opinions of students who took the object-oriented programming course are presented to provide preliminary context supporting the incorporation of visual simulations in Java (or other languages) as part of the training process.

[72] arXiv:2507.00145 [pdf, html, other]
Title: AI-Hybrid TRNG: Kernel-Based Deep Learning for Near-Uniform Entropy Harvesting from Physical Noise
Hasan Yiğit
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Theory (cs.IT); Signal Processing (eess.SP)

AI-Hybrid TRNG is a deep-learning framework that extracts near-uniform entropy directly from physical noise, eliminating the need for bulky quantum devices or expensive laboratory-grade RF receivers. Instead, it relies on a low-cost, thumb-sized RF front end, plus CPU-timing jitter, for training, and then emits 32-bit high-entropy streams without any quantization step.
Unlike deterministic or trained artificial intelligence random number generators (RNGs), our dynamic inner-outer network couples adaptive natural sources and reseeding, yielding truly unpredictable and autonomous sequences. Generated numbers pass the NIST SP 800-22 battery better than a CPU-based method. It also passes nineteen bespoke statistical tests for both bit- and integer-level analysis. All results satisfy cryptographic standards, while forward and backward prediction experiments reveal no exploitable biases. The model's footprint is below 0.5 MB, making it deployable on MCUs and FPGA soft cores, as well as suitable for other resource-constrained platforms.
By detaching randomness quality from dedicated hardware, AI-Hybrid TRNG broadens the reach of high-integrity random number generators across secure systems, cryptographic protocols, embedded and edge devices, stochastic simulations, and server applications that need randomness.

[73] arXiv:2507.00148 [pdf, html, other]
Title: Sensitivity and Query Complexity under Uncertainty
Deepu Benson, Balagopal Komarath, Nikhil Mande, Sai Soumya Nalli, Jayalal Sarma, Karteek Sreenivasaiah
Comments: 43 pages, merged and improved version of technical reports - arXiv:2501.00831 and arXiv:2412.06395
Subjects: Computational Complexity (cs.CC)

In this paper, we study the query complexity of Boolean functions in the presence of uncertainty, motivated by parallel computation with an unlimited number of processors where inputs are allowed to be unknown. We allow each query to produce three results: zero, one, or unknown. The output could also be: zero, one, or unknown, with the constraint that we should output ''unknown'' only when we cannot determine the answer from the revealed input bits. Such an extension of a Boolean function is called its hazard-free extension.
- We prove an analogue of Huang's celebrated sensitivity theorem [Annals of Mathematics, 2019] in our model of query complexity with uncertainty.
- We show that the deterministic query complexity of the hazard-free extension of a Boolean function is at most quadratic in its randomized query complexity and quartic in its quantum query complexity, improving upon the best-known bounds in the Boolean world.
- We exhibit an exponential gap between the smallest depth (size) of decision trees computing a Boolean function, and those computing its hazard-free extension.
- We present general methods to convert decision trees for Boolean functions to those for their hazard-free counterparts, and show optimality of this construction. We also parameterize this result by the maximum number of unknown values in the input.
- We show lower bounds on size complexity of decision trees for hazard-free extensions of Boolean functions in terms of the number of prime implicants and prime implicates of the underlying Boolean function.

[74] arXiv:2507.00152 [pdf, html, other]
Title: Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data
Ekaterina Borisova, Fabio Barth, Nils Feldhus, Raia Abu Ahmad, Malte Ostendorff, Pedro Ortiz Suarez, Georg Rehm, Sebastian Möller
Comments: TRL@ACL 2025, camera-ready version
Subjects: Computation and Language (cs.CL)

Tables are among the most widely used tools for representing structured data in research, business, medicine, and education. Although LLMs demonstrate strong performance in downstream tasks, their efficiency in processing tabular data remains underexplored. In this paper, we investigate the effectiveness of both text-based and multimodal LLMs on table understanding tasks through a cross-domain and cross-modality evaluation. Specifically, we compare their performance on tables from scientific vs. non-scientific contexts and examine their robustness on tables represented as images vs. text. Additionally, we conduct an interpretability analysis to measure context usage and input relevance. We also introduce the TableEval benchmark, comprising 3017 tables from scholarly publications, Wikipedia, and financial reports, where each table is provided in five different formats: Image, Dictionary, HTML, XML, and LaTeX. Our findings indicate that while LLMs maintain robustness across table modalities, they face significant challenges when processing scientific tables.

[75] arXiv:2507.00153 [pdf, html, other]
Title: Diffusion-Based Image Augmentation for Semantic Segmentation in Outdoor Robotics
Peter Mortimer, Mirko Maehlisch
Comments: Presented at the 2025 IEEE ICRA Workshop on Field Robotics
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The performance of leaning-based perception algorithms suffer when deployed in out-of-distribution and underrepresented environments. Outdoor robots are particularly susceptible to rapid changes in visual scene appearance due to dynamic lighting, seasonality and weather effects that lead to scenes underrepresented in the training data of the learning-based perception system. In this conceptual paper, we focus on preparing our autonomous vehicle for deployment in snow-filled environments. We propose a novel method for diffusion-based image augmentation to more closely represent the deployment environment in our training data. Diffusion-based image augmentations rely on the public availability of vision foundation models learned on internet-scale datasets. The diffusion-based image augmentations allow us to take control over the semantic distribution of the ground surfaces in the training data and to fine-tune our model for its deployment environment. We employ open vocabulary semantic segmentation models to filter out augmentation candidates that contain hallucinations. We believe that diffusion-based image augmentations can be extended to many other environments apart from snow surfaces, like sandy environments and volcanic terrains.

[76] arXiv:2507.00156 [pdf, html, other]
Title: Localized evaluation and fast summation in the extrapolated regularization method for integrals in Stokes flow
Joseph Siebor, Svetlana Tlupova
Subjects: Numerical Analysis (math.NA)

Boundary integral equation methods are widely used in the solution of many partial differential equations. The kernels that appear in these surface integrals are nearly singular when evaluated near the boundary, and straightforward numerical integration produces inaccurate results. In Beale and Tlupova (Adv. Comput. Math, 2024), an extrapolated regularization method was proposed to accurately evaluate the nearly singular single and double-layer surface integrals for harmonic potentials or Stokes flow. The kernels are regularized using a smoothing parameter, and then a standard quadrature is applied. The integrals are computed for three choices of the smoothing parameter to find the extrapolated value to fifth order accuracy. In this work, we apply several techniques to reduce the computational cost of the extrapolated regularization method applied to the Stokes single and double layer integrals. First, we use a straightforward OpenMP parallelization over the target points. Second, we note that the effect of the regularization is local and evaluate only the local component of the sum for three values of the smoothing parameter. The non-local component of the sum is only evaluated once and reused in the other sums. This component is still the computational bottleneck as it is $O(N^2)$, where $N$ is the system size. We apply the kernel-independent treecode to these far-field interactions to reduce the CPU time. We carry out experiments to determine optimal parameters both in terms of accuracy and efficiency of the computations. We then use these techniques to compute Stokes flow around two spheres that are nearly touching.

[77] arXiv:2507.00161 [pdf, other]
Title: Designing an Adaptive Storytelling Platform to Promote Civic Education in Politically Polarized Learning Environments
Christopher M. Wegemer, Edward Halim, Jeff Burke
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)

Political polarization undermines democratic civic education by exacerbating identity-based resistance to opposing viewpoints. Emerging AI technologies offer new opportunities to advance interventions that reduce polarization and promote political open-mindedness. We examined novel design strategies that leverage adaptive and emotionally-responsive civic narratives that may sustain students' emotional engagement in stories, and in turn, promote perspective-taking toward members of political out-groups. Drawing on theories from political psychology and narratology, we investigate how affective computing techniques can support three storytelling mechanisms: transportation into a story world, identification with characters, and interaction with the storyteller. Using a design-based research (DBR) approach, we iteratively developed and refined an AI-mediated Digital Civic Storytelling (AI-DCS) platform. Our prototype integrates facial emotion recognition and attention tracking to assess users' affective and attentional states in real time. Narrative content is organized around pre-structured story outlines, with beat-by-beat language adaptation implemented via GPT-4, personalizing linguistic tone to sustain students' emotional engagement in stories that center political perspectives different from their own. Our work offers a foundation for AI-supported, emotionally-sensitive strategies that address affective polarization while preserving learner autonomy. We conclude with implications for civic education interventions, algorithmic literacy, and HCI challenges associated with AI dialogue management and affect-adaptive learning environments.

[78] arXiv:2507.00162 [pdf, html, other]
Title: FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion
Yu Lu, Yi Yang
Comments: under review
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows to preserve fine details. Building on this, FreeLong++ extends FreeLong dual-branch design into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale. By arranging multiple window sizes from global to local, FreeLong++ enables multi-band frequency fusion from low to high frequencies, ensuring both semantic continuity and fine-grained motion dynamics across longer video sequences. Without any additional training, FreeLong++ can be plugged into existing video generation models (e.g. Wan2.1 and LTX-Video) to produce longer videos with substantially improved temporal consistency and visual fidelity. We demonstrate that our approach outperforms previous methods on longer video generation tasks (e.g. 4x and 8x of native length). It also supports coherent multi-prompt video generation with smooth scene transitions and enables controllable video generation using long depth or pose sequences.

[79] arXiv:2507.00163 [pdf, html, other]
Title: Prompting as Scientific Inquiry
Ari Holtzman, Chenhao Tan
Subjects: Computation and Language (cs.CL)

Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs-few-shot learning, chain-of-thought, constitutional AI-was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of complex and opaque organism that is trained rather than programmed, then prompting is not a workaround: it is behavioral science. Mechanistic interpretability peers into the neural substrate, prompting probes the model in its native interface: language. We contend that prompting is not inferior, but rather a key component in the science of LLMs.

[80] arXiv:2507.00166 [pdf, html, other]
Title: Novel Design of 3D Printed Tumbling Microrobots for in vivo Targeted Drug Delivery
Aaron C. Davis, Siting Zhang, Adalyn Meeks, Diya Sakhrani, Luis Carlos Sanjuan Acosta, D. Ethan Kelley, Emma Caldwell, Luis Solorio, Craig J. Goergen, David J. Cappelleri
Subjects: Robotics (cs.RO)

This paper presents innovative designs for 3D-printed tumbling microrobots, specifically engineered for targeted in vivo drug delivery applications. The microrobot designs, created using stereolithography 3D printing technologies, incorporate permanent micro-magnets to enable actuation via a rotating magnetic field actuator system. The experimental framework encompasses a series of locomotion characterization tests to evaluate microrobot performance under various conditions. Testing variables include variations in microrobot geometries, actuation frequencies, and environmental conditions, such as dry and wet environments, and temperature changes. The paper outlines designs for three drug loading methods, along with comprehensive assessments thermal drug release using a focused ultrasound system, as well as biocompatibility tests. Animal model testing involves tissue phantoms and in vivo rat models, ensuring a thorough evaluation of the microrobots' performance and compatibility. The results highlight the robustness and adaptability of the proposed microrobot designs, showcasing the potential for efficient and targeted in vivo drug delivery. This novel approach addresses current limitations in existing tumbling microrobot designs and paves the way for advancements in targeted drug delivery within the large intestine.

[81] arXiv:2507.00170 [pdf, html, other]
Title: SelvaBox: A high-resolution dataset for tropical tree crown detection
Hugo Baudchon, Arthur Ouaknine, Martin Weiss, Mélisande Teng, Thomas R. Walla, Antoine Caron-Guay, Christopher Pal, Etienne Laliberté
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Detecting individual tree crowns in tropical forests is essential to study these complex and crucial ecosystems impacted by human interventions and climate change. However, tropical crowns vary widely in size, structure, and pattern and are largely overlapping and intertwined, requiring advanced remote sensing methods applied to high-resolution imagery. Despite growing interest in tropical tree crown detection, annotated datasets remain scarce, hindering robust model development. We introduce SelvaBox, the largest open-access dataset for tropical tree crown detection in high-resolution drone imagery. It spans three countries and contains more than 83,000 manually labeled crowns - an order of magnitude larger than all previous tropical forest datasets combined. Extensive benchmarks on SelvaBox reveal two key findings: (1) higher-resolution inputs consistently boost detection accuracy; and (2) models trained exclusively on SelvaBox achieve competitive zero-shot detection performance on unseen tropical tree crown datasets, matching or exceeding competing methods. Furthermore, jointly training on SelvaBox and three other datasets at resolutions from 3 to 10 cm per pixel within a unified multi-resolution pipeline yields a detector ranking first or second across all evaluated datasets. Our dataset, code, and pre-trained weights are made public.

[82] arXiv:2507.00172 [pdf, html, other]
Title: Intellectual Property Rights and Entrepreneurship in the NFT Ecosystem: Legal Frameworks, Business Models, and Innovation Opportunities
Pranav Darshan, Rohan J S, Raghuveer Rajesh, Ruchitha M, Sanika Kamath, Manas M N
Comments: 11 pages
Subjects: Computers and Society (cs.CY); Emerging Technologies (cs.ET)

Non Fungible Tokens have changed digital ownership and how creators earn money. Between 2021 and 2024, the market value exceeded 40 billion. However, the fast growth of the NFT ecosystem has revealed serious issues in managing intellectual property rights. There is a lot of confusion about the difference between owning an NFT and owning the copyright for the underlying content. This research looks at the gap between traditional copyright laws and blockchain-based transactions. We use a mixed methods approach to analyze this disconnect. We create a new IP rights matrix that clearly shows how copyright law relates to NFT ownership structures. Additionally, we include a business model taxonomy that sorts new commercial applications by their IP risk and sustainability factors. By examining important legal cases, smart contracts, and interviews with stakeholders, we find key problems in enforcing laws across different regions, standardizing licenses, and assessing business opportunities.

[83] arXiv:2507.00180 [pdf, html, other]
Title: BlackBoxToBlueprint: Extracting Interpretable Logic from Legacy Systems using Reinforcement Learning and Counterfactual Analysis
Vidhi Rathore
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Modernizing legacy software systems is a critical but challenging task, often hampered by a lack of documentation and understanding of the original system's intricate decision logic. Traditional approaches like behavioral cloning merely replicate input-output behavior without capturing the underlying intent. This paper proposes a novel pipeline to automatically extract interpretable decision logic from legacy systems treated as black boxes. The approach uses a Reinforcement Learning (RL) agent to explore the input space and identify critical decision boundaries by rewarding actions that cause meaningful changes in the system's output. These counterfactual state transitions, where the output changes, are collected and clustered using K-Means. Decision trees are then trained on these clusters to extract human-readable rules that approximate the system's decision logic near the identified boundaries. I demonstrated the pipeline's effectiveness on three dummy legacy systems with varying complexity, including threshold-based, combined-conditional, and non-linear range logic. Results show that the RL agent successfully focuses exploration on relevant boundary regions, and the extracted rules accurately reflect the core logic of the underlying dummy systems, providing a promising foundation for generating specifications and test cases during legacy migration.

[84] arXiv:2507.00181 [pdf, other]
Title: ChatGPT produces more "lazy" thinkers: Evidence of cognitive engagement decline
Georgios P. Georgiou
Subjects: Artificial Intelligence (cs.AI)

Despite the increasing use of large language models (LLMs) in education, concerns have emerged about their potential to reduce deep thinking and active learning. This study investigates the impact of generative artificial intelligence (AI) tools, specifically ChatGPT, on the cognitive engagement of students during academic writing tasks. The study employed an experimental design with participants randomly assigned to either an AI-assisted (ChatGPT) or a non-assisted (control) condition. Participants completed a structured argumentative writing task followed by a cognitive engagement scale (CES), the CES-AI, developed to assess mental effort, attention, deep processing, and strategic thinking. The results revealed significantly lower cognitive engagement scores in the ChatGPT group compared to the control group. These findings suggest that AI assistance may lead to cognitive offloading. The study contributes to the growing body of literature on the psychological implications of AI in education and raises important questions about the integration of such tools into academic practice. It calls for pedagogical strategies that promote active, reflective engagement with AI-generated content to avoid compromising self-regulated learning and deep cognitive involvement of students.

[85] arXiv:2507.00182 [pdf, html, other]
Title: Graph-Based Deep Learning for Component Segmentation of Maize Plants
J. I. Ruíz, A. Méndez, E. Rodríguez
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In precision agriculture, one of the most important tasks when exploring crop production is identifying individual plant components. There are several attempts to accomplish this task by the use of traditional 2D imaging, 3D reconstructions, and Convolutional Neural Networks (CNN). However, they have several drawbacks when processing 3D data and identifying individual plant components. Therefore, in this work, we propose a novel Deep Learning architecture to detect components of individual plants on Light Detection and Ranging (LiDAR) 3D Point Cloud (PC) data sets. This architecture is based on the concept of Graph Neural Networks (GNN), and feature enhancing with Principal Component Analysis (PCA). For this, each point is taken as a vertex and by the use of a K-Nearest Neighbors (KNN) layer, the edges are established, thus representing the 3D PC data set. Subsequently, Edge-Conv layers are used to further increase the features of each point. Finally, Graph Attention Networks (GAT) are applied to classify visible phenotypic components of the plant, such as the leaf, stem, and soil. This study demonstrates that our graph-based deep learning approach enhances segmentation accuracy for identifying individual plant components, achieving percentages above 80% in the IoU average, thus outperforming other existing models based on point clouds.

[86] arXiv:2507.00184 [pdf, html, other]
Title: Text-to-Level Diffusion Models With Various Text Encoders for Super Mario Bros
Jacob Schrum, Olivia Kilday, Emilio Salas, Bess Hagan, Reid Williams
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Recent research shows how diffusion models can unconditionally generate tile-based game levels, but use of diffusion models for text-to-level generation is underexplored. There are practical considerations for creating a usable model: caption/level pairs are needed, as is a text embedding model, and a way of generating entire playable levels, rather than individual scenes. We present strategies to automatically assign descriptive captions to an existing level dataset, and train diffusion models using both pretrained text encoders and simple transformer models trained from scratch. Captions are automatically assigned to generated levels so that the degree of overlap between input and output captions can be compared. We also assess the diversity and playability of the resulting levels. Results are compared with an unconditional diffusion model and a generative adversarial network, as well as the text-to-level approaches Five-Dollar Model and MarioGPT. Notably, the best diffusion model uses a simple transformer model for text embedding, and takes less time to train than diffusion models employing more complex text encoders, indicating that reliance on larger language models is not necessary. We also present a GUI allowing designers to construct long levels from model-generated scenes.

[87] arXiv:2507.00188 [pdf, html, other]
Title: LIMAO: A Framework for Lifelong Modular Learned Query Optimization
Qihan Zhang, Shaolin Xie, Ibrahim Sabek
Comments: To appear at VLDB 2025 (this https URL)
Subjects: Databases (cs.DB)

Query optimizers are crucial for the performance of database systems. Recently, many learned query optimizers (LQOs) have demonstrated significant performance improvements over traditional optimizers. However, most of them operate under a limited assumption: a static query environment. This limitation prevents them from effectively handling complex, dynamic query environments in real-world scenarios. Extensive retraining can lead to the well-known catastrophic forgetting problem, which reduces the LQO generalizability over time. In this paper, we address this limitation and introduce LIMAO (Lifelong Modular Learned Query Optimizer), a framework for lifelong learning of plan cost prediction that can be seamlessly integrated into existing LQOs. LIMAO leverages a modular lifelong learning technique, an attention-based neural network composition architecture, and an efficient training paradigm designed to retain prior knowledge while continuously adapting to new environments. We implement LIMAO in two LQOs, showing that our approach is agnostic to underlying engines. Experimental results show that LIMAO significantly enhances the performance of LQOs, achieving up to a 40% improvement in query execution time and reducing the variance of execution time by up to 60% under dynamic workloads. By leveraging a precise and self-consistent design, LIMAO effectively mitigates catastrophic forgetting, ensuring stable and reliable plan quality over time. Compared to Postgres, LIMAO achieves up to a 4x speedup on selected benchmarks, highlighting its practical advantages in real-world query optimization.

[88] arXiv:2507.00189 [pdf, html, other]
Title: Plug. Play. Persist. Inside a Ready-to-Go Havoc C2 Infrastructure
Alessio Di Santo
Subjects: Cryptography and Security (cs.CR); Operating Systems (cs.OS)

This analysis focuses on a single Azure-hosted Virtual Machine at this http URL that the adversary converted into an all-in-one delivery, staging and Command-and-Control node. The host advertises an out-of-date Apache 2.4.52 instance whose open directory exposes phishing lures, PowerShell loaders, Reflective Shell-Code, compiled Havoc Demon implants and a toolbox of lateral-movement binaries; the same server also answers on 8443/80 for encrypted beacon traffic. The web tier is riddled with publicly documented critical vulnerabilities, that would have allowed initial code-execution had the attackers not already owned the device.
Initial access is delivered through an HTML file that, once de-obfuscated, perfectly mimics Google Unusual sign-in attempt notification and funnels victims toward credential collection. A PowerShell command follows: it disables AMSI in-memory, downloads a Base64-encoded stub, allocates RWX pages and starts the shell-code without ever touching disk. That stub reconstructs a DLL in memory using the Reflective-Loader technique and hands control to Havoc Demon implant. Every Demon variant-32- and 64-bit alike-talks to the same backend, resolves Windows APIs with hashed look-ups, and hides its activity behind indirect syscalls.
Runtime telemetry shows interests in registry under Image File Execution Options, deliberate queries to Software Restriction Policy keys, and heavy use of Crypto DLLs to protect payloads and C2 traffic. The attacker toolkit further contains Chisel, PsExec, Doppelganger and Whisker, some of them re-compiled under user directories that leak the developer personas tonzking123 and thobt. Collectively the findings paint a picture of a technically adept actor who values rapid re-tooling over deep operational security, leaning on Havoc modularity and on legitimate cloud services to blend malicious flows into ordinary enterprise traffic.

[89] arXiv:2507.00190 [pdf, html, other]
Title: Rethink 3D Object Detection from Physical World
Satoshi Tanaka, Koji Minoda, Fumiya Watanabe, Takamasa Horibe
Comments: 15 pages, 10 figures
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

High-accuracy and low-latency 3D object detection is essential for autonomous driving systems. While previous studies on 3D object detection often evaluate performance based on mean average precision (mAP) and latency, they typically fail to address the trade-off between speed and accuracy, such as 60.0 mAP at 100 ms vs 61.0 mAP at 500 ms. A quantitative assessment of the trade-offs between different hardware devices and accelerators remains unexplored, despite being critical for real-time applications. Furthermore, they overlook the impact on collision avoidance in motion planning, for example, 60.0 mAP leading to safer motion planning or 61.0 mAP leading to high-risk motion planning. In this paper, we introduce latency-aware AP (L-AP) and planning-aware AP (P-AP) as new metrics, which consider the physical world such as the concept of time and physical constraints, offering a more comprehensive evaluation for real-time 3D object detection. We demonstrate the effectiveness of our metrics for the entire autonomous driving system using nuPlan dataset, and evaluate 3D object detection models accounting for hardware differences and accelerators. We also develop a state-of-the-art performance model for real-time 3D object detection through latency-aware hyperparameter optimization (L-HPO) using our metrics. Additionally, we quantitatively demonstrate that the assumption "the more point clouds, the better the recognition performance" is incorrect for real-time applications and optimize both hardware and model selection using our metrics.

[90] arXiv:2507.00191 [pdf, html, other]
Title: Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions
Eray Erturk, Fahad Kamran, Salar Abbaspourazad, Sean Jewell, Harsh Sharma, Yujie Li, Sinead Williamson, Nicholas J Foti, Joseph Futoma
Comments: Accepted to ICML 2025
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Wearable devices record physiological and behavioral signals that can improve health predictions. While foundation models are increasingly used for such predictions, they have been primarily applied to low-level sensor data, despite behavioral data often being more informative due to their alignment with physiologically relevant timescales and quantities. We develop foundation models of such behavioral signals using over 2.5B hours of wearable data from 162K individuals, systematically optimizing architectures and tokenization strategies for this unique dataset. Evaluated on 57 health-related tasks, our model shows strong performance across diverse real-world applications including individual-level classification and time-varying health state prediction. The model excels in behavior-driven tasks like sleep prediction, and improves further when combined with representations of raw sensor data. These results underscore the importance of tailoring foundation model design to wearables and demonstrate the potential to enable new health applications.

[91] arXiv:2507.00193 [pdf, html, other]
Title: An energy-stable parametric finite element method for Willmore flow with normal-tangential velocity splitting
Harald Garcke, Robert Nürnberg, Quan Zhao
Subjects: Numerical Analysis (math.NA)

We propose and analyze an energy-stable fully discrete parametric approximation for Willmore flow of hypersurfaces in two and three space dimensions. We allow for the presence of spontaneous curvature effects and for open surfaces with boundary. The presented scheme is based on a new geometric partial differential equation (PDE) that combines an evolution equation for the mean curvature with a separate equation that prescribes the tangential velocity. The mean curvature is used to determine the normal velocity within the gradient flow structure, thus guaranteeing an unconditional energy stability for the discrete solution upon suitable discretization. We introduce a novel weak formulation for this geometric PDE, in which different types of boundary conditions can be naturally enforced. We further discretize the weak formulation to obtain a fully discrete parametric finite element method, for which well-posedness can be rigorously shown. Moreover, the constructed scheme admits an unconditional stability estimate in terms of the discrete energy. Extensive numerical experiments are reported to showcase the accuracy and robustness of the proposed method for computing Willmore flow of both curves in $\mathbb{R}^2$ and surfaces in $\mathbb{R}^3$.

[92] arXiv:2507.00195 [pdf, other]
Title: What Makes Local Updates Effective: The Role of Data Heterogeneity and Smoothness
Kumar Kshitij Patel
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Optimization and Control (math.OC); Machine Learning (stat.ML)

This thesis contributes to the theoretical understanding of local update algorithms, especially Local SGD, in distributed and federated optimization under realistic models of data heterogeneity. A central focus is on the bounded second-order heterogeneity assumption, which is shown to be both necessary and sufficient for local updates to outperform centralized or mini-batch methods in convex and non-convex settings. The thesis establishes tight upper and lower bounds in several regimes for various local update algorithms and characterizes the min-max complexity of multiple problem classes. At its core is a fine-grained consensus-error-based analysis framework that yields sharper finite-time convergence bounds under third-order smoothness and relaxed heterogeneity assumptions. The thesis also extends to online federated learning, providing fundamental regret bounds under both first-order and bandit feedback. Together, these results clarify when and why local updates offer provable advantages, and the thesis serves as a self-contained guide for analyzing Local SGD in heterogeneous environments.

[93] arXiv:2507.00196 [pdf, html, other]
Title: A Simple Algorithm for Trimmed Multipoint Evaluation
Nick Fischer, Melvin Kallmayer, Leo Wennmann
Comments: To appear at ESA 25
Subjects: Data Structures and Algorithms (cs.DS)

Evaluating a polynomial on a set of points is a fundamental task in computer algebra. In this work, we revisit a particular variant called trimmed multipoint evaluation: given an $n$-variate polynomial with bounded individual degree $d$ and total degree $D$, the goal is to evaluate it on a natural class of input points. This problem arises as a key subroutine in recent algorithmic results [Dinur; SODA '21], [Dell, Haak, Kallmayer, Wennmann; SODA '25]. It is known that trimmed multipoint evaluation can be solved in near-linear time [van der Hoeven, Schost; AAECC '13] by a clever yet somewhat involved algorithm. We give a simple recursive algorithm that avoids heavy computer-algebraic machinery, and can be readily understood by researchers without specialized background.

[94] arXiv:2507.00198 [pdf, html, other]
Title: Exploring AR Label Placements in Visually Cluttered Scenarios
Ji Hwan Park, Braden Roper, Amirhossein Arezoumand, Tien Tran
Subjects: Human-Computer Interaction (cs.HC)

We investigate methods for placing labels in AR environments that have visually cluttered scenes. As the number of items increases in a scene within the user' FOV, it is challenging to effectively place labels based on existing label placement guidelines. To address this issue, we implemented three label placement techniques for in-view objects for AR applications. We specifically target a scenario, where various items of different types are scattered within the user's field of view, and multiple items of the same type are situated close together. We evaluate three placement techniques for three target tasks. Our study shows that using a label to spatially group the same types of items is beneficial for identifying, comparing, and summarizing data.

[95] arXiv:2507.00202 [pdf, html, other]
Title: Examining the Social Communication and Community Engagement of Autistic Adults through an Asynchronous Focus Group
Blade Frisch, Betts Peters, Keith Vertanen
Subjects: Human-Computer Interaction (cs.HC)

Purpose: Little research has explored the communication needs of autistic adults and how their needs differ from those of other disabled populations. Augmentative and Alternative Communication (AAC) can support these communication needs, but more guidance is needed on how to design AAC to support this population.
Materials and Methods: We conducted an online, asynchronous, text-based focus group with five autistic adults to explore their social communication and community engagement and how AAC can help support them.
Results and Conclusion: Our analysis of the participant responses found that 1) participants' emotional experiences impacted the communication methods they used, 2) speaking autistic adults can benefit from AAC use, and 3) autistic shutdown creates dynamic communication needs. We present implications for future AAC design: supporting communication in times of shutdown, indicating communication ability to communication partners, and a need to better understand the fear of using AAC. These implications can inform the design for future AAC systems. We also provide themes for future autism research: exploring the impact of a late diagnosis, gaining a better understanding of the communication needs during autistic shutdown, and expanding research to include the social and environmental factors that impact communication. Finally, we provide guidance on how future online focus groups can be run in an accessible manner.

[96] arXiv:2507.00205 [pdf, html, other]
Title: Holistic Artificial Intelligence in Medicine; improved performance and explainability
Periklis Petridis, Georgios Margaritis, Vasiliki Stoumpou, Dimitris Bertsimas
Comments: Submitted to npj Digital Medicine
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

With the increasing interest in deploying Artificial Intelligence in medicine, we previously introduced HAIM (Holistic AI in Medicine), a framework that fuses multimodal data to solve downstream clinical tasks. However, HAIM uses data in a task-agnostic manner and lacks explainability. To address these limitations, we introduce xHAIM (Explainable HAIM), a novel framework leveraging Generative AI to enhance both prediction and explainability through four structured steps: (1) automatically identifying task-relevant patient data across modalities, (2) generating comprehensive patient summaries, (3) using these summaries for improved predictive modeling, and (4) providing clinical explanations by linking predictions to patient-specific medical knowledge. Evaluated on the HAIM-MIMIC-MM dataset, xHAIM improves average AUC from 79.9% to 90.3% across chest pathology and operative tasks. Importantly, xHAIM transforms AI from a black-box predictor into an explainable decision support system, enabling clinicians to interactively trace predictions back to relevant patient data, bridging AI advancements with clinical utility.

[97] arXiv:2507.00210 [pdf, html, other]
Title: LineRetriever: Planning-Aware Observation Reduction for Web Agents
Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Massimo Caccia, Véronique Eglin, Alexandre Aussem, Jérémy Espinas, Alexandre Lacoste
Subjects: Computation and Language (cs.CL)

While large language models have demonstrated impressive capabilities in web navigation tasks, the extensive context of web pages, often represented as DOM or Accessibility Tree (AxTree) structures, frequently exceeds model context limits. Current approaches like bottom-up truncation or embedding-based retrieval lose critical information about page state and action history. This is particularly problematic for adaptive planning in web agents, where understanding the current state is essential for determining future actions. We hypothesize that embedding models lack sufficient capacity to capture plan-relevant information, especially when retrieving content that supports future action prediction. This raises a fundamental question: how can retrieval methods be optimized for adaptive planning in web navigation tasks? In response, we introduce \textit{LineRetriever}, a novel approach that leverages a language model to identify and retrieve observation lines most relevant to future navigation steps. Unlike traditional retrieval methods that focus solely on semantic similarity, \textit{LineRetriever} explicitly considers the planning horizon, prioritizing elements that contribute to action prediction. Our experiments demonstrate that \textit{LineRetriever} can reduce the size of the observation at each step for the web agent while maintaining consistent performance within the context limitations.

[98] arXiv:2507.00214 [pdf, html, other]
Title: Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning
Mads Henrichsen, Rasmus Krebs
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Standard classification models often map inputs directly to labels without explicit reasoning, potentially limiting their performance, robustness, and interpretability. This paper introduces a novel two-stage approach to enhance text classification by leveraging Large Language Model (LLM)-generated reasonings. In the first stage, we fine-tune a Llama-3.2-1B-Instruct model (henceforth Llama-R-Gen) on a general-purpose reasoning dataset (syvai/reasoning-gen) to generate textual reasoning (R) given a question and its answer. In the second stage, this generally trained Llama-R-Gen is used offline to create an augmented training dataset for a downstream generative model. This downstream model, based on Llama-3.2-1B-Instruct, takes only the input text (Q) and is trained to output the generated reasoning (R) immediately followed by the predicted emotion (A). We demonstrate this methodology on the dair-ai/emotion dataset for emotion classification. Our experiments show that the generative model trained to output reasoning and the emotion (Classifier Q->RA) achieves a significant improvement of 8.7 percentage points in accuracy (for emotion prediction) compared to a baseline generative model trained solely to output the emotion (Classifier Q->A), highlighting the strong generalization capabilities of the reasoning generation and the benefit of explicit reasoning training. This work underscores the potential of LLM-generated reasonings for creating richer training datasets, thereby improving the performance of diverse downstream NLP tasks and providing explicit explanations.

[99] arXiv:2507.00216 [pdf, html, other]
Title: Towards Style Alignment in Cross-Cultural Translation
Shreya Havaldar, Adam Stein, Eric Wong, Lyle Ungar
Comments: Accepted to ACL 2025
Subjects: Computation and Language (cs.CL)

Successful communication depends on the speaker's intended style (i.e., what the speaker is trying to convey) aligning with the listener's interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style - biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style.

[100] arXiv:2507.00217 [pdf, html, other]
Title: CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training
Tiancheng Chen, Ales Kubicek, Langwen Huang, Torsten Hoefler
Comments: USENIX ATC '25
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Training large language models (LLMs) now requires resources that exceed a single datacenter, making cross-datacenter strategies increasingly crucial. We present CrossPipe, a framework designed to optimize model training across geographically distributed datacenters by explicitly modeling and mitigating the impact of network latency and limited bandwidth. It enables unified analysis and optimization incorporating both pipeline parallelism (PP) and opportunities for overlapping data parallelism (DP) communication. CrossPipe generates optimized pipeline schedules using either solver-based optimal or fast near-optimal greedy algorithms, built upon a flexible execution engine that separates scheduling logic from communication details. Our evaluation shows that CrossPipe reduces training time by up to 33.6\% compared to traditional pipeline schedules under identical memory constraints. When memory constraints are relaxed, CrossPipe maintains strong performance despite communication delays, approaching the efficiency of idealized schedules without delays. CrossPipe offers improved scalability and resource utilization, particularly in environments with high network latency or limited bandwidth.

[101] arXiv:2507.00218 [pdf, other]
Title: Learning for routing: A guided review of recent developments and future directions
Fangting Zhou, Attila Lischka, Balazs Kulcsar, Jiaming Wu, Morteza Haghir Chehreghani, Gilbert Laporte
Comments: Accepted for publication in Transportation Research Part E: Logistics and Transportation Review
Subjects: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)

This paper reviews the current progress in applying machine learning (ML) tools to solve NP-hard combinatorial optimization problems, with a focus on routing problems such as the traveling salesman problem (TSP) and the vehicle routing problem (VRP). Due to the inherent complexity of these problems, exact algorithms often require excessive computational time to find optimal solutions, while heuristics can only provide approximate solutions without guaranteeing optimality. With the recent success of machine learning models, there is a growing trend in proposing and implementing diverse ML techniques to enhance the resolution of these challenging routing problems. We propose a taxonomy categorizing ML-based routing methods into construction-based and improvement-based approaches, highlighting their applicability to various problem characteristics. This review aims to integrate traditional OR methods with state-of-the-art ML techniques, providing a structured framework to guide future research and address emerging VRP variants.

[102] arXiv:2507.00219 [pdf, html, other]
Title: Error Etimates for Non Conforming Discretisation of Time-dependent Convection-Diffusion-Reaction Model
Hasan Alzubaidi, Yahya Alnashri
Subjects: Numerical Analysis (math.NA)

We use a generic framework, namely the gradient discretisation method (GDM), to propose a unified numerical analysis for general time-dependent convection-diffusion-reaction models. We establish novel results for convergence rates of numerical approximations of such models under reasonable assumptions on exact solutions, and prove the existence and uniqueness of the approximate solution for suitably small time steps. The main interest of our results lies in covering several approximation methods and various applications of the considered model such as the generalised Burgers-Fisher (GBF) and the generalised Burgers-Huxley (GBH) models. Numerical tests based on the hybrid mimetic mixed (HMM) method for the GBF model are performed on various types of general meshes to examine the accuracy of the proposed gradient scheme. The results confirm our theoretical rates of convergence, even on mesh with extreme distortions.

[103] arXiv:2507.00224 [pdf, html, other]
Title: Computer Vision for Objects used in Group Work: Challenges and Opportunities
Changsoo Jung, Sheikh Mannan, Jack Fitzgerald, Nathaniel Blanchard
Comments: Accepted to AIED 2025 Late Breaking Results Track
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

Interactive and spatially aware technologies are transforming educational frameworks, particularly in K-12 settings where hands-on exploration fosters deeper conceptual understanding. However, during collaborative tasks, existing systems often lack the ability to accurately capture real-world interactions between students and physical objects. This issue could be addressed with automatic 6D pose estimation, i.e., estimation of an object's position and orientation in 3D space from RGB images or videos. For collaborative groups that interact with physical objects, 6D pose estimates allow AI systems to relate objects and entities. As part of this work, we introduce FiboSB, a novel and challenging 6D pose video dataset featuring groups of three participants solving an interactive task featuring small hand-held cubes and a weight scale. This setup poses unique challenges for 6D pose because groups are holistically recorded from a distance in order to capture all participants -- this, coupled with the small size of the cubes, makes 6D pose estimation inherently non-trivial. We evaluated four state-of-the-art 6D pose estimation methods on FiboSB, exposing the limitations of current algorithms on collaborative group work. An error analysis of these methods reveals that the 6D pose methods' object detection modules fail. We address this by fine-tuning YOLO11-x for FiboSB, achieving an overall mAP_50 of 0.898. The dataset, benchmark results, and analysis of YOLO11-x errors presented here lay the groundwork for leveraging the estimation of 6D poses in difficult collaborative contexts.

[104] arXiv:2507.00229 [pdf, html, other]
Title: A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss
Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Rashedul Hasan, Taieba Athay, Nursad Mamun, Anomadarshi Barua
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

Speech super-resolution (SSR) enhances low-resolution speech by increasing the sampling rate. While most SSR methods focus on magnitude reconstruction, recent research highlights the importance of phase reconstruction for improved perceptual quality. Therefore, we introduce CTFT-Net, a Complex Time-Frequency Transformation Network that reconstructs both magnitude and phase in complex domains for improved SSR tasks. It incorporates a complex global attention block to model inter-phoneme and inter-frequency dependencies and a complex conformer to capture long-range and local features, improving frequency reconstruction and noise robustness. CTFT-Net employs time-domain and multi-resolution frequency-domain loss functions for better generalization. Experiments show CTFT-Net outperforms state-of-the-art models (NU-Wave, WSRGlow, NVSR, AERO) on the VCTK dataset, particularly for extreme upsampling (2 kHz to 48 kHz), reconstructing high frequencies effectively without noisy artifacts.

[105] arXiv:2507.00230 [pdf, html, other]
Title: PPFL-RDSN: Privacy-Preserving Federated Learning-based Residual Dense Spatial Networks for Encrypted Lossy Image Reconstruction
Peilin He, James Joshi
Comments: This paper is under review; do not distribute
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)

Reconstructing high-quality images from low-resolution inputs using Residual Dense Spatial Networks (RDSNs) is crucial yet challenging, particularly in collaborative scenarios where centralized training poses significant privacy risks, including data leakage and inference attacks, as well as high computational costs. We propose a novel Privacy-Preserving Federated Learning-based RDSN (PPFL-RDSN) framework specifically tailored for lossy image reconstruction. PPFL-RDSN integrates Federated Learning (FL), local differential privacy, and robust model watermarking techniques, ensuring data remains secure on local devices, safeguarding sensitive information, and maintaining model authenticity without revealing underlying data. Empirical evaluations show that PPFL-RDSN achieves comparable performance to the state-of-the-art centralized methods while reducing computational burdens, and effectively mitigates security and privacy vulnerabilities, making it a practical solution for secure and privacy-preserving collaborative computer vision applications.

[106] arXiv:2507.00234 [pdf, html, other]
Title: Interpretable AI for Time-Series: Multi-Model Heatmap Fusion with Global Attention and NLP-Generated Explanations
Jiztom Kavalakkatt Francis, Matthew J Darr
Comments: 13 pages
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

In this paper, we present a novel framework for enhancing model interpretability by integrating heatmaps produced separately by ResNet and a restructured 2D Transformer with globally weighted input saliency. We address the critical problem of spatial-temporal misalignment in existing interpretability methods, where convolutional networks fail to capture global context and Transformers lack localized precision - a limitation that impedes actionable insights in safety-critical domains like healthcare and industrial monitoring. Our method merges gradient-weighted activation maps (ResNet) and Transformer attention rollout into a unified visualization, achieving full spatial-temporal alignment while preserving real-time performance. Empirical evaluations on clinical (ECG arrhythmia detection) and industrial (energy consumption prediction) datasets demonstrate significant improvements: the hybrid framework achieves 94.1% accuracy (F1 0.93) on the PhysioNet dataset and reduces regression error to RMSE = 0.28 kWh (R2 = 0.95) on the UCI Energy Appliance dataset-outperforming standalone ResNet, Transformer, and InceptionTime baselines by 3.8-12.4%. An NLP module translates fused heatmaps into domain-specific narratives (e.g., "Elevated ST-segment between 2-4 seconds suggests myocardial ischemia"), validated via BLEU-4 (0.586) and ROUGE-L (0.650) scores. By formalizing interpretability as causal fidelity and spatial-temporal alignment, our approach bridges the gap between technical outputs and stakeholder understanding, offering a scalable solution for transparent, time-aware decision-making.

[107] arXiv:2507.00235 [pdf, html, other]
Title: Minimum Selective Subset on Some Graph Classes
Bubai Manna
Comments: This work has been accepted in CCCG 2025
Subjects: Computational Geometry (cs.CG); Computational Complexity (cs.CC)

In a connected simple graph G = (V(G),E(G)), each vertex is assigned a color from the set of colors C={1, 2,..., c}. The set of vertices V(G) is partitioned as V_1, V_2, ... ,V_c, where all vertices in V_j share the same color j. A subset S of V(G) is called Selective Subset if, for every vertex v in V(G), and if v is in V_j, at least one of its nearest neighbors in (S union (V(G)\ V_j)) has the same color as v. The Minimum Selective Subset (MSS) problem seeks to find a selective subset of minimum size. The problem was first introduced by Wilfong in 1991 for a set of points in the Euclidean plane, where two major problems, MCS (Minimum Consistent Subset) and MSS, were proposed.
In graph algorithms, the only known result is that the MSS problem is NP-complete, as shown in 2018. Beyond this, no further progress has been made to date. In contrast, the MCS problem has been widely studied in various graph classes over the years. Therefore, in this work, we also extend the algorithmic study of MSS on various graph classes. We first present a log(n)-approximation algorithm for general graphs with n vertices and regardless of the number of colors. We also show that the problem remains NP-complete in planar graphs when restricted to just two colors.. Finally, we provide polynomial-time algorithms for computing optimal solutions in trees and unit interval graphs for any number of colors.

[108] arXiv:2507.00236 [pdf, html, other]
Title: Sim2Real Diffusion: Learning Cross-Domain Adaptive Representations for Transferable Autonomous Driving
Chinmay Vilas Samak, Tanmay Vilas Samak, Bing Li, Venkat Krovi
Subjects: Robotics (cs.RO)

Simulation-based design, optimization, and validation of autonomous driving algorithms have proven to be crucial for their iterative improvement over the years. Nevertheless, the ultimate measure of effectiveness is their successful transition from simulation to reality (sim2real). However, existing sim2real transfer methods struggle to comprehensively address the autonomy-oriented requirements of balancing: (i) conditioned domain adaptation, (ii) robust performance with limited examples, (iii) modularity in handling multiple domain representations, and (iv) real-time performance. To alleviate these pain points, we present a unified framework for learning cross-domain adaptive representations for sim2real transferable autonomous driving algorithms using conditional latent diffusion models. Our framework offers options to leverage: (i) alternate foundation models, (ii) a few-shot fine-tuning pipeline, and (iii) textual as well as image prompts for mapping across given source and target domains. It is also capable of generating diverse high-quality samples when diffusing across parameter spaces such as times of day, weather conditions, seasons, and operational design domains. We systematically analyze the presented framework and report our findings in the form of critical quantitative metrics and ablation studies, as well as insightful qualitative examples and remarks. Additionally, we demonstrate the serviceability of the proposed approach in bridging the sim2real gap for end-to-end autonomous driving using a behavioral cloning case study. Our experiments indicate that the proposed framework is capable of bridging the perceptual sim2real gap by over 40%. We hope that our approach underscores the potential of generative diffusion models in sim2real transfer, offering a pathway toward more robust and adaptive autonomous driving.

[109] arXiv:2507.00237 [pdf, html, other]
Title: Plan-Based Scalable Online Virtual Network Embedding
Oleg Kolosov, David Breitgand, Dean H. Lorenz, Gala Yadgar
Comments: Accepted to IEEE ICDCS 2025
Subjects: Networking and Internet Architecture (cs.NI)

Network virtualization allows hosting applications with diverse computation and communication requirements on shared edge infrastructure. Given a set of requests for deploying virtualized applications, the edge provider has to deploy a maximum number of them to the underlying physical network, subject to capacity constraints. This challenge is known as the virtual network embedding (VNE) problem: it models applications as virtual networks, where virtual nodes represent functions and virtual links represent communication between the virtual nodes.
All variants of VNE are known to be strongly NP-hard. Because of its centrality to network virtualization, VNE has been extensively studied. We focus on the online variant of VNE, in which deployment requests are not known in advance. This reflects the highly skewed and unpredictable demand intrinsic to the edge. Unfortunately, existing solutions to online VNE do not scale well with the number of requests per second and the physical topology size.
We propose a novel approach in which our new online algorithm, OLIVE, leverages a nearly optimal embedding for an aggregated expected demand. This embedding is computed offline. It serves as a plan that OLIVE uses as a guide for handling actual individual requests while dynamically compensating for deviations from the plan. We demonstrate that our solution can handle a number of requests per second greater by two orders of magnitude than the best results reported in the literature. Thus, it is particularly suitable for realistic edge environments.

[110] arXiv:2507.00239 [pdf, html, other]
Title: Linearly Decoding Refused Knowledge in Aligned Language Models
Aryan Shrivastava, Ari Holtzman
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Most commonly used language models (LMs) are instruction-tuned and aligned using a combination of fine-tuning and reinforcement learning, causing them to refuse users requests deemed harmful by the model. However, jailbreak prompts can often bypass these refusal mechanisms and elicit harmful responses. In this work, we study the extent to which information accessed via jailbreak prompts is decodable using linear probes trained on LM hidden states. We show that a great deal of initially refused information is linearly decodable. For example, across models, the response of a jailbroken LM for the average IQ of a country can be predicted by a linear probe with Pearson correlations exceeding $0.8$. Surprisingly, we find that probes trained on base models (which do not refuse) sometimes transfer to their instruction-tuned versions and are capable of revealing information that jailbreaks decode generatively, suggesting that the internal representations of many refused properties persist from base LMs through instruction-tuning. Importantly, we show that this information is not merely "leftover" in instruction-tuned models, but is actively used by them: we find that probe-predicted values correlate with LM generated pairwise comparisons, indicating that the information decoded by our probes align with suppressed generative behavior that may be expressed more subtly in other downstream tasks. Overall, our results suggest that instruction-tuning does not wholly eliminate or even relocate harmful information in representation space-they merely suppress its direct expression, leaving it both linearly accessible and indirectly influential in downstream behavior.

[111] arXiv:2507.00243 [pdf, html, other]
Title: VOCAL: Visual Odometry via ContrAstive Learning
Chi-Yao Huang, Zeel Bhatt, Yezhou Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Breakthroughs in visual odometry (VO) have fundamentally reshaped the landscape of robotics, enabling ultra-precise camera state estimation that is crucial for modern autonomous systems. Despite these advances, many learning-based VO techniques rely on rigid geometric assumptions, which often fall short in interpretability and lack a solid theoretical basis within fully data-driven frameworks. To overcome these limitations, we introduce VOCAL (Visual Odometry via ContrAstive Learning), a novel framework that reimagines VO as a label ranking challenge. By integrating Bayesian inference with a representation learning framework, VOCAL organizes visual features to mirror camera states. The ranking mechanism compels similar camera states to converge into consistent and spatially coherent representations within the latent space. This strategic alignment not only bolsters the interpretability of the learned features but also ensures compatibility with multimodal data sources. Extensive evaluations on the KITTI dataset highlight VOCAL's enhanced interpretability and flexibility, pushing VO toward more general and explainable spatial intelligence.

[112] arXiv:2507.00244 [pdf, other]
Title: The Algebraic Structure of Morphosyntax
Isabella Senturia, Matilde Marcolli
Comments: 45 pages, LaTeX, 2 png figures
Subjects: Computation and Language (cs.CL); Quantum Algebra (math.QA)

Within the context of the mathematical formulation of Merge and the Strong Minimalist Thesis, we present a mathematical model of the morphology-syntax interface. In this setting, morphology has compositional properties responsible for word formation, organized into a magma of morphological trees. However, unlike syntax, we do not have movement within morphology. A coproduct decomposition exists, but it requires extending the set of morphological trees beyond those which are generated solely by the magma, to a larger set of possible morphological inputs to syntactic trees. These participate in the formation of morphosyntactic trees as an algebra over an operad, and a correspondence between algebras over an operad. The process of structure formation for morphosyntactic trees can then be described in terms of this operadic correspondence that pairs syntactic and morphological data and the morphology coproduct. We reinterpret in this setting certain operations of Distributed Morphology as transformation that allow for flexibility in moving the boundary between syntax and morphology within the morphosyntactic objects.

[113] arXiv:2507.00246 [pdf, other]
Title: EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning
Sanchit Ahuja, Praneetha Vaddamanu, Barun Patra
Comments: 15 pages, 5 figures, 9 tables
Subjects: Computation and Language (cs.CL)

Despite recent advances in Language Reasoning Models (LRMs), most research focuses solely on English, even though many models are pretrained on multilingual data. In this work, we investigate: Is English the most token-efficient language for reasoning? We evaluate three open-source RLMs: DeepSeek R1, Qwen 2.5 and Qwen 3, across four math datasets and seven typologically diverse languages. We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy. These gains persist even after translating the reasoning traces into English, suggesting genuine shifts in reasoning behavior rather than surface-level linguistic effects. The extent of improvement, however, depends on the models multilingual strength. Our findings motivate a broader view of reasoning in language models, highlighting the potential of multilingual reasoning and the importance of strong multilingual foundations. The code for our work can be found: this https URL.

[114] arXiv:2507.00248 [pdf, other]
Title: Developing Lightweight DNN Models With Limited Data For Real-Time Sign Language Recognition
Nikita Nikitin, Eugene Fomin
Comments: 7 pages, 2 figures, 2 tables, for associated mpeg file, see this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

We present a novel framework for real-time sign language recognition using lightweight DNNs trained on limited data. Our system addresses key challenges in sign language recognition, including data scarcity, high computational costs, and discrepancies in frame rates between training and inference environments. By encoding sign language specific parameters, such as handshape, palm orientation, movement, and location into vectorized inputs, and leveraging MediaPipe for landmark extraction, we achieve highly separable input data representations. Our DNN architecture, optimized for sub 10MB deployment, enables accurate classification of 343 signs with less than 10ms latency on edge devices. The data annotation platform 'slait data' facilitates structured labeling and vector extraction. Our model achieved 92% accuracy in isolated sign recognition and has been integrated into the 'slait ai' web application, where it demonstrates stable inference.

[115] arXiv:2507.00253 [pdf, html, other]
Title: GazeTarget360: Towards Gaze Target Estimation in 360-Degree for Robot Perception
Zhuangzhuang Dai, Vincent Gbouna Zakka, Luis J. Manso, Chen Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: this https URL.

[116] arXiv:2507.00257 [pdf, html, other]
Title: Gym4ReaL: A Suite for Benchmarking Real-World Reinforcement Learning
Davide Salaorni, Vincenzo De Paola, Samuele Delpero, Giovanni Dispoto, Paolo Bonetti, Alessio Russo, Giuseppe Calcagno, Francesco Trovò, Matteo Papini, Alberto Maria Metelli, Marco Mussi, Marcello Restelli
Comments: 9 pages
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

In recent years, \emph{Reinforcement Learning} (RL) has made remarkable progress, achieving superhuman performance in a wide range of simulated environments. As research moves toward deploying RL in real-world applications, the field faces a new set of challenges inherent to real-world settings, such as large state-action spaces, non-stationarity, and partial observability. Despite their importance, these challenges are often underexplored in current benchmarks, which tend to focus on idealized, fully observable, and stationary environments, often neglecting to incorporate real-world complexities explicitly. In this paper, we introduce \texttt{Gym4ReaL}, a comprehensive suite of realistic environments designed to support the development and evaluation of RL algorithms that can operate in real-world scenarios. The suite includes a diverse set of tasks that expose algorithms to a variety of practical challenges. Our experimental results show that, in these settings, standard RL algorithms confirm their competitiveness against rule-based benchmarks, motivating the development of new methods to fully exploit the potential of RL to tackle the complexities of real-world tasks.

[117] arXiv:2507.00258 [pdf, html, other]
Title: Impact of Fine-Tuning Methods on Memorization in Large Language Models
Jie Hou, Chuxiong Wu, Lannan Luo, Qiang Zeng
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

As the capabilities of pre-trained large language models (LLMs) continue to advance, the "pre-train and fine-tune" paradigm has become increasingly mainstream, leading to the development of various fine-tuning methods. However, the privacy risks arising from memorization during fine-tuning have received relatively little attention. To address this gap, we categorize popular fine-tuning approaches and assess their impact on memorization through the lens of membership inference attacks (MIAs). Our results show that, compared to parameter-based fine-tuning, prompt-based fine-tuning achieves competitive performance while exhibiting lower vulnerability to MIAs. Furthermore, prompt-based methods maintain low memorization regardless of model scale. These findings suggest that parameter-based fine-tuning is more prone to leaking private information, whereas prompt-based fine-tuning serves as a more privacy-preserving option.

[118] arXiv:2507.00259 [pdf, html, other]
Title: Who Should I Listen To? Adaptive Collaboration in Personalized Federated Learning
Amr Abourayya, Jens Kleesiek, Bharat Rao, Michael Kamp
Subjects: Machine Learning (cs.LG)

Data heterogeneity is a central challenge in federated learning, and personalized federated learning (PFL) aims to address it by tailoring models to each client's distribution. Yet many PFL methods fail to outperform local or centralized baselines, suggesting a mismatch between the collaboration they enforce and the structure of the data. We propose an approach based on adaptive collaboration, where clients decide adaptively not only how much to rely on others, but also whom to trust at the level of individual examples. We instantiate this principle in FEDMOSAIC, a federated co-training method in which clients exchange predictions over a shared unlabeled dataset. This enables fine-grained trust decisions that are difficult to achieve with parameter sharing alone. Each client adjusts its loss weighting based on the agreement between private and public data, and contributes to global pseudo-labels in proportion to its estimated per-example confidence. Empirically, FEDMOSAIC improves upon state-of-the-art PFL methods across diverse non-IID settings, and we provide convergence guarantees under standard assumptions. Our results demonstrate the potential of data-aware collaboration for robust and effective personalization.

[119] arXiv:2507.00261 [pdf, html, other]
Title: VirtualFencer: Generating Fencing Bouts based on Strategies Extracted from In-the-Wild Videos
Zhiyin Lin, Purvi Goel, Joy Yun, C. Karen Liu, Joao Pedro Araujo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

Fencing is a sport where athletes engage in diverse yet strategically logical motions. While most motions fall into a few high-level actions (e.g. step, lunge, parry), the execution can vary widely-fast vs. slow, large vs. small, offensive vs. defensive. Moreover, a fencer's actions are informed by a strategy that often comes in response to the opponent's behavior. This combination of motion diversity with underlying two-player strategy motivates the application of data-driven modeling to fencing. We present VirtualFencer, a system capable of extracting 3D fencing motion and strategy from in-the-wild video without supervision, and then using that extracted knowledge to generate realistic fencing behavior. We demonstrate the versatile capabilities of our system by having it (i) fence against itself (self-play), (ii) fence against a real fencer's motion from online video, and (iii) fence interactively against a professional fencer.

[120] arXiv:2507.00263 [pdf, html, other]
Title: Room Scene Discovery and Grouping in Unstructured Vacation Rental Image Collections
Vignesh Ram Nithin Kappagantula, Shayan Hassantabar
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

The rapid growth of vacation rental (VR) platforms has led to an increasing volume of property images, often uploaded without structured categorization. This lack of organization poses significant challenges for travelers attempting to understand the spatial layout of a property, particularly when multiple rooms of the same type are present. To address this issue, we introduce an effective approach for solving the room scene discovery and grouping problem, as well as identifying bed types within each bedroom group. This grouping is valuable for travelers to comprehend the spatial organization, layout, and the sleeping configuration of the property. We propose a computationally efficient machine learning pipeline characterized by low latency and the ability to perform effectively with sample-efficient learning, making it well-suited for real-time and data-scarce environments. The pipeline integrates a supervised room-type detection model, a supervised overlap detection model to identify the overlap similarity between two images, and a clustering algorithm to group the images of the same space together using the similarity scores. Additionally, the pipeline maps each bedroom group to the corresponding bed types specified in the property's metadata, based on the visual content present in the group's images using a Multi-modal Large Language Model (MLLM) model. We evaluate the aforementioned models individually and also assess the pipeline in its entirety, observing strong performance that significantly outperforms established approaches such as contrastive learning and clustering with pretrained embeddings.

[121] arXiv:2507.00264 [pdf, other]
Title: Rust vs. C for Python Libraries: Evaluating Rust-Compatible Bindings Toolchains
Isabella Basso do Amaral (1), Renato Cordeiro Ferreira (1,2,3,4), Alfredo Goldman (1) ((1) University of São Paulo, (2) Jheronimus Academy of Data Science, (3) Technical University of Eindhoven, (4) Tilburg University)
Comments: 10 pages, 27 figures (1 diagram, 4 graphs, 9 tables, 13 code listings), submitted to SBAC-PAD 2025
Subjects: Programming Languages (cs.PL); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Software Engineering (cs.SE)

The Python programming language is best known for its syntax and scientific libraries, but it is also notorious for its slow interpreter. Optimizing critical sections in Python entails special knowledge of the binary interactions between programming languages, and can be cumbersome to interface manually, with implementers often resorting to convoluted third-party libraries. This comparative study evaluates the performance and ease of use of the PyO3 Python bindings toolchain for Rust against ctypes and cffi. By using Rust tooling developed for Python, we can achieve state-of-the-art performance with no concern for API compatibility.

[122] arXiv:2507.00265 [pdf, html, other]
Title: Examining Reject Relations in Stimulus Equivalence Simulations
Alexis Carrillo, Asieh Abolpour Mofrad, Anis Yazidi, Moises Betancort
Comments: 18 pages, 6 figures
Subjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

Simulations offer a valuable tool for exploring stimulus equivalence (SE), yet the potential of reject relations to disrupt the assessment of equivalence class formation is contentious. This study investigates the role of reject relations in the acquisition of stimulus equivalence using computational models. We examined feedforward neural networks (FFNs), bidirectional encoder representations from transformers (BERT), and generative pre-trained transformers (GPT) across 18 conditions in matching-to-sample (MTS) simulations. Conditions varied in training structure (linear series, one-to-many, and many-to-one), relation type (select-only, reject-only, and select-reject), and negative comparison selection (standard and biased). A probabilistic agent served as a benchmark, embodying purely associative learning. The primary goal was to determine whether artificial neural networks could demonstrate equivalence class formation or whether their performance reflected associative learning. Results showed that reject relations influenced agent performance. While some agents achieved high accuracy on equivalence tests, particularly with reject relations and biased negative comparisons, this performance was comparable to the probabilistic agent. These findings suggest that artificial neural networks, including transformer models, may rely on associative strategies rather than SE. This underscores the need for careful consideration of reject relations and more stringent criteria in computational models of equivalence.

[123] arXiv:2507.00267 [pdf, html, other]
Title: Minimal residual rational Krylov subspace method for sequences of shifted linear systems
Hussam Al Daas, Davide Palitta
Subjects: Numerical Analysis (math.NA)

The solution of sequences of shifted linear systems is a classic problem in numerical linear algebra, and a variety of efficient methods have been proposed over the years. Nevertheless, there still exist challenging scenarios witnessing a lack of performing solvers. For instance, state-of-the-art procedures struggle to handle nonsymmetric problems where the shifts are complex numbers that do not come as conjugate pairs. We design a novel projection strategy based on the rational Krylov subspace equipped with a minimal residual condition. We also devise a novel pole selection procedure, tailored to our problem, providing poles for the rational Krylov basis construction that yield faster convergence than those computed by available general-purpose schemes. A panel of diverse numerical experiments shows that our novel approach performs better than state-of-the-art techniques, especially on the very challenging problems mentioned above.

[124] arXiv:2507.00268 [pdf, other]
Title: Control-Optimized Deep Reinforcement Learning for Artificially Intelligent Autonomous Systems
Oren Fivel, Matan Rudman, Kobi Cohen
Comments: 27 pages, 10 figures
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)

Deep reinforcement learning (DRL) has become a powerful tool for complex decision-making in machine learning and AI. However, traditional methods often assume perfect action execution, overlooking the uncertainties and deviations between an agent's selected actions and the actual system response. In real-world applications, such as robotics, mechatronics, and communication networks, execution mismatches arising from system dynamics, hardware constraints, and latency can significantly degrade performance. This work advances AI by developing a novel control-optimized DRL framework that explicitly models and compensates for action execution mismatches, a challenge largely overlooked in existing methods. Our approach establishes a structured two-stage process: determining the desired action and selecting the appropriate control signal to ensure proper execution. It trains the agent while accounting for action mismatches and controller corrections. By incorporating these factors into the training process, the AI agent optimizes the desired action with respect to both the actual control signal and the intended outcome, explicitly considering execution errors. This approach enhances robustness, ensuring that decision-making remains effective under real-world uncertainties. Our approach offers a substantial advancement for engineering practice by bridging the gap between idealized learning and real-world implementation. It equips intelligent agents operating in engineering environments with the ability to anticipate and adjust for actuation errors and system disturbances during training. We evaluate the framework in five widely used open-source mechanical simulation environments we restructured and developed to reflect real-world operating conditions, showcasing its robustness against uncertainties and offering a highly practical and efficient solution for control-oriented applications.

[125] arXiv:2507.00270 [pdf, html, other]
Title: EMSpice 2.1: A Coupled EM and IR Drop Analysis Tool with Joule Heating and Thermal Map Integration for VLSI Reliability
Subed Lamichhane, Haotian Lu, Sheldon X.-D. Tan
Comments: 4 Pages, accepted to SMACD 2025
Subjects: Systems and Control (eess.SY)

Electromigration (EM) remains a critical reliability concern in current and future copper-based VLSI circuits. As technology scales down, EM-induced IR drop becomes increasingly severe. While several EM-aware IR drop analysis tools have been proposed, few incorporate the real impact of temperature distribution on both EM and IR drop effects. In this work, we introduce EMSpice 2.1, an enhanced tool built upon the existing coupled IR-EM analysis framework, EMSpice 2.0, for EM-aware IR drop analysis. For the first time, EMSpice 2.1 uniquely integrates Joule heating effects and practical thermal maps derived from actual chip conditions. Additionally, it features improved interoperability with commercial EDA tools, facilitating more comprehensive EM and IR drop sign-off analysis. Our findings demonstrate that specific hotspot patterns significantly impact the lifetime of interconnects and overall chip reliability due to EM failures. Furthermore, our tool exhibits strong agreement with industry-standard tools such as COMSOL, achieving a speedup of over 200 times while maintaining high accuracy.

[126] arXiv:2507.00271 [pdf, html, other]
Title: User Concerns Regarding Social Robots for Mood Regulation: A Case Study on the "Sunday Blues"
Zhuochao Peng, Jiaxin Xu, Jun Hu, Haian Xue, Laurens A. G. Kolks, Pieter M. A. Desmet
Comments: Accepted to International Conference on Social Robotics + AI (ICSR 2025)
Subjects: Human-Computer Interaction (cs.HC); Robotics (cs.RO)

While recent research highlights the potential of social robots to support mood regulation, little is known about how prospective users view their integration into everyday life. To explore this, we conducted an exploratory case study that used a speculative robot concept "Mora" to provoke reflection and facilitate meaningful discussion about using social robots to manage subtle, day-to-day emotional experiences. We focused on the "Sunday Blues," a common dip in mood that occurs at the end of the weekend, as a relatable context in which to explore individuals' insights. Using a video prototype and a co-constructing stories method, we engaged 15 participants in imagining interactions with Mora and discussing their expectations, doubts, and concerns. The study surfaced a range of nuanced reflections around the attributes of social robots like empathy, intervention effectiveness, and ethical boundaries, which we translated into design considerations for future research and development in human-robot interaction.

[127] arXiv:2507.00272 [pdf, html, other]
Title: Iteratively Saturated Kalman Filtering
Alan Yang, Stephen Boyd
Subjects: Systems and Control (eess.SY)

The Kalman filter (KF) provides optimal recursive state estimates for linear-Gaussian systems and underpins applications in control, signal processing, and others. However, it is vulnerable to outliers in the measurements and process noise. We introduce the iteratively saturated Kalman filter (ISKF), which is derived as a scaled gradient method for solving a convex robust estimation problem. It achieves outlier robustness while preserving the KF's low per-step cost and implementation simplicity, since in practice it typically requires only one or two iterations to achieve good performance. The ISKF also admits a steady-state variant that, like the standard steady-state KF, does not require linear system solves in each time step, making it well-suited for real-time systems.

[128] arXiv:2507.00273 [pdf, html, other]
Title: Mechanical Intelligence-Aware Curriculum Reinforcement Learning for Humanoids with Parallel Actuation
Yusuke Tanaka, Alvin Zhu, Quanyou Wang, Dennis Hong
Subjects: Robotics (cs.RO)

Reinforcement learning (RL) has enabled significant advances in humanoid robot locomotion, yet most learning frameworks do not account for mechanical intelligence embedded in parallel actuation mechanisms due to limitations in simulator support for closed kinematic chains. This omission can lead to inaccurate motion modeling and suboptimal policies, particularly for robots with high actuation complexity. This paper presents an end-to-end curriculum RL framework for BRUCE, a kid-sized humanoid robot featuring three distinct parallel mechanisms in its legs: a differential pulley, a 5-bar linkage, and a 4-bar linkage. Unlike prior approaches that rely on simplified serial approximations, we simulate all closed-chain constraints natively using GPU-accelerated MJX (MuJoCo), preserving the hardware's physical properties during training. We benchmark our RL approach against a Model Predictive Controller (MPC), demonstrating better surface generalization and performance in real-world zero-shot deployment. This work highlights the computational approaches and performance benefits of fully simulating parallel mechanisms in end-to-end learning pipelines for legged humanoids.

[129] arXiv:2507.00275 [pdf, html, other]
Title: Double Q-learning for Value-based Deep Reinforcement Learning, Revisited
Prabhat Nagarajan, Martha White, Marlos C. Machado
Comments: 44 pages
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Overestimation is pervasive in reinforcement learning (RL), including in Q-learning, which forms the algorithmic basis for many value-based deep RL algorithms. Double Q-learning is an algorithm introduced to address Q-learning's overestimation by training two Q-functions and using both to de-correlate action-selection and action-evaluation in bootstrap targets. Shortly after Q-learning was adapted to deep RL in the form of deep Q-networks (DQN), Double Q-learning was adapted to deep RL in the form of Double DQN. However, Double DQN only loosely adapts Double Q-learning, forgoing the training of two different Q-functions that bootstrap off one another. In this paper, we study algorithms that adapt this core idea of Double Q-learning for value-based deep RL. We term such algorithms Deep Double Q-learning (DDQL). Our aim is to understand whether DDQL exhibits less overestimation than Double DQN and whether performant instantiations of DDQL exist. We answer both questions affirmatively, demonstrating that DDQL reduces overestimation and outperforms Double DQN in aggregate across 57 Atari 2600 games, without requiring additional hyperparameters. We also study several aspects of DDQL, including its network architecture, replay ratio, and minibatch sampling strategy.

[130] arXiv:2507.00277 [pdf, html, other]
Title: Lazy B-Trees
Casper Moldrup Rysgaard, Sebastian Wild
Comments: MFCS 2025
Subjects: Data Structures and Algorithms (cs.DS); Databases (cs.DB)

Lazy search trees (Sandlund & Wild FOCS 2020, Sandlund & Zhang SODA 2022) are sorted dictionaries whose update and query performance smoothly interpolates between that of efficient priority queues and binary search trees - automatically, depending on actual use; no adjustments are necessary to the data structure to realize the cost savings. In this paper, we design lazy B-trees, a variant of lazy search trees suitable for external memory that generalizes the speedup of B-trees over binary search trees wrt. input/output operations to the same smooth interpolation regime.
A key technical difficulty to overcome is the lack of a (fully satisfactory) external variant of biased search trees, on which lazy search trees crucially rely. We give a construction for a subset of performance guarantees sufficient to realize external-memory lazy search trees, which we deem of independent interest.
As one special case, lazy B-trees can be used as an external-memory priority queue, in which case they are competitive with some tailor-made heaps; indeed, they offer faster decrease-key and insert operations than known data structures.

[131] arXiv:2507.00278 [pdf, other]
Title: Automatic discovery of optimal meta-solvers for time-dependent nonlinear PDEs
Youngkyu Lee, Shanqing Liu, Jerome Darbon, George Em Karniadakis
Subjects: Numerical Analysis (math.NA)

We present a general and scalable framework for the automated discovery of optimal meta-solvers for the solution of time-dependent nonlinear partial differential equations after appropriate discretization. By integrating classical numerical methods (e.g., Krylov-based methods) with modern deep learning components, such as neural operators, our approach enables flexible, on-demand solver design tailored to specific problem classes and objectives. The fast solvers tackle the large linear system resulting from the Newton--Raphson iteration or by using an implicit-explicit (IMEX) time integration scheme. Specifically, we formulate solver discovery as a multi-objective optimization problem, balancing various performance criteria such as accuracy, speed, and memory usage. The resulting Pareto optimal set provides a principled foundation for solver selection based on user-defined preference functions. When applied to problems in reaction--diffusion, fluid dynamics, and solid mechanics, the discovered meta-solvers consistently outperform conventional iterative methods, demonstrating both practical efficiency and broad applicability.

[132] arXiv:2507.00286 [pdf, html, other]
Title: Visual Privacy Management with Generative AI for Blind and Low-Vision People
Tanusree Sharma, Yu-Yun Tseng, Lotus Zhang, Ayae Ide, Kelly Avery Mack, Leah Findlater, Danna Gurari, Yang Wang
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)

Blind and low vision (BLV) individuals use Generative AI (GenAI) tools to interpret and manage visual content in their daily lives. While such tools can enhance the accessibility of visual content and so enable greater user independence, they also introduce complex challenges around visual privacy. In this paper, we investigate the current practices and future design preferences of blind and low vision individuals through an interview study with 21 participants. Our findings reveal a range of current practices with GenAI that balance privacy, efficiency, and emotional agency, with users accounting for privacy risks across six key scenarios, such as self-presentation, indoor/outdoor spatial privacy, social sharing, and handling professional content. Our findings reveal design preferences, including on-device processing, zero-retention guarantees, sensitive content redaction, privacy-aware appearance indicators, and multimodal tactile mirrored interaction methods. We conclude with actionable design recommendations to support user-centered visual privacy through GenAI, expanding the notion of privacy and responsible handling of others data.

[133] arXiv:2507.00287 [pdf, html, other]
Title: Self-Supervised Multiview Xray Matching
Mohamad Dabboussi, Malo Huard, Yann Gousseau, Pietro Gori
Comments: MICCAI 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Accurate interpretation of multi-view radiographs is crucial for diagnosing fractures, muscular injuries, and other anomalies. While significant advances have been made in AI-based analysis of single images, current methods often struggle to establish robust correspondences between different X-ray views, an essential capability for precise clinical evaluations. In this work, we present a novel self-supervised pipeline that eliminates the need for manual annotation by automatically generating a many-to-many correspondence matrix between synthetic X-ray views. This is achieved using digitally reconstructed radiographs (DRR), which are automatically derived from unannotated CT volumes. Our approach incorporates a transformer-based training phase to accurately predict correspondences across two or more X-ray views. Furthermore, we demonstrate that learning correspondences among synthetic X-ray views can be leveraged as a pretraining strategy to enhance automatic multi-view fracture detection on real data. Extensive evaluations on both synthetic and real X-ray datasets show that incorporating correspondences improves performance in multi-view fracture classification.

[134] arXiv:2507.00292 [pdf, html, other]
Title: Reducing Variability of Multiple Instance Learning Methods for Digital Pathology
Ali Mammadov, Loïc Le Folgoc, Guillaume Hocquet, Pietro Gori
Comments: MICCAI 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Digital pathology has revolutionized the field by enabling the digitization of tissue samples into whole slide images (WSIs). However, the high resolution and large size of WSIs present significant challenges when it comes to applying Deep Learning models. As a solution, WSIs are often divided into smaller patches with a global label (\textit{i.e., diagnostic}) per slide, instead of a (too) costly pixel-wise annotation. By treating each slide as a bag of patches, Multiple Instance Learning (MIL) methods have emerged as a suitable solution for WSI classification. A major drawback of MIL methods is their high variability in performance across different runs, which can reach up to 10-15 AUC points on the test set, making it difficult to compare different MIL methods reliably. This variability mainly comes from three factors: i) weight initialization, ii) batch (shuffling) ordering, iii) and learning rate. To address that, we introduce a Multi-Fidelity, Model Fusion strategy for MIL methods. We first train multiple models for a few epochs and average the most stable and promising ones based on validation scores. This approach can be applied to any existing MIL model to reduce performance variability. It also simplifies hyperparameter tuning and improves reproducibility while maintaining computational efficiency. We extensively validate our approach on WSI classification tasks using 2 different datasets, 3 initialization strategies and 5 MIL methods, for a total of more than 2000 experiments.

[135] arXiv:2507.00297 [pdf, other]
Title: Natural language processing for African languages
David Ifeoluwa Adelani
Comments: PhD thesis
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performance. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts. To address the under-representation of the African languages in NLP research, we developed large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks: named entity recognition and machine translation. We conduct an extensive empirical evaluation using state-of-the-art methods across supervised, weakly-supervised, and transfer learning settings.

[136] arXiv:2507.00299 [pdf, html, other]
Title: When Kids Mode Isn't For Kids: Investigating TikTok's "Under 13 Experience"
Olivia Figueira, Pranathi Chamarthi, Tu Le, Athina Markopoulou
Subjects: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR)

TikTok, the social media platform that is popular among children and adolescents, offers a more restrictive "Under 13 Experience" exclusively for young users in the US, also known as TikTok's "Kids Mode". While prior research has studied various aspects of TikTok's regular mode, including privacy and personalization, TikTok's Kids Mode remains understudied, and there is a lack of transparency regarding its content curation and its safety and privacy protections for children. In this paper, (i) we propose an auditing methodology to comprehensively investigate TikTok's Kids Mode and (ii) we apply it to characterize the platform's content curation and determine the prevalence of child-directed content, based on regulations in the Children's Online Privacy Protection Act (COPPA). We find that 83% of videos observed on the "For You" page in Kids Mode are actually not child-directed, and even inappropriate content was found. The platform also lacks critical features, namely parental controls and accessibility settings. Our findings have important design and regulatory implications, as children may be incentivized to use TikTok's regular mode instead of Kids Mode, where they are known to be exposed to further safety and privacy risks.

[137] arXiv:2507.00301 [pdf, other]
Title: Structure-preserving Lift & Learn: Scientific machine learning for nonlinear conservative partial differential equations
Harsh Sharma, Juan Diego Draxl Giannoni, Boris Kramer
Comments: arXiv admin note: substantial text overlap with arXiv:2503.02273
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)

This work presents structure-preserving Lift & Learn, a scientific machine learning method that employs lifting variable transformations to learn structure-preserving reduced-order models for nonlinear partial differential equations (PDEs) with conservation laws. We propose a hybrid learning approach based on a recently developed energy-quadratization strategy that uses knowledge of the nonlinearity at the PDE level to derive an equivalent quadratic lifted system with quadratic system energy. The lifted dynamics obtained via energy quadratization are linear in the old variables, making model learning very effective in the lifted setting. Based on the lifted quadratic PDE model form, the proposed method derives quadratic reduced terms analytically and then uses those derived terms to formulate a constrained optimization problem to learn the remaining linear reduced operators in a structure-preserving way. The proposed hybrid learning approach yields computationally efficient quadratic reduced-order models that respect the underlying physics of the high-dimensional problem. We demonstrate the generalizability of quadratic models learned via the proposed structure-preserving Lift & Learn method through three numerical examples: the one-dimensional wave equation with exponential nonlinearity, the two-dimensional sine-Gordon equation, and the two-dimensional Klein-Gordon-Zakharov equations. The numerical results show that the proposed learning approach is competitive with the state-of-the-art structure-preserving data-driven model reduction method in terms of both accuracy and computational efficiency.

[138] arXiv:2507.00304 [pdf, other]
Title: MamNet: A Novel Hybrid Model for Time-Series Forecasting and Frequency Pattern Analysis in Network Traffic
Yujun Zhang, Runlong Li, Xiaoxiang Liang, Xinhao Yang, Tian Su, Bo Liu, Yan Zhou
Comments: 16 pages
Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)

The abnormal fluctuations in network traffic may indicate potential security threats or system failures. Therefore, efficient network traffic prediction and anomaly detection methods are crucial for network security and traffic management. This paper proposes a novel network traffic prediction and anomaly detection model, MamNet, which integrates time-domain modeling and frequency-domain feature extraction. The model first captures the long-term dependencies of network traffic through the Mamba module (time-domain modeling), and then identifies periodic fluctuations in the traffic using Fourier Transform (frequency-domain feature extraction). In the feature fusion layer, multi-scale information is integrated to enhance the model's ability to detect network traffic anomalies. Experiments conducted on the UNSW-NB15 and CAIDA datasets demonstrate that MamNet outperforms several recent mainstream models in terms of accuracy, recall, and F1-Score. Specifically, it achieves an improvement of approximately 2% to 4% in detection performance for complex traffic patterns and long-term trend detection. The results indicate that MamNet effectively captures anomalies in network traffic across different time scales and is suitable for anomaly detection tasks in network security and traffic management. Future work could further optimize the model structure by incorporating external network event information, thereby improving the model's adaptability and stability in complex network environments.

[139] arXiv:2507.00305 [pdf, html, other]
Title: EEG-Based Auditory BCI for Communication in a Completely Locked-In Patient Using Volitional Frequency Band Modulation
Deland Liu, Frigyes Samuel Racz, Zoe Lalji, Jose del R. Millan
Subjects: Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)

Patients with amyotrophic lateral sclerosis (ALS) in the completely locked-in state (CLIS) can lose all reliable motor control and are left without any means of communication. It remains unknown whether non-invasive electroencephalogram (EEG) based brain-computer interfaces (BCIs) can support volitional communication in CLIS. Here, we show that a CLIS patient was able to operate an EEG-based BCI across multiple online sessions to respond to both general knowledge and personally relevant assistive questions. The patient delivered "Yes"/"No" responses by volitionally modulating alpha and beta band power at different channels, guided by real-time auditory feedback from the BCI. The patient communicated assistive needs above chance in all sessions, achieving a perfect score in the final session. Performance on general knowledge questions varied across sessions, with two sessions showing accurate and above-chance responses, while the first and last sessions remained at chance level. The patient also showed consistent modulation patterns over time. These findings suggest that non-invasive BCIs may offer a potential pathway for restoring basic communication in CLIS.

[140] arXiv:2507.00306 [pdf, html, other]
Title: Origin-Destination Travel Demand Estimation: An Approach That Scales Worldwide, and Its Application to Five Metropolitan Highway Networks
Chao Zhang, Neha Arora, Christopher Bian, Yechen Li, Willa Ng, Andrew Tomkins, Bin Yan, Janny Zhang, Carolina Osorio
Subjects: Emerging Technologies (cs.ET)

Estimating Origin-Destination (OD) travel demand is vital for effective urban planning and traffic management. Developing universally applicable OD estimation methodologies is significantly challenged by the pervasive scarcity of high-fidelity traffic data and the difficulty in obtaining city-specific prior OD estimates (or seed ODs), which are often prerequisite for traditional approaches. Our proposed method directly estimates OD travel demand by systematically leveraging aggregated, anonymized statistics from Google Maps Traffic Trends, obviating the need for conventional census or city-provided OD data. The OD demand is estimated by formulating a single-level, one-dimensional, continuous nonlinear optimization problem with nonlinear equality and bound constraints to replicate highway path travel times. The method achieves efficiency and scalability by employing a differentiable analytical macroscopic network model. This model by design is computationally lightweight, distinguished by its parsimonious parameterization that requires minimal calibration effort and its capacity for instantaneous evaluation. These attributes ensure the method's broad applicability and practical utility across diverse cities globally. Using segment sensor counts from Los Angeles and San Diego highway networks, we validate our proposed approach, demonstrating a two-thirds to three-quarters improvement in the fit to segment count data over a baseline. Beyond validation, we establish the method's scalability and robust performance in replicating path travel times across diverse highway networks, including Seattle, Orlando, Denver, Philadelphia, and Boston. In these expanded evaluations, our method not only aligns with simulation-based benchmarks but also achieves an average 13% improvement in it's ability to fit travel time data compared to the baseline during afternoon peak hours.

[141] arXiv:2507.00310 [pdf, other]
Title: Open-ended Scientific Discovery via Bayesian Surprise
Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, Peter Clark
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

The promise of autonomous scientific discovery (ASD) hinges not only on answering questions, but also on knowing which questions to ask. Most recent works in ASD explore the use of large language models (LLMs) in goal-driven settings, relying on human-specified research questions to guide hypothesis generation. However, scientific discovery may be accelerated further by allowing the AI system to drive exploration by its own criteria. The few existing approaches in open-ended ASD select hypotheses based on diversity heuristics or subjective proxies for human interestingness, but the former struggles to meaningfully navigate the typically vast hypothesis space, and the latter suffers from imprecise definitions. This paper presents AutoDS -- a method for open-ended ASD that instead drives scientific exploration using Bayesian surprise. Here, we quantify the epistemic shift from the LLM's prior beliefs about a hypothesis to its posterior beliefs after gathering experimental results. To efficiently explore the space of nested hypotheses, our method employs a Monte Carlo tree search (MCTS) strategy with progressive widening using surprisal as the reward function. We evaluate AutoDS in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AutoDS substantially outperforms competitors by producing 5--29\% more discoveries deemed surprising by the LLM. Our human evaluation further finds that two-thirds of AutoDS discoveries are surprising to the domain experts, suggesting this is an important step forward towards building open-ended ASD systems.

[142] arXiv:2507.00316 [pdf, html, other]
Title: $μ^2$Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation
Siyou Li, Pengyao Qin, Huanan Wu, Dong Nie, Arun J. Thirunavukarasu, Juntao Yu, Le Zhang
Comments: Accepted by MICCAI 2025
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Image and Video Processing (eess.IV)

Automated radiology report generation (RRG) aims to produce detailed textual reports from clinical imaging, such as computed tomography (CT) scans, to improve the accuracy and efficiency of diagnosis and provision of management advice. RRG is complicated by two key challenges: (1) inherent complexity in extracting relevant information from imaging data under resource constraints, and (2) difficulty in objectively evaluating discrepancies between model-generated and expert-written reports. To address these challenges, we propose $\mu^2$LLM, a $\underline{\textbf{mu}}$ltiscale $\underline{\textbf{mu}}$ltimodal large language models for RRG tasks. The novel ${\mu}^2$Tokenizer, as an intermediate layer, integrates multi-modal features from the multiscale visual tokenizer and the text tokenizer, then enhances report generation quality through direct preference optimization (DPO), guided by GREEN-RedLlama. Experimental results on four large CT image-report medical datasetdemonstrate that our method outperforms existing approaches, highlighting the potential of our fine-tuned $\mu^2$LLMs on limited data for RRG tasks.

[143] arXiv:2507.00319 [pdf, html, other]
Title: When Digital Twins Meet Large Language Models: Realistic, Interactive, and Editable Simulation for Autonomous Driving
Tanmay Vilas Samak, Chinmay Vilas Samak, Bing Li, Venkat Krovi
Subjects: Robotics (cs.RO)

Simulation frameworks have been key enablers for the development and validation of autonomous driving systems. However, existing methods struggle to comprehensively address the autonomy-oriented requirements of balancing: (i) dynamical fidelity, (ii) photorealistic rendering, (iii) context-relevant scenario orchestration, and (iv) real-time performance. To address these limitations, we present a unified framework for creating and curating high-fidelity digital twins to accelerate advancements in autonomous driving research. Our framework leverages a mix of physics-based and data-driven techniques for developing and simulating digital twins of autonomous vehicles and their operating environments. It is capable of reconstructing real-world scenes and assets (real2sim) with geometric and photorealistic accuracy and infusing them with various physical properties to enable real-time dynamical simulation of the ensuing driving scenarios. Additionally, it also incorporates a large language model (LLM) interface to flexibly edit the driving scenarios online via natural language prompts. We analyze the presented framework in terms of its fidelity, performance, and serviceability. Results indicate that our framework can reconstruct 3D scenes and assets with up to 97% structural similarity, while maintaining frame rates above 60 Hz. We also demonstrate that it can handle natural language prompts to generate diverse driving scenarios with up to 95% repeatability and 85% generalizability.

[144] arXiv:2507.00320 [pdf, other]
Title: Exploring Theory-Laden Observations in the Brain Basis of Emotional Experience
Christiana Westlin, Ashutosh Singh, Deniz Erdogmus, Georgios Stratis, Lisa Feldman Barrett
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

In the science of emotion, it is widely assumed that folk emotion categories form a biological and psychological typology, and studies are routinely designed and analyzed to identify emotion-specific patterns. This approach shapes the observations that studies report, ultimately reinforcing the assumption that guided the investigation. Here, we reanalyzed data from one such typologically-guided study that reported mappings between individual brain patterns and group-averaged ratings of 34 emotion categories. Our reanalysis was guided by an alternative view of emotion categories as populations of variable, situated instances, and which predicts a priori that there will be significant variation in brain patterns within a category across instances. Correspondingly, our analysis made minimal assumptions about the structure of the variance present in the data. As predicted, we did not observe the original mappings and instead observed significant variation across individuals. These findings demonstrate how starting assumptions can ultimately impact scientific conclusions and suggest that a hypothesis must be supported using multiple analytic methods before it is taken seriously.

[145] arXiv:2507.00322 [pdf, html, other]
Title: Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones
Daking Rai, Samuel Miller, Kevin Moran, Ziyu Yao
Comments: 23 pages, 10 figures, Preprint
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.

[146] arXiv:2507.00327 [pdf, html, other]
Title: Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes
Chuyan Zhang, Kefan Wang, Yun Gu
Comments: Accepted by ICCV 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Low-Rank Adaptation (LoRA) has proven effective in reducing computational costs while maintaining performance comparable to fully fine-tuned foundation models across various tasks. However, its fixed low-rank structure restricts its adaptability in scenarios with substantial domain gaps, where higher ranks are often required to capture domain-specific complexities. Current adaptive LoRA methods attempt to overcome this limitation by dynamically expanding or selectively allocating ranks, but these approaches frequently depend on computationally intensive techniques such as iterative pruning, rank searches, or additional regularization. To address these challenges, we introduce Stable Rank-Guided Low-Rank Adaptation (SR-LoRA), a novel framework that utilizes the stable rank of pre-trained weight matrices as a natural prior for layer-wise rank allocation. By leveraging the stable rank, which reflects the intrinsic dimensionality of the weights, SR-LoRA enables a principled and efficient redistribution of ranks across layers, enhancing adaptability without incurring additional search costs. Empirical evaluations on few-shot tasks with significant domain gaps show that SR-LoRA consistently outperforms recent adaptive LoRA variants, achieving a superior trade-off between performance and efficiency. Our code is available at this https URL.

[147] arXiv:2507.00328 [pdf, html, other]
Title: MammoTracker: Mask-Guided Lesion Tracking in Temporal Mammograms
Xuan Liu, Yinhao Ren, Marc D. Ryser, Lars J. Grimm, Joseph Y. Lo
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Accurate lesion tracking in temporal mammograms is essential for monitoring breast cancer progression and facilitating early diagnosis. However, automated lesion correspondence across exams remains a challenges in computer-aided diagnosis (CAD) systems, limiting their effectiveness. We propose MammoTracker, a mask-guided lesion tracking framework that automates lesion localization across consecutively exams. Our approach follows a coarse-to-fine strategy incorporating three key modules: global search, local search, and score refinement. To support large-scale training and evaluation, we introduce a new dataset with curated prior-exam annotations for 730 mass and calcification cases from the public EMBED mammogram dataset, yielding over 20000 lesion pairs, making it the largest known resource for temporal lesion tracking in mammograms. Experimental results demonstrate that MammoTracker achieves 0.455 average overlap and 0.509 accuracy, surpassing baseline models by 8%, highlighting its potential to enhance CAD-based lesion progression analysis. Our dataset will be available at this https URL.

[148] arXiv:2507.00330 [pdf, html, other]
Title: Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios
Mohna Chakraborty, Adithya Kulkarni, Qi Li
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)

Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabilities depend on verbalizer token proximity in the embedding space. To address this, we propose COLDSELECT, a joint verbalizer and instance selection approach that models data diversity. COLDSELECT maps PLM vocabulary and $h_{[MASK]}$ embeddings into a shared space, applying dimensionality reduction and clustering to ensure efficient and diverse selection. By optimizing for minimal uncertainty and maximal diversity, COLDSELECT captures data relationships effectively. Experiments on eight benchmarks demonstrate COLDSELECT's superiority in reducing uncertainty and enhancing generalization, outperforming baselines in verbalizer and few-shot instance selection for cold-start scenarios.

[149] arXiv:2507.00333 [pdf, html, other]
Title: Scope Meets Screen: Lessons Learned in Designing Composite Visualizations for Marksmanship Training Across Skill Levels
Emin Zerman, Jonas Carlsson, Mårten Sjöström
Comments: 5 pages, accepted at IEEE VIS 2025
Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)

Marksmanship practices are required in various professions, including police, military personnel, hunters, as well as sports shooters, such as Olympic shooting, biathlon, and modern pentathlon. The current form of training and coaching is mostly based on repetition, where the coach does not see through the eyes of the shooter, and analysis is limited to stance and accuracy post-session. In this study, we present a shooting visualization system and evaluate its perceived effectiveness for both novice and expert shooters. To achieve this, five composite visualizations were developed using first-person shooting video recordings enriched with overlaid metrics and graphical summaries. These views were evaluated with 10 participants (5 expert marksmen, 5 novices) through a mixed-methods study including shot-count and aiming interpretation tasks, pairwise preference comparisons, and semi-structured interviews. The results show that a dashboard-style composite view, combining raw video with a polar plot and selected graphs, was preferred in 9 of 10 cases and supported understanding across skill levels. The insights gained from this design study point to the broader value of integrating first-person video with visual analytics for coaching, and we suggest directions for applying this approach to other precision-based sports.

[150] arXiv:2507.00334 [pdf, html, other]
Title: Populate-A-Scene: Affordance-Aware Human Video Generation
Mengyi Shan, Zecheng He, Haoyu Ma, Felix Juefei-Xu, Peizhao Zhang, Tingbo Hou, Ching-Yao Chuang
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.

[151] arXiv:2507.00337 [pdf, html, other]
Title: Seeing Through the Fog: Empowering Mobile Devices to Expose and Mitigate RAN Buffer Effects on Delay-Sensitive Protocols
Yuxin Liu, Tianyang Zhang, Qiang Wu, Ju Ren, Kyle Jamieson, Yaxiong Xie
Subjects: Networking and Internet Architecture (cs.NI)

Delay-based protocols rely on end-to-end delay measurements to detect network congestion. However, in cellular networks, Radio Access Network (RAN) buffers introduce significant delays unrelated to congestion, fundamentally challenging these protocols' assumptions. We identify two major types of RAN buffers - retransmission buffers and uplink scheduling buffers - that can introduce delays comparable to congestion-induced delays, severely degrading protocol performance. We present CellNinjia, a software-based system providing real-time visibility into RAN operations, and Gandalf, which leverages this visibility to systematically handle RAN-induced delays. Unlike existing approaches that treat these delays as random noise, Gandalf identifies specific RAN operations and compensates for their effects. Our evaluation in commercial 4G LTE and 5G networks shows that Gandalf enables substantial performance improvements - up to 7.49x for Copa and 9.53x for PCC Vivace - without modifying the protocols' core algorithms, demonstrating that delay-based protocols can realize their full potential in cellular networks.

[152] arXiv:2507.00339 [pdf, html, other]
Title: Training for X-Ray Vision: Amodal Segmentation, Amodal Content Completion, and View-Invariant Object Representation from Multi-Camera Video
Alexander Moore, Amar Saini, Kylie Cancilla, Doug Poland, Carmen Carrano
Comments: 9 pages, 2 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Amodal segmentation and amodal content completion require using object priors to estimate occluded masks and features of objects in complex scenes. Until now, no data has provided an additional dimension for object context: the possibility of multiple cameras sharing a view of a scene. We introduce MOVi-MC-AC: Multiple Object Video with Multi-Cameras and Amodal Content, the largest amodal segmentation and first amodal content dataset to date. Cluttered scenes of generic household objects are simulated in multi-camera video. MOVi-MC-AC contributes to the growing literature of object detection, tracking, and segmentation by including two new contributions to the deep learning for computer vision world. Multiple Camera (MC) settings where objects can be identified and tracked between various unique camera perspectives are rare in both synthetic and real-world video. We introduce a new complexity to synthetic video by providing consistent object ids for detections and segmentations between both frames and multiple cameras each with unique features and motion patterns on a single scene. Amodal Content (AC) is a reconstructive task in which models predict the appearance of target objects through occlusions. In the amodal segmentation literature, some datasets have been released with amodal detection, tracking, and segmentation labels. While other methods rely on slow cut-and-paste schemes to generate amodal content pseudo-labels, they do not account for natural occlusions present in the modal masks. MOVi-MC-AC provides labels for ~5.8 million object instances, setting a new maximum in the amodal dataset literature, along with being the first to provide ground-truth amodal content. The full dataset is available at this https URL ,

[153] arXiv:2507.00343 [pdf, html, other]
Title: Meaningful Data Erasure in the Presence of Dependencies
Vishal Chakraborty, Youri Kaminsky, Sharad Mehrotra, Felix Naumann, Faisal Nawab, Primal Pappachan, Mohammad Sadoghi, Nalini Venkatasubramanian
Comments: VLDB 2025 Preprint
Subjects: Databases (cs.DB)

Data regulations like GDPR require systems to support data erasure but leave the definition of "erasure" open to interpretation. This ambiguity makes compliance challenging, especially in databases where data dependencies can lead to erased data being inferred from remaining data. We formally define a precise notion of data erasure that ensures any inference about deleted data, through dependencies, remains bounded to what could have been inferred before its insertion. We design erasure mechanisms that enforce this guarantee at minimal cost. Additionally, we explore strategies to balance cost and throughput, batch multiple erasures, and proactively compute data retention times when possible. We demonstrate the practicality and scalability of our algorithms using both real and synthetic datasets.

[154] arXiv:2507.00347 [pdf, other]
Title: VTS-Guided AI Interaction Workflow for Business Insights
Sun Ding, Ude Enebeli, Atilhan (Ati)Manay, Ryan Pua, Kamal Kotak
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Modern firms face a flood of dense, unstructured reports. Turning these documents into usable insights takes heavy effort and is far from agile when quick answers are needed. VTS-AI tackles this gap. It integrates Visual Thinking Strategies, which emphasize evidence-based observation, linking, and thinking, into AI agents, so the agents can extract business insights from unstructured text, tables, and images at scale. The system works in three tiers (micro, meso, macro). It tags issues, links them to source pages, and rolls them into clear action levers stored in a searchable YAML file. In tests on an 18-page business report, VTS-AI matched the speed of a one-shot ChatGPT prompt yet produced richer findings: page locations, verbatim excerpts, severity scores, and causal links. Analysts can accept or adjust these outputs in the same IDE, keeping human judgment in the loop. Early results show VTS-AI spots the direction of key metrics and flags where deeper number-crunching is needed. Next steps include mapping narrative tags to financial ratios, adding finance-tuned language models through a Model-Context Protocol, and building a Risk & Safety Layer to stress-test models and secure data. These upgrades aim to make VTS-AI a production-ready, audit-friendly tool for rapid business analysis.

[155] arXiv:2507.00348 [pdf, html, other]
Title: Addressing malware family concept drift with triplet autoencoder
Numan Halit Guldemir, Oluwafemi Olukoya, Jesús Martínez-del-Rincón
Journal-ref: Proceedings of the Eighteenth International Conference on Emerging Security Information, Systems and Technologies (SECURWARE 2024)
Subjects: Cryptography and Security (cs.CR)

Machine learning is increasingly vital in cybersecurity, especially in malware detection. However, concept drift, where the characteristics of malware change over time, poses a challenge for maintaining the efficacy of these detection systems. Concept drift can occur in two forms: the emergence of entirely new malware families and the evolution of existing ones. This paper proposes an innovative method to address the former, focusing on effectively identifying new malware families. Our approach leverages a supervised autoencoder combined with triplet loss to differentiate between known and new malware families. We create clear and robust clusters that enhance the accuracy and resilience of malware family classification by utilizing this metric learning technique and the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. The effectiveness of our method is validated using an Android malware dataset and a Windows portable executable (PE) malware dataset, showcasing its capability to sustain model performance within the dynamic landscape of emerging malware threats. Our results demonstrate a significant improvement in detecting new malware families, offering a reliable solution for ongoing cybersecurity challenges.

[156] arXiv:2507.00352 [pdf, html, other]
Title: An AST-guided LLM Approach for SVRF Code Synthesis
Abanoub E. Abdelmalak, Mohamed A. Elsayed, David Abercrombie, Ilhami Torunoglu
Comments: 9 Pages, 5 Figures, 2 Tables
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)

Standard Verification Rule Format (SVRF) is essential for semiconductor applications like Design Rule Check (DRC), Layout Versus Schematic (LVS), and Optical Proximity Correction (OPC) and it faces challenges as advancing nodes create complex design rules that renders traditional SVRF development ineffective and highlight an expertise gap. This paper introduces a novel methodology integrating Abstract Syntax Tree (AST) embedding and Retrieval-Augmented Generation (RAG) for enhanced SVRF code synthesis, ensuring semantic accuracy and error minimization through structural validation with domain-specific insights for precise code generation.
We evaluate different T5-based models and propose an innovative SVRF-specific scoring framework that complements standard metrics like BLEU and ROUGE-L. In our approach, AST provides rigorous structural validation, while RAG infuses relevant domain knowledge, effectively enhancing the code generation workflow.
Testing on a comprehensive benchmark of 740 DRC rule implementations, our methodology demonstrates up to a 40\% improvement in code generation accuracy compared to basic text-based fine-tuning process. This fusion of industry expertise with advanced coding strategies not only optimizes SVRF development under limited dataset constraints but also creates a more intuitive and efficient coding environment. Consequently, users can rapidly iterate through design cycles, reduce manual error correction, and significantly improve overall productivity.

[157] arXiv:2507.00353 [pdf, html, other]
Title: Augmented Physics-Based Li-ion Battery Model via Adaptive Ensemble Sparse Learning and Conformal Prediction
Samuel Filgueira da Silva, Mehmet Fatih Ozkan, Faissal El Idrissi, Marcello Canova
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG)

Accurate electrochemical models are essential for the safe and efficient operation of lithium-ion batteries in real-world applications such as electrified vehicles and grid storage. Reduced-order models (ROM) offer a balance between fidelity and computational efficiency but often struggle to capture complex and nonlinear behaviors, such as the dynamics in the cell voltage response under high C-rate conditions. To address these limitations, this study proposes an Adaptive Ensemble Sparse Identification (AESI) framework that enhances the accuracy of reduced-order li-ion battery models by compensating for unpredictable dynamics. The approach integrates an Extended Single Particle Model (ESPM) with an evolutionary ensemble sparse learning strategy to construct a robust hybrid model. In addition, the AESI framework incorporates a conformal prediction method to provide theoretically guaranteed uncertainty quantification for voltage error dynamics, thereby improving the reliability of the model's predictions. Evaluation across diverse operating conditions shows that the hybrid model (ESPM + AESI) improves the voltage prediction accuracy, achieving mean squared error reductions of up to 46% on unseen data. Prediction reliability is further supported by conformal prediction, yielding statistically valid prediction intervals with coverage ratios of 96.85% and 97.41% for the ensemble models based on bagging and stability selection, respectively.

[158] arXiv:2507.00355 [pdf, html, other]
Title: Question Decomposition for Retrieval-Augmented Generation
Paul J. L. Ammann, Jonas Golde, Alan Akbik
Comments: Accepted to ACL SRW 2025. 9 Pages, 2 Figures, 4 Tables
Subjects: Computation and Language (cs.CL)

Grounding large language models (LLMs) in verifiable external sources is a well-established strategy for generating reliable answers. Retrieval-augmented generation (RAG) is one such approach, particularly effective for tasks like question answering: it retrieves passages that are semantically related to the question and then conditions the model on this evidence. However, multi-hop questions, such as "Which company among NVIDIA, Apple, and Google made the biggest profit in 2023?," challenge RAG because relevant facts are often distributed across multiple documents rather than co-occurring in one source, making it difficult for standard RAG to retrieve sufficient information. To address this, we propose a RAG pipeline that incorporates question decomposition: (i) an LLM decomposes the original query into sub-questions, (ii) passages are retrieved for each sub-question, and (iii) the merged candidate pool is reranked to improve the coverage and precision of the retrieved evidence. We show that question decomposition effectively assembles complementary documents, while reranking reduces noise and promotes the most relevant passages before answer generation. Although reranking itself is standard, we show that pairing an off-the-shelf cross-encoder reranker with LLM-driven question decomposition bridges the retrieval gap on multi-hop questions and provides a practical, drop-in enhancement, without any extra training or specialized indexing. We evaluate our approach on the MultiHop-RAG and HotpotQA, showing gains in retrieval (MRR@10: +36.7%) and answer accuracy (F1: +11.6%) over standard RAG baselines.

[159] arXiv:2507.00356 [pdf, other]
Title: CGEarthEye:A High-Resolution Remote Sensing Vision Foundation Model Based on the Jilin-1 Satellite Constellation
Zhiwei Yi, Xin Cheng, Jingyu Ma, Ruifei Zhu, Junwei Tian, Yuanxiu Zhou, Xinge Zhao, Hongzhe Li
Comments: A Remote Sensing Fundation Model for Very High Resolution Images
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Deep learning methods have significantly advanced the development of intelligent rinterpretation in remote sensing (RS), with foundational model research based on large-scale pre-training paradigms rapidly reshaping various domains of Earth Observation (EO). However, compared to the open accessibility and high spatiotemporal coverage of medium-resolution data, the limited acquisition channels for ultra-high-resolution optical RS imagery have constrained the progress of high-resolution remote sensing vision foundation models (RSVFM). As the world's largest sub-meter-level commercial RS satellite constellation, the Jilin-1 constellation possesses abundant sub-meter-level image resources. This study proposes CGEarthEye, a RSVFM framework specifically designed for Jilin-1 satellite characteristics, comprising five backbones with different parameter scales with totaling 2.1 billion parameters. To enhance the representational capacity of the foundation model, we developed JLSSD, the first 15-million-scale multi-temporal self-supervised learning (SSL) dataset featuring global coverage with quarterly temporal sampling within a single year, constructed through multi-level representation clustering and sampling strategies. The framework integrates seasonal contrast, augmentation-based contrast, and masked patch token contrastive strategies for pre-training. Comprehensive evaluations across 10 benchmark datasets covering four typical RS tasks demonstrate that the CGEarthEye consistently achieves state-of-the-art (SOTA) performance. Further analysis reveals CGEarthEye's superior characteristics in feature visualization, model convergence, parameter efficiency, and practical mapping applications. This study anticipates that the exceptional representation capabilities of CGEarthEye will facilitate broader and more efficient applications of Jilin-1 data in traditional EO application.

[160] arXiv:2507.00358 [pdf, html, other]
Title: Data-Driven Exploration for a Class of Continuous-Time Linear--Quadratic Reinforcement Learning Problems
Yilie Huang, Xun Yu Zhou
Comments: 36 pages, 10 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)

We study reinforcement learning (RL) for the same class of continuous-time stochastic linear--quadratic (LQ) control problems as in \cite{huang2024sublinear}, where volatilities depend on both states and controls while states are scalar-valued and running control rewards are absent. We propose a model-free, data-driven exploration mechanism that adaptively adjusts entropy regularization by the critic and policy variance by the actor. Unlike the constant or deterministic exploration schedules employed in \cite{huang2024sublinear}, which require extensive tuning for implementations and ignore learning progresses during iterations, our adaptive exploratory approach boosts learning efficiency with minimal tuning. Despite its flexibility, our method achieves a sublinear regret bound that matches the best-known model-free results for this class of LQ problems, which were previously derived only with fixed exploration schedules. Numerical experiments demonstrate that adaptive explorations accelerate convergence and improve regret performance compared to the non-adaptive model-free and model-based counterparts.

[161] arXiv:2507.00363 [pdf, html, other]
Title: GDGS: 3D Gaussian Splatting Via Geometry-Guided Initialization And Dynamic Density Control
Xingjun Wang, Lianlei Shan
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose a method to enhance 3D Gaussian Splatting (3DGS)~\cite{Kerbl2023}, addressing challenges in initialization, optimization, and density control. Gaussian Splatting is an alternative for rendering realistic images while supporting real-time performance, and it has gained popularity due to its explicit 3D Gaussian representation. However, 3DGS heavily depends on accurate initialization and faces difficulties in optimizing unstructured Gaussian distributions into ordered surfaces, with limited adaptive density control mechanism proposed so far. Our first key contribution is a geometry-guided initialization to predict Gaussian parameters, ensuring precise placement and faster convergence. We then introduce a surface-aligned optimization strategy to refine Gaussian placement, improving geometric accuracy and aligning with the surface normals of the scene. Finally, we present a dynamic adaptive density control mechanism that adjusts Gaussian density based on regional complexity, for visual fidelity. These innovations enable our method to achieve high-fidelity real-time rendering and significant improvements in visual quality, even in complex scenes. Our method demonstrates comparable or superior results to state-of-the-art methods, rendering high-fidelity images in real time.

[162] arXiv:2507.00365 [pdf, other]
Title: An Improved U-Net Model for Offline handwriting signature denoising
Wanghui Xiao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Handwriting signatures, as an important means of identity recognition, are widely used in multiple fields such as financial transactions, commercial contracts and personal affairs due to their legal effect and uniqueness. In forensic science appraisals, the analysis of offline handwriting signatures requires the appraiser to provide a certain number of signature samples, which are usually derived from various historical contracts or archival materials. However, the provided handwriting samples are often mixed with a large amount of interfering information, which brings severe challenges to handwriting identification work. This study proposes a signature handwriting denoising model based on the improved U-net structure, aiming to enhance the robustness of the signature recognition system. By introducing discrete wavelet transform and PCA transform, the model's ability to suppress noise has been enhanced. The experimental results show that this modelis significantly superior to the traditional methods in denoising effect, can effectively improve the clarity and readability of the signed images, and provide more reliable technical support for signature analysis and recognition.

[163] arXiv:2507.00366 [pdf, html, other]
Title: Wireless AI Evolution: From Statistical Learners to Electromagnetic-Guided Foundation Models
Jian Xiao, Ji Wang, Kunrui Cao, Xingwang Li, Zhao Chen, Chau Yuen
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

While initial applications of artificial intelligence (AI) in wireless communications over the past decade have demonstrated considerable potential using specialized models for targeted communication tasks, the revolutionary demands of sixth-generation (6G) networks for holographic communications, ubiquitous sensing, and native intelligence are propelling a necessary evolution towards AI-native wireless networks. The arrival of large AI models paves the way for the next phase of Wireless AI, driven by wireless foundation models (WFMs). In particular, pre-training on universal electromagnetic (EM) principles equips WFMs with the essential adaptability for a multitude of demanding 6G applications. However, existing large AI models face critical limitations, including pre-training strategies disconnected from EM-compliant constraints leading to physically inconsistent predictions, a lack of embedded understanding of wave propagation physics, and the inaccessibility of massive labeled datasets for comprehensive EM-aware training. To address these challenges, this article presents an electromagnetic information theory-guided self-supervised pre-training (EIT-SPT) framework designed to systematically inject EM physics into WFMs. The EIT-SPT framework aims to infuse WFMs with intrinsic EM knowledge, thereby enhancing their physical consistency, generalization capabilities across varied EM landscapes, and overall data efficiency. Building upon the proposed EIT-SPT framework, this article first elaborates on diverse potential applications in 6G scenarios of WFMs, then validates the efficacy of the proposed framework through illustrative case studies, and finally summarizes critical open research challenges and future directions for WFMs.

[164] arXiv:2507.00367 [pdf, html, other]
Title: Presto: Hardware Acceleration of Ciphers for Hybrid Homomorphic Encryption
Yeonsoo Jeon, Mattan Erez, Michael Orshansky
Subjects: Hardware Architecture (cs.AR); Cryptography and Security (cs.CR)

Hybrid Homomorphic Encryption (HHE) combines symmetric key and homomorphic encryption to reduce ciphertext expansion crucial in client-server deployments of HE. Special symmetric ciphers, amenable to efficient HE evaluation, have been developed. Their client-side deployment calls for performant and energy-efficient implementation, and in this paper we develop and evaluate hardware accelerators for the two known CKKS-targeting HHE ciphers, HERA and Rubato.
We design vectorized and overlapped functional modules. The design exploits transposition-invariance property of the MixColumns and MixRows function and alternates the order of intermediate state to eliminate bubbles in stream key generation, improving latency and throughput. We decouple the RNG and key computation phases to hide the latency of RNG and to reduce the critical path in FIFOs, achieving higher operating frequency.
We implement the accelerator on an AMD Virtex UltraScale+ FPGA. Both Rubato and HERA achieve a 6x improvement in throughput compared to the software implementation. In terms of latency, Rubato achieves a 5x reduction, while HERA achieves a 3x reduction. Additionally, our hardware implementations reduce energy consumption by 75x for Rubato and 47x for HERA compared to their software implementation.

[165] arXiv:2507.00368 [pdf, html, other]
Title: Out-of-Distribution Detection with Adaptive Top-K Logits Integration
Hikaru Shijo, Yutaka Yoshihama, Kenichi Yadani, Norifumi Murata
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Neural networks often make overconfident predictions from out-of-distribution (OOD) samples. Detection of OOD data is therefore crucial to improve the safety of machine learning. The simplest and most powerful method for OOD detection is MaxLogit, which uses the model's maximum logit to provide an OOD score. We have discovered that, in addition to the maximum logit, some other logits are also useful for OOD detection. Based on this finding, we propose a new method called ATLI (Adaptive Top-k Logits Integration), which adaptively determines effective top-k logits that are specific to each model and combines the maximum logit with the other top-k logits. In this study we evaluate our proposed method using ImageNet-1K benchmark. Extensive experiments showed our proposed method to reduce the false positive rate (FPR95) by 6.73% compared to the MaxLogit approach, and decreased FPR95 by an additional 2.67% compared to other state-of-the-art methods.

[166] arXiv:2507.00371 [pdf, other]
Title: PlantSegNeRF: A few-shot, cross-dataset method for plant 3D instance point cloud reconstruction via joint-channel NeRF with multi-view image instance matching
Xin Yang (1 and 2), Ruiming Du (3), Hanyang Huang (1 and 2), Jiayang Xie (1 and 2), Pengyao Xie (1 and 2), Leisen Fang (1 and 2), Ziyue Guo (1 and 2), Nanjun Jiang (4), Yu Jiang (5), Haiyan Cen (1 and 2) ((1) College of Biosystems Engineering and Food Science, Zhejiang University, (2) Key Laboratory of Spectroscopy Sensing, Ministry of Agriculture and Rural Affairs, (3) Department of Biological and Environmental Engineering, Cornell University, (4) Amway (China) Botanical R and D Center, (5) Horticulture Section, School of Integrative Plant Science, Cornell AgriTech)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Organ segmentation of plant point clouds is a prerequisite for the high-resolution and accurate extraction of organ-level phenotypic traits. Although the fast development of deep learning has boosted much research on segmentation of plant point clouds, the existing techniques for organ segmentation still face limitations in resolution, segmentation accuracy, and generalizability across various plant species. In this study, we proposed a novel approach called plant segmentation neural radiance fields (PlantSegNeRF), aiming to directly generate high-precision instance point clouds from multi-view RGB image sequences for a wide range of plant species. PlantSegNeRF performed 2D instance segmentation on the multi-view images to generate instance masks for each organ with a corresponding ID. The multi-view instance IDs corresponding to the same plant organ were then matched and refined using a specially designed instance matching module. The instance NeRF was developed to render an implicit scene, containing color, density, semantic and instance information. The implicit scene was ultimately converted into high-precision plant instance point clouds based on the volume density. The results proved that in semantic segmentation of point clouds, PlantSegNeRF outperformed the commonly used methods, demonstrating an average improvement of 16.1%, 18.3%, 17.8%, and 24.2% in precision, recall, F1-score, and IoU compared to the second-best results on structurally complex datasets. More importantly, PlantSegNeRF exhibited significant advantages in plant point cloud instance segmentation tasks. Across all plant datasets, it achieved average improvements of 11.7%, 38.2%, 32.2% and 25.3% in mPrec, mRec, mCov, mWCov, respectively. This study extends the organ-level plant phenotyping and provides a high-throughput way to supply high-quality 3D data for the development of large-scale models in plant science.

[167] arXiv:2507.00372 [pdf, html, other]
Title: Efficient Depth- and Spatially-Varying Image Simulation for Defocus Deblur
Xinge Yang, Chuong Nguyen, Wenbin Wang, Kaizhang Kang, Wolfgang Heidrich, Xiaoxing Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Modern cameras with large apertures often suffer from a shallow depth of field, resulting in blurry images of objects outside the focal plane. This limitation is particularly problematic for fixed-focus cameras, such as those used in smart glasses, where adding autofocus mechanisms is challenging due to form factor and power constraints. Due to unmatched optical aberrations and defocus properties unique to each camera system, deep learning models trained on existing open-source datasets often face domain gaps and do not perform well in real-world settings. In this paper, we propose an efficient and scalable dataset synthesis approach that does not rely on fine-tuning with real-world data. Our method simultaneously models depth-dependent defocus and spatially varying optical aberrations, addressing both computational complexity and the scarcity of high-quality RGB-D datasets. Experimental results demonstrate that a network trained on our low resolution synthetic images generalizes effectively to high resolution (12MP) real-world images across diverse scenes.

[168] arXiv:2507.00373 [pdf, html, other]
Title: Customizable ROI-Based Deep Image Compression
Ian Jin, Fanxin Xia, Feng Ding, Xinfeng Zhang, Meiqin Liu, Yao Zhao, Weisi Lin, Lili Meng
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Region of Interest (ROI)-based image compression optimizes bit allocation by prioritizing ROI for higher-quality reconstruction. However, as the users (including human clients and downstream machine tasks) become more diverse, ROI-based image compression needs to be customizable to support various preferences. For example, different users may define distinct ROI or require different quality trade-offs between ROI and non-ROI. Existing ROI-based image compression schemes predefine the ROI, making it unchangeable, and lack effective mechanisms to balance reconstruction quality between ROI and non-ROI. This work proposes a paradigm for customizable ROI-based deep image compression. First, we develop a Text-controlled Mask Acquisition (TMA) module, which allows users to easily customize their ROI for compression by just inputting the corresponding semantic \emph{text}. It makes the encoder controlled by text. Second, we design a Customizable Value Assign (CVA) mechanism, which masks the non-ROI with a changeable extent decided by users instead of a constant one to manage the reconstruction quality trade-off between ROI and non-ROI. Finally, we present a Latent Mask Attention (LMA) module, where the latent spatial prior of the mask and the latent Rate-Distortion Optimization (RDO) prior of the image are extracted and fused in the latent space, and further used to optimize the latent representation of the source image. Experimental results demonstrate that our proposed customizable ROI-based deep image compression paradigm effectively addresses the needs of customization for ROI definition and mask acquisition as well as the reconstruction quality trade-off management between the ROI and non-ROI.

[169] arXiv:2507.00376 [pdf, html, other]
Title: Adaptive finite element convergence analysis of AT1 phase-field model for quasi-static fracture in strain-limiting solids
Ram Manohar, S. M. Mallikarjunaiah
Comments: arXiv admin note: text overlap with arXiv:2505.19801
Subjects: Numerical Analysis (math.NA)

This research rigorously investigates the convergence of adaptive finite element methods for regularized variational models of quasi-static brittle fracture in elastic solids. We specifically examine a novel Ambrosio-Tortorelli (AT1) phase-field model within the framework of elasticity theories, particularly for material models characterized by an algebraically nonlinear stress-strain relationship. Two distinct and novel adaptive mesh refinement algorithms, underpinned by robust local error indicators, were introduced to efficiently solve the underlying nonlinear energy minimization problem. A detailed convergence analysis was conducted on the sequences of minimizers produced by these strategies. Our findings rigorously demonstrate that the minimizer sequences from the first adaptive algorithm achieve convergence to a predefined tolerance. Crucially, the second algorithm is proven to generate inherently convergent sequences, thereby eliminating the need for an explicit stopping criterion. The practical effectiveness of this proposed adaptive framework is thoroughly validated through extensive numerical simulations. A case study involving an edge crack in an elastic body, governed by an algebraically nonlinear strain-limiting relationship and subjected to anti-plane shear-type loading, is presented. Critical comparisons of the energy components-bulk, surface, and total-showcase the superior performance of both adaptive algorithms.

[170] arXiv:2507.00377 [pdf, html, other]
Title: MedDiff-FT: Data-Efficient Diffusion Model Fine-tuning with Structural Guidance for Controllable Medical Image Synthesis
Jianhao Xie, Ziang Zhang, Zhenyu Weng, Yuesheng Zhu, Guibo Luo
Comments: 11 pages,3 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent advancements in deep learning for medical image segmentation are often limited by the scarcity of high-quality training this http URL diffusion models provide a potential solution by generating synthetic images, their effectiveness in medical imaging remains constrained due to their reliance on large-scale medical datasets and the need for higher image quality. To address these challenges, we present MedDiff-FT, a controllable medical image generation method that fine-tunes a diffusion foundation model to produce medical images with structural dependency and domain specificity in a data-efficient manner. During inference, a dynamic adaptive guiding mask enforces spatial constraints to ensure anatomically coherent synthesis, while a lightweight stochastic mask generator enhances diversity through hierarchical randomness injection. Additionally, an automated quality assessment protocol filters suboptimal outputs using feature-space metrics, followed by mask corrosion to refine fidelity. Evaluated on five medical segmentation datasets,MedDiff-FT's synthetic image-mask pairs improve SOTA method's segmentation performance by an average of 1% in Dice score. The framework effectively balances generation quality, diversity, and computational efficiency, offering a practical solution for medical data augmentation. The code is available at this https URL.

[171] arXiv:2507.00378 [pdf, html, other]
Title: iPanda: An Intelligent Protocol Testing and Debugging Agent for Conformance Testing
Xikai Sun, Fan Dang, Kebin Liu, Xin Miao, Zihao Yang, Haimo Lu, Yawen Zheng, Yunhao Liu
Comments: 14 pages, 6 figures
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

Conformance testing is essential for ensuring that protocol implementations comply with their specifications. However, traditional testing approaches involve manually creating numerous test cases and scripts, making the process labor-intensive and inefficient. Recently, Large Language Models (LLMs) have demonstrated impressive text comprehension and code generation abilities, providing promising opportunities for automation. In this paper, we propose iPanda, the first end-to-end framework that leverages LLMs to automate protocol conformance testing. Given a protocol specification document and its implementation, iPanda first employs a keyword-based method to automatically generate comprehensive test cases. Then, it utilizes a code-based retrieval-augmented generation approach to effectively interpret the implementation and produce executable test code. To further enhance code quality, iPanda incorporates an iterative self-correction mechanism to refine generated test scripts interactively. Finally, by executing and analyzing the generated tests, iPanda systematically verifies compliance between implementations and protocol specifications. Comprehensive experiments on various protocols show that iPanda significantly outperforms pure LLM-based approaches, improving the success rate (Pass@1) of test-code generation by factors ranging from 4.675 times to 10.751 times.

[172] arXiv:2507.00379 [pdf, html, other]
Title: Towards Robustness: A Critique of Current Vector Database Assessments
Zikai Wang, Qianxi Zhang, Baotong Lu, Qi Chen, Cheng Tan
Subjects: Databases (cs.DB)

Vector databases are critical infrastructure in AI systems, and average recall is the dominant metric for their evaluation. Both users and researchers rely on it to choose and optimize their systems. We show that relying on average recall is problematic. It hides variability across queries, allowing systems with strong mean performance to underperform significantly on hard queries. These tail cases confuse users and can lead to failure in downstream applications such as RAG. We argue that robustness consistently achieving acceptable recall across queries is crucial to vector database evaluation. We propose Robustness-$\delta$@K, a new metric that captures the fraction of queries with recall above a threshold $\delta$. This metric offers a deeper view of recall distribution, helps vector index selection regarding application needs, and guides the optimization of tail performance. We integrate Robustness-$\delta$@K into existing benchmarks and evaluate mainstream vector indexes, revealing significant robustness differences. More robust vector indexes yield better application performance, even with the same average recall. We also identify design factors that influence robustness, providing guidance for improving real-world performance.

[173] arXiv:2507.00380 [pdf, html, other]
Title: Gregorian melody, modality, and memory: Segmenting chant with Bayesian nonparametrics
Vojtěch Lanz, Jan Hajič jr
Subjects: Computation and Language (cs.CL)

The idea that Gregorian melodies are constructed from some vocabulary of segments has long been a part of chant scholarship. This so-called "centonisation" theory has received much musicological criticism, but frequent re-use of certain melodic segments has been observed in chant melodies, and the intractable number of possible segmentations allowed the option that some undiscovered segmentation exists that will yet prove the value of centonisation, and recent empirical results have shown that segmentations can outperform music-theoretical features in mode classification. Inspired by the fact that Gregorian chant was memorised, we search for an optimal unsupervised segmentation of chant melody using nested hierarchical Pitman-Yor language models. The segmentation we find achieves state-of-the-art performance in mode classification. Modeling a monk memorising the melodies from one liturgical manuscript, we then find empirical evidence for the link between mode classification and memory efficiency, and observe more formulaic areas at the beginnings and ends of melodies corresponding to the practical role of modality in performance. However, the resulting segmentations themselves indicate that even such a memory-optimal segmentation is not what is understood as centonisation.

[174] arXiv:2507.00387 [pdf, html, other]
Title: A Review on Zeroing Neural Networks
Chengze Jiang, Jie Gui, Long Jin, Shuai Li
Comments: This is what we submitted to IJCAI 2023. Maybe we will update this paper in the future
Subjects: Neural and Evolutionary Computing (cs.NE)

Zeroing neural networks (ZNNs) have demonstrated outstanding performance on time-varying optimization and control problems. Nonetheless, few studies are committed to illustrating the relationship among different ZNNs and the derivation of them. Therefore, reviewing the advances for a systematical understanding of this field is desirable. This paper provides a survey of ZNNs' progress regarding implementing methods, analysis theory, and practical applications.

[175] arXiv:2507.00388 [pdf, html, other]
Title: Accuracy and Security-Guaranteed Participant Selection and Beamforming Design for RIS-Assisted Federated Learning
Mengru Wu, Yu Gao, Weidang Lu, Huimei Han, Lei Sun, Wanli Ni
Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)

Federated learning (FL) has emerged as an effective approach for training neural network models without requiring the sharing of participants' raw data, thereby addressing data privacy concerns. In this paper, we propose a reconfigurable intelligent surface (RIS)-assisted FL framework in the presence of eavesdropping, where partial edge devices are selected to participate in the FL training process. In contrast, the remaining devices serve as cooperative jammers by transmitting jamming signals to disrupt eavesdropping. We aim to minimize the training latency in each FL round by jointly optimizing participant selection, bandwidth allocation, and RIS beamforming design, subject to the convergence accuracy of FL and the secure uploading requirements. To solve the resulting mixed-integer nonlinear programming problem, we propose a twin delayed deep deterministic policy gradient (TD3) algorithm. Simulation results demonstrate that the proposed scheme reduces the FL training latency by approximately 27$\%$ compared to baselines.

[176] arXiv:2507.00389 [pdf, html, other]
Title: Causal Prompting for Implicit Sentiment Analysis with Large Language Models
Jing Ren, Wenhao Zhou, Bowen Li, Mujie Liu, Nguyen Linh Dan Le, Jiade Cen, Liping Chen, Ziqi Xu, Xiwei Xu, Xiaodong Li
Subjects: Computation and Language (cs.CL)

Implicit Sentiment Analysis (ISA) aims to infer sentiment that is implied rather than explicitly stated, requiring models to perform deeper reasoning over subtle contextual cues. While recent prompting-based methods using Large Language Models (LLMs) have shown promise in ISA, they often rely on majority voting over chain-of-thought (CoT) reasoning paths without evaluating their causal validity, making them susceptible to internal biases and spurious correlations. To address this challenge, we propose CAPITAL, a causal prompting framework that incorporates front-door adjustment into CoT reasoning. CAPITAL decomposes the overall causal effect into two components: the influence of the input prompt on the reasoning chains, and the impact of those chains on the final output. These components are estimated using encoder-based clustering and the NWGM approximation, with a contrastive learning objective used to better align the encoder's representation with the LLM's reasoning space. Experiments on benchmark ISA datasets with three LLMs demonstrate that CAPITAL consistently outperforms strong prompting baselines in both accuracy and robustness, particularly under adversarial conditions. This work offers a principled approach to integrating causal inference into LLM prompting and highlights its benefits for bias-aware sentiment reasoning. The source code and case study are available at: this https URL.

[177] arXiv:2507.00390 [pdf, html, other]
Title: MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE
Geng Zhang, Yuxuan Han, Yuxuan Lou, Wangbo Zhao, Yiqi Zhang, Yang You
Subjects: Machine Learning (cs.LG)

Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes Mixture-of-Novices-and-Experts (MoNE), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices-unbiased estimations of their original outputs-minimizing performance degradation. Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness. Notably, it improves the average zero shot accuracy across nine downstream tasks by up to 2.71 under 25\% pruning ratio and 3.61 under 50\% pruning. The code is available at this https URL.

[178] arXiv:2507.00392 [pdf, html, other]
Title: Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space
Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as \textbf{Lift to Match (L2M)}, taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.

[179] arXiv:2507.00394 [pdf, html, other]
Title: HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
Geng Zhang, Shenggan Cheng, Xuanlei Zhao, Ziming Liu, Yang You
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)

As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel pipeline parallelism for long sequence transformer training. First, HelixPipe introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles. Second, it employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with computation. Additionally, HelixPipe utilizes recomputation without attention and chunked MLP to mitigate fragmentation and enable longer sequences. Experiments demonstrate that HelixPipe gains increasing advantages with longer sequence lengths, and outperforms existing methods in throughput and scalability across varying pipeline sizes, model sizes, and cluster configurations. Notably, it achieves a 26\% speedup over baseline methods when training a 7B model with 128k sequence length on 64 H20 GPUs. Code is available at this https URL.

[180] arXiv:2507.00401 [pdf, other]
Title: Few-shot Classification as Multi-instance Verification: Effective Backbone-agnostic Transfer across Domains
Xin Xu, Eibe Frank, Geoffrey Holmes
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

We investigate cross-domain few-shot learning under the constraint that fine-tuning of backbones (i.e., feature extractors) is impossible or infeasible -- a scenario that is increasingly common in practical use cases. Handling the low-quality and static embeddings produced by frozen, "black-box" backbones leads to a problem representation of few-shot classification as a series of multiple instance verification (MIV) tasks. Inspired by this representation, we introduce a novel approach to few-shot domain adaptation, named the "MIV-head", akin to a classification head that is agnostic to any pretrained backbone and computationally efficient. The core components designed for the MIV-head, when trained on few-shot data from a target domain, collectively yield strong performance on test data from that domain. Importantly, it does so without fine-tuning the backbone, and within the "meta-testing" phase. Experimenting under various settings and on an extension of the Meta-dataset benchmark for cross-domain few-shot image classification, using representative off-the-shelf convolutional neural network and vision transformer backbones pretrained on ImageNet1K, we show that the MIV-head achieves highly competitive accuracy when compared to state-of-the-art "adapter" (or partially fine-tuning) methods applied to the same backbones, while incurring substantially lower adaptation cost. We also find well-known "classification head" approaches lag far behind in terms of accuracy. Ablation study empirically justifies the core components of our approach. We share our code at this https URL.

[181] arXiv:2507.00406 [pdf, html, other]
Title: Partnering with AI: A Pedagogical Feedback System for LLM Integration into Programming Education
Niklas Scholz, Manh Hung Nguyen, Adish Singla, Tomohiro Nagashima
Comments: This is an extended version of a poster paper accepted and published at ECTEL-2025
Subjects: Computers and Society (cs.CY)

Feedback is one of the most crucial components to facilitate effective learning. With the rise of large language models (LLMs) in recent years, research in programming education has increasingly focused on automated feedback generation to help teachers provide timely support to every student. However, prior studies often overlook key pedagogical principles, such as mastery and progress adaptation, that shape effective feedback strategies. This paper introduces a novel pedagogical framework for LLM-driven feedback generation derived from established feedback models and local insights from secondary school teachers. To evaluate this framework, we implemented a web-based application for Python programming with LLM-based feedback that follows the framework and conducted a mixed-method evaluation with eight secondary-school computer science teachers. Our findings suggest that teachers consider that, when aligned with the framework, LLMs can effectively support students and even outperform human teachers in certain scenarios through instant and precise feedback. However, we also found several limitations, such as its inability to adapt feedback to dynamic classroom contexts. Such a limitation highlights the need to complement LLM-generated feedback with human expertise to ensure effective student learning. This work demonstrates an effective way to use LLMs for feedback while adhering to pedagogical standards and highlights important considerations for future systems.

[182] arXiv:2507.00409 [pdf, html, other]
Title: Eilenberg correspondence for Stone recognition
Jorge Almeida, Ondřej Klíma
Subjects: Formal Languages and Automata Theory (cs.FL)

We develop and explore the idea of recognition of languages (in the general sense of subsets of topological algebras) as preimages of clopen sets under continuous homomorphisms into Stone topological algebras. We obtain an Eilenberg correspondence between varieties of languages and varieties of ordered Stone topological algebras and a Birkhoff/Reiterman-type theorem showing that the latter may me defined by certain pseudo-inequalities. In the case of classical formal languages, of words over a finite alphabet, we also show how this extended framework goes beyond the class of regular languages by working with Stone completions of minimal automata, viewed as unary algebras. This leads to a general method for showing that a language does not belong to a variety of languages, expressed in terms of sequences of pairs of words, which is illustrated when the class consists of all finite intersections of context-free languages.

[183] arXiv:2507.00411 [pdf, html, other]
Title: Diffusion Disambiguation Models for Partial Label Learning
Jinfu Fan, Xiaohui Zhong, Kangrui Ren, Jiangnan Li, Linqing Huang
Subjects: Machine Learning (cs.LG)

Learning from ambiguous labels is a long-standing problem in practical machine learning applications. The purpose of \emph{partial label learning} (PLL) is to identify the ground-truth label from a set of candidate labels associated with a given instance. Inspired by the remarkable performance of diffusion models in various generation tasks, this paper explores their potential to denoise ambiguous labels through the reverse denoising process. Therefore, this paper reformulates the label disambiguation problem from the perspective of generative models, where labels are generated by iteratively refining initial random guesses. This perspective enables the diffusion model to learn how label information is generated stochastically. By modeling the generation uncertainty, we can use the maximum likelihood estimate of the label for classification inference. However, such ambiguous labels lead to a mismatch between instance and label, which reduces the quality of generated data. To address this issue, this paper proposes a \emph{diffusion disambiguation model for PLL} (DDMP), which first uses the potential complementary information between instances and labels to construct pseudo-clean labels for initial diffusion training. Furthermore, a transition-aware matrix is introduced to estimate the potential ground-truth labels, which are dynamically updated during the diffusion generation. During training, the ground-truth label is progressively refined, improving the classifier. Experiments show the advantage of the DDMP and its suitability for PLL.

[184] arXiv:2507.00412 [pdf, html, other]
Title: ViscoReg: Neural Signed Distance Functions via Viscosity Solutions
Meenakshi Krishnan, Ramani Duraiswami
Comments: 14 pages, 6 figures
Subjects: Graphics (cs.GR)

Implicit Neural Representations (INRs) that learn a Signed Distance Function (SDF) are a powerful tool for continuous 3D scene reconstruction. These models are trained by enforcing the Eikonal equation. We demonstrate theoretically that despite the ill-posedness of the Eikonal equation, generalization error estimates may be obtained for Neural SDFs in terms of the training error. However, training with the Eikonal loss can lead to unstable gradient flows, necessitating alternate stabilization techniques. Traditional numerical solvers for the equation have relied on viscosity approaches for regularization. We enhance Neural SDF training using this well-developed theory, and introduce a new loss formulation we call ViscoReg. We theoretically demonstrate the stability of the gradient flow equation of our proposed loss term. Empirically, ViscoReg outperforms state-of-the-art approaches such as SIREN, DiGS, and StEik without adding significant computational cost.

[185] arXiv:2507.00413 [pdf, html, other]
Title: Recommending Variable Names for Extract Local Variable Refactorings
Taiming Wang, Hui Liu, Yuxia Zhang, Yanjie Jiang
Comments: Accepted by TOSEM
Subjects: Software Engineering (cs.SE)

Extract local variable is one of the most popular refactorings, and most IDEs and refactoring tools provide automated support for this refactoring. However, we find approximately 70% of the names recommended by these IDEs are different from what developers manually constructed, adding additional renaming burdens to developers and providing limited assistance. In this paper, we introduce VarNamer, an automated approach designed to recommend variable names for extract local variable refactorings. Through a large-scale empirical study, we identify key contexts that are useful for composing variable names. Leveraging these insights, we developed a set of heuristic rules through program static analysis techniques and employ data mining techniques to recommend variable names effectively. Notably, some of our heuristic rules have been successfully integrated into Eclipse, where they are now distributed with the latest releases of the IDE. Evaluation demonstrates its superiority over state-of-the-art IDEs. Specifically, VarNamer significantly increases the chance of exact match by 52.6% compared to Eclipse and 40.7% compared to IntelliJ IDEA. We also evaluated the proposed approach with real-world extract local variable refactorings conducted in C++ projects, and the results suggest that the approach can achieve comparable performance on programming languages besides Java. It may suggest the generalizability of VarNamer. Finally, we designed and conducted a user study and the results of the user study suggest that our approach can speed up the refactoring by 27.8% and reduce 49.3% edits on the recommended variable names.

[186] arXiv:2507.00415 [pdf, html, other]
Title: Minimal Construction of Graphs with Maximum Robustness
Haejoon Lee, Dimitra Panagou
Subjects: Systems and Control (eess.SY); Social and Information Networks (cs.SI)

The notions of network $r$-robustness and $(r,s)$-robustness have been earlier introduced in the literature to achieve resilient control in the presence of misbehaving agents. However, while higher robustness levels provide networks with higher tolerances against the misbehaving agents, they also require dense communication structures, which are not always desirable for systems with limited capabilities and energy capacities. Therefore, this paper studies the fundamental structures behind $r$-robustness and $(r,s)$- robustness properties in two different ways. (a) We first explore and establish the tight necessary conditions on the number of edges for undirected graphs with any nodes must satisfy to achieve maximum $r$- and $(r,s)$-robustness. (b) We then use these conditions to construct two classes of undirected graphs, referred as to $\gamma$- and $(\gamma,\gamma)$-Minimal Edge Robust Graphs (MERGs), that provably achieve maximum robustness with minimal numbers of edges. We finally validate our work through some sets of simulations.

[187] arXiv:2507.00416 [pdf, html, other]
Title: Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding
Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Bo Zhao
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Vision-Language-Action (VLA) models have emerged as a promising framework for enabling generalist robots capable of perceiving, reasoning, and acting in the real world. These models usually build upon pretrained Vision-Language Models (VLMs), which excel at semantic understanding due to large-scale text pretraining. However, VLMs typically lack precise spatial understanding capabilities, as they are primarily tuned on 2D image-text pairs without 3D supervision. To address this limitation, recent approaches have incorporated explicit 3D inputs such as point clouds or depth maps, but this necessitates additional depth sensors or defective estimation. In contrast, our work introduces a plug-and-play module that implicitly injects 3D geometry features into VLA models by leveraging an off-the-shelf visual geometry foundation models. We design five spatially challenging tasks that require precise spatial understanding ability to validate effectiveness of our method. Extensive evaluations show that our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.

[188] arXiv:2507.00417 [pdf, html, other]
Title: ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context
Joongwon Kim, Anirudh Goyal, Liang Tan, Hannaneh Hajishirzi, Srinivasan Iyer, Tianlu Wang
Comments: 36 pages, 23 figures
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

We introduce ASTRO, the "Autoregressive Search-Taught Reasoner", a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already exhibit strong reasoning capabilities along with search behavior observed even before RL. As a result, it is yet unclear how to boost the reasoning capabilities of other non-reasoner models including Llama 3. ASTRO teaches such models to internalize structured search behavior through a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. By converting search traces into natural language chain-of-thoughts that capture both successes and recoveries from failure, ASTRO bootstraps models with a rich prior for exploration during RL. We finetune our models on these search-derived traces and further improve performance via RL with verifiable rewards. We apply ASTRO to the Llama 3 family of models and achieve absolute performance gains of 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024, especially improving upon challenging problems that require iterative correction. Our results demonstrate that search-inspired training offers a principled way to instill robust reasoning capabilities into open LLMs.

[189] arXiv:2507.00418 [pdf, html, other]
Title: Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs
Mohammad Firas Sada, John J. Graham, Elham E Khoda, Mahidhar Tatineni, Dmitry Mishin, Rajesh K. Gupta, Rick Wagner, Larry Smarr, Thomas A. DeFanti, Frank Würthwein
Comments: To appear in Proceedings of the Practice and Experience in Advanced Research Computing (PEARC '25)
Journal-ref: Proceedings of the Practice and Experience in Advanced Research Computing PEARC25 2025
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)

This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) and performance against leading NVIDIA (A100, H200) and AMD (MI300A) GPUs within the National Research Platform (NRP) ecosystem. A total of 15 open-source LLMs, ranging from 117 million to 90 billion parameters, are served using the vLLM framework. The QAic inference cards appears to be energy efficient and performs well in the energy efficiency metric in most cases. The findings offer insights into the potential of the Qualcomm Cloud AI 100 Ultra for high-performance computing (HPC) applications within the National Research Platform (NRP).

[190] arXiv:2507.00421 [pdf, html, other]
Title: Embedded DevOps: A Survey on the Application of DevOps Practices in Embedded Software and Firmware Development
Parthiv Katapara, Anand Sharma
Comments: This paper present survey on DevOps practices which exists in Embedded Software development
Subjects: Software Engineering (cs.SE)

The adoption of DevOps practices in embedded systems and firmware development is emerging as a response to the growing complexity of modern hardware--software co-designed products. Unlike cloud-native applications, embedded systems introduce challenges such as hardware dependency, real-time constraints, and safety-critical requirements. This literature review synthesizes findings from 20 academic and industrial sources to examine how DevOps principles--particularly continuous integration, continuous delivery, and automated testing--are adapted to embedded contexts. We categorize efforts across tooling, testing strategies, pipeline automation, and security practices. The review highlights current limitations in deployment workflows and observability, proposing a roadmap for future research. This work offers researchers and practitioners a consolidated understanding of Embedded DevOps, bridging fragmented literature with a structured perspective.

[191] arXiv:2507.00422 [pdf, html, other]
Title: Evolutionary Dynamics with Self-Interaction Learning in Networked Systems
Ziyan Zeng, Minyu Feng, Attila Szolnoki
Comments: 15 pages, 10 figures
Subjects: Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)

The evolution of cooperation in networked systems helps to understand the dynamics in social networks, multi-agent systems, and biological species. The self-persistence of individual strategies is common in real-world decision making. The self-replacement of strategies in evolutionary dynamics forms a selection amplifier, allows an agent to insist on its autologous strategy, and helps the networked system to avoid full defection. In this paper, we study the self-interaction learning in the networked evolutionary dynamics. We propose a self-interaction landscape to capture the strength of an agent's self-loop to reproduce the strategy based on local topology. We find that proper self-interaction can reduce the condition for cooperation and help cooperators to prevail in the system. For a system that favors the evolution of spite, the self-interaction can save cooperative agents from being harmed. Our results on random networks further suggest that an appropriate self-interaction landscape can significantly reduce the critical condition for advantageous mutants, especially for large-degree networks.

[192] arXiv:2507.00423 [pdf, html, other]
Title: Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning
Wenjin Mo, Zhiyuan Li, Minghong Fang, Mingwei Fang
Comments: To appear in ICCV 2025
Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

Federated learning (FL) allows multiple clients to collaboratively train a global machine learning model with coordination from a central server, without needing to share their raw data. This approach is particularly appealing in the era of privacy regulations like the GDPR, leading many prominent companies to adopt it. However, FL's distributed nature makes it susceptible to poisoning attacks, where malicious clients, controlled by an attacker, send harmful data to compromise the model. Most existing poisoning attacks in FL aim to degrade the model's integrity, such as reducing its accuracy, with limited attention to privacy concerns from these attacks. In this study, we introduce FedPoisonMIA, a novel poisoning membership inference attack targeting FL. FedPoisonMIA involves malicious clients crafting local model updates to infer membership information. Additionally, we propose a robust defense mechanism to mitigate the impact of FedPoisonMIA attacks. Extensive experiments across various datasets demonstrate the attack's effectiveness, while our defense approach reduces its impact to a degree.

[193] arXiv:2507.00424 [pdf, html, other]
Title: Multi-Agent Coordination under Poisson Observations: A Global Game Approach
Marcos M. Vasconcelos, Behrouz Touri
Subjects: Systems and Control (eess.SY)

We study a model of strategic coordination based on a class of games with incomplete information known as Global Games. Under the assumption of Poisson-distributed signals and a Gamma prior distribution on state of the system, we demonstrate the existence of a Bayesian Nash equilibrium within the class of threshold policies for utility functions that are linear in the agents' actions. Although computing the exact threshold that constitutes an equilibrium in a system with finitely many agents is a highly non-trivial task, the problem becomes tractable by analyzing the game's potential function with countably infinitely many agents. Through numerical examples, we provide evidence that the resulting potential function is unimodal, exhibiting a well-defined maximum. Our results are applicable to the modeling of bacterial Quorum Sensing systems, whose noisy observation signals are often well-approximated using Poisson processes.

[194] arXiv:2507.00425 [pdf, other]
Title: Flexible Language Modeling in Continuous Space with Transformer-based Autoregressive Flows
Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Yizhe Zhang, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Josh Susskind, Navdeep Jaitly
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

Autoregressive models have driven remarkable progress in language modeling. Their foundational reliance on discrete tokens, unidirectional context, and single-pass decoding, while central to their success, also inspires the exploration of a design space that could offer new axes of modeling flexibility. In this work, we explore an alternative paradigm, shifting language modeling from a discrete token space to a continuous latent space. We propose a novel framework TarFlowLM, that employs transformer-based autoregressive normalizing flows to model these continuous representations. This approach unlocks substantial flexibility, enabling the construction of models that can capture global bi-directional context through stacked, alternating-direction autoregressive transformations, support block-wise generation with flexible token patch sizes, and facilitate a hierarchical multi-pass generation process. We further propose new mixture-based coupling transformations designed to capture complex dependencies within the latent space shaped by discrete data, and demonstrate theoretical connections to conventional discrete autoregressive models. Extensive experiments on language modeling benchmarks demonstrate strong likelihood performance and highlight the flexible modeling capabilities inherent in our framework.

[195] arXiv:2507.00427 [pdf, html, other]
Title: Zero-Knowledge Verifiable Graph Query Evaluation via Expansion-Centric Operator Decomposition
Hao Wu, Changzheng Wei, Yanhao Wang, Li Lin, Yilong Leng, Shiyu He, Minghao Zhao, Hanghang Wu, Ying Yan, Aoying Zhou
Subjects: Databases (cs.DB)

This paper investigates the feasibility of achieving zero-knowledge verifiability for graph databases, enabling database owners to cryptographically prove the query execution correctness without disclosing the underlying data. Although similar capabilities have been explored for relational databases, their implementation for graph databases presents unique challenges. This is mainly attributed to the relatively large complexity of queries in graph databases. When translating graph queries into arithmetic circuits, the circuit scale can be too large to be practically evaluated. To address this issue, we propose to break down graph queries into more fine-grained, primitive operators, enabling a step-by-step evaluation through smaller-scale circuits. Accordingly, the verification with ZKP circuits of complex graph queries can be decomposed into a series of composable cryptographic primitives, each designed to verify a fundamental structural property such as path ordering or edge directionality. Especially, having noticed that the graph expansion (i.e., traversing from nodes to their neighbors along edges) operation serves as the backbone of graph query evaluation, we design the expansion centric operator decomposition. In addition to constructing circuits for the expansion primitives, we also design specialized ZKP circuits for the various attributes that augment this traversal. The circuits are meticulously designed to take advantage of PLONKish arithmetization. By integrating these optimized circuits, we implement ZKGraph, a system that provides verifiable query processing while preserving data privacy. Performance evaluation indicates that ZKGraph significantly outperforms naive in circuit implementations of graph operators, achieving substantial improvements in both runtime and memory consumption.

[196] arXiv:2507.00428 [pdf, html, other]
Title: Real-Time In-Network Machine Learning on P4-Programmable FPGA SmartNICs with Fixed-Point Arithmetic and Taylor
Mohammad Firas Sada, John J. Graham, Mahidhar Tatineni, Dmitry Mishin, Thomas A. DeFanti, Frank Würthwein
Comments: To appear in Proceedings of the Practice and Experience in Advanced Research Computing (PEARC25)
Journal-ref: Proceedings of the Practice and Experience in Advanced Research Computing PEARC '25, July 20-24, 2025, Columbus, OH, USA
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)

As machine learning (ML) applications become integral to modern network operations, there is an increasing demand for network programmability that enables low-latency ML inference for tasks such as Quality of Service (QoS) prediction and anomaly detection in cybersecurity. ML models provide adaptability through dynamic weight adjustments, making Programming Protocol-independent Packet Processors (P4)-programmable FPGA SmartNICs an ideal platform for investigating In-Network Machine Learning (INML). These devices offer high-throughput, low-latency packet processing and can be dynamically reconfigured via the control plane, allowing for flexible integration of ML models directly at the network edge. This paper explores the application of the P4 programming paradigm to neural networks and regression models, where weights and biases are stored in control plane table lookups. This approach enables flexible programmability and efficient deployment of retrainable ML models at the network edge, independent of core infrastructure at the switch level.

[197] arXiv:2507.00429 [pdf, html, other]
Title: DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting
Jingyi Pan, Dan Xu, Qiong Luo
Comments: ICCV 2025, Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Developing a unified pipeline that enables users to remove, re-texture, or replace objects in a versatile manner is crucial for text-guided 3D inpainting. However, there are still challenges in performing multiple 3D inpainting tasks within a unified framework: 1) Single reference inpainting methods lack robustness when dealing with views that are far from the reference view. 2) Appearance inconsistency arises when independently inpainting multi-view images with 2D diffusion priors; 3) Geometry inconsistency limits performance when there are significant geometric changes in the inpainting regions. To tackle these challenges, we introduce DiGA3D, a novel and versatile 3D inpainting pipeline that leverages diffusion models to propagate consistent appearance and geometry in a coarse-to-fine manner. First, DiGA3D develops a robust strategy for selecting multiple reference views to reduce errors during propagation. Next, DiGA3D designs an Attention Feature Propagation (AFP) mechanism that propagates attention features from the selected reference views to other views via diffusion models to maintain appearance consistency. Furthermore, DiGA3D introduces a Texture-Geometry Score Distillation Sampling (TG-SDS) loss to further improve the geometric consistency of inpainted 3D scenes. Extensive experiments on multiple 3D inpainting tasks demonstrate the effectiveness of our method. The project page is available at this https URL.

[198] arXiv:2507.00430 [pdf, html, other]
Title: MFH: Marrying Frequency Domain with Handwritten Mathematical Expression Recognition
Huanxin Yang, Qiwen Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Handwritten mathematical expression recognition (HMER) suffers from complex formula structures and character layouts in sequence prediction. In this paper, we incorporate frequency domain analysis into HMER and propose a method that marries frequency domain with HMER (MFH), leveraging the discrete cosine transform (DCT). We emphasize the structural analysis assistance of frequency information for recognizing mathematical formulas. When implemented on various baseline models, our network exhibits a consistent performance enhancement, demonstrating the efficacy of frequency domain information. Experiments show that our MFH-CoMER achieves noteworthy accuracyrates of 61.66%/62.07%/63.72% on the CROHME 2014/2016/2019 test sets. The source code is available at this https URL.

[199] arXiv:2507.00432 [pdf, html, other]
Title: Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

[200] arXiv:2507.00435 [pdf, html, other]
Title: RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation
Yi Ru Wang, Carter Ung, Grant Tannert, Jiafei Duan, Josephine Li, Amy Le, Rishabh Oswal, Markus Grotz, Wilbert Pumacay, Yuquan Deng, Ranjay Krishna, Dieter Fox, Siddhartha Srinivasa
Comments: Project page: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

We present RoboEval, a simulation benchmark and structured evaluation framework designed to reveal the limitations of current bimanual manipulation policies. While prior benchmarks report only binary task success, we show that such metrics often conceal critical weaknesses in policy behavior -- such as poor coordination, slipping during grasping, or asymmetric arm usage. RoboEval introduces a suite of tiered, semantically grounded tasks decomposed into skill-specific stages, with variations that systematically challenge spatial, physical, and coordination capabilities. Tasks are paired with fine-grained diagnostic metrics and 3000+ human demonstrations to support imitation learning. Our experiments reveal that policies with similar success rates diverge in how tasks are executed -- some struggle with alignment, others with temporally consistent bimanual control. We find that behavioral metrics correlate with success in over half of task-metric pairs, and remain informative even when binary success saturates. By pinpointing when and how policies fail, RoboEval enables a deeper, more actionable understanding of robotic manipulation -- and highlights the need for evaluation tools that go beyond success alone.

[201] arXiv:2507.00439 [pdf, html, other]
Title: Beyond Sociodemographic Prompting: Using Supervision to Align LLMs with Human Response Distributions
Gauri Kambhatla, Sanjana Gautam, Angela Zhang, Alex Liu, Ravi Srinivasan, Junyi Jessy Li, Matthew Lease
Subjects: Computation and Language (cs.CL)

The ability to accurately predict how different population groups would answer subjective questions would have great value. In this work, we show that use of relatively simple supervision can greatly improve language model alignment with diverse population groups, as measured over three datasets spanning various topics. Beyond evaluating average performance, we also report how alignment varies across specific groups. The simplicity and generality of our approach promotes easy adoption, while our broad findings provide useful guidance for when to use or not use our approach in practice. By conducting evaluation over many LLMs and prompting strategies, along with open-sourcing our work, we provide a useful benchmark to stimulate future research.

[202] arXiv:2507.00440 [pdf, html, other]
Title: A Recipe for Causal Graph Regression: Confounding Effects Revisited
Yujia Yin, Tianyi Qu, Zihao Wang, Yifan Chen
Comments: ICML 2025 accepted
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)

Through recognizing causal subgraphs, causal graph learning (CGL) has risen to be a promising approach for improving the generalizability of graph neural networks under out-of-distribution (OOD) scenarios. However, the empirical successes of CGL techniques are mostly exemplified in classification settings, while regression tasks, a more challenging setting in graph learning, are overlooked. We thus devote this work to tackling causal graph regression (CGR); to this end we reshape the processing of confounding effects in existing CGL studies, which mainly deal with classification. Specifically, we reflect on the predictive power of confounders in graph-level regression, and generalize classification-specific causal intervention techniques to regression through a lens of contrastive learning. Extensive experiments on graph OOD benchmarks validate the efficacy of our proposals for CGR. The model implementation and the code are provided on this https URL.

[203] arXiv:2507.00443 [pdf, other]
Title: Novel Pigeon-inspired 3D Obstacle Detection and Avoidance Maneuver for Multi-UAV Systems
Reza Ahmadvand, Sarah Safura Sharif, Yaser Mike Banad
Comments: 11 Pages, 11 Pictures, 1 Table, 3 Algorithms
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

Recent advances in multi-agent systems manipulation have demonstrated a rising demand for the implementation of multi-UAV systems in urban areas, which are always subjected to the presence of static and dynamic obstacles. Inspired by the collective behavior of tilapia fish and pigeons, the focus of the presented research is on the introduction of a nature-inspired collision-free formation control for a multi-UAV system, considering the obstacle avoidance maneuvers. The developed framework in this study utilizes a semi-distributed control approach, in which, based on a probabilistic Lloyd's algorithm, a centralized guidance algorithm works for optimal positioning of the UAVs, while a distributed control approach has been used for the intervehicle collision and obstacle avoidance. Further, the presented framework has been extended to the 3D space with a novel definition of 3D maneuvers. Finally, the presented framework has been applied to multi-UAV systems in 2D and 3D scenarios, and the obtained results demonstrated the validity of the presented method in dynamic environments with stationary and moving obstacles.

[204] arXiv:2507.00444 [pdf, html, other]
Title: DiffCkt: A Diffusion Model-Based Hybrid Neural Network Framework for Automatic Transistor-Level Generation of Analog Circuits
Chengjie Liu, Jiajia Li, Yabing Feng, Wenhao Huang, Weiyu Chen, Yuan Du, Jun Yang, Li Du
Comments: Accepted by ICCAD2025
Subjects: Emerging Technologies (cs.ET)

Analog circuit design consists of the pre-layout and layout phases. Among them, the pre-layout phase directly decides the final circuit performance, but heavily depends on experienced engineers to do manual design according to specific application scenarios. To overcome these challenges and automate the analog circuit pre-layout design phase, we introduce DiffCkt: a diffusion model-based hybrid neural network framework for the automatic transistor-level generation of analog circuits, which can directly generate corresponding circuit structures and device parameters tailored to specific performance requirements. To more accurately quantify the efficiency of circuits generated by DiffCkt, we introduce the Circuit Generation Efficiency Index (CGEI), which is determined by both the figure of merit (FOM) of a single generated circuit and the time consumed. Compared with relative research, DiffCkt has improved CGEI by a factor of $2.21 \sim 8365\times$, reaching a state-of-the-art (SOTA) level. In conclusion, this work shows that the diffusion model has the remarkable ability to learn and generate analog circuit structures and device parameters, providing a revolutionary method for automating the pre-layout design of analog circuits. The circuit dataset will be open source, its preview version is available at this https URL.

[205] arXiv:2507.00445 [pdf, html, other]
Title: Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design
Xingyu Su, Xiner Li, Masatoshi Uehara, Sunwoo Kim, Yulai Zhao, Gabriele Scalia, Ehsan Hajiramezanali, Tommaso Biancalani, Degui Zhi, Shuiwang Ji
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)

We address the problem of fine-tuning diffusion models for reward-guided generation in biomolecular design. While diffusion models have proven highly effective in modeling complex, high-dimensional data distributions, real-world applications often demand more than high-fidelity generation, requiring optimization with respect to potentially non-differentiable reward functions such as physics-based simulation or rewards based on scientific knowledge. Although RL methods have been explored to fine-tune diffusion models for such objectives, they often suffer from instability, low sample efficiency, and mode collapse due to their on-policy nature. In this work, we propose an iterative distillation-based fine-tuning framework that enables diffusion models to optimize for arbitrary reward functions. Our method casts the problem as policy distillation: it collects off-policy data during the roll-in phase, simulates reward-based soft-optimal policies during roll-out, and updates the model by minimizing the KL divergence between the simulated soft-optimal policy and the current model policy. Our off-policy formulation, combined with KL divergence minimization, enhances training stability and sample efficiency compared to existing RL-based methods. Empirical results demonstrate the effectiveness and superior reward optimization of our approach across diverse tasks in protein, small molecule, and regulatory DNA design.

[206] arXiv:2507.00446 [pdf, html, other]
Title: DIJE: Dense Image Jacobian Estimation for Robust Robotic Self-Recognition and Visual Servoing
Yasunori Toshimitsu, Kento Kawaharazuka, Akihiro Miki, Kei Okada, Masayuki Inaba
Comments: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Journal-ref: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 2022, pp. 2219-2226
Subjects: Robotics (cs.RO)

For robots to move in the real world, they must first correctly understand the state of its own body and the tools that it holds. In this research, we propose DIJE, an algorithm to estimate the image Jacobian for every pixel. It is based on an optical flow calculation and a simplified Kalman Filter that can be efficiently run on the whole image in real time. It does not rely on markers nor knowledge of the robotic structure. We use the DIJE in a self-recognition process which can robustly distinguish between movement by the robot and by external entities, even when the motion overlaps. We also propose a visual servoing controller based on DIJE, which can learn to control the robot's body to conduct reaching movements or bimanual tool-tip control. The proposed algorithms were implemented on a physical musculoskeletal robot and its performance was verified. We believe that such global estimation of the visuomotor policy has the potential to be extended into a more general framework for manipulation.

[207] arXiv:2507.00447 [pdf, html, other]
Title: Latent Posterior-Mean Rectified Flow for Higher-Fidelity Perceptual Face Restoration
Xin Luo, Menglin Zhang, Yunwei Lan, Tianyu Zhang, Rui Li, Chang Liu, Dong Liu
Comments: Code and Models will be publicly available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

The Perception-Distortion tradeoff (PD-tradeoff) theory suggests that face restoration algorithms must balance perceptual quality and fidelity. To achieve minimal distortion while maintaining perfect perceptual quality, Posterior-Mean Rectified Flow (PMRF) proposes a flow based approach where source distribution is minimum distortion estimations. Although PMRF is shown to be effective, its pixel-space modeling approach limits its ability to align with human perception, where human perception is defined as how humans distinguish between two image distributions. In this work, we propose Latent-PMRF, which reformulates PMRF in the latent space of a variational autoencoder (VAE), facilitating better alignment with human perception during optimization. By defining the source distribution on latent representations of minimum distortion estimation, we bound the minimum distortion by the VAE's reconstruction error. Moreover, we reveal the design of VAE is crucial, and our proposed VAE significantly outperforms existing VAEs in both reconstruction and restoration. Extensive experiments on blind face restoration demonstrate the superiority of Latent-PMRF, offering an improved PD-tradeoff compared to existing methods, along with remarkable convergence efficiency, achieving a 5.79X speedup over PMRF in terms of FID. Our code will be available as open-source.

[208] arXiv:2507.00449 [pdf, html, other]
Title: Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention
Zhihao Zhan, Jianan Zhao, Zhaocheng Zhu, Jian Tang
Comments: Proceedings of the 42nd International Conference on Machine Learning, ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 18 pages, 9 figures
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

Efficient long-context modeling remains a critical challenge for natural language processing (NLP), as the time complexity of the predominant Transformer architecture scales quadratically with the sequence length. While state-space models (SSMs) offer alternative sub-quadratic solutions, they struggle to capture long-range dependencies effectively. In this work, we focus on analyzing and improving the long-context modeling capabilities of SSMs. We show that the widely used synthetic task, associative recall, which requires a model to recall a value associated with a single key without context, insufficiently represents the complexities of real-world long-context modeling. To address this limitation, we extend the associative recall to a novel synthetic task, \emph{joint recall}, which requires a model to recall the value associated with a key given in a specified context. Theoretically, we prove that SSMs do not have the expressiveness to solve multi-query joint recall in sub-quadratic time complexity. To resolve this issue, we propose a solution based on integrating SSMs with Context-Dependent Sparse Attention (CDSA), which has the expressiveness to solve multi-query joint recall with sub-quadratic computation. To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection (HAX), which instantiates the theoretical solution and is further tailored to natural language domains. Extensive experiments on both synthetic and real-world long-context benchmarks show that HAX consistently outperforms SSM baselines and SSMs integrated with context-independent sparse attention (CISA).

[209] arXiv:2507.00451 [pdf, html, other]
Title: Best Agent Identification for General Game Playing
Matthew Stephenson, Alex Newcombe, Eric Piette, Dennis Soemers
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (stat.ML)

We present an efficient and generalised procedure to accurately identify the best performing algorithm for each sub-task in a multi-problem domain. Our approach treats this as a set of best arm identification problems for multi-armed bandits, where each bandit corresponds to a specific task and each arm corresponds to a specific algorithm or agent. We propose an optimistic selection process based on the Wilson score interval (Optimistic-WS) that ranks each arm across all bandits in terms of their potential regret reduction. We evaluate the performance of Optimistic-WS on two of the most popular general game domains, the General Video Game AI (GVGAI) framework and the Ludii general game playing system, with the goal of identifying the highest performing agent for each game within a limited number of trials. Compared to previous best arm identification algorithms for multi-armed bandits, our results demonstrate a substantial performance improvement in terms of average simple regret. This novel approach can be used to significantly improve the quality and accuracy of agent evaluation procedures for general game frameworks, as well as other multi-task domains with high algorithm runtimes.

[210] arXiv:2507.00452 [pdf, other]
Title: The impact of the following vehicles behaviors on the car following behaviors of the ego-vehicle
Yang Liu, Jiahao Zhang, Yuxuan Ouyang, Huan Yu, Dengbo He
Subjects: Systems and Control (eess.SY)

Among all types of crashes, rear-end crashes dominate, which are closely related to the car-following (CF) behaviors. Traditional CF behavior models focused on the influence of the vehicle in front, but usually ignored the peer pressure from the surrounding road users, including the following vehicle (FV). Based on an open dataset, the highD dataset, we investigated whether the FV's states can affect the CF behavior of the ego-vehicle in CF events. Two types of CF events were extracted from highD database, including the tailgated events, where the time headway between the FV and the ego-vehicle (i.e., time gap) was smaller than 1 second, and the gapped events, where the time gap was larger than 3 seconds. The dynamic time warping was used to extract CF pairs with similar speed profiles of the leading vehicle (LV). Statistical analyses were conducted to compare the CF-performance metrics in tailgated and gapped events. Then, the inverse reinforcement learning was used to recover the reward function of the ego-vehicle drivers in different CF events. The results showed that the ego-driver would adjust their CF behavior in response to the pressure from a tailgating FV, by maintaining a closer distance to the LV, but at the same time, driving more cautiously. Further, drivers were still able to adjust their CF strategies based on the speed of traffic flow and the distance to the LV, even when being tailgated. These findings provide insights regarding more accurate modelling of traffic flow by considering the peer pressure from surrounding road users.

[211] arXiv:2507.00453 [pdf, html, other]
Title: Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling
Ankit Kashyap
Comments: 19 pages, 9 figures, 1 table; implemented entirely from scratch in PyTorch
Subjects: Machine Learning (cs.LG)

We present a Transformer architecture for long-context language modeling that combines global attention with two biologically inspired components: chunked local attention and a gated FIFO memory mechanism. This unified attention block allows the model to efficiently handle both short-range and long-range dependencies without increasing attention cost quadratically. The memory module persistently stores past token representations using a gated update mechanism inspired by recurrent networks. Rotary positional encoding is applied per attention head to enable directionally disentangled, scale-invariant positional signals. The architecture is implemented entirely from scratch in PyTorch, with no reliance on high-level libraries, enabling transparent and modular experimentation. Our model offers a lightweight and extensible design for tasks such as dialogue modeling, code completion, and document understanding.

[212] arXiv:2507.00454 [pdf, html, other]
Title: ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales
Yihao Zhen, Qiang Wang, Yu Qiao, Liangqiong Qu, Huijie Fan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

A main challenge of Visual-Language Tracking (VLT) is the misalignment between visual inputs and language descriptions caused by target movement. Previous trackers have explored many effective feature modification methods to preserve more aligned features. However, an important yet unexplored factor ultimately hinders their capability, which is the inherent differences in the temporal and spatial scale of information between visual and language inputs. To address this issue, we propose a novel visual-language tracker that enhances the effect of feature modification by \textbf{A}ligning \textbf{T}emporal and \textbf{S}patial scale of different input components, named as \textbf{ATSTrack}. Specifically, we decompose each language description into phrases with different attributes based on their temporal and spatial correspondence with visual inputs, and modify their features in a fine-grained manner. Moreover, we introduce a Visual-Language token that comprises modified linguistic information from the previous frame to guide the model to extract visual features that are more relevant to language description, thereby reducing the impact caused by the differences in spatial scale. Experimental results show that our proposed ATSTrack achieves performance comparable to existing methods. Our code will be released.

[213] arXiv:2507.00456 [pdf, html, other]
Title: Teacher-AI Collaboration for Curating and Customizing Lesson Plans in Low-Resource Schools
Deepak Varuvel Dennison, Bakhtawar Ahtisham, Kavyansh Chourasia, Nirmit Arora, Rahul Singh, Rene F. Kizilcec, Akshay Nambi, Tanuja Ganu, Aditya Vashistha
Subjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

This study investigates Shiksha copilot, an AI-assisted lesson planning tool deployed in government schools across Karnataka, India. The system combined LLMs and human expertise through a structured process in which English and Kannada lesson plans were co-created by curators and AI; teachers then further customized these curated plans for their classrooms using their own expertise alongside AI support. Drawing on a large-scale mixed-methods study involving 1,043 teachers and 23 curators, we examine how educators collaborate with AI to generate context-sensitive lesson plans, assess the quality of AI-generated content, and analyze shifts in teaching practices within multilingual, low-resource environments. Our findings show that teachers used Shiksha copilot both to meet administrative documentation needs and to support their teaching. The tool eased bureaucratic workload, reduced lesson planning time, and lowered teaching-related stress, while promoting a shift toward activity-based pedagogy. However, systemic challenges such as staffing shortages and administrative demands constrained broader pedagogical change. We frame these findings through the lenses of teacher-AI collaboration and communities of practice to examine the effective integration of AI tools in teaching. Finally, we propose design directions for future teacher-centered EdTech, particularly in multilingual and Global South contexts.

[214] arXiv:2507.00460 [pdf, html, other]
Title: Pitfalls of Evaluating Language Models with Open Benchmarks
Md. Najib Hasan (1), Mohammad Fakhruddin Babar (2), Souvika Sarkar (1), Monowar Hasan (2), Santu Karmaker (3) ((1) Wichita State University, (2) Washington State University, (3) University of Central Florida)
Subjects: Computation and Language (cs.CL)

Open Large Language Model (LLM) benchmarks, such as HELM and BIG-bench, offer standardized, transparent protocols that facilitate the fair comparison, reproducibility, and iterative advancement of Language Models (LMs). However, their openness also introduces critical and underexplored pitfalls. This study exposes these weaknesses by systematically constructing ``cheating'' models -- smaller variants of BART, T5, and GPT-2 fine-tuned directly on public test sets -- which achieve top rankings on a prominent open, holistic benchmark (HELM) despite poor generalization and limited practical utility. Our findings underscore three key insights: \ca high leaderboard performance on open benchmarks may not always reflect real-world effectiveness; \cb private or dynamic benchmarks must complement open evaluations to safeguard integrity; and \cc a fundamental reevaluation of current benchmarking practices is essential to ensure robust and trustworthy LM assessments.

[215] arXiv:2507.00461 [pdf, html, other]
Title: Novel Complex-Valued Hopfield Neural Networks with Phase and Magnitude Quantization
Garimella Ramamurthy, Marcos Eduardo Valle, Tata Jagannadha Swamy
Comments: Paper submitted to the Fifth International Conference on Emerging Techniques in Computational Intelligence (ICETCI 2025)
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)

This research paper introduces two novel complex-valued Hopfield neural networks (CvHNNs) that incorporate phase and magnitude quantization. The first CvHNN employs a ceiling-type activation function that operates on the rectangular coordinate representation of the complex net contribution. The second CvHNN similarly incorporates phase and magnitude quantization but utilizes a ceiling-type activation function based on the polar coordinate representation of the complex net contribution. The proposed CvHNNs, with their phase and magnitude quantization, significantly increase the number of states compared to existing models in the literature, thereby expanding the range of potential applications for CvHNNs.

[216] arXiv:2507.00462 [pdf, html, other]
Title: Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation
Jizhou Han, Chenhao Ding, SongLin Dong, Yuhang He, Xinyuan Gao, Yihong Gong
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Visual-language models (VLMs) like CLIP exhibit strong generalization but struggle with distribution shifts at test time. Existing training-free test-time adaptation (TTA) methods operate strictly within CLIP's original feature space, relying on high-confidence samples while overlooking the potential of low-confidence ones. We propose MS-TTA, a training-free approach that enhances feature representations beyond CLIP's space using a single-step k-nearest neighbors (kNN) Mean-Shift. By refining all test samples, MS-TTA improves feature compactness and class separability, leading to more stable adaptation. Additionally, a cache of refined embeddings further enhances inference by providing Mean Shift enhanced logits. Extensive evaluations on OOD and cross-dataset benchmarks demonstrate that MS-TTA consistently outperforms state-of-the-art training-free TTA methods, achieving robust adaptation without requiring additional training.

[217] arXiv:2507.00464 [pdf, html, other]
Title: A Miniature High-Resolution Tension Sensor Based on a Photo-Reflector for Robotic Hands and Grippers
Hyun-Bin Kim, Kyung-Soo Kim
Subjects: Robotics (cs.RO)

This paper presents a miniature tension sensor using a photo-reflector, designed for compact tendon-driven grippers and robotic hands. The proposed sensor has a small form factor of 13~mm x 7~mm x 6.5~mm and is capable of measuring tensile forces up to 200~N. A symmetric elastomer structure incorporating fillets and flexure hinges is designed based on Timoshenko beam theory and verified via FEM analysis, enabling improved sensitivity and mechanical durability while minimizing torsional deformation. The sensor utilizes a compact photo-reflector (VCNT2020) to measure displacement in the near-field region, eliminating the need for light-absorbing materials or geometric modifications required in photo-interrupter-based designs. A 16-bit analog-to-digital converter (ADC) and CAN-FD (Flexible Data-rate) communication enable efficient signal acquisition with up to 5~kHz sampling rate. Calibration experiments demonstrate a resolution of 9.9~mN (corresponding to over 14-bit accuracy) and a root mean square error (RMSE) of 0.455~N. Force control experiments using a twisted string actuator and PI control yield RMSEs as low as 0.073~N. Compared to previous research using photo-interrupter, the proposed method achieves more than tenfold improvement in resolution while also reducing nonlinearity and hysteresis. The design is mechanically simple, lightweight, easy to assemble, and suitable for integration into robotic and prosthetic systems requiring high-resolution force feedback.

[218] arXiv:2507.00465 [pdf, html, other]
Title: Encoding Peano Arithmetic in a Minimal Fragment of Separation Logic
Sohei Ito, Makoto Tatsuta
Subjects: Logic in Computer Science (cs.LO)

This paper investigates the expressive power of a minimal fragment of separation logic extended with natural numbers. Specifically, it demonstrates that the fragment consisting solely of the intuitionistic points-to predicate, the constant 0, and the successor function is sufficient to encode all $\Pi^0_1$ formulas of Peano Arithmetic (PA). The authors construct a translation from PA into this fragment, showing that a $\Pi^0_1$ formula is valid in the standard model of arithmetic if and only if its translation is valid in the standard interpretation of the separation logic fragment. This result implies the undecidability of validity in the fragment, despite its syntactic simplicity. The translation leverages a heap-based encoding of arithmetic operations - addition, multiplication, and inequality - using structured memory cells. The paper also explores the boundaries of this encoding, showing that the translation does not preserve validity for $\Sigma^0_1$ formulas. Additionally, an alternative undecidability proof is presented via a reduction from finite model theory. Finally, the paper establishes that the validity problem for this fragment is $\Pi^0_1$-complete, highlighting its theoretical significance in the landscape of logic and program verification.

[219] arXiv:2507.00466 [pdf, html, other]
Title: Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer Architecture
Sebastian Murgul, Michael Heizmann
Comments: Accepted to the 22nd Sound and Music Computing Conference (SMC), 2025
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

Beat tracking in musical performance MIDI is a challenging and important task for notation-level music transcription and rhythmical analysis, yet existing methods primarily focus on audio-based approaches. This paper proposes an end-to-end transformer-based model for beat and downbeat tracking in performance MIDI, leveraging an encoder-decoder architecture for sequence-to-sequence translation of MIDI input to beat annotations. Our approach introduces novel data preprocessing techniques, including dynamic augmentation and optimized tokenization strategies, to improve accuracy and generalizability across different datasets. We conduct extensive experiments using the A-MAPS, ASAP, GuitarSet, and Leduc datasets, comparing our model against state-of-the-art hidden Markov models (HMMs) and deep learning-based beat tracking methods. The results demonstrate that our model outperforms existing symbolic music beat tracking approaches, achieving competitive F1-scores across various musical styles and instruments. Our findings highlight the potential of transformer architectures for symbolic beat tracking and suggest future integration with automatic music transcription systems for enhanced music analysis and score generation.

[220] arXiv:2507.00467 [pdf, html, other]
Title: Diversity Conscious Refined Random Forest
Sijan Bhattarai, Saurav Bhandari, Girija Bhusal, Saroj Shakya, Tapendra Pandey
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Random Forest (RF) is a widely used ensemble learning technique known for its robust classification performance across diverse domains. However, it often relies on hundreds of trees and all input features, leading to high inference cost and model redundancy. In this work, our goal is to grow trees dynamically only on informative features and then enforce maximal diversity by clustering and retaining uncorrelated trees. Therefore, we propose a Refined Random Forest Classifier that iteratively refines itself by first removing the least informative features and then analytically determines how many new trees should be grown, followed by correlation-based clustering to remove redundant trees. The classification accuracy of our model was compared against the standard RF on the same number of trees. Experiments on 8 multiple benchmark datasets, including binary and multiclass datasets, demonstrate that the proposed model achieves improved accuracy compared to standard RF.

[221] arXiv:2507.00469 [pdf, html, other]
Title: Bisecle: Binding and Separation in Continual Learning for Video Language Understanding
Yue Tan, Xiaoqian Hu, Hao Xue, Celso De Melo, Flora D. Salim
Comments: 23 pages, 12 figures, 10 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e.g., dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models on new tasks, usually, a small subset of parameters is updated while the bulk of the model remains frozen. This poses new challenges to existing continual learning frameworks in the context of large multimodal foundation models, i.e., catastrophic forgetting and update conflict. While the foundation models struggle with parameter-efficient continual learning, the hippocampus in the human brain has evolved highly efficient mechanisms for memory formation and consolidation. Inspired by the rapid Binding and pattern separation mechanisms in the hippocampus, in this work, we propose Bisecle for video-language continual learning, where a multi-directional supervision module is used to capture more cross-modal relationships and a contrastive prompt learning scheme is designed to isolate task-specific knowledge to facilitate efficient memory storage. Binding and separation processes further strengthen the ability of VLMs to retain complex experiences, enabling robust and efficient continual learning in video understanding tasks. We perform a thorough evaluation of the proposed Bisecle, demonstrating its ability to mitigate forgetting and enhance cross-task generalization on several VideoQA benchmarks.

[222] arXiv:2507.00472 [pdf, html, other]
Title: ARIG: Autoregressive Interactive Head Generation for Real-time Conversations
Ying Guo, Xi Liu, Cheng Zhen, Pengfei Yan, Xiaoming Wei
Comments: ICCV 2025. Homepage: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.

[223] arXiv:2507.00474 [pdf, html, other]
Title: ADAptation: Reconstruction-based Unsupervised Active Learning for Breast Ultrasound Diagnosis
Yaofei Duan, Yuhao Huang, Xin Yang, Luyi Han, Xinyu Xie, Zhiyuan Zhu, Ping He, Ka-Hou Chan, Ligang Cui, Sio-Kei Im, Dong Ni, Tao Tan
Comments: 11 pages, 4 figures, 4 tables. Accepted by conference MICCAI2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Deep learning-based diagnostic models often suffer performance drops due to distribution shifts between training (source) and test (target) domains. Collecting and labeling sufficient target domain data for model retraining represents an optimal solution, yet is limited by time and scarce resources. Active learning (AL) offers an efficient approach to reduce annotation costs while maintaining performance, but struggles to handle the challenge posed by distribution variations across different datasets. In this study, we propose a novel unsupervised Active learning framework for Domain Adaptation, named ADAptation, which efficiently selects informative samples from multi-domain data pools under limited annotation budget. As a fundamental step, our method first utilizes the distribution homogenization capabilities of diffusion models to bridge cross-dataset gaps by translating target images into source-domain style. We then introduce two key innovations: (a) a hypersphere-constrained contrastive learning network for compact feature clustering, and (b) a dual-scoring mechanism that quantifies and balances sample uncertainty and representativeness. Extensive experiments on four breast ultrasound datasets (three public and one in-house/multi-center) across five common deep classifiers demonstrate that our method surpasses existing strong AL-based competitors, validating its effectiveness and generalization for clinical domain adaptation. The code is available at the anonymized link: this https URL.

[224] arXiv:2507.00475 [pdf, html, other]
Title: AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences
Minoru Kishi, Ryosuke Sakai, Shinnosuke Takamichi, Yusuke Kanamori, Yuki Okamoto
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

We propose a novel objective evaluation metric for synthesized audio in text-to-audio (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of the synthesized sound is an important, but its implementation requires monetary costs. Therefore, objective evaluation such as mel-cepstral distortion are used, but the correlation between these objective metrics and subjective evaluation values is weak. Our proposed objective evaluation metric, AudioBERTScore, calculates the similarity between embedding of the synthesized and reference sounds. The method is based not only on the max-norm used in conventional BERTScore but also on the $p$-norm to reflect the non-local nature of environmental sounds. Experimental results show that scores obtained by the proposed method have a higher correlation with subjective evaluation values than conventional metrics.

[225] arXiv:2507.00476 [pdf, html, other]
Title: FreNBRDF: A Frequency-Rectified Neural Material Representation
Chenliang Zhou, Zheyuan Hu, Cengiz Oztireli
Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

Accurate material modeling is crucial for achieving photorealistic rendering, bridging the gap between computer-generated imagery and real-world photographs. While traditional approaches rely on tabulated BRDF data, recent work has shifted towards implicit neural representations, which offer compact and flexible frameworks for a range of tasks. However, their behavior in the frequency domain remains poorly understood. To address this, we introduce FreNBRDF, a frequency-rectified neural material representation. By leveraging spherical harmonics, we integrate frequency-domain considerations into neural BRDF modeling. We propose a novel frequency-rectified loss, derived from a frequency analysis of neural materials, and incorporate it into a generalizable and adaptive reconstruction and editing pipeline. This framework enhances fidelity, adaptability, and efficiency. Extensive experiments demonstrate that \ours improves the accuracy and robustness of material appearance reconstruction and editing compared to state-of-the-art baselines, enabling more structured and interpretable downstream tasks and applications.

[226] arXiv:2507.00477 [pdf, html, other]
Title: Read the Docs Before Rewriting: Equip Rewriter with Domain Knowledge via Continual Pre-training
Qi Wang, Yixuan Cao, Yifan Liu, Jiangtao Zhao, Ping Luo
Subjects: Information Retrieval (cs.IR)

A Retrieval-Augmented Generation (RAG)-based question-answering (QA) system enhances a large language model's knowledge by retrieving relevant documents based on user queries. Discrepancies between user queries and document phrasings often necessitate query rewriting. However, in specialized domains, the rewriter model may struggle due to limited domain-specific knowledge. To resolve this, we propose the R\&R (Read the doc before Rewriting) rewriter, which involves continual pre-training on professional documents, akin to how students prepare for open-book exams by reviewing textbooks. Additionally, it can be combined with supervised fine-tuning for improved results. Experiments on multiple datasets demonstrate that R\&R excels in professional QA across multiple domains, effectively bridging the query-document gap, while maintaining good performance in general scenarios, thus advancing the application of RAG-based QA systems in specialized fields.

[227] arXiv:2507.00479 [pdf, html, other]
Title: On Mitigating Data Sparsity in Conversational Recommender Systems
Sixiao Zhang, Mingrui Liu, Cheng Long, Wei Yuan, Hongxu Chen, Xiangyu Zhao, Hongzhi Yin
Subjects: Information Retrieval (cs.IR)

Conversational recommender systems (CRSs) capture user preference through textual information in dialogues. However, they suffer from data sparsity on two fronts: the dialogue space is vast and linguistically diverse, while the item space exhibits long-tail and sparse distributions. Existing methods struggle with (1) generalizing to varied dialogue expressions due to underutilization of rich textual cues, and (2) learning informative item representations under severe sparsity. To address these problems, we propose a CRS model named DACRS. It consists of three modules, namely Dialogue Augmentation, Knowledge-Guided Entity Modeling, and Dialogue-Entity Matching. In the Dialogue Augmentation module, we apply a two-stage augmentation pipeline to augment the dialogue context to enrich the data and improve generalizability. In the Knowledge-Guided Entity Modeling, we propose a knowledge graph (KG) based entity substitution and an entity similarity constraint to enhance the expressiveness of entity embeddings. In the Dialogue-Entity Matching module, we fuse the dialogue embedding with the mentioned entity embeddings through a dialogue-guided attention aggregation to acquire user embeddings that contain both the explicit and implicit user preferences. Extensive experiments on two public datasets demonstrate the state-of-the-art performance of DACRS.

[228] arXiv:2507.00480 [pdf, html, other]
Title: Posterior Inference in Latent Space for Scalable Constrained Black-box Optimization
Kiyoung Om, Kyuil Sim, Taeyoung Yun, Hyeongyu Kang, Jinkyoo Park
Comments: 25 pages, 11 figures, 5 tables. Equal contribution by Kiyoung Om, Kyuil Sim, and Taeyoung Yun
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Optimizing high-dimensional black-box functions under black-box constraints is a pervasive task in a wide range of scientific and engineering problems. These problems are typically harder than unconstrained problems due to hard-to-find feasible regions. While Bayesian optimization (BO) methods have been developed to solve such problems, they often struggle with the curse of dimensionality. Recently, generative model-based approaches have emerged as a promising alternative for constrained optimization. However, they suffer from poor scalability and are vulnerable to mode collapse, particularly when the target distribution is highly multi-modal. In this paper, we propose a new framework to overcome these challenges. Our method iterates through two stages. First, we train flow-based models to capture the data distribution and surrogate models that predict both function values and constraint violations with uncertainty quantification. Second, we cast the candidate selection problem as a posterior inference problem to effectively search for promising candidates that have high objective values while not violating the constraints. During posterior inference, we find that the posterior distribution is highly multi-modal and has a large plateau due to constraints, especially when constraint feedback is given as binary indicators of feasibility. To mitigate this issue, we amortize the sampling from the posterior distribution in the latent space of flow-based models, which is much smoother than that in the data space. We empirically demonstrate that our method achieves superior performance on various synthetic and real-world constrained black-box optimization tasks. Our code is publicly available \href{this https URL}{here}.

[229] arXiv:2507.00481 [pdf, html, other]
Title: The Influence of HEXACO Personality Traits on the Teamwork Quality in Software Teams -- A Preliminary Research Approach
Philipp M. Zähl, Sabine Theis, Martin R. Wolf
Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC)

Although software engineering research has focused on optimizing processes and technology, there is a growing recognition that human factors, particularly teamwork, also significantly impact optimization. Recent research suggests that developer personality has a strong influence on teamwork. In fact, personality considerations may have a greater impact on software development than processes and tools. This paper aims to design a study that measures the impact of HEXACO personality traits on the Teamwork Quality (TWQ) of software teams. A preliminary data collection (n=54) was conducted for this purpose. The analysis showed that several personality traits, as well as their composition, had a significant impact on TWQ. Additionally, other variables, such as the proportion of women and age distribution, also affected TWQ. The study's initial results demonstrate the usefulness and validity of the study design. The results also suggest several opportunities to improve teamwork in IT organizations and avenues for further research.

[230] arXiv:2507.00485 [pdf, html, other]
Title: PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning
Weiran Guo, Guanjun Liu, Ziyuan Zhou, Ling Wang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Reinforcement Learning (RL) is widely used in tasks where agents interact with an environment to maximize rewards. Building on this foundation, Safe Reinforcement Learning (Safe RL) incorporates a cost metric alongside the reward metric, ensuring that agents adhere to safety constraints during decision-making. In this paper, we identify that Safe RL is vulnerable to backdoor attacks, which can manipulate agents into performing unsafe actions. First, we introduce the relevant concepts and evaluation metrics for backdoor attacks in Safe RL. It is the first attack framework in the Safe RL field that involves both Positive and Negative Action sample (PNAct) is to implant backdoors, where positive action samples provide reference actions and negative action samples indicate actions to be avoided. We theoretically point out the properties of PNAct and design an attack algorithm. Finally, we conduct experiments to evaluate the effectiveness of our proposed backdoor attack framework, evaluating it with the established metrics. This paper highlights the potential risks associated with Safe RL and underscores the feasibility of such attacks. Our code and supplementary material are available at this https URL.

[231] arXiv:2507.00487 [pdf, html, other]
Title: MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models
Jianghao Lin, Xinyuan Wang, Xinyi Dai, Menghui Zhu, Bo Chen, Ruiming Tang, Yong Yu, Weinan Zhang
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)

Tool retrieval is a critical component in enabling large language models (LLMs) to interact effectively with external tools. It aims to precisely filter the massive tools into a small set of candidates for the downstream tool-augmented LLMs. However, most existing approaches primarily focus on optimizing tool representations, often neglecting the importance of precise query comprehension. To address this gap, we introduce MassTool, a multi-task search-based framework designed to enhance both query representation and tool retrieval accuracy. MassTool employs a two-tower architecture: a tool usage detection tower that predicts the need for function calls, and a tool retrieval tower that leverages a query-centric graph convolution network (QC-GCN) for effective query-tool matching. It also incorporates search-based user intent modeling (SUIM) to handle diverse and out-of-distribution queries, alongside an adaptive knowledge transfer (AdaKT) module for efficient multi-task learning. By jointly optimizing tool usage detection loss, list-wise retrieval loss, and contrastive regularization loss, MassTool establishes a robust dual-step sequential decision-making pipeline for precise query understanding. Extensive experiments demonstrate its effectiveness in improving retrieval accuracy. Our code is available at this https URL.

[232] arXiv:2507.00488 [pdf, html, other]
Title: Have Object-Oriented Languages Missed a Trick with Class Function and its Subclasses?
Lloyd Allison
Subjects: Programming Languages (cs.PL)

Compared to functions in mathematics, functions in programming languages seem to be under classified. Functional programming languages based on the lambda calculus famously treat functions as first-class values. Object-oriented languages have adopted ``lambdas'', notably for call-back routines in event-based programming. Typically a programming language has functions, a function has a type, and some functions act on other functions and/or return functions but there is generally a lack of (i) ``class Function'' in the OO sense of the word class and particularly (ii) subclasses of Function for functions having specific properties. Some such classes are presented here and programmed in some popular programming languages as an experimental investigation into OO languages missing this opportunity.

[233] arXiv:2507.00489 [pdf, html, other]
Title: Towards Efficient Random-Order Enumeration for Join Queries
Pengyu Chen, Zizheng Guo, Jianwei Yang, Dongjing Miao
Subjects: Databases (cs.DB)

In many data analysis pipelines, a basic and time-consuming process is to produce join results and feed them into downstream tasks. Numerous enumeration algorithms have been developed for this purpose. To be a statistically meaningful representation of the whole join result, the result tuples are required to be enumerated in uniformly random order. However, existing studies lack an efficient random-order enumeration algorithm with a worst-case runtime guarantee for (cyclic) join queries. In this paper, we study the problem of enumerating the results of a join query in random order. We develop an efficient random-order enumeration algorithm for join queries with no large hidden constants in its complexity, achieving expected $O(\frac{\mathrm{AGM}(Q)}{|Res(Q)|}\log^2|Q|)$ delay, $O(\mathrm{AGM}(Q)\log|Q|)$ total running time after $O(|Q|\log|Q|)$-time index construction, where $|Q|$ is the size of input, $\mathrm{AGM}(Q)$ is the AGM bound, and $|Res(Q)|$ is the size of the join result. We prove that our algorithm is near-optimal in the worst case, under the combinatorial $k$-clique hypothesis. Our algorithm requires no query-specific preprocessing and can be flexibly adapted to many common database indexes with only minor modifications. We also devise two non-trivial techniques to speed up the enumeration, and provide an experimental study on our enumeration algorithm along with the speed-up techniques. The experimental results show that our algorithm, enhanced with the proposed techniques, significantly outperforms existing state-of-the-art methods.

[234] arXiv:2507.00490 [pdf, html, other]
Title: Just Noticeable Difference for Large Multimodal Models
Zijian Chen, Yuan Tian, Yuze Sun, Wei Sun, Zicheng Zhang, Weisi Lin, Guangtao Zhai, Wenjun Zhang
Comments: 19 pages, 19 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Just noticeable difference (JND), the minimum change that the human visual system (HVS) can perceive, has been studied for decades. Although recent work has extended this line of research into machine vision, there has been a scarcity of studies systematically exploring its perceptual boundaries across multiple tasks and stimulus types, particularly in the current era of rapidly advancing large multimodal models (LMMs), where studying the multifaceted capabilities of models has become a mainstream focus. Moreover, the perceptual defects of LMMs are not investigated thoroughly, resulting in potential security issues and suboptimal response efficiency. In this paper, we take an initial attempt and demonstrate that there exist significant visual blind spots in current LMMs. To systemically quantify this characteristic, we propose a new concept, {\bf LMM-JND}, together with its determination pipeline. Targeting uncovering the behavior commonalities in HVS-aligned visual perception tasks, we delve into several LMM families and construct a large-scale dataset, named VPA-JND, which contains 21.5k reference images with over 489k stimuli across 12 distortion types, to facilitate LMM-JND studies. VPA-JND exposes areas where state-of-the-art LMMs, including GPT-4o and the InternVL2.5 series, struggle with basic comparison queries and fall significantly short of human-level visual performance. We further explore the effects of vision and language backbones and find a notable correlation between their design philosophy that may instruct the future refinement of LMMs for their visual acuity. Together, our research underscores the significance of LMM-JND as a unique perspective for studying LMMs, and predictable LMM-JND is crucial for security concerns. This work will be available at this https URL.

[235] arXiv:2507.00491 [pdf, html, other]
Title: Twill: Scheduling Compound AI Systems on Heterogeneous Mobile Edge Platforms
Zain Taufique, Aman Vyas, Antonio Miele, Pasi Liljeberg, Anil Kanduri
Comments: 9 Pages, 9 Figures, Accepted in International Conference on Computer-Aided Design (ICCAD) 2025
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)

Compound AI (cAI) systems chain multiple AI models to solve complex problems. cAI systems are typically composed of deep neural networks (DNNs), transformers, and large language models (LLMs), exhibiting a high degree of computational diversity and dynamic workload variation. Deploying cAI services on mobile edge platforms poses a significant challenge in scheduling concurrent DNN-transformer inference tasks, which arrive dynamically in an unknown sequence. Existing mobile edge AI inference strategies manage multi-DNN or transformer-only workloads, relying on design-time profiling, and cannot handle concurrent inference of DNNs and transformers required by cAI systems. In this work, we address the challenge of scheduling cAI systems on heterogeneous mobile edge platforms. We present Twill, a run-time framework to handle concurrent inference requests of cAI workloads through task affinity-aware cluster mapping and migration, priority-aware task freezing/unfreezing, and DVFS, while minimizing inference latency within power budgets. We implement and deploy our Twill framework on the Nvidia Jetson Orin NX platform. We evaluate Twill against state-of-the-art edge AI inference techniques over contemporary DNNs and LLMs, reducing inference latency by 54% on average, while honoring power budgets.

[236] arXiv:2507.00493 [pdf, html, other]
Title: Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models
Fenil R. Doshi, Thomas Fel, Talia Konkle, George Alvarez
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs-texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self-supervised and language-aligned transformers -- exemplified by DINOv2, SigLIP2 and EVA-CLIP -- occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius-controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control remains at chance (iv), ruling out "border-hacking" strategies. Finally, (v) we show that configural shape score also predicts other shape-dependent evals. Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape.

[237] arXiv:2507.00496 [pdf, html, other]
Title: Coverage-Guided Testing for Deep Learning Models: A Comprehensive Survey
Hongjing Guo, Chuanqi Tao, Zhiqiu Huang, Weiqin Zou
Subjects: Software Engineering (cs.SE)

As Deep Learning (DL) models are increasingly applied in safety-critical domains, ensuring their quality has emerged as a pressing challenge in modern software engineering. Among emerging validation paradigms, coverage-guided testing (CGT) has gained prominence as a systematic framework for identifying erroneous or unexpected model behaviors. Despite growing research attention, existing CGT studies remain methodologically fragmented, limiting the understanding of current advances and emerging trends. This work addresses that gap through a comprehensive review of state-of-the-art CGT methods for DL models, including test coverage analysis, coverage-guided test input generation, and coverage-guided test input optimization. This work provides detailed taxonomies to organize these methods based on methodological characteristics and application scenarios. We also investigate evaluation practices adopted in existing studies, including the use of benchmark datasets, model architectures, and evaluation aspects. Finally, open challenges and future directions are highlighted in terms of the correlation between structural coverage and testing objectives, method generalizability across tasks and models, practical deployment concerns, and the need for standardized evaluation and tool support. This work aims to provide a roadmap for future academic research and engineering practice in DL model quality assurance.

[238] arXiv:2507.00498 [pdf, html, other]
Title: MuteSwap: Silent Face-based Voice Conversion
Yifan Liu, Yu Fang, Zhouhan Lin
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

Conventional voice conversion modifies voice characteristics from a source speaker to a target speaker, relying on audio input from both sides. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inputs. i.e., given images of a target speaker and a silent video of a source speaker containing lip motion, SFVC generates speech aligning the identity of the target speaker while preserving the speech content in the source silent video. As this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging. To address this, we introduce MuteSwap, a novel framework that employs contrastive learning to align cross-modality identities and minimize mutual information to separate shared visual features. Experimental results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC.

[239] arXiv:2507.00501 [pdf, html, other]
Title: Laplace-Mamba: Laplace Frequency Prior-Guided Mamba-CNN Fusion Network for Image Dehazing
Yongzhen Wang, Liangliang Chen, Bingwen Hu, Heng Liu, Xiao-Ping Zhang, Mingqiang Wei
Comments: 12 pages, 11 figures, 6 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent progress in image restoration has underscored Spatial State Models (SSMs) as powerful tools for modeling long-range dependencies, owing to their appealing linear complexity and computational efficiency. However, SSM-based approaches exhibit limitations in reconstructing localized structures and tend to be less effective when handling high-dimensional data, frequently resulting in suboptimal recovery of fine image features. To tackle these challenges, we introduce Laplace-Mamba, a novel framework that integrates Laplace frequency prior with a hybrid Mamba-CNN architecture for efficient image dehazing. Leveraging the Laplace decomposition, the image is disentangled into low-frequency components capturing global texture and high-frequency components representing edges and fine details. This decomposition enables specialized processing via dual parallel pathways: the low-frequency branch employs SSMs for global context modeling, while the high-frequency branch utilizes CNNs to refine local structural details, effectively addressing diverse haze scenarios. Notably, the Laplace transformation facilitates information-preserving downsampling of low-frequency components in accordance with the Nyquist theory, thereby significantly improving computational efficiency. Extensive evaluations across multiple benchmarks demonstrate that our method outperforms state-of-the-art approaches in both restoration quality and efficiency. The source code and pretrained models are available at this https URL.

[240] arXiv:2507.00502 [pdf, html, other]
Title: ExPaMoE: An Expandable Parallel Mixture of Experts for Continual Test-Time Adaptation
JianChao Zhao, Songlin Dong
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Continual Test-Time Adaptation (CTTA) aims to enable models to adapt on-the-fly to a stream of unlabeled data under evolving distribution shifts. However, existing CTTA methods typically rely on shared model parameters across all domains, making them vulnerable to feature entanglement and catastrophic forgetting in the presence of large or non-stationary domain shifts. To address this limitation, we propose \textbf{ExPaMoE}, a novel framework based on an \emph{Expandable Parallel Mixture-of-Experts} architecture. ExPaMoE decouples domain-general and domain-specific knowledge via a dual-branch expert design with token-guided feature separation, and dynamically expands its expert pool based on a \emph{Spectral-Aware Online Domain Discriminator} (SODD) that detects distribution changes in real-time using frequency-domain cues. Extensive experiments demonstrate the superiority of ExPaMoE across diverse CTTA scenarios. We evaluate our method on standard benchmarks including CIFAR-10C, CIFAR-100C, ImageNet-C, and Cityscapes-to-ACDC for semantic segmentation. Additionally, we introduce \textbf{ImageNet++}, a large-scale and realistic CTTA benchmark built from multiple ImageNet-derived datasets, to better reflect long-term adaptation under complex domain evolution. ExPaMoE consistently outperforms prior arts, showing strong robustness, scalability, and resistance to forgetting.

[241] arXiv:2507.00505 [pdf, html, other]
Title: LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
Haoran Lou, Chunxiao Fan, Ziyan Liu, Yuexin Wu, Xinxiang Wang
Comments: ICCV
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which \textbf{ only adds six spatial visual tokens} to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1)We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: ``from central region to global" and ``from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual information, enriching the overall visual representation. 2) We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling, enabling the model to handle diverse visual understanding tasks. 3) Extensive experiments show that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming the state-of-the-art LLaVA-1.5 model in multiple tasks with nearly identical inference latency. The code and models are available at \href{this https URL}{\texttt{this https URL}}.

[242] arXiv:2507.00506 [pdf, html, other]
Title: SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning
Yunfei Xie, Yuxuan Cheng, Juncheng Wu, Haoyu Zhang, Yuyin Zhou, Shoudong Han
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-world perturbations. Our method introduces two key innovations: Firstly, we proposed Selective Visual Prompt Fusion (SVIP), a lightweight module that dynamically injects discriminative visual features into text prompts via a cross-modal gating mechanism. Moreover, the proposed Perturbation-Driven Consistency Alignment (PDCA) is a dual-path training strategy that enforces invariant feature alignment under random image perturbations by regularizing consistency between original and augmented cross-modal embeddings. Extensive experiments are conducted on several popular benchmarks covering Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-REID, and P-DukeMTMC, which demonstrate the impressive performance of the proposed method. Notably, our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead. The code will be released upon acceptance.

[243] arXiv:2507.00507 [pdf, html, other]
Title: LLM-Mesh: Enabling Elastic Sharing for Serverless LLM Inference
Chuhao Xu, Zijun Li, Quan Chen, Han Zhao, Minyi Guo
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-scale models and infrequent requests. While existing solutions follow exclusive GPU deployment, we take a step back to explore modern platforms and find that: Emerging CPU architectures with built-in accelerators are capable of serving LLMs but remain underutilized, and both CPUs and GPUs can accommodate multiple LLMs simultaneously.
We propose LLM-Mesh, a serverless inference scheme for small-to-mid-sized LLMs that enables elastic sharing across heterogeneous hardware. LLM-Mesh tackles three fundamental challenges: (1) precise, fine-grained compute resource allocation at token-level to handle fluctuating computational demands; (2) a coordinated and forward-looking memory scaling mechanism to detect out-of-memory hazards and reduce operational overhead; and (3) a dual approach that reduces resource fragmentation through proactive preemption and reactive bin-packing. Experimental results on 4 32-core CPUs and 4 A100 GPUs show that LLM-Meshimproves service capacity by 44% - 63% through sharing, while further leveraging CPUs boosts this to 91% - 159%.

[244] arXiv:2507.00509 [pdf, html, other]
Title: TeamCMU at Touché: Adversarial Co-Evolution for Advertisement Integration and Detection in Conversational Search
To Eun Kim, João Coelho, Gbemileke Onilude, Jai Singh
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

As conversational search engines increasingly adopt generation-based paradigms powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), the integration of advertisements into generated responses presents both commercial opportunities and challenges for user experience. Unlike traditional search, where advertisements are clearly delineated, generative systems blur the boundary between informational content and promotional material, raising concerns around transparency and trust. In this work, we propose a modular pipeline for advertisement management in RAG-based conversational systems, consisting of an ad-rewriter for seamless ad integration and a robust ad-classifier for detection. We leverage synthetic data to train high-performing classifiers, which are then used to guide two complementary ad-integration strategies: supervised fine-tuning of the ad-rewriter and a best-of-N sampling approach that selects the least detectable ad-integrated response among multiple candidates. Our evaluation focuses on two core questions: the effectiveness of ad classifiers in detecting diverse ad integration strategies, and the training methods that best support coherent, minimally intrusive ad insertion. Experimental results show that our ad-classifier, trained on synthetic advertisement data inspired by marketing strategies and enhanced through curriculum learning, achieves robust detection performance. Additionally, we demonstrate that classifier-guided optimization, through both fine-tuning and best-of-N sampling, significantly improves ad stealth, enabling more seamless integration. These findings contribute an adversarial co-evolution framework for developing more sophisticated ad-aware generative search systems and robust ad classifiers.

[245] arXiv:2507.00513 [pdf, html, other]
Title: Customer Service Representative's Perception of the AI Assistant in an Organization's Call Center
Kai Qin, Kexin Du, Yimeng Chen, Yueyan Liu, Jie Cai, Zhiqiang Nie, Nan Gao, Guohui Wei, Shengzhu Wang, Chun Yu
Comments: ACM CSCW Poster 2025
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

The integration of various AI tools creates a complex socio-technical environment where employee-customer interactions form the core of work practices. This study investigates how customer service representatives (CSRs) at the power grid service customer service call center perceive AI assistance in their interactions with customers. Through a field visit and semi-structured interviews with 13 CSRs, we found that AI can alleviate some traditional burdens during the call (e.g., typing and memorizing) but also introduces new burdens (e.g., earning, compliance, psychological burdens). This research contributes to a more nuanced understanding of AI integration in organizational settings and highlights the efforts and burdens undertaken by CSRs to adapt to the updated system.

[246] arXiv:2507.00516 [pdf, html, other]
Title: The Fourier spectral approach to the spatial discretization of quasilinear hyperbolic systems
Vincent Duchêne, Johanna Ulvedal Marstrander
Comments: 30 pages, 7 figures. Figures are reproducible using the Julia package WaterWaves1D at this https URL
Subjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP)

We discuss the rigorous justification of the spatial discretization by means of Fourier spectral methods of quasilinear first-order hyperbolic systems. We provide uniform stability estimates that grant spectral convergence of the (spatially) semi-discretized solutions towards the corresponding continuous solution provided that the underlying system satisfies some suitable structural assumptions. We consider a setting with sharp low-pass filters and a setting with smooth low-pass filters and argue that - at least theoretically - smooth low-pass filters are operable on a larger class of systems. While our theoretical results are supported with numerical evidence, we also pinpoint some behavior of the numerical method that currently has no theoretical explanation.

[247] arXiv:2507.00518 [pdf, html, other]
Title: Exploring Large Action Sets with Hyperspherical Embeddings using von Mises-Fisher Sampling
Walid Bendada, Guillaume Salha-Galvan, Romain Hennequin, Théo Bontempelli, Thomas Bouabça, Tristan Cazenave
Comments: 42nd International Conference on Machine Learning (ICML 2025)
Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR)

This paper introduces von Mises-Fisher exploration (vMF-exp), a scalable method for exploring large action sets in reinforcement learning problems where hyperspherical embedding vectors represent these actions. vMF-exp involves initially sampling a state embedding representation using a von Mises-Fisher distribution, then exploring this representation's nearest neighbors, which scales to virtually unlimited numbers of candidate actions. We show that, under theoretical assumptions, vMF-exp asymptotically maintains the same probability of exploring each action as Boltzmann Exploration (B-exp), a popular alternative that, nonetheless, suffers from scalability issues as it requires computing softmax values for each action. Consequently, vMF-exp serves as a scalable alternative to B-exp for exploring large action sets with hyperspherical embeddings. Experiments on simulated data, real-world public data, and the successful large-scale deployment of vMF-exp on the recommender system of a global music streaming service empirically validate the key properties of the proposed method.

[248] arXiv:2507.00519 [pdf, html, other]
Title: Topology-Constrained Learning for Efficient Laparoscopic Liver Landmark Detection
Ruize Cui, Jiaan Zhang, Jialun Pei, Kai Wang, Pheng-Ann Heng, Jing Qin
Comments: This paper has been accepted by MICCAI 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Liver landmarks provide crucial anatomical guidance to the surgeon during laparoscopic liver surgery to minimize surgical risk. However, the tubular structural properties of landmarks and dynamic intraoperative deformations pose significant challenges for automatic landmark detection. In this study, we introduce TopoNet, a novel topology-constrained learning framework for laparoscopic liver landmark detection. Our framework adopts a snake-CNN dual-path encoder to simultaneously capture detailed RGB texture information and depth-informed topological structures. Meanwhile, we propose a boundary-aware topology fusion (BTF) module, which adaptively merges RGB-D features to enhance edge perception while preserving global topology. Additionally, a topological constraint loss function is embedded, which contains a center-line constraint loss and a topological persistence loss to ensure homotopy equivalence between predictions and labels. Extensive experiments on L3D and P2ILF datasets demonstrate that TopoNet achieves outstanding accuracy and computational complexity, highlighting the potential for clinical applications in laparoscopic liver surgery. Our code will be available at this https URL.

[249] arXiv:2507.00521 [pdf, html, other]
Title: \texttt{WebANNS}: Fast and Efficient Approximate Nearest Neighbor Search in Web Browsers
Mugeng Liu, Siqi Zhong, Qi Yang, Yudong Han, Xuanzhe Liu, Yun Ma
Comments: SIGIR 2025
Subjects: Information Retrieval (cs.IR)

Approximate nearest neighbor search (ANNS) has become vital to modern AI infrastructure, particularly in retrieval-augmented generation (RAG) applications. Numerous in-browser ANNS engines have emerged to seamlessly integrate with popular LLM-based web applications, while addressing privacy protection and challenges of heterogeneous device deployments. However, web browsers present unique challenges for ANNS, including computational limitations, external storage access issues, and memory utilization constraints, which state-of-the-art (SOTA) solutions fail to address comprehensively.
We propose \texttt{WebANNS}, a novel ANNS engine specifically designed for web browsers. \texttt{WebANNS} leverages WebAssembly to overcome computational bottlenecks, designs a lazy loading strategy to optimize data retrieval from external storage, and applies a heuristic approach to reduce memory usage. Experiments show that \texttt{WebANNS} is fast and memory efficient, achieving up to $743.8\times$ improvement in 99th percentile query latency over the SOTA engine, while reducing memory usage by up to 39\%. Note that \texttt{WebANNS} decreases query time from 10 seconds to the 10-millisecond range in browsers, making in-browser ANNS practical with user-acceptable latency.

[250] arXiv:2507.00522 [pdf, html, other]
Title: Cyber Attacks Detection, Prevention, and Source Localization in Digital Substation Communication using Hybrid Statistical-Deep Learning
Nicola Cibin, Bas Mulder, Herman Carstens, Peter Palensky, Alexandru Ştefanov
Comments: 10 pages, 6 figures. This work has been submitted to the IEEE for possible publication
Subjects: Cryptography and Security (cs.CR); Systems and Control (eess.SY)

The digital transformation of power systems is accelerating the adoption of IEC 61850 standard. However, its communication protocols, including Sampled Values (SV), lack built-in security features such as authentication and encryption, making them vulnerable to malicious packet injection. Such cyber attacks can delay fault clearance or trigger unintended circuit breaker operations. While most existing research focuses on detecting cyber attacks in digital substations, intrusion prevention systems have been disregarded because of the risk of potential communication network disruptions. This paper proposes a novel method using hybrid statistical-deep learning for the detection, prevention, and source localization of IEC 61850 SV injection attacks. The method uses exponentially modified Gaussian distributions to model communication network latency and long short-term memory and Elman recurrent neural network to detect anomalous variations in the estimated probability distributions. It effectively discards malicious SV frames with minimal processing overhead and latency, maintains robustness against communication network latency variation and time-synchronization issues, and guarantees a near-zero false positive rate in non-attack scenarios. Comprehensive validation is conducted on three testbeds involving industrial-grade devices, hardware-in-the-loop simulations, virtualized intelligent electronic devices and merging units, and high-fidelity emulated communication networks. Results demonstrate the method's suitability for practical deployment in IEC 61850-compliant digital substations.

[251] arXiv:2507.00523 [pdf, html, other]
Title: Edge Computing and its Application in Robotics: A Survey
Nazish Tahir, Ramviyas Parasuraman
Subjects: Robotics (cs.RO); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)

The Edge computing paradigm has gained prominence in both academic and industry circles in recent years. By implementing edge computing facilities and services in robotics, it becomes a key enabler in the deployment of artificial intelligence applications to robots. Time-sensitive robotics applications benefit from the reduced latency, mobility, and location awareness provided by the edge computing paradigm, which enables real-time data processing and intelligence at the network's edge. While the advantages of integrating edge computing into robotics are numerous, there has been no recent survey that comprehensively examines these benefits. This paper aims to bridge that gap by highlighting important work in the domain of edge robotics, examining recent advancements, and offering deeper insight into the challenges and motivations behind both current and emerging solutions. In particular, this article provides a comprehensive evaluation of recent developments in edge robotics, with an emphasis on fundamental applications, providing in-depth analysis of the key motivations, challenges, and future directions in this rapidly evolving domain. It also explores the importance of edge computing in real-world robotics scenarios where rapid response times are critical. Finally, the paper outlines various open research challenges in the field of edge robotics.

[252] arXiv:2507.00525 [pdf, html, other]
Title: Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving
Djamahl Etchegaray, Yuxia Fu, Zi Huang, Yadan Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Interpretable communication is essential for safe and trustworthy autonomous driving, yet current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios. Existing driving-oriented VQA datasets are limited to full-scene descriptions or waypoint prediction, preventing the assessment of whether VLMs can respond to localized user-driven queries. We introduce Box-QAymo, a box-referring dataset and benchmark designed to both evaluate and finetune VLMs on spatial and temporal reasoning over user-specified objects. Users express intent by drawing bounding boxes, offering a fast and intuitive interface for focused queries in complex scenes. Specifically, we propose a hierarchical evaluation protocol that begins with binary sanity-check questions to assess basic model capacities, and progresses to (1) attribute prediction for box-referred objects, (2) motion understanding of target instances, and (3) spatiotemporal motion reasoning over inter-object dynamics across frames. To support this, we crowd-sourced fine-grained object classes and visual attributes that reflect the complexity drivers encounter, and extract object trajectories to construct temporally grounded QA pairs. Rigorous quality control through negative sampling, temporal consistency checks, and difficulty-aware balancing guarantee dataset robustness and diversity. Our comprehensive evaluation reveals significant limitations in current VLMs when queried about perception questions, highlighting the gap in achieving real-world performance. This work provides a foundation for developing more robust and interpretable autonomous driving systems that can communicate effectively with users under real-world conditions. Project page and dataset are available at this https URL.

[253] arXiv:2507.00531 [pdf, html, other]
Title: An inverse-free fixed-time stable dynamical system and its forward-Euler discretization for solving generalized absolute value equations
Xuehua Li, Linjie Chen, Dongmei Yu, Cairong Chen, Deren Han
Comments: 14 pages
Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)

An inverse-free dynamical system is proposed to solve the generalized absolute value equation (GAVE) within a fixed time, where the time of convergence is finite and is uniformly bounded for all initial points. Moreover, an iterative method obtained by using the forward-Euler discretization of the proposed dynamic model are developed and sufficient conditions which guarantee that the discrete iteration globally converge to an arbitrarily small neighborhood of the unique solution of GAVE within a finite number of iterative steps are given.

[254] arXiv:2507.00534 [pdf, html, other]
Title: NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data
Tahir Javed, Kaushal Bhogale, Mitesh M. Khapra
Comments: Accepted in Interspecch 2025
Subjects: Computation and Language (cs.CL)

We introduce Nirantar, a comprehensive framework for evaluating continual learning (CL) in multilingual and multi-domain ASR. Designed to reflect real-world CL challenges, Nirantar leverages data collected incrementally across 22 languages and 208 districts in India through natural episodes. This enables evaluation across Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL) scenarios. Unlike prior work that relies on simulated episodes, Nirantar presents dynamic, non-uniform language and domain shifts, making it an ideal testbed for CL research. With 3250 hours of human-transcribed speech, including 1720 hours newly introduced in this work, our framework enables systematic benchmarking of CL methods. We evaluate existing approaches and demonstrate that no single method performs consistently well, underscoring the need for more robust CL strategies.

[255] arXiv:2507.00535 [pdf, html, other]
Title: Rethinking Group Recommender Systems in the Era of Generative AI: From One-Shot Recommendations to Agentic Group Decision Support
Dietmar Jannach, Amra Delić, Francesco Ricci, Markus Zanker
Comments: Submitted for publication
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

More than twenty-five years ago, first ideas were developed on how to design a system that can provide recommendations to groups of users instead of individual users. Since then, a rich variety of algorithmic proposals were published, e.g., on how to acquire individual preferences, how to aggregate them, and how to generate recommendations for groups of users. However, despite the rich literature on the topic, barely any examples of real-world group recommender systems can be found. This lets us question common assumptions in academic research, in particular regarding communication processes in a group and how recommendation-supported decisions are made. In this essay, we argue that these common assumptions and corresponding system designs often may not match the needs or expectations of users. We thus call for a reorientation in this research area, leveraging the capabilities of modern Generative AI assistants like ChatGPT. Specifically, as one promising future direction, we envision group recommender systems to be systems where human group members interact in a chat and an AI-based group recommendation agent assists the decision-making process in an agentic way. Ultimately, this shall lead to a more natural group decision-making environment and finally to wider adoption of group recommendation systems in practice.

[256] arXiv:2507.00537 [pdf, html, other]
Title: Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation
Feng Lin, Marco Chen, Haokui Zhang, Xiaotian Yu, Guangming Lu, Rong Xiao
Comments: 21 pages, 7 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

This paper studies the role of attention heads in CLIP's image encoder. While CLIP has exhibited robust performance across diverse applications, we hypothesize that certain attention heads negatively affect final representations and that ablating them can improve performance in downstream tasks. To capitalize on this insight, we propose a simple yet effective method, called Attention Ablation Technique (AAT), to suppress the contribution of specific heads by manipulating attention weights. By integrating two alternative strategies tailored for different application scenarios, AAT systematically identifies and ablates detrimental attention heads to enhance representation quality. Experiments demonstrate that AAT consistently improves downstream task performance across various domains, boosting recall rate by up to 11.1% on CLIP-family models for cross-modal retrieval. The results highlight the potential of AAT to effectively refine large-scale vision-language models with virtually no increase in inference cost.

[257] arXiv:2507.00539 [pdf, html, other]
Title: Ensemble Kalman Filter for Data Assimilation coupled with low-resolution computations techniques applied in Fluid Dynamics
Paul Jeanney, Ashton Hetherington, Shady E. Ahmed, David Lanceta, Susana Saiz, José Miguel Perez, Soledad Le CLainche
Comments: article, 49 pages, 29 figures, 4 tables
Subjects: Computational Engineering, Finance, and Science (cs.CE); Fluid Dynamics (physics.flu-dyn)

This paper presents an innovative Reduced-Order Model (ROM) for merging experimental and simulation data using Data Assimilation (DA) to estimate the "True" state of a fluid dynamics system, leading to more accurate predictions. Our methodology introduces a novel approach implementing the Ensemble Kalman Filter (EnKF) within a reduced-dimensional framework, grounded in a robust theoretical foundation and applied to fluid dynamics. To address the substantial computational demands of DA, the proposed ROM employs low-resolution (LR) techniques to drastically reduce computational costs. This approach involves downsampling datasets for DA computations, followed by an advanced reconstruction technique based on low-cost Singular Value Decomposition (lcSVD). The lcSVD method, a key innovation in this paper, has never been applied to DA before and offers a highly efficient way to enhance resolution with minimal computational resources. Our results demonstrate significant reductions in both computation time and RAM usage through the LR techniques without compromising the accuracy of the estimations. For instance, in a turbulent test case, the LR approach with a compression rate of 15.9 can achieve a speed-up of 13.7 and a RAM compression of 90.9% while maintaining a low Relative Root Mean Square Error (RRMSE) of 2.6%, compared to 0.8% in the high-resolution (HR) reference. Furthermore, we highlight the effectiveness of the EnKF in estimating and predicting the state of fluid flow systems based on limited observations and low-fidelity numerical data. This paper highlights the potential of the proposed DA method in fluid dynamics applications, particularly for improving computational efficiency in CFD and related fields. Its ability to balance accuracy with low computational and memory costs makes it suitable for large-scale and real-time applications, such as environmental monitoring or aerospace.

[258] arXiv:2507.00540 [pdf, other]
Title: Capsule Network-Based Semantic Intent Modeling for Human-Computer Interaction
Shixiao Wang, Yifan Zhuang, Runsheng Zhang, Zhijun Song
Subjects: Computation and Language (cs.CL)

This paper proposes a user semantic intent modeling algorithm based on Capsule Networks to address the problem of insufficient accuracy in intent recognition for human-computer interaction. The method represents semantic features in input text through a vectorized capsule structure. It uses a dynamic routing mechanism to transfer information across multiple capsule layers. This helps capture hierarchical relationships and part-whole structures between semantic entities more effectively. The model uses a convolutional feature extraction module as the low-level encoder. After generating initial semantic capsules, it forms high-level abstract intent representations through an iterative routing process. To further enhance performance, a margin-based mechanism is introduced into the loss function. This improves the model's ability to distinguish between intent classes. Experiments are conducted using a public natural language understanding dataset. Multiple mainstream models are used for comparison. Results show that the proposed model outperforms traditional methods and other deep learning structures in terms of accuracy, F1-score, and intent detection rate. The study also analyzes the effect of the number of dynamic routing iterations on model performance. A convergence curve of the loss function during training is provided. These results verify the stability and effectiveness of the proposed method in semantic modeling. Overall, this study presents a new structured modeling approach to improve intent recognition under complex semantic conditions.

[259] arXiv:2507.00543 [pdf, other]
Title: Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications
Leila Tavakoli, Hamed Zamani
Comments: 9 pages,5 figures
Subjects: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

Despite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the search clarification task, leveraging a high-quality, multi-dimensional dataset that includes five distinct fine-grained annotation subtasks. Although LLMs have shown impressive capabilities in general settings, our study reveals that even state-of-the-art models struggle to replicate human-level performance in subjective or fine-grained evaluation tasks. Through a systematic assessment, we demonstrate that LLM predictions are often inconsistent, poorly calibrated, and highly sensitive to prompt variations. To address these limitations, we propose a simple yet effective human-in-the-loop (HITL) workflow that uses confidence thresholds and inter-model disagreement to selectively involve human review. Our findings show that this lightweight intervention significantly improves annotation reliability while reducing human effort by up to 45%, offering a relatively scalable and cost-effective yet accurate path forward for deploying LLMs in real-world evaluation settings.

[260] arXiv:2507.00547 [pdf, other]
Title: Methodological Rigour in Algorithm Application: An Illustration of Topic Modelling Algorithm
Malmi Amadoru
Journal-ref: ACIS 2024 Proceedings. 10 (2024)
Subjects: Computation and Language (cs.CL)

The rise of advanced computational algorithms has opened new avenues for computationally intensive research approaches to theory development. However, the opacity of these algorithms and lack of transparency and rigour in their application pose methodological challenges, potentially undermining trust in research. The discourse on methodological rigour in this new genre of research is still emerging. Against this backdrop, I attempt to offer guidance on methodological rigour, particularly in the context of topic modelling algorithms. By illustrating the application of the structural topic modelling algorithm and presenting a set of guidelines, I discuss how to ensure rigour in topic modelling studies. Although the guidelines are for the application of topic modelling algorithms, they can be applied to other algorithms with context-specific adjustments. The guidelines are helpful, especially for novice researchers applying topic modelling, and editors and reviewers handling topic modelling manuscripts. I contribute to the literature on topic modelling and join the emerging dialogue on methodological rigour in computationally intensive theory construction research.

[261] arXiv:2507.00550 [pdf, other]
Title: Collaborative Multi-Agent Reinforcement Learning Approach for Elastic Cloud Resource Scaling
Bruce Fang, Danyi Gao
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

This paper addresses the challenges of rapid resource variation and highly uncertain task loads in cloud computing environments. It proposes an optimization method for elastic cloud resource scaling based on a multi-agent system. The method deploys multiple autonomous agents to perceive resource states in parallel and make local decisions. While maintaining the distributed nature of the system, it introduces a collaborative value function to achieve global coordination. This improves the responsiveness of resource scheduling and enhances overall system performance. To strengthen system foresight, a lightweight state prediction model is designed. It assists agents in identifying future workload trends and optimizes the selection of scaling actions. For policy training, the method adopts a centralized training and decentralized execution reinforcement learning framework. This enables agents to learn effectively and coordinate strategies under conditions of incomplete information. The paper also constructs typical cloud scenarios, including multi-tenancy and burst traffic, to evaluate the proposed method. The evaluation focuses on resource isolation, service quality assurance, and robustness. Experimental results show that the proposed multi-agent scaling strategy outperforms existing methods in resource utilization, SLA violation control, and scheduling latency. The results demonstrate strong adaptability and intelligent regulation. This provides an efficient and reliable new approach to solving the problem of elastic resource scaling in complex cloud platforms.

[262] arXiv:2507.00552 [pdf, html, other]
Title: Generation of Indoor Open Street Maps for Robot Navigation from CAD Files
Jiajie Zhang, Shenrui Wu, Xu Ma, Sören Schwertfeger
Comments: 8 pages, 8 figures
Subjects: Robotics (cs.RO)

The deployment of autonomous mobile robots is predicated on the availability of environmental maps, yet conventional generation via SLAM (Simultaneous Localization and Mapping) suffers from significant limitations in time, labor, and robustness, particularly in dynamic, large-scale indoor environments where map obsolescence can lead to critical localization failures. To address these challenges, this paper presents a complete and automated system for converting architectural Computer-Aided Design (CAD) files into a hierarchical topometric OpenStreetMap (OSM) representation, tailored for robust life-long robot navigation. Our core methodology involves a multi-stage pipeline that first isolates key structural layers from the raw CAD data and then employs an AreaGraph-based topological segmentation to partition the building layout into a hierarchical graph of navigable spaces. This process yields a comprehensive and semantically rich map, further enhanced by automatically associating textual labels from the CAD source and cohesively merging multiple building floors into a unified, topologically-correct model. By leveraging the permanent structural information inherent in CAD files, our system circumvents the inefficiencies and fragility of SLAM, offering a practical and scalable solution for deploying robots in complex indoor spaces. The software is encapsulated within an intuitive Graphical User Interface (GUI) to facilitate practical use. The code and dataset are available at this https URL.

[263] arXiv:2507.00554 [pdf, html, other]
Title: LOD-GS: Level-of-Detail-Sensitive 3D Gaussian Splatting for Detail Conserved Anti-Aliasing
Zhenya Yang, Bingchen Gong, Kai Chen, Qi Dou
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Despite the advancements in quality and efficiency achieved by 3D Gaussian Splatting (3DGS) in 3D scene rendering, aliasing artifacts remain a persistent challenge. Existing approaches primarily rely on low-pass filtering to mitigate aliasing. However, these methods are not sensitive to the sampling rate, often resulting in under-filtering and over-smoothing renderings. To address this limitation, we propose LOD-GS, a Level-of-Detail-sensitive filtering framework for Gaussian Splatting, which dynamically predicts the optimal filtering strength for each 3D Gaussian primitive. Specifically, we introduce a set of basis functions to each Gaussian, which take the sampling rate as input to model appearance variations, enabling sampling-rate-sensitive filtering. These basis function parameters are jointly optimized with the 3D Gaussian in an end-to-end manner. The sampling rate is influenced by both focal length and camera distance. However, existing methods and datasets rely solely on down-sampling to simulate focal length changes for anti-aliasing evaluation, overlooking the impact of camera distance. To enable a more comprehensive assessment, we introduce a new synthetic dataset featuring objects rendered at varying camera distances. Extensive experiments on both public datasets and our newly collected dataset demonstrate that our method achieves SOTA rendering quality while effectively eliminating aliasing. The code and dataset have been open-sourced.

[264] arXiv:2507.00557 [pdf, html, other]
Title: Advancing Local Search in SMT-NRA with MCSAT Integration
Tianyi Ding, Haokun Li, Xinpeng Ni, Bican Xia, Tianqi Zhao
Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)

In this paper, we advance local search for Satisfiability Modulo the Theory of Nonlinear Real Arithmetic (SMT-NRA for short). First, we introduce a two-dimensional cell-jump move, called \emph{$2d$-cell-jump}, generalizing the key operation, cell-jump, of the local search method for SMT-NRA. Then, we propose an extended local search framework, named \emph{$2d$-LS} (following the local search framework, LS, for SMT-NRA), integrating the model constructing satisfiability calculus (MCSAT) framework to improve search efficiency. To further improve the efficiency of MCSAT, we implement a recently proposed technique called \emph{sample-cell projection operator} for MCSAT, which is well suited for CDCL-style search in the real domain and helps guide the search away from conflicting states. Finally, we design a hybrid framework for SMT-NRA combining MCSAT, $2d$-LS and OpenCAD, to improve search efficiency through information exchange. The experimental results demonstrate improvements in local search performance, highlighting the effectiveness of the proposed methods.

[265] arXiv:2507.00563 [pdf, html, other]
Title: Isogeometric contact analysis in subsea umbilical and power cables
Tianjiao Dai (1 and 2), Shuo Yang (3), Xing Jin (4), Svein Sævik (5), Jiaxuan Zhang (1), Jun Wu (1), Naiquan Ye (6) ((1) School of Naval Architecture and Ocean Engineering, Huazhong University of Science and Technology (HUST), Wuhan, China, (2) Hubei Key Laboratory of Naval Architecture and Ocean Engineering Hydrodynamics, HUST, Wuhan, China, (3) Zoomlion Heavy Industry Science and Technology Co., Ltd., China, (4) China Offshore Engineering and Technology Co., Ltd., China, (5) Department of Marine Technology, Norwegian University of Science and Technology, Trondheim, Norway, (6) Energy and Transport, Sintef Ocean, Trondheim, Norway)
Subjects: Numerical Analysis (math.NA)

Subsea umbilical and power cables contain a large number of contact interfaces between different geometries and materials. These complex interactions rise significant challenges for accurately considering contact surface properties by using traditional analytical solutions or finite element methods. These properties have been identified as the most sensitive parameters when performing the numerical simulation for stress analysis. Therefore, it is essential to apply a novel approach for contact analysis which improves the accuracy and efficiency for predicting contact properties. This paper presents an isogeometric analysis (IGA) approach addressing contact problems in dynamic umbilicals and power cables. Firstly, this isogeometric contact algorithm is formulated in MATLAB as a tool including the geometry description, contact detection and penalty function. Secondly, the contact interface between a steel tube and an outer sheath in an dynamic umbilical is established by this IGA contact algorithm and validated against that in ABAQUS for proving the accuracy and efficiency of IGA. Finally, the effects of element refinement, geometrical description, penalty factor on the accuracy, efficiency and stability of IGA are discussed.

[266] arXiv:2507.00564 [pdf, html, other]
Title: Computational complexity of covering regular trees
Jan Bok, Jiří Fiala, Nikola Jedličková, Jan Kratochvíl
Comments: 19 pages, accepted to MFCS 2025
Subjects: Discrete Mathematics (cs.DM); Combinatorics (math.CO)

A graph covering projection, also referred to as a locally bijective homomorphism, is a mapping between the vertices and edges of two graphs that preserves incidences and is a local bijection. This concept originates in topological graph theory but has also found applications in combinatorics and theoretical computer science. In this paper we consider undirected graphs in the most general setting -- graphs may contain multiple edges, loops, and semi-edges. This is in line with recent trends in topological graph theory and mathematical physics.
We advance the study of the computational complexity of the {\sc $H$-Cover} problem, which asks whether an input graph allows a covering projection onto a parameter graph $H$. The quest for a complete characterization started in 1990's. Several results for simple graphs or graphs without semi-edges have been known, the role of semi-edges in the complexity setting has started to be investigated only recently. One of the most general known NP-hardness results states that {\sc $H$}-Cover is NP-complete for every simple connected regular graph of valency greater than two. We complement this result by considering regular graphs $H$ arising from connected acyclic graphs by adding semi-edges. Namely, we prove that any graph obtained by adding semi-edges to the vertices of a tree making it a $d$-regular graph with $d \geq 3$, defines an NP-complete graph covering problem. In line with the so called Strong Dichotomy Conjecture, we prove that the NP-hardness holds even for simple graphs on input.

[267] arXiv:2507.00566 [pdf, html, other]
Title: Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment
Kai Zhou, Shuhai Zhang, Zeng You, Jinwu Hu, Mingkui Tan, Fei Liu
Comments: This paper is accepted by IEEE TIP 2025. Code is publicly available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features, enabling knowledge transfer to unseen classes through skeleton-text alignment and language models' generalization. However, their efficacy is hindered by 1) insufficient discrimination for skeleton features, as the fixed skeleton encoder fails to capture necessary alignment information for effective skeleton-text alignment; 2) the neglect of alignment bias between skeleton and unseen text features during testing. To this end, we propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA. Specifically, we develop an end-to-end cross-modal contrastive training framework to improve skeleton-text alignment, ensuring sufficient discrimination for skeleton features. Additionally, we introduce a prototype-guided text feature alignment strategy to mitigate the adverse impact of the distribution discrepancy during testing. We provide a theoretical analysis to support our prototype-guided text feature alignment strategy and empirically evaluate our overall PGFA on three well-known datasets. Compared with the top competitor SMIE method, our PGFA achieves absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on the NTU-60, NTU-120, and PKU-MMD datasets, respectively.

[268] arXiv:2507.00570 [pdf, html, other]
Title: Out-of-distribution detection in 3D applications: a review
Zizhao Li, Xueyang Kang, Joseph West, Kourosh Khoshelham
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The ability to detect objects that are not prevalent in the training set is a critical capability in many 3D applications, including autonomous driving. Machine learning methods for object recognition often assume that all object categories encountered during inference belong to a closed set of classes present in the training data. This assumption limits generalization to the real world, as objects not seen during training may be misclassified or entirely ignored. As part of reliable AI, OOD detection identifies inputs that deviate significantly from the training distribution. This paper provides a comprehensive overview of OOD detection within the broader scope of trustworthy and uncertain AI. We begin with key use cases across diverse domains, introduce benchmark datasets spanning multiple modalities, and discuss evaluation metrics. Next, we present a comparative analysis of OOD detection methodologies, exploring model structures, uncertainty indicators, and distributional distance taxonomies, alongside uncertainty calibration techniques. Finally, we highlight promising research directions, including adversarially robust OOD detection and failure identification, particularly relevant to 3D applications. The paper offers both theoretical and practical insights into OOD detection, showcasing emerging research opportunities such as 3D vision integration. These insights help new researchers navigate the field more effectively, contributing to the development of reliable, safe, and robust AI systems.

[269] arXiv:2507.00573 [pdf, html, other]
Title: High order global flux schemes for general steady state preservation of shallow water moment equations with non-conservative products
Mirco Ciallella, Julian Koellermeier
Subjects: Numerical Analysis (math.NA)

Shallow water moment equations are reduced-order models for free-surface flows that allow to represent vertical variations of the velocity profile at the expense of additional evolution equations for a number of additional variables, so called moments. This introduces non-linear non-conservative products in the system, which make the analytical characterization of steady states much harder if not impossible. The lack of analytical steady states poses a challenge for the design of well-balanced schemes, which aim at preserving such steady states as crucial in many applications. In this work, we present a family of fully well-balanced, high-order WENO finite volume methods for general hyperbolic balance laws with non-conservative products like the shallow water moment equations, for which no analytical steady states are available. The schemes are based on the flux globalization approach, in which both source terms and non-conservative products are integrated with a tailored high order quadrature in the divergence term. The resulting global flux is then reconstructed instead of the conservative variables to preserve all steady states. Numerical tests show the optimal convergence of the method and a significant error reduction for steady state solutions. Furthermore, we provide a numerical comparison of perturbed steady states for different families of shallow water moment equations, which illustrates the flexibility of our method that is valid for general equations without prior knowledge of steady states.

[270] arXiv:2507.00574 [pdf, html, other]
Title: Foundation Models for Clinical Records at Health System Scale
Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu, Shih-Lun Huang, Long Chen, Kyunghyun Cho, Cem M. Deniz, Narges Razavian
Comments: Accepted to ICML 2025 Workshop on Foundation Models for Structured Data
Subjects: Machine Learning (cs.LG)

Large-scale pretraining has transformed modeling of language and other data types, but its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present a novel generative pretraining strategy for sequential EHR data using next-visit event prediction. Our model learns to autoregressively generate various tokenized clinical events for the next visit based on patient history and inherently handles the joint prediction of heterogeneous data types. Additionally, we introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Our model is evaluated via zero-shot prediction for forecasting dementia and knee osteoarthritis incidence within 2 and 5 years, and the model performance rivals a fully fine-tuned masked pretrained Transformer baseline, demonstrating that our approach captures complex clinical dependencies without requiring costly task-specific fine-tuning.

[271] arXiv:2507.00576 [pdf, html, other]
Title: DynoStore: A wide-area distribution system for the management of data over heterogeneous storage
Dante D. Sanchez-Gallegos, J. L. Gonzalez-Compean, Maxime Gonthier, Valerie Hayot-Sasson, J. Gregory Pauloski, Haochen Pan, Kyle Chard, Jesus Carretero, Ian Foster
Comments: 10 pages. Conference: The 25th IEEE International Symposium on Cluster, Cloud, and Internet Computing
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Data distribution across different facilities offers benefits such as enhanced resource utilization, increased resilience through replication, and improved performance by processing data near its source. However, managing such data is challenging due to heterogeneous access protocols, disparate authentication models, and the lack of a unified coordination framework. This paper presents DynoStore, a system that manages data across heterogeneous storage systems. At the core of DynoStore are data containers, an abstraction that provides standardized interfaces for seamless data management, irrespective of the underlying storage systems. Multiple data container connections create a cohesive wide-area storage network, ensuring resilience using erasure coding policies. Furthermore, a load-balancing algorithm ensures equitable and efficient utilization of storage resources. We evaluate DynoStore using benchmarks and real-world case studies, including the management of medical and satellite data across geographically distributed environments. Our results demonstrate a 10\% performance improvement compared to centralized cloud-hosted systems while maintaining competitive performance with state-of-the-art solutions such as Redis and IPFS. DynoStore also exhibits superior fault tolerance, withstanding more failures than traditional systems.

[272] arXiv:2507.00577 [pdf, html, other]
Title: BadViM: Backdoor Attack against Vision Mamba
Yinghao Wu, Liyan Zhang
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Vision State Space Models (SSMs), particularly architectures like Vision Mamba (ViM), have emerged as promising alternatives to Vision Transformers (ViTs). However, the security implications of this novel architecture, especially their vulnerability to backdoor attacks, remain critically underexplored. Backdoor attacks aim to embed hidden triggers into victim models, causing the model to misclassify inputs containing these triggers while maintaining normal behavior on clean inputs. This paper investigates the susceptibility of ViM to backdoor attacks by introducing BadViM, a novel backdoor attack framework specifically designed for Vision Mamba. The proposed BadViM leverages a Resonant Frequency Trigger (RFT) that exploits the frequency sensitivity patterns of the victim model to create stealthy, distributed triggers. To maximize attack efficacy, we propose a Hidden State Alignment loss that strategically manipulates the internal representations of model by aligning the hidden states of backdoor images with those of target classes. Extensive experimental results demonstrate that BadViM achieves superior attack success rates while maintaining clean data accuracy. Meanwhile, BadViM exhibits remarkable resilience against common defensive measures, including PatchDrop, PatchShuffle and JPEG compression, which typically neutralize normal backdoor attacks.

[273] arXiv:2507.00579 [pdf, html, other]
Title: TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification
Miriam Anschütz, Ekaterina Gikalo, Niklas Herbster, Georg Groh
Comments: 6 pages, 3 figures, SemEval-2025 Task 3, ACL
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Hallucinations are one of the major problems of LLMs, hindering their trustworthiness and deployment to wider use cases. However, most of the research on hallucinations focuses on English data, neglecting the multilingual nature of LLMs. This paper describes our submission to the SemEval-2025 Task-3 - Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. We propose a two-part pipeline that combines retrieval-based fact verification against Wikipedia with a BERT-based system fine-tuned to identify common hallucination patterns. Our system achieves competitive results across all languages, reaching top-10 results in eight languages, including English. Moreover, it supports multiple languages beyond the fourteen covered by the shared task. This multilingual hallucination identifier can help to improve LLM outputs and their usefulness in the future.

[274] arXiv:2507.00583 [pdf, html, other]
Title: AI-Generated Video Detection via Perceptual Straightening
Christian Internò, Robert Geirhos, Markus Olhofer, Sunny Liu, Barbara Hammer, David Klindt
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the "perceptual straightening" hypothesis -- which suggests real-world video trajectories become more straight in neural representation domain -- we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.

[275] arXiv:2507.00585 [pdf, html, other]
Title: Similarity Memory Prior is All You Need for Medical Image Segmentation
Tang Hao, Guo ZhiQing, Wang LieJun, Liu Chao
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In recent years, it has been found that "grandmother cells" in the primary visual cortex (V1) of macaques can directly recognize visual input with complex shapes. This inspires us to examine the value of these cells in promoting the research of medical image segmentation. In this paper, we design a Similarity Memory Prior Network (Sim-MPNet) for medical image segmentation. Specifically, we propose a Dynamic Memory Weights-Loss Attention (DMW-LA), which matches and remembers the category features of specific lesions or organs in medical images through the similarity memory prior in the prototype memory bank, thus helping the network to learn subtle texture changes between categories. DMW-LA also dynamically updates the similarity memory prior in reverse through Weight-Loss Dynamic (W-LD) update strategy, effectively assisting the network directly extract category features. In addition, we propose the Double-Similarity Global Internal Enhancement Module (DS-GIM) to deeply explore the internal differences in the feature distribution of input data through cosine similarity and euclidean distance. Extensive experiments on four public datasets show that Sim-MPNet has better segmentation performance than other state-of-the-art methods. Our code is available on this https URL.

[276] arXiv:2507.00586 [pdf, html, other]
Title: Context-Aware Academic Emotion Dataset and Benchmark
Luming Zhao, Jingwen Xuan, Jiamin Lou, Yonghui Yu, Wenwu Yang
Comments: Accepted to ICCV 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Academic emotion analysis plays a crucial role in evaluating students' engagement and cognitive states during the learning process. This paper addresses the challenge of automatically recognizing academic emotions through facial expressions in real-world learning environments. While significant progress has been made in facial expression recognition for basic emotions, academic emotion recognition remains underexplored, largely due to the scarcity of publicly available datasets. To bridge this gap, we introduce RAER, a novel dataset comprising approximately 2,700 video clips collected from around 140 students in diverse, natural learning contexts such as classrooms, libraries, laboratories, and dormitories, covering both classroom sessions and individual study. Each clip was annotated independently by approximately ten annotators using two distinct sets of academic emotion labels with varying granularity, enhancing annotation consistency and reliability. To our knowledge, RAER is the first dataset capturing diverse natural learning scenarios. Observing that annotators naturally consider context cues-such as whether a student is looking at a phone or reading a book-alongside facial expressions, we propose CLIP-CAER (CLIP-based Context-aware Academic Emotion Recognition). Our method utilizes learnable text prompts within the vision-language model CLIP to effectively integrate facial expression and context cues from videos. Experimental results demonstrate that CLIP-CAER substantially outperforms state-of-the-art video-based facial expression recognition methods, which are primarily designed for basic emotions, emphasizing the crucial role of context in accurately recognizing academic emotions. Project page: this https URL

[277] arXiv:2507.00589 [pdf, html, other]
Title: Quantum Circuit Structure Optimization for Quantum Reinforcement Learning
Seok Bin Son, Joongheon Kim
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Reinforcement learning (RL) enables agents to learn optimal policies through environmental interaction. However, RL suffers from reduced learning efficiency due to the curse of dimensionality in high-dimensional spaces. Quantum reinforcement learning (QRL) addresses this issue by leveraging superposition and entanglement in quantum computing, allowing efficient handling of high-dimensional problems with fewer resources. QRL combines quantum neural networks (QNNs) with RL, where the parameterized quantum circuit (PQC) acts as the core computational module. The PQC performs linear and nonlinear transformations through gate operations, similar to hidden layers in classical neural networks. Previous QRL studies, however, have used fixed PQC structures based on empirical intuition without verifying their optimality. This paper proposes a QRL-NAS algorithm that integrates quantum neural architecture search (QNAS) to optimize PQC structures within QRL. Experiments demonstrate that QRL-NAS achieves higher rewards than QRL with fixed circuits, validating its effectiveness and practical utility.

[278] arXiv:2507.00591 [pdf, html, other]
Title: Construction of LDPC convolutional codes with large girth from Latin squares
Elisa Junghans, Julia Lieb
Subjects: Information Theory (cs.IT)

Due to their capacity approaching performance low-density parity-check (LDPC) codes gained a lot of attention in the last years. The parity-check matrix of the codes can be associated with a bipartite graph, called Tanner graph. To decrease the probability of decoding failure it is desirable to have LDPC codes with large girth of the associated Tanner graph. Moreover, to store such codes efficiently, it is desirable to have compact constructions for them. In this paper, we present constructions of LDPC convolutional codes with girth up to $12$ using a special class of Latin squares and several lifting steps, which enables a compact representation of these codes. With these techniques, we can provide constructions for well-performing and efficiently storable time-varying and time-invariant LDPC convolutional codes as well as for LDPC block codes.

[279] arXiv:2507.00593 [pdf, html, other]
Title: Overtake Detection in Trucks Using CAN Bus Signals: A Comparative Study of Machine Learning Methods
Fernando Alonso-Fernandez, Talha Hanif Butt, Prayag Tiwari
Comments: Under review at ESWA
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Safe overtaking manoeuvres in trucks are vital for preventing accidents and ensuring efficient traffic flow. Accurate prediction of such manoeuvres is essential for Advanced Driver Assistance Systems (ADAS) to make timely and informed decisions. In this study, we focus on overtake detection using Controller Area Network (CAN) bus data collected from five in-service trucks provided by the Volvo Group. We evaluate three common classifiers for vehicle manoeuvre detection, Artificial Neural Networks (ANN), Random Forest (RF), and Support Vector Machines (SVM), and analyse how different preprocessing configurations affect performance. We find that variability in traffic conditions strongly influences the signal patterns, particularly in the no-overtake class, affecting classification performance if training data lacks adequate diversity. Since the data were collected under unconstrained, real-world conditions, class diversity cannot be guaranteed a priori. However, training with data from multiple vehicles improves generalisation and reduces condition-specific bias. Our pertruck analysis also reveals that classification accuracy, especially for overtakes, depends on the amount of training data per vehicle. To address this, we apply a score-level fusion strategy, which yields the best per-truck performance across most cases. Overall, we achieve an accuracy via fusion of TNR=93% (True Negative Rate) and TPR=86.5% (True Positive Rate). This research has been part of the BIG FUN project, which explores how Artificial Intelligence can be applied to logged vehicle data to understand and predict driver behaviour, particularly in relation to Camera Monitor Systems (CMS), being introduced as digital replacements for traditional exterior mirrors.

[280] arXiv:2507.00595 [pdf, html, other]
Title: The Secrets Must Not Flow: Scaling Security Verification to Large Codebases (extended version)
Linard Arquint, Samarth Kishor, Jason R. Koenig, Joey Dodds, Daniel Kroening, Peter Müller
Subjects: Cryptography and Security (cs.CR); Programming Languages (cs.PL); Software Engineering (cs.SE)

Existing program verifiers can prove advanced properties about security protocol implementations, but are difficult to scale to large codebases because of the manual effort required. We develop a novel methodology called *Diodon* that addresses this challenge by splitting the codebase into the protocol implementation (the *Core*) and the remainder (the *Application*). This split allows us to apply powerful semi-automated verification techniques to the security-critical Core, while fully-automatic static analyses scale the verification to the entire codebase by ensuring that the Application cannot invalidate the security properties proved for the Core. The static analyses achieve that by proving *I/O independence*, i.e., that the I/O operations within the Application are independent of the Core's security-relevant data (such as keys), and that the Application meets the Core's requirements. We have proved Diodon sound by first showing that we can safely allow the Application to perform I/O independent of the security protocol, and second that manual verification and static analyses soundly compose. We evaluate Diodon on two case studies: an implementation of the signed Diffie-Hellman key exchange and a large (100k+ LoC) production Go codebase implementing a key exchange protocol for which we obtained secrecy and injective agreement guarantees by verifying a Core of about 1% of the code with the auto-active program verifier Gobra in less than three person months.

[281] arXiv:2507.00596 [pdf, html, other]
Title: Gaze3P: Gaze-Based Prediction of User-Perceived Privacy
Mayar Elfares, Pascal Reisert, Ralf Küsters, Andreas Bulling
Subjects: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR)

Privacy is a highly subjective concept and perceived variably by different individuals. Previous research on quantifying user-perceived privacy has primarily relied on questionnaires. Furthermore, applying user-perceived privacy to optimise the parameters of privacy-preserving techniques (PPT) remains insufficiently explored. To address these limitations, we introduce Gaze3P -- the first dataset specifically designed to facilitate systematic investigations into user-perceived privacy. Our dataset comprises gaze data from 100 participants and 1,000 stimuli, encompassing a range of private and safe attributes. With Gaze3P, we train a machine learning model to implicitly and dynamically predict perceived privacy from human eye gaze. Through comprehensive experiments, we show that the resulting models achieve high accuracy. Finally, we illustrate how predicted privacy can be used to optimise the parameters of differentially private mechanisms, thereby enhancing their alignment with user expectations.

[282] arXiv:2507.00598 [pdf, html, other]
Title: High-resolution spatial memory requires grid-cell-like neural codes
Madison Cotteret, Christopher J. Kymn, Hugh Greatorex, Martin Ziegler, Elisabetta Chicca, Friedrich T. Sommer
Comments: 14 pages, 4 figures. Supplementary material: 11 pages, 5 figures
Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)

Continuous attractor networks (CANs) are widely used to model how the brain temporarily retains continuous behavioural variables via persistent recurrent activity, such as an animal's position in an environment. However, this memory mechanism is very sensitive to even small imperfections, such as noise or heterogeneity, which are both common in biological systems. Previous work has shown that discretising the continuum into a finite set of discrete attractor states provides robustness to these imperfections, but necessarily reduces the resolution of the represented variable, creating a dilemma between stability and resolution. We show that this stability-resolution dilemma is most severe for CANs using unimodal bump-like codes, as in traditional models. To overcome this, we investigate sparse binary distributed codes based on random feature embeddings, in which neurons have spatially-periodic receptive fields. We demonstrate theoretically and with simulations that such grid-cell-like codes enable CANs to achieve both high stability and high resolution simultaneously. The model extends to embedding arbitrary nonlinear manifolds into a CAN, such as spheres or tori, and generalises linear path integration to integration along freely-programmable on-manifold vector fields. Together, this work provides a theory of how the brain could robustly represent continuous variables with high resolution and perform flexible computations over task-relevant manifolds.

[283] arXiv:2507.00600 [pdf, html, other]
Title: A Practical Guide to Interpretable Role-Based Clustering in Multi-Layer Financial Networks
Christian Franssen, Iman van Lelyveld, Bernd Heidergott
Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG)

Understanding the functional roles of financial institutions within interconnected markets is critical for effective supervision, systemic risk assessment, and resolution planning. We propose an interpretable role-based clustering approach for multi-layer financial networks, designed to identify the functional positions of institutions across different market segments. Our method follows a general clustering framework defined by proximity measures, cluster evaluation criteria, and algorithm selection. We construct explainable node embeddings based on egonet features that capture both direct and indirect trading relationships within and across market layers. Using transaction-level data from the ECB's Money Market Statistical Reporting (MMSR), we demonstrate how the approach uncovers heterogeneous institutional roles such as market intermediaries, cross-segment connectors, and peripheral lenders or borrowers. The results highlight the flexibility and practical value of role-based clustering in analyzing financial networks and understanding institutional behavior in complex market structures.

[284] arXiv:2507.00601 [pdf, other]
Title: Transferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based
Shuangquan Lyu, Yingnan Deng, Guiran Liu, Zhen Qi, Ruotong Wang
Subjects: Computation and Language (cs.CL)

This paper addresses the limited transfer and adaptation capabilities of large language models in low-resource language scenarios. It proposes a unified framework that combines a knowledge transfer module with parameter-efficient fine-tuning strategies. The method introduces knowledge alignment loss and soft prompt tuning to guide the model in effectively absorbing the structural features of target languages or tasks under minimal annotation. This enhances both generalization performance and training stability. The framework includes lightweight adaptation modules to reduce computational costs. During training, it integrates freezing strategies and prompt injection to preserve the model's original knowledge while enabling quick adaptation to new tasks. The study also conducts stability analysis experiments and synthetic pseudo-data transfer experiments to systematically evaluate the method's applicability and robustness across different low-resource tasks. Experimental results show that compared with existing multilingual pre-trained models and mainstream transfer methods, the proposed approach achieves higher performance and stability on cross-lingual tasks such as MLQA, XQuAD, and PAWS-X. It demonstrates particularly strong advantages under extremely data-scarce conditions. The proposed method offers strong generality and scalability. It enhances task-specific adaptability while preserving the general capabilities of large language models. This makes it well-suited for complex semantic modeling and multilingual processing tasks.

[285] arXiv:2507.00603 [pdf, html, other]
Title: World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model
Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, Dongbin Zhao
Comments: ICCV 2025, first version
Subjects: Computer Vision and Pattern Recognition (cs.CV)

End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.1\% relative reduction in L2 error, 46.7% lower collision rate, and 3.75 faster training convergence. Codes will be accessed at this https URL.

[286] arXiv:2507.00606 [pdf, html, other]
Title: Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies
Tao Xiong, Xavier Hu, Wenyan Fan, Shengyu Zhang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning without external prompt engineering. MoR has two phases: Thought Generation, creating reasoning chain templates with models like GPT-4o, and SFT Dataset Construction, pairing templates with benchmark datasets for supervised this http URL experiments show that MoR significantly enhances performance, with MoR150 achieving 0.730 (2.2% improvement) using CoT prompting and 0.734 (13.5% improvement) compared to baselines. MoR eliminates the need for task-specific prompts, offering a generalizable solution for robust reasoning across diverse tasks.

[287] arXiv:2507.00608 [pdf, html, other]
Title: De-Simplifying Pseudo Labels to Enhancing Domain Adaptive Object Detection
Zehua Fu, Chenguang Liu, Yuyu Chen, Jiaqi Zhou, Qingjie Liu, Yunhong Wang
Comments: Accepted by IEEE Transactions on Intelligent Transportation Systems. 15 pages, 10 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Despite its significant success, object detection in traffic and transportation scenarios requires time-consuming and laborious efforts in acquiring high-quality labeled data. Therefore, Unsupervised Domain Adaptation (UDA) for object detection has recently gained increasing research attention. UDA for object detection has been dominated by domain alignment methods, which achieve top performance. Recently, self-labeling methods have gained popularity due to their simplicity and efficiency. In this paper, we investigate the limitations that prevent self-labeling detectors from achieving commensurate performance with domain alignment methods. Specifically, we identify the high proportion of simple samples during training, i.e., the simple-label bias, as the central cause. We propose a novel approach called De-Simplifying Pseudo Labels (DeSimPL) to mitigate the issue. DeSimPL utilizes an instance-level memory bank to implement an innovative pseudo label updating strategy. Then, adversarial samples are introduced during training to enhance the proportion. Furthermore, we propose an adaptive weighted loss to avoid the model suffering from an abundance of false positive pseudo labels in the late training period. Experimental results demonstrate that DeSimPL effectively reduces the proportion of simple samples during training, leading to a significant performance improvement for self-labeling detectors. Extensive experiments conducted on four benchmarks validate our analysis and conclusions.

[288] arXiv:2507.00609 [pdf, html, other]
Title: On the rank weight hierarchy of $M$-codes
G. Berhuy, J. Molina
Subjects: Information Theory (cs.IT)

We study the rank weight hierarchy of linear codes which are stable under a linear endomorphism defined over the base field, in particular when the endomorphism is cyclic. In this last case, we give a necessary and sufficient condition for such a code to have first rank weight equal to $1$ in terms of its generator polynomial, as well as an explicit formula for its last rank weight.

[289] arXiv:2507.00611 [pdf, html, other]
Title: Residual Reward Models for Preference-based Reinforcement Learning
Chenyang Cao, Miguel Rogel-García, Mohamed Nabail, Xueqian Wang, Nicholas Rhinehart
Comments: 26 pages, 22 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)

Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify, avoiding heuristic and time-consuming reward design. However, PbRL can suffer from slow convergence speed since it requires training in a reward model. Prior work has proposed learning a reward model from demonstrations and fine-tuning it using preferences. However, when the model is a neural network, using different loss functions for pre-training and fine-tuning can pose challenges to reliable optimization. In this paper, we propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM). An RRM assumes that the true reward of the environment can be split into a sum of two parts: a prior reward and a learned reward. The prior reward is a term available before training, for example, a user's ``best guess'' reward function, or a reward function learned from inverse reinforcement learning (IRL), and the learned reward is trained with preferences. We introduce state-based and image-based versions of RRM and evaluate them on several tasks in the Meta-World environment suite. Experimental results show that our method substantially improves the performance of a common PbRL method. Our method achieves performance improvements for a variety of different types of prior rewards, including proxy rewards, a reward obtained from IRL, and even a negated version of the proxy reward. We also conduct experiments with a Franka Panda to show that our method leads to superior performance on a real robot. It significantly accelerates policy learning for different tasks, achieving success in fewer steps than the baseline. The videos are presented at this https URL.

[290] arXiv:2507.00612 [pdf, html, other]
Title: Hamiltonicity Parameterized by Mim-Width is (Indeed) Para-NP-Hard
Benjamin Bergougnoux, Lars Jaffke
Comments: 11 pages, 6 figures
Subjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)

We prove that Hamiltonian Path and Hamiltonian Cycle are NP-hard on graphs of linear mim-width 26, even when a linear order of the input graph with mim-width 26 is provided together with input. This fills a gap left by a broken proof of the para-NP-hardness of Hamiltonicity problems parameterized by mim-width.

[291] arXiv:2507.00617 [pdf, html, other]
Title: Accelerating MPGP-type Methods Through Preconditioning
Jakub Kružík, David Horák
Subjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)

This work investigates the acceleration of MPGP-type algorithms using preconditioning for the solution of quadratic programming problems. The preconditioning needs to be done only on the free set so as not to change the constraints. A variant of preconditioning restricted to the free set is the preconditioning in face. The inner preconditioner in preconditioning in face needs to be recomputed or updated every time the free set changes. Here, we investigate an approximate variant of preconditioning in face that computes the inner preconditioner only once. We analyze the error of the approximate variant and provide numerical experiments demonstrating that very large speedups can be achieved by the approximate variant.

[292] arXiv:2507.00619 [pdf, other]
Title: Gender Differences in International Research Collaboration in European Union
Elsa Fontainha, Tanya Araújo
Comments: 29 pages, 10 figures
Subjects: Social and Information Networks (cs.SI)

This paper investigates International Research Collaboration (IRC) among European Union (EU) countries from 2011 to 2022, with emphasis on gender-based authorship patterns. Drawing from the Web of Science Social Science Citation Index (WoS-SSCI) database, a large dataset of IRC articles was constructed, annotated with categories of authorship based on gender, author affiliation, and COVID-19 subject as topic. Using network science, the study maps collaboration structures and reveals gendered differences in co-authorship networks. Results highlight a substantial rise in IRC over the decade, particularly with the USA and China as key non-EU partners. Articles with at least one female author were consistently less frequent than those with at least one male author. Notably, female-exclusive collaborations showed distinctive network topologies, with more centralized (star-like) patterns and shorter tree diameters. The COVID-19 pandemic further reshaped collaboration dynamics, temporarily reducing the gender gap in IRC but also revealing vulnerabilities in female-dominated research networks. These findings underscore both progress and persistent disparities in the gender dynamics of EU participation in IRC.

[293] arXiv:2507.00623 [pdf, html, other]
Title: Remote Rendering for Virtual Reality: performance comparison of multimedia frameworks and protocols
Daniel Mejías, Inhar Yeregui, Roberto Viola, Miguel Fernández, Mario Montagud
Subjects: Networking and Internet Architecture (cs.NI)

The increasing complexity of Extended Reality (XR) applications demands substantial processing power and high bandwidth communications, often unavailable on lightweight devices. Remote rendering consists of offloading processing tasks to a remote node with a powerful GPU, delivering the rendered content to the end device. The delivery is usually performed through popular streaming protocols such as Web Real-Time Communications (WebRTC), offering a data channel for interactions, or Dynamic Adaptive Streaming over HTTP (DASH), better suitable for scalability. Moreover, new streaming protocols based on QUIC are emerging as potential replacements for WebRTC and DASH and offer benefits like connection migration, stream multiplexing and multipath delivery. This work describes the integration of the two most popular multimedia frameworks, GStreamer and FFmpeg, with a rendering engine acting as a Remote Renderer, and analyzes their performance when offering different protocols for delivering the rendered content to the end device over WIFI or 5G. This solution constitutes a beyond state-of-the-art testbed to conduct cutting-edge research in the XR field.

[294] arXiv:2507.00628 [pdf, html, other]
Title: Price Aware Power Split Control in Heterogeneous Battery Storage Systems
Sheng Yin, Vivek Teja Tanjavooru, Thomas Hamacher, Christoph Goebel, Holger Hesse
Subjects: Systems and Control (eess.SY)

This paper presents a unified framework for the optimal scheduling of battery dispatch and internal power allocation in Battery energy storage systems (BESS). This novel approach integrates both market-based (price-aware) signals and physical system constraints to simultaneously optimize (1) external energy dispatch and (2) internal heterogeneity management of BESS, enhancing its operational economic value and performance. This work compares both model-based Linear Programming (LP) and model-free Reinforcement Learning (RL) approaches for optimization under varying forecast assumptions, using a custom Gym-based simulation environment. The evaluation considers both long-term and short-term performance, focusing on economic savings, State of Charge (SOC) and temperature balancing, and overall system efficiency. In summary, the long-term results show that the RL approach achieved 10% higher system efficiency compared to LP, whereas the latter yielded 33% greater cumulative savings. In terms of internal heterogeneity, the LP approach resulted in lower mean SOC imbalance, while the RL approach achieved better temperature balance between strings. This behavior is further examined in the short-term evaluation, which indicates that LP delivers strong optimization under known and stable conditions, whereas RL demonstrates higher adaptability in dynamic environments, offering potential advantages for real-time BESS control.

[295] arXiv:2507.00631 [pdf, html, other]
Title: Horus: A Protocol for Trustless Delegation Under Uncertainty
David Shi, Kevin Joo
Comments: 9 pages, 1 figure
Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

Correctness is an emergent property of systems where exposing error is cheaper than committing it. In dynamic, low-trust environments, autonomous AI agents benefit from delegating work to sub-agents, yet correctness cannot be assured through upfront specification or centralized oversight. We propose a protocol that enforces correctness through collateralized claims in a recursive verification game. Tasks are published as intents, and solvers compete to fulfill them. Selected solvers carry out tasks under risk, with correctness checked post hoc by verifiers. Any challenger can challenge a result by staking against it to trigger the verification process. Incorrect agents are slashed and correct opposition is rewarded, with an escalation path that penalizes erroneous verifiers themselves. When incentives are aligned across solvers, challengers, and verifiers, falsification conditions make correctness the Nash equilibrium.

[296] arXiv:2507.00635 [pdf, html, other]
Title: Stable Tracking of Eye Gaze Direction During Ophthalmic Surgery
Tinghe Hong, Shenlin Cai, Boyang Li, Kai Huang
Comments: Accepted by ICRA 2025
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

Ophthalmic surgical robots offer superior stability and precision by reducing the natural hand tremors of human surgeons, enabling delicate operations in confined surgical spaces. Despite the advancements in developing vision- and force-based control methods for surgical robots, preoperative navigation remains heavily reliant on manual operation, limiting the consistency and increasing the uncertainty. Existing eye gaze estimation techniques in the surgery, whether traditional or deep learning-based, face challenges including dependence on additional sensors, occlusion issues in surgical environments, and the requirement for facial detection. To address these limitations, this study proposes an innovative eye localization and tracking method that combines machine learning with traditional algorithms, eliminating the requirements of landmarks and maintaining stable iris detection and gaze estimation under varying lighting and shadow conditions. Extensive real-world experiment results show that our proposed method has an average estimation error of 0.58 degrees for eye orientation estimation and 2.08-degree average control error for the robotic arm's movement based on the calculated orientation.

[297] arXiv:2507.00637 [pdf, html, other]
Title: Integrating Network and Attack Graphs for Service-Centric Impact Analysis
Joni Herttuainen, Vesa Kuikka, Kimmo K. Kaski
Comments: 17 pages, 13 figures, submitted for peer-review
Subjects: Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)

We present a novel methodology for modelling, visualising, and analysing cyber threats, attack paths, as well as their impact on user services in enterprise or infrastructure networks of digital devices and services they provide. Using probabilistic methods to track the propagation of an attack through attack graphs, via the service or application layers, and on physical communication networks, our model enables us to analyse cyber attacks at different levels of detail. Understanding the propagation of an attack within a service among microservices and its spread between different services or application servers could help detect and mitigate it early. We demonstrate that this network-based influence spreading modelling approach enables the evaluation of diverse attack scenarios and the development of protection and mitigation measures, taking into account the criticality of services from the user's perspective. This methodology could also aid security specialists and system administrators in making well-informed decisions regarding risk mitigation strategies.

[298] arXiv:2507.00642 [pdf, html, other]
Title: ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis
Runkai Li, Jia Xiong, Xiuyuan He, Jieru Zhao, Qiang Xu, Xi Wang
Subjects: Hardware Architecture (cs.AR)

The increasing complexity of computational demands has accelerated the adoption of domain-specific accelerators, yet traditional hardware design methodologies remain constrained by prolonged development and verification cycles. High-Level Synthesis (HLS) bridges the gap between software and hardware by enabling hardware design from high-level programming languages. However, its widespread adoption is hindered by strict coding constraints and intricate hardware-specific optimizations, creating significant obstacles for developers. Recent advancements in Large Language Models (LLMs) demonstrate substantial potential in hardware design automation. However, their effectiveness is limited by the scarcity of high-quality datasets, particularly in the context of HLS. To address these challenges, we introduce ChatHLS, an agile HLS design automation and optimization workflow that leverages fine-tuned LLMs integrated within a multi-agent framework for error correction and design optimization. Our extensive evaluations reveal that ChatHLS achieves an average repair pass rate of 82.7% over 612 test cases, outperforming the GPT-4o and Llama3-8B by 19.1% and 63.0%, respectively. Furthermore, ChatHLS delivers performance enhancements ranging from 1.9$\times$ to 14.8$\times$ upon resource-constrained kernels. By enabling sophisticated optimization reasoning within practical computational budgets, ChatHLS attains a 4.9$\times$ geometric mean speedup compared to state-of-the-art DSL-based approaches. These results underscore the potential of ChatHLS in substantially expediting hardware development cycles while maintaining rigorous standards of design reliability and optimization quality.

[299] arXiv:2507.00643 [pdf, html, other]
Title: Decentralized Pliable Index Coding For Federated Learning In Intelligent Transportation Systems
Sadina Kadakkottiri, Narisetty Harish, Nujoom Sageer Karat, Deepthi Paramel Pattathil, Balaji Sundar Rajan
Subjects: Information Theory (cs.IT)

Federated Learning is a promising option for data privacy and security in ITS, because it allows edge devices, Road Side Units (RSUs), and Central Server (CS) to jointly train the machine learning model. Since RSU collects data from the vehicles passing through its range, the local data of each RSU will have a non-IID distribution, which adversely affects the convergence speed and accuracy of FL training. Generating synthetic data locally at individual nodes, followed by data shuffling among the nodes, is a promising approach to address the Non-IID data problem. In this work, we propose pliable index coding (PIC) solutions for efficient data shuffling among the nodes in an FL system. In PIC($S$) problems, a client is satisfied if it can retrieve any $S$ new messages not originally present in its side-information. We particularly consider decentralized pliable index coding problems (DPIC) where the clients communicate among themselves without a central server to model the data shuffling in FL. A class of DPIC, known as Consecutive Decentralized Pliable Index Coding (CDPIC($S$,$K$)), where each client has $K$ consecutive messages as side-information, is considered. For CDPIC($S$,$K$) problems, pliable index code designs are provided for any value of $K$ and $S$, and optimality proofs for some of the cases are established. Further, these CDPIC solutions are applied for data shuffling in FL, to transform the local data distribution towards IID progressively with each transmission, thereby enhancing the performance of FL. The improvement in the accuracy and convergence of the most popular FL technique, FedAvg, and a promising federated submodel technique, CELL (Communication Efficient Lottery Learning), are analysed by providing different degrees of data shuffling using the proposed CDPIC schemes.

[300] arXiv:2507.00644 [pdf, html, other]
Title: Parallel Transmission Aware Co-Design: Enhancing Manipulator Performance Through Actuation-Space Optimization
Rohit Kumar, Melya Boukheddimi, Dennis Mronga, Shivesh Kumar, Frank Kirchner
Subjects: Robotics (cs.RO)

In robotics, structural design and behavior optimization have long been considered separate processes, resulting in the development of systems with limited capabilities. Recently, co-design methods have gained popularity, where bi-level formulations are used to simultaneously optimize the robot design and behavior for specific tasks. However, most implementations assume a serial or tree-type model of the robot, overlooking the fact that many robot platforms incorporate parallel mechanisms. In this paper, we present a novel co-design approach that explicitly incorporates parallel coupling constraints into the dynamic model of the robot. In this framework, an outer optimization loop focuses on the design parameters, in our case the transmission ratios of a parallel belt-driven manipulator, which map the desired torques from the joint space to the actuation space. An inner loop performs trajectory optimization in the actuation space, thus exploiting the entire dynamic range of the manipulator. We compare the proposed method with a conventional co-design approach based on a simplified tree-type model. By taking advantage of the actuation space representation, our approach leads to a significant increase in dynamic payload capacity compared to the conventional co-design implementation.

[301] arXiv:2507.00647 [pdf, html, other]
Title: Cooperative Sheaf Neural Networks
André Ribeiro, Ana Luiza Tenório, Juan Belieni, Amauri H. Souza, Diego Mesquita
Subjects: Machine Learning (cs.LG)

Sheaf diffusion has recently emerged as a promising design pattern for graph representation learning due to its inherent ability to handle heterophilic data and avoid oversmoothing. Meanwhile, cooperative message passing has also been proposed as a way to enhance the flexibility of information diffusion by allowing nodes to independently choose whether to propagate/gather information from/to neighbors. A natural question ensues: is sheaf diffusion capable of exhibiting this cooperative behavior? Here, we provide a negative answer to this question. In particular, we show that existing sheaf diffusion methods fail to achieve cooperative behavior due to the lack of message directionality. To circumvent this limitation, we introduce the notion of cellular sheaves over directed graphs and characterize their in- and out-degree Laplacians. We leverage our construction to propose Cooperative Sheaf Neural Networks (CSNNs). Theoretically, we characterize the receptive field of CSNN and show it allows nodes to selectively attend (listen) to arbitrarily far nodes while ignoring all others in their path, potentially mitigating oversquashing. Our experiments show that CSNN presents overall better performance compared to prior art on sheaf diffusion as well as cooperative graph neural networks.

[302] arXiv:2507.00648 [pdf, html, other]
Title: UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions
Siyuan Yao, Rui Zhu, Ziqi Wang, Wenqi Ren, Yanyang Yan, Xiaochun Cao
Comments: Accepted to ICCV 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2% frames in source daytime datasets) in multiple weather conditions under the guidance of different text prompts. Afterwards, we design a simple yet effective domain-customized adapter (DCA), allowing the target objects' representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new state-of-the-art performance by a significant margin. Our code is available at this https URL.

[303] arXiv:2507.00651 [pdf, html, other]
Title: GANs Secretly Perform Approximate Bayesian Model Selection
Maurizio Filippone, Marius P. Linhard
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Generative Adversarial Networks (GANs) are popular and successful generative models. Despite their success, optimization is notoriously challenging and they require regularization against overfitting. In this work, we explain the success and limitations of GANs by interpreting them as probabilistic generative models. This interpretation enables us to view GANs as Bayesian neural networks with partial stochasticity, allowing us to establish conditions of universal approximation. We can then cast the adversarial-style optimization of several variants of GANs as the optimization of a proxy for the marginal likelihood. Taking advantage of the connection between marginal likelihood optimization and Occam's razor, we can define regularization and optimization strategies to smooth the loss landscape and search for solutions with minimum description length, which are associated with flat minima and good generalization. The results on a wide range of experiments indicate that these strategies lead to performance improvements and pave the way to a deeper understanding of regularization strategies for GANs.

[304] arXiv:2507.00653 [pdf, html, other]
Title: Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models
Yilun Zhang
Comments: 23 pages
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

The escalating computational costs of Large Language Model (LLM) inference have become a critical barrier to their widespread and sustainable deployment. While existing optimization strategies are effective, they are predominantly based on statistical heuristics or architectural modifications, lacking a guiding cognitive theory to manage the inference process itself. This paper aims to bridge this gap by introducing a novel paradigm: the Cognitive Load-Aware Inference (CLAI) framework, which operationalizes principles from Cognitive Load Theory (CLT) and neuroscience for LLM inference. We formalize the concepts of Intrinsic Cognitive Load, Extraneous Cognitive Load, and Germane Cognitive Load into quantifiable LLM metrics ($ICL_{LLM}$, $ECL_{LLM}$, and $GCL_{LLM}$), thereby reframing the inference process as a cognitive economics optimization problem: based on the intrinsic complexity of a problem ($ICL_{LLM}$), minimize wasteful computation ($ECL_{LLM}$), and strategically allocate the token budget to productive reasoning ($GCL_{LLM}$). We propose two implementation paths: CLAI-Prompt, a zero-shot method that guides a base LLM through cognitive control steps via a structured meta-prompt, and CLAI-Tune, a fine-tuned model that internalizes these principles for spontaneous cognitive economy. Across a range of benchmarks in complex reasoning, long-context question answering, and code generation, our methods achieve significant reductions in token consumption (up to 45\%) without sacrificing accuracy. Furthermore, CLAI-Tune exhibits an emergent ability to autonomously decompose difficult problems, a key characteristic of human expert cognition. This work demonstrates that by emulating the brain's resource management strategies, we can build more efficient, robust, and capable artificial intelligence systems.

[305] arXiv:2507.00654 [pdf, html, other]
Title: Neural Augmented Kalman Filters for Road Network assisted GNSS positioning
Hans van Gorp, Davide Belli, Amir Jalalirad, Bence Major
Comments: Accepted to ICML 2025 workshop ML4Wireless
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)

The Global Navigation Satellite System (GNSS) provides critical positioning information globally, but its accuracy in dense urban environments is often compromised by multipath and non-line-of-sight errors. Road network data can be used to reduce the impact of these errors and enhance the accuracy of a positioning system. Previous works employing road network data are either limited to offline applications, or rely on Kalman Filter (KF) heuristics with little flexibility and robustness. We instead propose training a Temporal Graph Neural Network (TGNN) to integrate road network information into a KF. The TGNN is designed to predict the correct road segment and its associated uncertainty to be used in the measurement update step of the KF. We validate our approach with real-world GNSS data and open-source road networks, observing a 29% decrease in positioning error for challenging scenarios compared to a GNSS-only KF. To the best of our knowledge, ours is the first deep learning-based approach jointly employing road network data and GNSS measurements to determine the user position on Earth.

[306] arXiv:2507.00656 [pdf, html, other]
Title: The Rate-Distortion Function for Sampled Cyclostationary Gaussian Processes with Memory and with Bounded Processing Delay: Extended Version with Proofs
Zikun Tan, Ron Dabora, H. Vincent Poor
Comments: accepted by the 2025 IEEE International Symposium on Information Theory (ISIT)
Subjects: Information Theory (cs.IT)

We study the rate-distortion function (RDF) for the lossy compression of discrete-time (DT) wide-sense almost cyclostationary (WSACS) Gaussian processes with memory, arising from sampling continuous-time (CT) wide-sense cyclostationary (WSCS) Gaussian source processes. The importance of this problem arises as such CT processes represent communications signals, and sampling must be applied to facilitate the DT processing associated with their compression. Moreover, the physical characteristics of oscillators imply that the sampling interval is incommensurate with the period of the autocorrelation function (AF) of the physical process, giving rise to the DT WSACS model considered. In addition, to reduce the loss, the sampling interval is generally shorter than the correlation length, and thus, the DT process is correlated as well. The difficulty in the RDF characterization follows from the information-instability of WSACS processes, which renders the traditional information-theoretic tools inapplicable. In this work we utilize the information-spectrum framework to characterize the RDF when a finite and bounded delay is allowed between processing of subsequent source sequences. This scenario extends our previous works which studied settings without processing delays or without memory. Numerical evaluations reveal the impact of scenario parameters on the RDF with asynchronous sampling.

[307] arXiv:2507.00657 [pdf, html, other]
Title: Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity
Jacopo Nudo, Mario Edoardo Pandolfo, Edoardo Loru, Mattia Samory, Matteo Cinelli, Walter Quattrociocchi
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)

We investigate how Large Language Models (LLMs) behave when simulating political discourse on social media. Leveraging 21 million interactions on X during the 2024 U.S. presidential election, we construct LLM agents based on 1,186 real users, prompting them to reply to politically salient tweets under controlled conditions. Agents are initialized either with minimal ideological cues (Zero Shot) or recent tweet history (Few Shot), allowing one-to-one comparisons with human replies. We evaluate three model families (Gemini, Mistral, and DeepSeek) across linguistic style, ideological consistency, and toxicity. We find that richer contextualization improves internal consistency but also amplifies polarization, stylized signals, and harmful language. We observe an emergent distortion that we call "generation exaggeration": a systematic amplification of salient traits beyond empirical baselines. Our analysis shows that LLMs do not emulate users, they reconstruct them. Their outputs, indeed, reflect internal optimization dynamics more than observed behavior, introducing structural biases that compromise their reliability as social proxies. This challenges their use in content moderation, deliberative simulations, and policy modeling.

[308] arXiv:2507.00659 [pdf, html, other]
Title: LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment
Juelin Zhu, Shuaibang Peng, Long Wang, Hanlin Tan, Yu Liu, Maojun Zhang, Shen Yan
Comments: Accepted by ICCV 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. Previous wireframe-alignment-based method LoD-Loc has shown promising localization results leveraging LoD models. However, LoD-Loc mainly relies on high-LoD (LoD3 or LoD2) city models, but the majority of available models and those many countries plan to construct nationwide are low-LoD (LoD1). Consequently, enabling localization on low-LoD city models could unlock drones' potential for global urban localization. To address these issues, we introduce LoD-Loc v2, which employs a coarse-to-fine strategy using explicit silhouette alignment to achieve accurate localization over low-LoD city models in the air. Specifically, given a query image, LoD-Loc v2 first applies a building segmentation network to shape building silhouettes. Then, in the coarse pose selection stage, we construct a pose cost volume by uniformly sampling pose hypotheses around a prior pose to represent the pose probability distribution. Each cost of the volume measures the degree of alignment between the projected and predicted silhouettes. We select the pose with maximum value as the coarse pose. In the fine pose estimation stage, a particle filtering method incorporating a multi-beam tracking approach is used to efficiently explore the hypothesis space and obtain the final pose estimation. To further facilitate research in this field, we release two datasets with LoD1 city models covering 10.7 km , along with real RGB queries and ground-truth pose annotations. Experimental results show that LoD-Loc v2 improves estimation accuracy with high-LoD models and enables localization with low-LoD models for the first time. Moreover, it outperforms state-of-the-art baselines by large margins, even surpassing texture-model-based methods, and broadens the convergence basin to accommodate larger prior errors.

[309] arXiv:2507.00665 [pdf, html, other]
Title: SAFER: Probing Safety in Reward Models with Sparse Autoencoder
Sihang Li, Wei Shi, Ziyuan Xie, Tao Liang, Guojun Ma, Xiang Wang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at this https URL. \textit{This paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}

[310] arXiv:2507.00667 [pdf, html, other]
Title: Special measures of smoothness for approximation by sampling operators in $L_p(\Bbb{R}^d)$
Yurii Kolomoitsev
Subjects: Numerical Analysis (math.NA); Classical Analysis and ODEs (math.CA)

Traditional measures of smoothness often fail to provide accurate $L_p$-error estimates for approximation by sampling or interpolation operators, especially for functions with low smoothness. To address this issue, we introduce a modified measure of smoothness that incorporates the local behavior of a function at the sampling points through the use of averaged operators. With this new tool, we obtain matching direct and inverse error estimates for a wide class of sampling operators and functions in $L_p$ spaces. Additionally, we derive a criterion for the convergence of sampling operators in $L_p$, identify conditions that ensure the exact rate of approximation, construct realizations of $K$-functionals based on these operators, and study the smoothness properties of sampling operators. We also demonstrate how our results apply to several well-known operators, including the classical Whittaker-Shannon sampling operator, sampling operators generated by $B$-splines, and those based on the Gaussian.

[311] arXiv:2507.00669 [pdf, html, other]
Title: Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding
Duc Cao-Dinh, Khai Le-Duc, Anh Dao, Bach Phan Tat, Chris Ngo, Duy M. H. Nguyen, Nguyen X. Khanh, Thanh Nguyen-Tang
Comments: Work in progress, 42 pages
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

3D Visual Grounding (3DVG) involves localizing target objects in 3D point clouds based on natural language. While prior work has made strides using textual descriptions, leveraging spoken language-known as Audio-based 3D Visual Grounding-remains underexplored and challenging. Motivated by advances in automatic speech recognition (ASR) and speech representation learning, we propose Audio-3DVG, a simple yet effective framework that integrates audio and spatial information for enhanced grounding. Rather than treating speech as a monolithic input, we decompose the task into two complementary components. First, we introduce Object Mention Detection, a multi-label classification task that explicitly identifies which objects are referred to in the audio, enabling more structured audio-scene reasoning. Second, we propose an Audio-Guided Attention module that captures interactions between candidate objects and relational speech cues, improving target discrimination in cluttered scenes. To support benchmarking, we synthesize audio descriptions for standard 3DVG datasets, including ScanRefer, Sr3D, and Nr3D. Experimental results demonstrate that Audio-3DVG not only achieves new state-of-the-art performance in audio-based grounding, but also competes with text-based methods-highlighting the promise of integrating spoken language into 3D vision tasks.

[312] arXiv:2507.00672 [pdf, html, other]
Title: Toward Edge General Intelligence with Multiple-Large Language Model (Multi-LLM): Architecture, Trust, and Orchestration
Haoxiang Luo, Yinqiu Liu, Ruichen Zhang, Jiacheng Wang, Gang Sun, Dusit Niyato, Hongfang Yu, Zehui Xiong, Xianbin Wang, Xuemin Shen
Subjects: Networking and Internet Architecture (cs.NI); Distributed, Parallel, and Cluster Computing (cs.DC)

Edge computing enables real-time data processing closer to its source, thus improving the latency and performance of edge-enabled AI applications. However, traditional AI models often fall short when dealing with complex, dynamic tasks that require advanced reasoning and multimodal data processing. This survey explores the integration of multi-LLMs (Large Language Models) to address this in edge computing, where multiple specialized LLMs collaborate to enhance task performance and adaptability in resource-constrained environments. We review the transition from conventional edge AI models to single LLM deployment and, ultimately, to multi-LLM systems. The survey discusses enabling technologies such as dynamic orchestration, resource scheduling, and cross-domain knowledge transfer that are key for multi-LLM implementation. A central focus is on trusted multi-LLM systems, ensuring robust decision-making in environments where reliability and privacy are crucial. We also present multimodal multi-LLM architectures, where multiple LLMs specialize in handling different data modalities, such as text, images, and audio, by integrating their outputs for comprehensive analysis. Finally, we highlight future directions, including improving resource efficiency, trustworthy governance multi-LLM systems, while addressing privacy, trust, and robustness concerns. This survey provides a valuable reference for researchers and practitioners aiming to leverage multi-LLM systems in edge computing applications.

[313] arXiv:2507.00674 [pdf, html, other]
Title: A hyperboloidal method for numerical simulations of multidimensional nonlinear wave equations
Oliver Rinne
Comments: 20 pages, 8 figures, 2 tables
Subjects: Numerical Analysis (math.NA); Mathematical Physics (math-ph); Analysis of PDEs (math.AP)

We consider the scalar wave equation with power nonlinearity in n+1 dimensions. Unlike previous numerical studies, we go beyond the radial case and do not assume any symmetries for n=3, and we only impose an SO(n-1) symmetry in higher dimensions. Our method is based on a hyperboloidal foliation of Minkowski spacetime and conformal compactification. We focus on the late-time power-law decay (tails) of the solutions and compute decay exponents for different spherical harmonic modes, for subcritical, critical and supercritical, focusing and defocusing nonlinear wave equations.

[314] arXiv:2507.00676 [pdf, html, other]
Title: A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation
Edward Effendy, Kuan-Wei Tseng, Rei Kawakami
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Accepted in the ICIP 2025
We present a novel transformer-based framework for whole-body grasping that addresses both pose generation and motion infilling, enabling realistic and stable object interactions. Our pipeline comprises three stages: Grasp Pose Generation for full-body grasp generation, Temporal Infilling for smooth motion continuity, and a LiftUp Transformer that refines downsampled joints back to high-resolution markers. To overcome the scarcity of hand-object interaction data, we introduce a data-efficient Generalized Pretraining stage on large, diverse motion datasets, yielding robust spatio-temporal representations transferable to grasping tasks. Experiments on the GRAB dataset show that our method outperforms state-of-the-art baselines in terms of coherence, stability, and visual realism. The modular design also supports easy adaptation to other human-motion applications.

[315] arXiv:2507.00677 [pdf, html, other]
Title: Learning Steerable Imitation Controllers from Unstructured Animal Motions
Dongho Kang, Jin Cheng, Fatemeh Zargarbashi, Taerim Yoon, Sungjoon Choi, Stelian Coros
Comments: The supplementary video is available at this https URL
Subjects: Robotics (cs.RO)

This paper presents a control framework for legged robots that leverages unstructured real-world animal motion data to generate animal-like and user-steerable behaviors. Our framework learns to follow velocity commands while reproducing the diverse gait patterns in the original dataset. To begin with, animal motion data is transformed into a robot-compatible database using constrained inverse kinematics and model predictive control, bridging the morphological and physical gap between the animal and the robot. Subsequently, a variational autoencoder-based motion synthesis module captures the diverse locomotion patterns in the motion database and generates smooth transitions between them in response to velocity commands. The resulting kinematic motions serve as references for a reinforcement learning-based feedback controller deployed on physical robots. We show that this approach enables a quadruped robot to adaptively switch gaits and accurately track user velocity commands while maintaining the stylistic coherence of the motion data. Additionally, we provide component-wise evaluations to analyze the system's behavior in depth and demonstrate the efficacy of our method for more accurate and reliable motion imitation.

[316] arXiv:2507.00678 [pdf, html, other]
Title: Sectional Kolmogorov N-widths for parameter-dependent function spaces: A general framework with application to parametrized Friedrichs' systems
Christian Engwer, Mario Ohlberger, Lukas Renelt
Comments: 18 pages, 2 figures
Subjects: Numerical Analysis (math.NA)

We investigate parametrized variational problems where for each parameter the solution may originate from a different parameter-dependent function space. Our main motivation is the theory of Friedrichs' systems, a large abstract class of linear PDE-problems whose solutions are sought in operator- (and thus parameter-)dependent graph spaces. Other applications include function spaces on parametrized domains or discretizations involving data-dependent stabilizers. Concerning the set of all parameter-dependent solutions, we argue that in these cases the interpretation as a "solution manifold" widely adopted in the model order reduction community is no longer applicable. Instead, we propose a novel framework based on the theory of fiber bundles and explain how established concepts such as approximability generalize by introducing a Sectional Kolmogorov N-width. Further, we prove exponential approximation rates of this N-width if a norm equivalence criterion is fulfilled. Applying this result to problems with Friedrichs' structure then gives a sufficient criterion that can be easily verified.

[317] arXiv:2507.00686 [pdf, html, other]
Title: A Domain-specific Language and Architecture for Detecting Process Activities from Sensor Streams in IoT
Ronny Seiger, Daniel Locher, Marco Kaufmann, Aaron F. Kurz
Comments: Submitted to Internet of Things (ISSN 2542-6605)
Subjects: Software Engineering (cs.SE); Emerging Technologies (cs.ET)

Modern Internet of Things (IoT) systems are equipped with a plethora of sensors providing real-time data about the current operations of their components, which is crucial for the systems' internal control systems and processes. However, these data are often too fine-grained to derive useful insights into the execution of the larger processes an IoT system might be part of. Process mining has developed advanced approaches for the analysis of business processes that may also be used in the context of IoT. Bringing process mining to IoT requires an event abstraction step to lift the low-level sensor data to the business process level. In this work, we aim to empower domain experts to perform this step using a newly developed domain-specific language (DSL) called Radiant. Radiant supports the specification of patterns within the sensor data that indicate the execution of higher level process activities. These patterns are translated to complex event processing (CEP) applications to be used for detecting activity executions at runtime. We propose a corresponding software architecture for online event abstraction from IoT sensor streams using the CEP applications. We evaluate these applications to monitor activity executions using IoT sensors in smart manufacturing and smart healthcare. The evaluation method and results inform the domain expert about the quality of activity detections and potential for improvement.

[318] arXiv:2507.00687 [pdf, html, other]
Title: Diffusion Classifier Guidance for Non-robust Classifiers
Philipp Vaeth, Dibyanshu Kumar, Benjamin Paassen, Magda Gregorová
Comments: Accepted at ECML 2025
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Classifier guidance is intended to steer a diffusion process such that a given classifier reliably recognizes the generated data point as a certain class. However, most classifier guidance approaches are restricted to robust classifiers, which were specifically trained on the noise of the diffusion forward process. We extend classifier guidance to work with general, non-robust, classifiers that were trained without noise. We analyze the sensitivity of both non-robust and robust classifiers to noise of the diffusion process on the standard CelebA data set, the specialized SportBalls data set and the high-dimensional real-world CelebA-HQ data set. Our findings reveal that non-robust classifiers exhibit significant accuracy degradation under noisy conditions, leading to unstable guidance gradients. To mitigate these issues, we propose a method that utilizes one-step denoised image predictions and implements stabilization techniques inspired by stochastic optimization methods, such as exponential moving averages. Experimental results demonstrate that our approach improves the stability of classifier guidance while maintaining sample diversity and visual quality. This work contributes to advancing conditional sampling techniques in generative models, enabling a broader range of classifiers to be used as guidance classifiers.

[319] arXiv:2507.00690 [pdf, html, other]
Title: Cage-Based Deformation for Transferable and Undefendable Point Cloud Attack
Keke Tang, Ziyong Du, Weilong Peng, Xiaofei Wang, Peican Zhu, Ligang Liu, Zhihong Tian
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

Adversarial attacks on point clouds often impose strict geometric constraints to preserve plausibility; however, such constraints inherently limit transferability and undefendability. While deformation offers an alternative, existing unstructured approaches may introduce unnatural distortions, making adversarial point clouds conspicuous and undermining their plausibility. In this paper, we propose CageAttack, a cage-based deformation framework that produces natural adversarial point clouds. It first constructs a cage around the target object, providing a structured basis for smooth, natural-looking deformation. Perturbations are then applied to the cage vertices, which seamlessly propagate to the point cloud, ensuring that the resulting deformations remain intrinsic to the object and preserve plausibility. Extensive experiments on seven 3D deep neural network classifiers across three datasets show that CageAttack achieves a superior balance among transferability, undefendability, and plausibility, outperforming state-of-the-art methods. Codes will be made public upon acceptance.

[320] arXiv:2507.00693 [pdf, html, other]
Title: Leveraging Large Language Models for Spontaneous Speech-Based Suicide Risk Detection
Yifan Gao, Jiao Fu, Long Guo, Hong Liu
Comments: Accepted to Interspeech 2025
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

Early identification of suicide risk is crucial for preventing suicidal behaviors. As a result, the identification and study of patterns and markers related to suicide risk have become a key focus of current research. In this paper, we present the results of our work in the 1st SpeechWellness Challenge (SW1), which aims to explore speech as a non-invasive and easily accessible mental health indicator for identifying adolescents at risk of this http URL approach leverages large language model (LLM) as the primary tool for feature extraction, alongside conventional acoustic and semantic features. The proposed method achieves an accuracy of 74\% on the test set, ranking first in the SW1 challenge. These findings demonstrate the potential of LLM-based methods for analyzing speech in the context of suicide risk assessment.

[321] arXiv:2507.00695 [pdf, html, other]
Title: A Test-Function Approach to Incremental Stability
Daniel Pfrommer, Max Simchowitz, Ali Jadbabaie
Comments: 8 pages
Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)

This paper presents a novel framework for analyzing Incremental-Input-to-State Stability ($\delta$ISS) based on the idea of using rewards as "test functions." Whereas control theory traditionally deals with Lyapunov functions that satisfy a time-decrease condition, reinforcement learning (RL) value functions are constructed by exponentially decaying a Lipschitz reward function that may be non-smooth and unbounded on both sides. Thus, these RL-style value functions cannot be directly understood as Lyapunov certificates. We develop a new equivalence between a variant of incremental input-to-state stability of a closed-loop system under given a policy, and the regularity of RL-style value functions under adversarial selection of a Hölder-continuous reward function. This result highlights that the regularity of value functions, and their connection to incremental stability, can be understood in a way that is distinct from the traditional Lyapunov-based approach to certifying stability in control theory.

[322] arXiv:2507.00697 [pdf, html, other]
Title: Analysis of A Mixed Finite Element Method for Poisson's Equation with Rough Boundary Data
Huadong Gao, Yuhui Huang, Wen Xie
Comments: 18 pages,6 figures
Subjects: Numerical Analysis (math.NA)

This paper is concerned with finite element methods for Poisson's equation with rough boundary data. Conventional methods require that the boundary data $g$ of the problem belongs to $H^{1/2} (\partial \Omega)$. However, in many applications one has to consider the case when $g$ is in $L^2(\partial \Omega)$ only. To this end, very weak solutions are considered to establish the well-posedness of the problem. Most previously proposed numerical methods use regularizations of the boundary data. The main purpose of this paper is to use the Raviart--Thomas mixed finite element method to solve the Poisson equation with rough boundary data directly. We prove that the solution to the proposed mixed method converges to the very weak solution. In particular, we prove that the convergence rate of the numerical solution is $O(h^{1/2})$ in convex domains and $O(h^{s-1/2})$ in nonconvex domains, where $s > 1/2$ depends on the geometry of the domain. The analysis is based on a regularized approach and a rigorous estimate for the corresponding dual problem. Numerical experiments confirm the theoretically predicted convergence rates for the proposed mixed method for Poisson's equation with rough boundary data.

[323] arXiv:2507.00698 [pdf, html, other]
Title: Rectifying Magnitude Neglect in Linear Attention
Qihang Fan, Huaibo Huang, Yuang Ai, ran He
Comments: Accepted by ICCV2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query's magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. We evaluate the effectiveness of MALA on multiple tasks, including image classification, object detection, instance segmentation, semantic segmentation, natural language processing, speech recognition, and image generation. Our MALA achieves strong results on all of these tasks. Code will be available at this https URL

[324] arXiv:2507.00699 [pdf, html, other]
Title: A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback
Guoliang Duan, Mingwei Liu, Yanlin Wang, Chong Wang, Xin Peng, Zibin Zheng
Subjects: Software Engineering (cs.SE)

Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored. Existing benchmarks often prioritize functional correctness, overlooking the nuanced requirements found in real-world development. We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions: constraint type, hierarchical levels, and iterative refinement. Built upon a structured taxonomy of 9 categories and 27 constraint types, MultiCodeIF enables granular assessment of both functional and non-functional instruction adherence. Using an automated pipeline, ConstraGen, we synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task variants. Empirical evaluation of six state-of-the-art LLMs uncovers substantial performance disparities. The top-performing model, Claude-3-7-Sonnet, achieves 63.0% average constraint satisfaction, while smaller models like Qwen3-1.7B fall to 44.8%. Models perform well on explicit constraints, but struggle with implicit or abstract constraints. Tasks with multiple hierarchical constraints significantly reduce model success rates, from 54.5% in single-level to just 18.8% in multi-level scenarios. However, structured feedback enables progressive improvement: average constraint satisfaction rises from 63.0% to 83.4% over four iterative refinement rounds. MultiCodeIF provides a scalable, constraint-aware, and feedback-sensitive framework to benchmark LLMs under realistic code generation scenarios, bridging the gap between synthetic evaluations and real-world instruction complexity. The full benchmark dataset, evaluation pipeline, and source code are available at this https URL.

[325] arXiv:2507.00700 [pdf, html, other]
Title: Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English
Ahmed Sabir, Azinovič Gasper, Mengsay Loem, Rajesh Sharma
Subjects: Computation and Language (cs.CL)

Cross-cultural research in perception and cognition has shown that individuals from different cultural backgrounds process visual information in distinct ways. East Asians, for example, tend to adopt a holistic perspective, attending to contextual relationships, whereas Westerners often employ an analytical approach, focusing on individual objects and their attributes. In this study, we investigate whether Vision-Language Models (VLMs) trained predominantly on different languages, specifically Japanese and English, exhibit similar culturally grounded attentional patterns. Using comparative analysis of image descriptions, we examine whether these models reflect differences in holistic versus analytic tendencies. Our findings suggest that VLMs not only internalize the structural properties of language but also reproduce cultural behaviors embedded in the training data, indicating that cultural cognition may implicitly shape model outputs.

[326] arXiv:2507.00701 [pdf, html, other]
Title: SCAWaveNet: A Spatial-Channel Attention-based Network for Global Significant Wave Height Retrieval
Chong Zhang, Xichao Liu, Yibing Zhan, Dapeng Tao, Jun Ni
Comments: 16 pages,6 tables,11 figures
Subjects: Machine Learning (cs.LG)

Recent advancements in spaceborne GNSS missions have produced extensive global datasets, providing a robust basis for deep learning-based significant wave height (SWH) retrieval. While existing deep learning models predominantly utilize CYGNSS data with four-channel information, they often adopt single-channel inputs or simple channel concatenation without leveraging the benefits of cross-channel information interaction during training. To address this limitation, a novel spatial-channel attention-based network, namely SCAWaveNet, is proposed for SWH retrieval. Specifically, features from each channel of the DDMs are modeled as independent attention heads, enabling the fusion of spatial and channel-wise information. For auxiliary parameters, a lightweight attention mechanism is designed to assign weights along the spatial and channel dimensions. The final feature integrates both spatial and channel-level characteristics. Model performance is evaluated using four-channel CYGNSS data. When ERA5 is used as a reference, SCAWaveNet achieves an average RMSE of 0.438 m. When using buoy data from NDBC, the average RMSE reaches 0.432 m. Compared to state-of-the-art models, SCAWaveNet reduces the average RMSE by at least 3.52% on the ERA5 dataset and by 5.47% on the NDBC buoy observations. The code is available at this https URL.

[327] arXiv:2507.00707 [pdf, html, other]
Title: BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving
Zeming Chen, Hang Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Multi-view image generation in autonomous driving demands consistent 3D scene understanding across camera views. Most existing methods treat this problem as a 2D image set generation task, lacking explicit 3D modeling. However, we argue that a structured representation is crucial for scene generation, especially for autonomous driving applications. This paper proposes BEV-VAE for consistent and controllable view synthesis. BEV-VAE first trains a multi-view image variational autoencoder for a compact and unified BEV latent space and then generates the scene with a latent diffusion transformer. BEV-VAE supports arbitrary view generation given camera configurations, and optionally 3D layouts. Experiments on nuScenes and Argoverse 2 (AV2) show strong performance in both 3D consistent reconstruction and generation. The code is available at: this https URL.

[328] arXiv:2507.00708 [pdf, html, other]
Title: On the (In)Approximability of the Monitoring Edge Geodetic Set Problem
Davide Bilò, Giodano Colli, Luca Forlizzi, Stefano Leucci
Comments: arXiv admin note: text overlap with arXiv:2405.13875
Subjects: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC)

We study the minimum \emph{Monitoring Edge Geodetic Set} (\megset) problem introduced in [Foucaud et al., CALDAM'23]: given a graph $G$, we say that an edge is monitored by a pair $u,v$ of vertices if \emph{all} shortest paths between $u$ and $v$ traverse $e$; the goal of the problem consists in finding a subset $M$ of vertices of $G$ such that each edge of $G$ is monitored by at least one pair of vertices in $M$, and $|M|$ is minimized.
In this paper, we prove that all polynomial-time approximation algorithms for the minimum \megset problem must have an approximation ratio of $\Omega(\log n)$, unless \p = \np. To the best of our knowledge, this is the first non-constant inapproximability result known for this problem. We also strengthen the known \np-hardness of the problem on $2$-apex graphs by showing that the same result holds for $1$-apex graphs. This leaves open the problem of determining whether the problem remains \np-hard on planar (i.e., $0$-apex) graphs.
On the positive side, we design an algorithm that computes good approximate solutions for hereditary graph classes that admit efficiently computable balanced separators of truly sublinear size. This immediately results in polynomial-time approximation algorithms achieving an approximation ratio of $O(n^{\frac{1}{4}} \sqrt{\log n})$ on planar graphs, graphs with bounded genus, and $k$-apex graphs with $k=O(n^{\frac{1}{4}})$. On graphs with bounded treewidth, we obtain an approximation ratio of $O(\log^{3/2} n)$ for any constant $\varepsilon > 0$. This compares favorably with the best-known approximation algorithm for general graphs, which achieves an approximation ratio of $O(\sqrt{n \log n})$ via a simple reduction to the \textsc{Set Cover} problem.

[329] arXiv:2507.00709 [pdf, html, other]
Title: TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving
Yiming Yang, Yueru Luo, Bingkun He, Hongbin Lin, Suzhong Fu, Chao Yan, Kun Tang, Xinrui Yan, Chao Zheng, Shuguang Cui, Zhen Li
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.4% mAP in lane segment perception and +2.1% OLS in centerline perception tasks.

[330] arXiv:2507.00710 [pdf, html, other]
Title: Robust Task Offloading for UAV-enabled Secure MEC Against Aerial Eavesdropper
Can Cui, ZIye Jia, Chao Dong, Qihui Wu
Subjects: Emerging Technologies (cs.ET)

Unmanned aerial vehicles (UAVs) are recognized as a promising candidate for the multi-access edge computing (MEC) in the future sixth generation communication networks. However, the aerial eavesdropping UAVs (EUAVs) pose a significant security threat to the data offloading. In this paper, we investigate a robust MEC scenario with multiple service UAVs (SUAVs) towards the potential eavesdropping from the EUAV, in which the random parameters such as task complexities are considered in the practical applications. In detail, the problem is formulated to optimize the deployment positions of SUAVs, the connection relationships between GUs and SUAVs, and the offloading ratios. With the uncertain task complexities, the corresponding chance constraints are constructed under the uncertainty set, which is tricky to deal with. Therefore, we first optimize the pre-deployment of SUAVs by the K-means algorithm. Then, the distributionally robust optimization method is employed, and the conditional value at risk is utilized to transform the chance constraints into convex forms, which can be solved via convex toolkits. Finally, the simulation results show that with the consideration of uncertainties, just 5% more energy is consumed compared with the ideal circumstance, which verifies the robustness of the proposed algorithms.

[331] arXiv:2507.00711 [pdf, html, other]
Title: Large Reasoning Models are not thinking straight: on the unreliability of thinking trajectories
Jhouben Cuesta-Ramirez, Samuel Beaussant, Mehdi Mounsif
Comments: Accepted to KONVENS 2025
Subjects: Machine Learning (cs.LG)

Large Language Models (LLMs) trained via Reinforcement Learning (RL) have recently achieved impressive results on reasoning benchmarks. Yet, growing evidence shows that these models often generate longer but ineffective chains of thought (CoTs), calling into question whether benchmark gains reflect real reasoning improvements. We present new evidence of overthinking, where models disregard correct solutions even when explicitly provided, instead continuing to generate unnecessary reasoning steps that often lead to incorrect conclusions. Experiments on three state-of-the-art models using the AIME2024 math benchmark reveal critical limitations in these models ability to integrate corrective information, posing new challenges for achieving robust and interpretable reasoning.

[332] arXiv:2507.00715 [pdf, html, other]
Title: EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens
Chaoqun Yang, Xinyu Lin, Wenjie Wang, Yongqi Li, Teng Sun, Xianjing Han, Tat-Seng Chua
Comments: Accepted by KDD 2025
Subjects: Information Retrieval (cs.IR)

Large Language Model-based generative recommendation (LLMRec) has achieved notable success, but it suffers from high inference latency due to massive computational overhead and memory pressure of KV Cache. Existing KV Cache reduction methods face critical limitations: cache compression offers marginal acceleration given recommendation tasks' short decoding steps, while prompt compression risks discarding vital interaction history. Through systematic analysis of attention patterns in LLMRec, we uncover two pivotal insights: 1) layer-wise attention sparsity inversion where early layers retain dense informative patterns while later layers exhibit high redundancy, and 2) dual attention sinks phenomenon where attention scores concentrate on both head and tail tokens of input sequences. Motivated by these insights, we propose EARN, an efficient inference framework that leverages the early layers to compress information into register tokens placed at the input sequence boundaries, then focuses solely on these tokens in the subsequent layers. Extensive experiments on three datasets, two LLMRec methods and two LLM architectures demonstrate EARN's superiority, achieving up to 3.79x speedup and 80.8% KV Cache reduction with better accuracy than the general finetuning approach. Our work bridges the efficiency-effectiveness gap in LLMRec, offering practical deployment advantages for industrial scenarios.

[333] arXiv:2507.00716 [pdf, other]
Title: Accelerating Loading WebGraphs in ParaGrapher
Mohsen Koohi Esfahani
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

ParaGrapher is a graph loading API and library that enables graph processing frameworks to load large-scale compressed graphs with minimal overhead. This capability accelerates the design and implementation of new high-performance graph algorithms and their evaluation on a wide range of graphs and across different frameworks. However, our previous study identified two major limitations in ParaGrapher: inefficient utilization of high-bandwidth storage and reduced decompression bandwidth due to increased compression ratios. To address these limitations, we present two optimizations for ParaGrapher in this paper. To improve storage utilization, particularly for high-bandwidth storage, we introduce ParaGrapher-FUSE (PG-Fuse) a filesystem based on the FUSE (Filesystem in User Space). PG-Fuse optimizes storage access by increasing the size of requested blocks, reducing the number of calls to the underlying filesystem, and caching the received blocks in memory for future calls. To improve the decompression bandwidth, we introduce CompBin, a compact binary representation of the CSR format. CompBin facilitates direct accesses to neighbors while preventing storage usage for unused bytes. Our evaluation on 12 real-world and synthetic graphs with up to 128 billion edges shows that PG-Fuse and CompBin achieve up to 7.6 and 21.8 times speedup, respectively.

[334] arXiv:2507.00718 [pdf, html, other]
Title: AI Analyst: Framework and Comprehensive Evaluation of Large Language Models for Financial Time Series Report Generation
Elizabeth Fons, Elena Kochkina, Rachneet Kaur, Zhen Zeng, Berowne Hlavaty, Charese Smiley, Svitlana Vyetrenko, Manuela Veloso
Subjects: Computation and Language (cs.CL)

This paper explores the potential of large language models (LLMs) to generate financial reports from time series data. We propose a framework encompassing prompt engineering, model selection, and evaluation. We introduce an automated highlighting system to categorize information within the generated reports, differentiating between insights derived directly from time series data, stemming from financial reasoning, and those reliant on external knowledge. This approach aids in evaluating the factual grounding and reasoning capabilities of the models. Our experiments, utilizing both data from the real stock market indices and synthetic time series, demonstrate the capability of LLMs to produce coherent and informative financial reports.

[335] arXiv:2507.00721 [pdf, html, other]
Title: UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement
Xiao Zhang, Fei Wei, Yong Wang, Wenda Zhao, Feiyi Li, Xiangxiang Chu
Comments: ICCV2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios. Code is available at this https URL.

[336] arXiv:2507.00723 [pdf, other]
Title: Multi-goal-oriented anisotropic error control and mesh adaptivity for time-dependent convection-dominated problems
Markus Bause, Marius Paul Bruchhäuser, Bernhard Endtmayer, Nils Margenberg, Ioannis Toulopoulos, Thomas Wick
Comments: 14 pages, 5 Figures, 2 Tables. Submitted to PAMM. arXiv admin note: substantial text overlap with arXiv:2504.04951
Subjects: Numerical Analysis (math.NA)

In this work, we present an anisotropic multi-goal error control based on the Dual Weighted Residual (DWR) method for time-dependent convection-diffusion-reaction (CDR) equations. This multi-goal oriented approach allows for an accurate and efficient error control with regard to several quantities of interest simultaneously. Using anisotropic interpolation and restriction operators, we obtain elementwise error indicators in space and time, where the spatial indicators are additionally separated with respect to the single directions. The directional error indicators quantify anisotropy of the solution with respect to the goals, and produce adaptive, anisotropic meshes that efficiently capture layers. To prevent spurious oscillations the streamline upwind Petrov-Galerkin (SUPG) method is applied to stabilize the underlying system in the case of high Péclet numbers. Numerical examples show efficiency and robustness of the proposed approach for several goal quantities using established benchmarks for convection-dominated transport.

[337] arXiv:2507.00724 [pdf, html, other]
Title: Holmes: Towards Effective and Harmless Model Ownership Verification to Personalized Large Vision Models via Decoupling Common Features
Linghui Zhu, Yiming Li, Haiqin Weng, Yan Liu, Tianwei Zhang, Shu-Tao Xia, Zhi Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Large vision models achieve remarkable performance in various downstream tasks, primarily by personalizing pre-trained models through fine-tuning with private and valuable local data, which makes the personalized model a valuable intellectual property for its owner. Similar to the era of traditional DNNs, model stealing attacks also pose significant risks to these personalized models. However, in this paper, we reveal that most existing defense methods (developed for traditional DNNs), typically designed for models trained from scratch, either introduce additional security risks, are prone to misjudgment, or are even ineffective for fine-tuned models. To alleviate these problems, this paper proposes a harmless model ownership verification method for personalized models by decoupling similar common features. In general, our method consists of three main stages. In the first stage, we create shadow models that retain common features of the victim model while disrupting dataset-specific features. We represent the dataset-specific features of the victim model by the output differences between the shadow and victim models. After that, a meta-classifier is trained to identify stolen models by determining whether suspicious models contain the dataset-specific features of the victim. In the third stage, we conduct model ownership verification by hypothesis test to mitigate randomness and enhance robustness. Extensive experiments on benchmark datasets verify the effectiveness of the proposed method in detecting different types of model stealing simultaneously.

[338] arXiv:2507.00725 [pdf, html, other]
Title: Analyzing Time-Varying Scalar Fields using Piecewise-Linear Morse-Cerf Theory
Amritendu Dhar, Apratim Chakraborty, Vijay Natarajan
Subjects: Graphics (cs.GR); Computational Geometry (cs.CG)

Morse-Cerf theory considers a one-parameter family of smooth functions defined on a manifold and studies the evolution of their critical points with the parameter. This paper presents an adaptation of Morse-Cerf theory to a family of piecewise-linear (PL) functions. The vertex diagram and Cerf diagram are introduced as representations of the evolution of critical points of the PL function. The characterization of a crossing in the vertex diagram based on the homology of the lower links of vertices leads to the definition of a topological descriptor for time-varying scalar fields. An algorithm for computing the Cerf diagram and a measure for comparing two Cerf diagrams are also described together with experimental results on time-varying scalar fields.

[339] arXiv:2507.00726 [pdf, html, other]
Title: Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess
Dongyoon Hwang, Hojoon Lee, Jaegul Choo, Dongmin Park, Jongho Park
Comments: 27 pages
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

While reinforcement learning (RL) for large language models (LLMs) has shown promise in mathematical reasoning, strategic reasoning for LLMs using RL remains largely unexplored. We investigate whether LLMs can develop strategic reasoning capabilities through RL in chess. To this end, we leverage a chess-pretrained action-value network to provide dense reward on the LLM's output move quality, which can be seen as a form of knowledge distillation. Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels. We provide SFT and RL ablations on chess reasoning training and find evidence that this limitation stems from a deficit in the pretrained models' internal understanding of chess--a deficit which RL alone may not be able to fully overcome.

[340] arXiv:2507.00727 [pdf, other]
Title: On Hierarchical Coded Caching with Offline Users
Rashid Ummer N.T., B. Sundar Rajan
Comments: A short version of this is accepted for presentation in 2025 IEEE Information Theory Workshop; 8 pages, one figure
Subjects: Information Theory (cs.IT)

This paper studies a two-layer hierarchical network in which some users are offline during the content delivery phase. A two-layer hierarchical network consists of a single server connected to multiple cache-aided mirror sites, and each mirror site is connected to a distinct set of cache-aided users. A scheme for such a hierarchical system with offline users has been proposed recently but considered a special case where all mirror caches have zero memory, which is a significant limitation. We propose an array known as a hierarchical hotplug placement delivery array (HHPDA), which describes the placement and delivery phases of a coded caching scheme for a general two-layer hierarchical network with offline users. Further, we construct a class of HHPDAs using combinatorial t-designs.

[341] arXiv:2507.00728 [pdf, html, other]
Title: Temporal Orienteering with Changing Fuel Costs
Timothée Corsini, Jessica Enright, Laura Larios-Jones, Kitty Meeks
Comments: 23 pages, 2 figures
Subjects: Discrete Mathematics (cs.DM)

The problem Orienteering asks whether there exists a walk which visits a number of sites without exceeding some fuel budget. In the variant of the problem we consider, the cost of each edge in the walk is dependent on the time we depart one endpoint and the time we arrive at the other endpoint. This mirrors applications such as travel between orbiting objects where fuel costs are dependent on both the departure time and the length of time spent travelling. In defining this problem, we introduce a natural generalisation of the standard notion of temporal graphs: the pair consisting of the graph of the sites and a cost function, in which costs as well as shortest travel times between pairs of objects change over time. We believe this model is likely to be of independent interest. The problem of deciding whether a stated goal is feasible is easily seen to be NP-complete; we investigate three different ways to restrict the input which lead to efficient algorithms. These include the number of times an edge can be used, an analogue of vertex-interval-membership width, and the number of sites to be visited.

[342] arXiv:2507.00733 [pdf, other]
Title: Aleatoric and Epistemic Uncertainty Measures for Ordinal Classification through Binary Reduction
Stefan Haas, Eyke Hüllermeier
Subjects: Machine Learning (cs.LG)

Ordinal classification problems, where labels exhibit a natural order, are prevalent in high-stakes fields such as medicine and finance. Accurate uncertainty quantification, including the decomposition into aleatoric (inherent variability) and epistemic (lack of knowledge) components, is crucial for reliable decision-making. However, existing research has primarily focused on nominal classification and regression. In this paper, we introduce a novel class of measures of aleatoric and epistemic uncertainty in ordinal classification, which is based on a suitable reduction to (entropy- and variance-based) measures for the binary case. These measures effectively capture the trade-off in ordinal classification between exact hit-rate and minimial error distances. We demonstrate the effectiveness of our approach on various tabular ordinal benchmark datasets using ensembles of gradient-boosted trees and multi-layer perceptrons for approximate Bayesian inference. Our method significantly outperforms standard and label-wise entropy and variance-based measures in error detection, as indicated by misclassification rates and mean absolute error. Additionally, the ordinal measures show competitive performance in out-of-distribution (OOD) detection. Our findings highlight the importance of considering the ordinal nature of classification problems when assessing uncertainty.

[343] arXiv:2507.00736 [pdf, html, other]
Title: Ordinality in Discrete-level Question Difficulty Estimation: Introducing Balanced DRPS and OrderedLogitNN
Arthur Thuy, Ekaterina Loginova, Dries F. Benoit
Comments: Published in the EvalLAC'25 workshop at AIED 2025
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Recent years have seen growing interest in Question Difficulty Estimation (QDE) using natural language processing techniques. Question difficulty is often represented using discrete levels, framing the task as ordinal regression due to the inherent ordering from easiest to hardest. However, the literature has neglected the ordinal nature of the task, relying on classification or discretized regression models, with specialized ordinal regression methods remaining unexplored. Furthermore, evaluation metrics are tightly coupled to the modeling paradigm, hindering cross-study comparability. While some metrics fail to account for the ordinal structure of difficulty levels, none adequately address class imbalance, resulting in biased performance assessments. This study addresses these limitations by benchmarking three types of model outputs -- discretized regression, classification, and ordinal regression -- using the balanced Discrete Ranked Probability Score (DRPS), a novel metric that jointly captures ordinality and class imbalance. In addition to using popular ordinal regression methods, we propose OrderedLogitNN, extending the ordered logit model from econometrics to neural networks. We fine-tune BERT on the RACE++ and ARC datasets and find that OrderedLogitNN performs considerably better on complex tasks. The balanced DRPS offers a robust and fair evaluation metric for discrete-level QDE, providing a principled foundation for future research.

[344] arXiv:2507.00739 [pdf, html, other]
Title: Biorthogonal Tunable Wavelet Unit with Lifting Scheme in Convolutional Neural Network
An Le, Hung Nguyen, Sungbal Seo, You-Suk Bae, Truong Nguyen
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)

This work introduces a novel biorthogonal tunable wavelet unit constructed using a lifting scheme that relaxes both the orthogonality and equal filter length constraints, providing greater flexibility in filter design. The proposed unit enhances convolution, pooling, and downsampling operations, leading to improved image classification and anomaly detection in convolutional neural networks (CNN). When integrated into an 18-layer residual neural network (ResNet-18), the approach improved classification accuracy on CIFAR-10 by 2.12% and on the Describable Textures Dataset (DTD) by 9.73%, demonstrating its effectiveness in capturing fine-grained details. Similar improvements were observed in ResNet-34. For anomaly detection in the hazelnut category of the MVTec Anomaly Detection dataset, the proposed method achieved competitive and wellbalanced performance in both segmentation and detection tasks, outperforming existing approaches in terms of accuracy and robustness.

[345] arXiv:2507.00740 [pdf, html, other]
Title: Safe Low Bandwidth SPV: A Formal Treatment of Simplified Payment Verification Protocols and Security Bounds
Craig S Wright
Comments: 56 pages 5 images
Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)

This paper presents a complete formal specification, protocol description, and mathematical proof structure for Simplified Payment Verification (SPV) as originally defined in the Bitcoin whitepaper \cite{nakamoto2008}. In stark contrast to the misrepresentations proliferated by popular implementations, we show that SPV is not only secure under bounded adversarial assumptions but strictly optimal for digital cash systems requiring scalable and verifiable transaction inclusion. We reconstruct the SPV protocol from first principles, grounding its verification model in symbolic automata, Merkle membership relations, and chain-of-proof dominance predicates. Through rigorous probabilistic and game-theoretic analysis, we derive the economic bounds within which the protocol operates securely and verify its liveness and safety properties under partial connectivity, hostile relay networks, and adversarial propagation delay. Our specification further introduces low-bandwidth optimisations such as adaptive polling and compressed header synchronisation while preserving correctness. This document serves both as a blueprint for secure SPV implementation and a rebuttal of common misconceptions surrounding non-validating clients.

[346] arXiv:2507.00742 [pdf, html, other]
Title: Evaluating LLMs and Prompting Strategies for Automated Hardware Diagnosis from Textual User-Reports
Carlos Caminha, Maria de Lourdes M. Silva, Iago C. Chaves, Felipe T. Brito, Victor A. E. Farias, Javam C. Machado
Comments: To be published in the Proceedings of the Brazilian Integrated Software and Hardware Seminar 2025 (SEMISH 2025)
Subjects: Machine Learning (cs.LG)

Computer manufacturers offer platforms for users to describe device faults using textual reports such as "My screen is flickering". Identifying the faulty component from the report is essential for automating tests and improving user experience. However, such reports are often ambiguous and lack detail, making this task challenging. Large Language Models (LLMs) have shown promise in addressing such issues. This study evaluates 27 open-source models (1B-72B parameters) and 2 proprietary LLMs using four prompting strategies: Zero-Shot, Few-Shot, Chain-of-Thought (CoT), and CoT+Few-Shot (CoT+FS). We conducted 98,948 inferences, processing over 51 million input tokens and generating 13 million output tokens. We achieve f1-score up to 0.76. Results show that three models offer the best balance between size and performance: mistral-small-24b-instruct and two smaller models, llama-3.2-1b-instruct and gemma-2-2b-it, that offer competitive performance with lower VRAM usage, enabling efficient inference on end-user devices as modern laptops or smartphones with NPUs.

[347] arXiv:2507.00748 [pdf, html, other]
Title: Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
Bob Zhang, Haoran Li, Tao Zhang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yanbin Hao
Comments: 11 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions, which reveals limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, achieving +9.04\% improvements on MIG-Bench and +4.98\% improvements on several out-of-domain reasoning grounding benchmarks over the SFT baseline. Furthermore, our approach exhibits strong generalization in multi-image perception, with gains of +3.1\% and +2.4\% over the base model on subsets of the BLINK and MMIU benchmarks, respectively.

[348] arXiv:2507.00752 [pdf, html, other]
Title: Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation
Hao Xing, Kai Zhe Boey, Yuankai Wu, Darius Burschka, Gordon Cheng
Comments: 7 pages, 4 figures, accepted in IROS25, Hangzhou, China
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts.
Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.

[349] arXiv:2507.00754 [pdf, html, other]
Title: Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLMs
Selim Kuzucu, Muhammad Ferjad Naeem, Anna Kukleva, Federico Tombari, Bernt Schiele
Comments: 26 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The integration of Large Language Model (LLMs) blocks with Vision Transformers (ViTs) holds immense promise for vision-only tasks by leveraging the rich semantic knowledge and reasoning capabilities of LLMs. However, a fundamental challenge lies in the inherent modality mismatch between text-centric pretraining of LLMs and vision-centric training of ViTs. Direct fusion often fails to fully exploit the LLM's potential and suffers from unstable finetuning. As a result, LLM blocks are kept frozen while only the vision components are learned. As a remedy to these challenges, we introduce Language-Unlocked Vision Transformers (LUViT), a novel approach that bridges this modality mismatch through a synergistic pre-training strategy. LUViT co-adapts a ViT backbone and an LLM fusion block by (1) employing Masked Auto-Encoding (MAE) to pre-train the ViT for richer visual representations, and (2) concurrently training Low-Rank Adaptation (LoRA) layers within the LLM block using the MAE objective. This joint optimization guides the ViT to produce LLM-aligned features and the LLM to effectively interpret visual information. We demonstrate through extensive experiments that LUViT significantly improves performance on various downstream vision tasks, showcasing a more effective and efficient pathway to harness LLM knowledge for visual understanding.

[350] arXiv:2507.00756 [pdf, html, other]
Title: Towards Open-World Human Action Segmentation Using Graph Convolutional Networks
Hao Xing, Kai Zhe Boey, Gordon Cheng
Comments: 8 pages, 3 figures, accepted in IROS25, Hangzhou, China
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Human-object interaction segmentation is a fundamental task of daily activity understanding, which plays a crucial role in applications such as assistive robotics, healthcare, and autonomous systems. Most existing learning-based methods excel in closed-world action segmentation, they struggle to generalize to open-world scenarios where novel actions emerge. Collecting exhaustive action categories for training is impractical due to the dynamic diversity of human activities, necessitating models that detect and segment out-of-distribution actions without manual annotation. To address this issue, we formally define the open-world action segmentation problem and propose a structured framework for detecting and segmenting unseen actions. Our framework introduces three key innovations: 1) an Enhanced Pyramid Graph Convolutional Network (EPGCN) with a novel decoder module for robust spatiotemporal feature upsampling. 2) Mixup-based training to synthesize out-of-distribution data, eliminating reliance on manual annotations. 3) A novel Temporal Clustering loss that groups in-distribution actions while distancing out-of-distribution samples.
We evaluate our framework on two challenging human-object interaction recognition datasets: Bimanual Actions and 2 Hands and Object (H2O) datasets. Experimental results demonstrate significant improvements over state-of-the-art action segmentation models across multiple open-set evaluation metrics, achieving 16.9% and 34.6% relative gains in open-set segmentation (F1@50) and out-of-distribution detection performances (AUROC), respectively. Additionally, we conduct an in-depth ablation study to assess the impact of each proposed component, identifying the optimal framework configuration for open-world action segmentation.

[351] arXiv:2507.00761 [pdf, html, other]
Title: A Probabilistic Approach to Wildfire Spread Prediction Using a Denoising Diffusion Surrogate Model
Wenbo Yu, Anirbit Ghosh, Tobias Sebastian Finn, Rossella Arcucci, Marc Bocquet, Sibo Cheng
Subjects: Machine Learning (cs.LG)

Thanks to recent advances in generative AI, computers can now simulate realistic and complex natural processes. We apply this capability to predict how wildfires spread, a task made difficult by the unpredictable nature of fire and the variety of environmental conditions it depends on. In this study, We present the first denoising diffusion model for predicting wildfire spread, a new kind of AI framework that learns to simulate fires not just as one fixed outcome, but as a range of possible scenarios. By doing so, it accounts for the inherent uncertainty of wildfire dynamics, a feature that traditional models typically fail to represent. Unlike deterministic approaches that generate a single prediction, our model produces ensembles of forecasts that reflect physically meaningful distributions of where fire might go next. This technology could help us develop smarter, faster, and more reliable tools for anticipating wildfire behavior, aiding decision-makers in fire risk assessment and response planning.

[352] arXiv:2507.00762 [pdf, other]
Title: Leveraging Genetic Algorithms for Efficient Demonstration Generation in Real-World Reinforcement Learning Environments
Tom Maus, Asma Atamna, Tobias Glasmachers
Comments: This article has been submitted to and accepted for presentation at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD 2025). After publication, it will appear in the official LOD 2025 proceedings
Subjects: Machine Learning (cs.LG)

Reinforcement Learning (RL) has demonstrated significant potential in certain real-world industrial applications, yet its broader deployment remains limited by inherent challenges such as sample inefficiency and unstable learning dynamics. This study investigates the utilization of Genetic Algorithms (GAs) as a mechanism for improving RL performance in an industrially inspired sorting environment. We propose a novel approach in which GA-generated expert demonstrations are used to enhance policy learning. These demonstrations are incorporated into a Deep Q-Network (DQN) replay buffer for experience-based learning and utilized as warm-start trajectories for Proximal Policy Optimization (PPO) agents to accelerate training convergence. Our experiments compare standard RL training with rule-based heuristics, brute-force optimization, and demonstration data, revealing that GA-derived demonstrations significantly improve RL performance. Notably, PPO agents initialized with GA-generated data achieved superior cumulative rewards, highlighting the potential of hybrid learning paradigms, where heuristic search methods complement data-driven RL. The utilized framework is publicly available and enables further research into adaptive RL strategies for real-world applications.

[353] arXiv:2507.00769 [pdf, other]
Title: LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
Daniel Fein, Sebastian Russo, Violet Xiang, Kabir Jolly, Rafael Rafailov, Nick Haber
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at this https URL, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.

[354] arXiv:2507.00775 [pdf, html, other]
Title: Designing Visualization Widgets for Tangible Data Exploration: A Systematic Review
Haonan Yao, Lingyun Yu, Lijie Yao
Subjects: Human-Computer Interaction (cs.HC)

We present a systematic review on tasks, interactions, and visualization widgets (refer to tangible entities that are used to accomplish data exploration tasks through specific interactions) in the context of tangible data exploration. Tangible widgets have been shown to reduce cognitive load, enable more natural interactions, and support the completion of complex data exploration tasks. Yet, the field lacks a structured understanding of how task types, interaction methods, and widget designs are coordinated, limiting the ability to identify recurring design patterns and opportunities for innovation. To address this gap, we conduct a systematic review to analyze existing work and characterize the current design of data exploration tasks, interactions, and tangible visualization widgets. We next reflect based on our findings and propose a research agenda to inform the development of a future widget design toolkit for tangible data exploration. Our systematic review and supplemental materials are available at this http URL and this http URL.

[355] arXiv:2507.00782 [pdf, other]
Title: A Diagrammatic Calculus for a Functional Model of Natural Language Semantics
Matthieu Pierre Boyer
Comments: 15 pages, preprint before submission to CSL 2026
Subjects: Computation and Language (cs.CL); Programming Languages (cs.PL)

In this paper, we study a functional programming approach to natural language semantics, allowing us to increase the expressivity of a more traditional denotation style. We will formalize a category based type and effect system, and construct a diagrammatic calculus to model parsing and handling of effects, and use it to efficiently compute the denotations for sentences.

[356] arXiv:2507.00783 [pdf, other]
Title: Generative AI and the future of scientometrics: current topics and future questions
Benedetto Lepori, Jens Peter Andersen, Karsten Donnay
Subjects: Computation and Language (cs.CL); Digital Libraries (cs.DL)

The aim of this paper is to review the use of GenAI in scientometrics, and to begin a debate on the broader implications for the field. First, we provide an introduction on GenAI's generative and probabilistic nature as rooted in distributional linguistics. And we relate this to the debate on the extent to which GenAI might be able to mimic human 'reasoning'. Second, we leverage this distinction for a critical engagement with recent experiments using GenAI in scientometrics, including topic labelling, the analysis of citation contexts, predictive applications, scholars' profiling, and research assessment. GenAI shows promise in tasks where language generation dominates, such as labelling, but faces limitations in tasks that require stable semantics, pragmatic reasoning, or structured domain knowledge. However, these results might become quickly outdated. Our recommendation is, therefore, to always strive to systematically compare the performance of different GenAI models for specific tasks. Third, we inquire whether, by generating large amounts of scientific language, GenAI might have a fundamental impact on our field by affecting textual characteristics used to measure science, such as authors, words, and references. We argue that careful empirical work and theoretical reflection will be essential to remain capable of interpreting the evolving patterns of knowledge production.

[357] arXiv:2507.00786 [pdf, html, other]
Title: Snaps: Bloated and Outdated?
Jukka Ruohonen, Qusai Ramadan
Comments: Submitted as a "poster paper" to APSEC
Subjects: Software Engineering (cs.SE)

Snap is an alternative software packaging system developed by Canonical and provided by default in the Ubuntu Linux distribution. Given the heterogeneity of various Linux distributions and their various releases, Snap allows an interoperable delivery of software directly to users. However, concerns and criticism have also been frequently expressed. Regarding this criticism, the paper shows that currently distributed snap packages are indeed on average bloated in terms of their sizes and outdated in terms updating frequencies. With these empirical observations, this short paper contributes to the research domain of software packaging, software packages, and package managers.

[358] arXiv:2507.00788 [pdf, other]
Title: Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability
Markus Borg, Dave Hewett, Nadim Hagatulah, Noric Couderc, Emma Söderberg, Donald Graham, Uttam Kini, Dave Farley
Comments: Preprint of study preregistered at ICSME 2025 with In-Principal Acceptance. this https URL
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)

[Context] AI assistants, like GitHub Copilot and Cursor, are transforming software engineering. While several studies highlight productivity improvements, their impact on maintainability requires further investigation. [Objective] This study investigates whether co-development with AI assistants affects software maintainability, specifically how easily other developers can evolve the resulting source code. [Method] We conducted a two-phase controlled experiment involving 151 participants, 95% of whom were professional developers. In Phase 1, participants added a new feature to a Java web application, with or without AI assistance. In Phase 2, a randomized controlled trial, new participants evolved these solutions without AI assistance. [Results] AI-assisted development in Phase 1 led to a modest speedup in subsequent evolution and slightly higher average CodeHealth. Although neither difference was significant overall, the increase in CodeHealth was statistically significant when habitual AI users completed Phase 1. For Phase 1, we also observed a significant effect that corroborates previous productivity findings: using an AI assistant yielded a 30.7% median decrease in task completion time. Moreover, for habitual AI users, the mean speedup was 55.9%. [Conclusions] Our study adds to the growing evidence that AI assistants can effectively accelerate development. Moreover, we did not observe warning signs of degraded code-level maintainability. We recommend that future research focus on risks such as code bloat from excessive code generation and the build-up of cognitive debt as developers invest less mental effort during implementation.

[359] arXiv:2507.00789 [pdf, other]
Title: OptiPrune: Boosting Prompt-Image Consistency with Attention-Guided Noise and Dynamic Token Selection
Ziji Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Text-to-image diffusion models often struggle to achieve accurate semantic alignment between generated images and text prompts while maintaining efficiency for deployment on resource-constrained hardware. Existing approaches either incur substantial computational overhead through noise optimization or compromise semantic fidelity by aggressively pruning tokens. In this work, we propose OptiPrune, a unified framework that combines distribution-aware initial noise optimization with similarity-based token pruning to address both challenges simultaneously. Specifically, (1) we introduce a distribution-aware noise optimization module guided by attention scores to steer the initial latent noise toward semantically meaningful regions, mitigating issues such as subject neglect and feature entanglement; (2) we design a hardware-efficient token pruning strategy that selects representative base tokens via patch-wise similarity, injects randomness to enhance generalization, and recovers pruned tokens using maximum similarity copying before attention operations. Our method preserves the Gaussian prior during noise optimization and enables efficient inference without sacrificing alignment quality. Experiments on benchmark datasets, including Animal-Animal, demonstrate that OptiPrune achieves state-of-the-art prompt-image consistency with significantly reduced computational cost.

[360] arXiv:2507.00790 [pdf, html, other]
Title: LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling
Huaqiu Li, Yong Wang, Tongwen Huang, Hailang Huang, Haoqian Wang, Xiangxiang Chu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data will be available at this https URL.

[361] arXiv:2507.00792 [pdf, html, other]
Title: Real-Time Inverse Kinematics for Generating Multi-Constrained Movements of Virtual Human Characters
Hendric Voss, Stefan Kopp
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

Generating accurate and realistic virtual human movements in real-time is of high importance for a variety of applications in computer graphics, interactive virtual environments, robotics, and biomechanics. This paper introduces a novel real-time inverse kinematics (IK) solver specifically designed for realistic human-like movement generation. Leveraging the automatic differentiation and just-in-time compilation of TensorFlow, the proposed solver efficiently handles complex articulated human skeletons with high degrees of freedom. By treating forward and inverse kinematics as differentiable operations, our method effectively addresses common challenges such as error accumulation and complicated joint limits in multi-constrained problems, which are critical for realistic human motion modeling. We demonstrate the solver's effectiveness on the SMPLX human skeleton model, evaluating its performance against widely used iterative-based IK algorithms, like Cyclic Coordinate Descent (CCD), FABRIK, and the nonlinear optimization algorithm IPOPT. Our experiments cover both simple end-effector tasks and sophisticated, multi-constrained problems with realistic joint limits. Results indicate that our IK solver achieves real-time performance, exhibiting rapid convergence, minimal computational overhead per iteration, and improved success rates compared to existing methods. The project code is available at this https URL

[362] arXiv:2507.00797 [pdf, html, other]
Title: VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator
Zhican Wang, Hongxiang Fan, Haroon Waris, Gang Wang, Zhenyu Li, Jianfei Jiang, Yanan Sun, Guanghui He
Comments: DAC 2025
Subjects: Hardware Architecture (cs.AR)

Large Language Models (LLMs) excel in natural language processing tasks but pose significant computational and memory challenges for edge deployment due to their intensive resource demands. This work addresses the efficiency of LLM inference by algorithm-hardware-dataflow tri-optimizations. We propose a novel voting-based KV cache eviction algorithm, balancing hardware efficiency and algorithm accuracy by adaptively identifying unimportant kv vectors. From a dataflow perspective, we introduce a flexible-product dataflow and a runtime reconfigurable PE array for matrix-vector multiplication. The proposed approach effectively handles the diverse dimensional requirements and solves the challenges of incrementally varying sequence lengths. Additionally, an element-serial scheduling scheme is proposed for nonlinear operations, such as softmax and layer normalization (layernorm). Results demonstrate a substantial reduction in latency, accompanied by a significant decrease in hardware complexity, from O(N) to O(1). The proposed solution is realized in a custom-designed accelerator, VEDA, which outperforms existing hardware platforms. This research represents a significant advancement in LLM inference on resource-constrained edge devices, facilitating real-time processing, enhancing data privacy, and enabling model customization.

[363] arXiv:2507.00802 [pdf, html, other]
Title: TRACE: Temporally Reliable Anatomically-Conditioned 3D CT Generation with Enhanced Efficiency
Minye Shao, Xingyu Miao, Haoran Duan, Zeyu Wang, Jingkun Chen, Yawen Huang, Xian Wu, Jingjing Deng, Yang Long, Yefeng Zheng
Comments: Accepted to MICCAI 2025 (this version is not peer-reviewed; it is the preprint version). MICCAI proceedings DOI will appear here
Subjects: Computer Vision and Pattern Recognition (cs.CV)

3D medical image generation is essential for data augmentation and patient privacy, calling for reliable and efficient models suited for clinical practice. However, current methods suffer from limited anatomical fidelity, restricted axial length, and substantial computational cost, placing them beyond reach for regions with limited resources and infrastructure. We introduce TRACE, a framework that generates 3D medical images with spatiotemporal alignment using a 2D multimodal-conditioned diffusion approach. TRACE models sequential 2D slices as video frame pairs, combining segmentation priors and radiology reports for anatomical alignment, incorporating optical flow to sustain temporal coherence. During inference, an overlapping-frame strategy links frame pairs into a flexible length sequence, reconstructed into a spatiotemporally and anatomically aligned 3D volume. Experimental results demonstrate that TRACE effectively balances computational efficiency with preserving anatomical fidelity and spatiotemporal consistency. Code is available at: this https URL.

[364] arXiv:2507.00803 [pdf, html, other]
Title: Out of the Day Job: Perspectives of Industry Practitioners in Co-Design and Delivery of Software Engineering Courses
Gillian Daniel, Chris Hall, Per Hammer, Alec-Angus Macdonald, Hollie Marwick-Best, Emma McKenzie, George Popa, Derek Somerville, Tim Storer
Subjects: Software Engineering (cs.SE)

Over more than two decades, The University of Glasgow has co-designed and delivered numerous software engineering focused courses with industry partners, covering both technical and discipline specific professional skills. Such collaborations are not unique and many of the benefits are well recognised in the literature. These include enhancing the real-world relevance of curricula, developing student professional networks ahead of graduation and easing recruitment opportunities for employers.
However, there is relatively little scholarship on the perspectives of industry practitioners who participate in course design and delivery. This gap is significant, since the effort invested by practitioners is often substantial and may require ongoing support from both the industry partner and academic institution. Understanding the motivations, expectations and experiences of practitioners who engage in course delivery can guide the formation of future partnerships and ensure their long-term sustainability.
We begin to address this gap by reporting on the outcomes of a retrospective conducted amongst the practitioner coauthors of this paper, with the academic coauthors acting as facilitators. All coauthors have participated in the recent co-design and delivery of software engineering courses, but we choose to focus explicitly on the perspectives of the practitioners. We report on the themes that emerged from the discussions and our resulting recommendations for future collaborations.

[365] arXiv:2507.00807 [pdf, html, other]
Title: A posteriori and a priori error estimates for linearized thin sheet folding
Harbir Antil, Sean P. Carney, Rohit Khandelwal
Subjects: Numerical Analysis (math.NA)

We describe a posteriori error analysis for a discontinuous Galerkin method for a fourth order elliptic interface problem that arises from a linearized model of thin sheet folding. The primary contribution is a local efficiency bound for an estimator that measures the extent to which the interface conditions along the fold are satisfied, which is accomplished by constructing a novel edge bubble function. We subsequently conduct a medius analysis to obtain improved a priori error estimates under the minimal regularity assumption on the exact solution. The performance of the method is illustrated by numerical experiments.

[366] arXiv:2507.00808 [pdf, html, other]
Title: Multi-interaction TTS toward professional recording reproduction
Hiroki Kanagawa, Kenichi Fujita, Aya Watanabe, Yusuke Ijima
Comments: 7 pages,6 figures, Accepted to Speech Synthesis Workshop 2025 (SSW13)
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

Voice directors often iteratively refine voice actors' performances by providing feedback to achieve the desired outcome. While this iterative feedback-based refinement process is important in actual recordings, it has been overlooked in text-to-speech synthesis (TTS). As a result, fine-grained style refinement after the initial synthesis is not possible, even though the synthesized speech often deviates from the user's intended style. To address this issue, we propose a TTS method with multi-step interaction that allows users to intuitively and rapidly refine synthetized speech. Our approach models the interaction between the TTS model and its user to emulate the relationship between voice actors and voice directors. Experiments show that the proposed model with its corresponding dataset enable iterative style refinements in accordance with users' directions, thus demonstrating its multi-interaction capability. Sample audios are available: https://ntt-hilab-gensp. this http URL

[367] arXiv:2507.00810 [pdf, html, other]
Title: A Robust Algorithm for Non-IID Machine Learning Problems with Convergence Analysis
Qing Xu, Xiaohua Xuan
Subjects: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)

In this paper, we propose an improved numerical algorithm for solving minimax problems based on nonsmooth optimization, quadratic programming and iterative process. We also provide a rigorous proof of convergence for our algorithm under some mild assumptions, such as gradient continuity and boundedness. Such an algorithm can be widely applied in various fields such as robust optimization, imbalanced learning, etc.

[368] arXiv:2507.00814 [pdf, html, other]
Title: Many LLMs Are More Utilitarian Than One
Anita Keshmirian, Razan Baltaji, Babak Hemmatian, Hadi Asghari, Lav R. Varshney
Comments: 9 pages, 8 Figures, 7 tables
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Moral judgment is integral to large language model (LLM) alignment and social reasoning. As multi-agent systems gain prominence, it becomes crucial to understand how LLMs function collectively during collaboration, compared to individual agents. In human moral judgment, group deliberation leads to a utilitarian boost: a tendency to endorse norm violations that maximize benefits for the greatest number of people despite harms. We study whether a similar dynamic emerges in multi-agent LLM systems. We tested six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reasoned independently, and (2) Group, where they engaged in multi-turn discussions in pairs or triads. In personal moral dilemmas, where agents must decide to directly harm one individual to maximize the utility for others, all models found moral violations to be more acceptable when part of a group than individually, similar to human experiments. Some models endorsed actions that maximized overall well-being, even if they benefited strangers over familiar individuals. Others became more willing to violate moral norms in groups. However, while human groups show a similar action bias, the mechanism for their utilitarian boost differs from LLMs. Whereas the human shift comes from heightened sensitivity to decision outcomes, LLM groups show either reduced norm sensitivity or enhanced impartiality. This suggests that while the surface behavior of LLM collectives mimics human group reasoning, the underlying drivers differ. We discuss the implications for AI alignment, multi-agent design, and artificial moral reasoning.

[369] arXiv:2507.00816 [pdf, html, other]
Title: PI-WAN: A Physics-Informed Wind-Adaptive Network for Quadrotor Dynamics Prediction in Unknown Environments
Mengyun Wang, Bo Wang, Yifeng Niu, Chang Wang
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Accurate dynamics modeling is essential for quadrotors to achieve precise trajectory tracking in various applications. Traditional physical knowledge-driven modeling methods face substantial limitations in unknown environments characterized by variable payloads, wind disturbances, and external perturbations. On the other hand, data-driven modeling methods suffer from poor generalization when handling out-of-distribution (OoD) data, restricting their effectiveness in unknown scenarios. To address these challenges, we introduce the Physics-Informed Wind-Adaptive Network (PI-WAN), which combines knowledge-driven and data-driven modeling methods by embedding physical constraints directly into the training process for robust quadrotor dynamics learning. Specifically, PI-WAN employs a Temporal Convolutional Network (TCN) architecture that efficiently captures temporal dependencies from historical flight data, while a physics-informed loss function applies physical principles to improve model generalization and robustness across previously unseen conditions. By incorporating real-time prediction results into a model predictive control (MPC) framework, we achieve improvements in closed-loop tracking performance. Comprehensive simulations and real-world flight experiments demonstrate that our approach outperforms baseline methods in terms of prediction accuracy, tracking precision, and robustness to unknown environments.

[370] arXiv:2507.00817 [pdf, html, other]
Title: CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs
Jiaming Zhang, Rui Hu, Qing Guo, Wei Yang Bryan Lim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model's text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability with specialized fine-tuning for spatiotemporal coherence. Empirical evaluation on comprehensive video understanding benchmarks demonstrates that CAVALRY-V significantly outperforms existing attack methods, achieving 22.8% average improvement over the best baseline attacks on both commercial systems (GPT-4.1, Gemini 2.0) and open-source models (QwenVL-2.5, InternVL-2.5, Llava-Video, Aria, MiniCPM-o-2.6). Our framework achieves flexibility through implicit temporal coherence modeling rather than explicit regularization, enabling significant performance improvements even on image understanding (34.4% average gain). This capability demonstrates CAVALRY-V's potential as a foundational approach for adversarial research across multimodal systems.

[371] arXiv:2507.00821 [pdf, other]
Title: Sensemaking Through Making: Developing Clinical Domain Knowledge by Crafting Synthetic Datasets and Prototyping System Architectures
Mihnea Stefan Calota, Wessel Nieuwenhuys, Janet Yi-Ching Huang, Lin-Lin Chen, Mathias Funk
Subjects: Human-Computer Interaction (cs.HC)

Designers have ample opportunities to impact the healthcare domain. However, hospitals are often closed ecosystems that pose challenges in engaging clinical stakeholders, developing domain knowledge, and accessing relevant systems and data. In this paper, we introduce a making-oriented approach to help designers understand the intricacies of their target healthcare context. Using Remote Patient Monitoring (RPM) as a case study, we explore how manually crafting synthetic datasets based on real-world observations enables designers to learn about complex data-driven healthcare systems. Our process involves observing and modeling the real-world RPM context, crafting synthetic datasets, and iteratively prototyping a simplified RPM system that balances contextual richness and intentional abstraction. Through this iterative process of sensemaking through making, designers can still develop context familiarity when direct access to the actual healthcare system is limited. Our approach emphasizes the value of hands-on interaction with data structures to support designers in understanding opaque healthcare systems.

[372] arXiv:2507.00822 [pdf, html, other]
Title: Instant Particle Size Distribution Measurement Using CNNs Trained on Synthetic Data
Yasser El Jarida, Youssef Iraqi, Loubna Mekouar
Comments: Accepted at the Synthetic Data for Computer Vision Workshop @ CVPR 2025. 10 pages, 5 figures. Code available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Accurate particle size distribution (PSD) measurement is important in industries such as mining, pharmaceuticals, and fertilizer manufacturing, significantly influencing product quality and operational efficiency. Traditional PSD methods like sieve analysis and laser diffraction are manual, time-consuming, and limited by particle overlap. Recent developments in convolutional neural networks (CNNs) enable automated, real-time PSD estimation directly from particle images. In this work, we present a CNN-based methodology trained on realistic synthetic particle imagery generated using Blender's advanced rendering capabilities. Synthetic data sets using this method can replicate various industrial scenarios by systematically varying particle shapes, textures, lighting, and spatial arrangements that closely resemble the actual configurations. We evaluated three CNN-based architectures, ResNet-50, InceptionV3, and EfficientNet-B0, for predicting critical PSD parameters (d10, d50, d90). Results demonstrated comparable accuracy across models, with EfficientNet-B0 achieving the best computational efficiency suitable for real-time industrial deployment. This approach shows the effectiveness of realistic synthetic data for robust CNN training, which offers significant potential for automated industrial PSD monitoring. The code is released at : this https URL

[373] arXiv:2507.00824 [pdf, html, other]
Title: PANDAS: Peer-to-peer, Adaptive Networking for Data Availability Sampling within Ethereum Consensus Timebounds
Matthieu Pigaglio, Onur Ascigil, Michał Król, Sergi Rene, Felix Lange, Kaleem Peeroo, Ramin Sadre, Vladimir Stankovic, Etienne Rivière
Comments: 14 pages, 10 figures, 1 algorithm, 1 table, and 18 plots
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Performance (cs.PF)

Layer-2 protocols can assist Ethereum's limited throughput, but globally broadcasting layer-2 data limits their scalability. The Danksharding evolution of Ethereum aims to support the selective distribution of layer-2 data, whose availability in the network is verified using randomized data availability sampling (DAS). Integrating DAS into Ethereum's consensus process is challenging, as pieces of layer-2 data must be disseminated and sampled within four seconds of the beginning of each consensus slot. No existing solution can support dissemination and sampling under such strict time bounds.
We propose PANDAS, a practical approach to integrate DAS with Ethereum under Danksharding's requirements without modifying its protocols for consensus and node discovery. PANDAS disseminates layer-2 data and samples its availability using lightweight, direct exchanges. Its design accounts for message loss, node failures, and unresponsive participants while anticipating the need to scale out the Ethereum network. Our evaluation of PANDAS's prototype in a 1,000-node cluster and simulations for up to 20,000 peers shows that it allows layer-2 data dissemination and sampling under planetary-scale latencies within the 4-second deadline.

[374] arXiv:2507.00825 [pdf, html, other]
Title: High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery
Hongxing Peng, Lide Chen, Hui Zhu, Yan Chen
Comments: 14 pages, 9 figures, to appear in KBS
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Unmanned Aerial Vehicle-based Object Detection (UAV-OD) faces substantial challenges, including small target sizes, high-density distributions, and cluttered backgrounds in UAV imagery. Current algorithms often depend on hand-crafted components like anchor boxes, which demand fine-tuning and exhibit limited generalization, and Non-Maximum Suppression (NMS), which is threshold-sensitive and prone to misclassifying dense objects. These generic architectures thus struggle to adapt to aerial imaging characteristics, resulting in performance limitations. Moreover, emerging end-to-end frameworks have yet to effectively mitigate these aerial-specific this http URL address these issues, we propose HEGS-DETR, a comprehensively enhanced, real-time Detection Transformer framework tailored for UAVs. First, we introduce the High-Frequency Enhanced Semantics Network (HFESNet) as a novel backbone. HFESNet preserves critical high-frequency spatial details to extract robust semantic features, thereby improving discriminative capability for small and occluded targets in complex backgrounds. Second, our Efficient Small Object Pyramid (ESOP) strategy strategically fuses high-resolution feature maps with minimal computational overhead, significantly boosting small object detection. Finally, the proposed Selective Query Recollection (SQR) and Geometry-Aware Positional Encoding (GAPE) modules enhance the detector's decoder stability and localization accuracy, effectively optimizing bounding boxes and providing explicit spatial priors for dense scenes. Experiments on the VisDrone dataset demonstrate that HEGS-DETR achieves a 5.1\% AP$_{50}$ and 3.8\% AP increase over the baseline, while maintaining real-time speed and reducing parameter count by 4M.

[375] arXiv:2507.00826 [pdf, html, other]
Title: Getting Dynamic Line Ratings into Markets
Zhiyi Zhou, Christoph Graf, Yury Dvorkin
Subjects: Systems and Control (eess.SY)

Static transmission line ratings may lead to underutilization of line capacity due to overly conservative (worst-case) assumptions. Grid-enhancing technologies (GETs) such as dynamic line ratings (DLRs), which adjust line capacity based on real-time conditions, are a techno-economically viable alternative to increase the utilization of existing power lines. Nonetheless, their adoption has been slow, partly due to the absence of operational tools that effectively account for simultaneous impacts on dispatch and pricing. In this paper, we represent transmission capacity with DLRs as a stock-like resource with time-variant interdependency, which is modeled via an approximation of line temperature evolution process, decoupling the impacts of ambient weather conditions and power flow on transmission line temperature and thus capacity. We integrate DLRs into a multi-period DC optimal power flow problem, with chance constrains addressing correlated uncertainty in DLRs and renewable generation. This yields non-convex problems that we transform into a tractable convex form by linearization. We derive locational marginal energy and ancillary services prices consistent with a competitive equilibrium. Numerical experiments on the 11-zone and 1814-node NYISO systems demonstrate its performance, including impacts on dispatch, pricing, and marginal carbon emissions.

[376] arXiv:2507.00827 [pdf, other]
Title: A Technique for the Detection of PDF Tampering or Forgery
Gabriel Grobler, Sheunesu Makura, Hein Venter
Comments: 19 Pages, 5 figures, published in Online Proceedings of the South African Institute of Computer Scientists and Information Technologists 2024 Conference, ISSN 2959-8877
Subjects: Cryptography and Security (cs.CR)

Tampering or forgery of digital documents has become widespread, most commonly through altering images without any malicious intent such as enhancing the overall appearance of the image. However, there are occasions when tampering of digital documents can have negative consequences, such as financial fraud and reputational damage. Tampering can occur through altering a digital document's text or editing an image's pixels. Many techniques have been developed to detect whether changes have been made to a document. Most of these techniques rely on generating hashes or watermarking the document. These techniques, however, have limitations in that they cannot detect alterations to portable document format (PDF) signatures or other non-visual aspects, such as metadata. This paper presents a new technique that can be used to detect tampering within a PDF document by utilizing the PDF document's file page objects. The technique employs a prototype that can detect changes to a PDF document, such as changes made to the text, images, or metadata of the said file.

[377] arXiv:2507.00828 [pdf, html, other]
Title: ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
Alexander Hoyle, Lorena Calvo-Bartolomé, Jordan Boyd-Graber, Philip Resnik
Comments: Accepted to ACL 2025 (Main)
Subjects: Computation and Language (cs.CL)

Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at this https URL

[378] arXiv:2507.00829 [pdf, html, other]
Title: On the Surprising Efficacy of LLMs for Penetration-Testing
Andreas Happe, Jürgen Cito
Subjects: Cryptography and Security (cs.CR)

This paper presents a critical examination of the surprising efficacy of Large Language Models (LLMs) in penetration testing. The paper thoroughly reviews the evolution of LLMs and their rapidly expanding capabilities which render them increasingly suitable for complex penetration testing operations. It systematically details the historical adoption of LLMs in both academic research and industry, showcasing their application across various offensive security tasks and covering broader phases of the cyber kill chain. Crucially, the analysis also extends to the observed adoption of LLMs by malicious actors, underscoring the inherent dual-use challenge of this technology within the security landscape.
The unexpected effectiveness of LLMs in this context is elucidated by several key factors: the strong alignment between penetration testing's reliance on pattern-matching and LLMs' core strengths, their inherent capacity to manage uncertainty in dynamic environments, and cost-effective access to competent pre-trained models through LLM providers.
The current landscape of LLM-aided penetration testing is categorized into interactive 'vibe-hacking' and the emergence of fully autonomous systems. The paper identifies and discusses significant obstacles impeding wider adoption and safe deployment. These include critical issues concerning model reliability and stability, paramount safety and security concerns, substantial monetary and ecological costs, implications for privacy and digital sovereignty, complex questions of accountability, and profound ethical dilemmas. This comprehensive review and analysis provides a foundation for discussion on future research directions and the development of robust safeguards at the intersection of AI and security.

[379] arXiv:2507.00833 [pdf, other]
Title: HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning
Zhi Jing, Siyuan Yang, Jicong Ao, Ting Xiao, Yugang Jiang, Chenjia Bai
Comments: Project Page: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

For robotic manipulation, existing robotics datasets and simulation benchmarks predominantly cater to robot-arm platforms. However, for humanoid robots equipped with dual arms and dexterous hands, simulation tasks and high-quality demonstrations are notably lacking. Bimanual dexterous manipulation is inherently more complex, as it requires coordinated arm movements and hand operations, making autonomous data collection challenging. This paper presents HumanoidGen, an automated task creation and demonstration collection framework that leverages atomic dexterous operations and LLM reasoning to generate relational constraints. Specifically, we provide spatial annotations for both assets and dexterous hands based on the atomic operations, and perform an LLM planner to generate a chain of actionable spatial constraints for arm movements based on object affordances and scenes. To further improve planning ability, we employ a variant of Monte Carlo tree search to enhance LLM reasoning for long-horizon tasks and insufficient annotation. In experiments, we create a novel benchmark with augmented scenarios to evaluate the quality of the collected data. The results show that the performance of the 2D and 3D diffusion policies can scale with the generated dataset. Project page is this https URL.

[380] arXiv:2507.00838 [pdf, html, other]
Title: Stylometry recognizes human and LLM-generated texts in short samples
Karol Przystalski, Jan K. Argasiński, Iwona Grabska-Gradzińska, Jeremi K. Ochab
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show -- crucially, in the context of the increasingly sophisticated LLMs -- that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.

[381] arXiv:2507.00839 [pdf, html, other]
Title: RapidStore: An Efficient Dynamic Graph Storage System for Concurrent Queries
Chiyu Hao, Jixian Su, Shixuan Sun, Hao Zhang, Sen Gao, Jianwen Zhao, Chenyi Zhang, Jieru Zhao, Chen Chen, Minyi Guo
Comments: 17 pages, 18 figures
Subjects: Databases (cs.DB)

Dynamic graph storage systems are essential for real-time applications such as social networks and recommendation, where graph data continuously evolves. However, they face significant challenges in efficiently handling concurrent read and write operations. We find that existing methods suffer from write queries interfering with read efficiency, substantial time and space overhead due to per-edge versioning, and an inability to balance performance, such as slow searches under concurrent workloads. To address these issues, we propose RapidStore, a holistic approach for efficient in-memory dynamic graph storage designed for read-intensive workloads. Our key idea is to exploit the characteristics of graph queries through a decoupled system design that separates the management of read and write queries and decouples version data from graph data. Particularly, we design an efficient dynamic graph store to cooperate with the graph concurrency control mechanism. Experimental results demonstrate that RapidStore enables fast and scalable concurrent graph queries, effectively balancing the performance of inserts, searches, and scans, and significantly improving efficiency in dynamic graph storage systems.

[382] arXiv:2507.00841 [pdf, html, other]
Title: SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents
Siyuan Liang, Tianmeng Fang, Zhe Liu, Aishan Liu, Yan Xiao, Jinyuan He, Ee-Chien Chang, Xiaochun Cao
Comments: 12 pages
Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

With the wide application of multimodal foundation models in intelligent agent systems, scenarios such as mobile device control, intelligent assistant interaction, and multimodal task execution are gradually relying on such large model-driven agents. However, the related systems are also increasingly exposed to potential jailbreak risks. Attackers may induce the agents to bypass the original behavioral constraints through specific inputs, and then trigger certain risky and sensitive operations, such as modifying settings, executing unauthorized commands, or impersonating user identities, which brings new challenges to system security. Existing security measures for intelligent agents still have limitations when facing complex interactions, especially in detecting potentially risky behaviors across multiple rounds of conversations or sequences of tasks. In addition, an efficient and consistent automated methodology to assist in assessing and determining the impact of such risks is currently lacking. This work explores the security issues surrounding mobile multimodal agents, attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information, and designs an automated assisted assessment scheme based on a large language model. Through preliminary validation in several representative high-risk tasks, the results show that the method can improve the recognition of risky behaviors to some extent and assist in reducing the probability of agents being jailbroken. We hope that this study can provide some valuable references for the security risk modeling and protection of multimodal intelligent agent systems.

[383] arXiv:2507.00845 [pdf, html, other]
Title: Do Echo Top Heights Improve Deep Learning Nowcasts?
Peter Pavlík, Marc Schleiss, Anna Bou Ezzeddine, Viera Rozinajová
Comments: Pre-review version of an article accepted at Transactions on Large-Scale Data and Knowledge-Centered Systems
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Precipitation nowcasting -- the short-term prediction of rainfall using recent radar observations -- is critical for weather-sensitive sectors such as transportation, agriculture, and disaster mitigation. While recent deep learning models have shown promise in improving nowcasting skill, most approaches rely solely on 2D radar reflectivity fields, discarding valuable vertical information available in the full 3D radar volume. In this work, we explore the use of Echo Top Height (ETH), a 2D projection indicating the maximum altitude of radar reflectivity above a given threshold, as an auxiliary input variable for deep learning-based nowcasting. We examine the relationship between ETH and radar reflectivity, confirming its relevance for predicting rainfall intensity. We implement a single-pass 3D U-Net that processes both the radar reflectivity and ETH as separate input channels. While our models are able to leverage ETH to improve skill at low rain-rate thresholds, results are inconsistent at higher intensities and the models with ETH systematically underestimate precipitation intensity. Three case studies are used to illustrate how ETH can help in some cases, but also confuse the models and increase the error variance. Nonetheless, the study serves as a foundation for critically assessing the potential contribution of additional variables to nowcasting performance.

[384] arXiv:2507.00846 [pdf, html, other]
Title: BoltzNCE: Learning Likelihoods for Boltzmann Generation with Stochastic Interpolants and Noise Contrastive Estimation
Rishal Aggrwal, Jacky Chen, Nicholas M. Boffi, David Ryan Koes
Comments: 19 pages, 25 figures, submitted to NeurIPS 2025
Subjects: Machine Learning (cs.LG); Biological Physics (physics.bio-ph)

Efficient sampling from the Boltzmann distribution defined by an energy function is a key challenge in modeling physical systems such as molecules. Boltzmann Generators tackle this by leveraging Continuous Normalizing Flows that transform a simple prior into a distribution that can be reweighted to match the Boltzmann distribution using sample likelihoods. However, obtaining likelihoods requires computing costly Jacobians during integration, making it impractical for large molecular systems. To overcome this, we propose learning the likelihood of the generated distribution via an energy-based model trained with noise contrastive estimation and score matching. By using stochastic interpolants to anneal between the prior and generated distributions, we combine both the objective functions to efficiently learn the density function. On the alanine dipeptide system, we demonstrate that our method yields free energy profiles and energy distributions comparable to those obtained with exact likelihoods. Additionally, we show that free energy differences between metastable states can be estimated accurately with orders-of-magnitude speedup.

[385] arXiv:2507.00847 [pdf, html, other]
Title: Stealtooth: Breaking Bluetooth Security Abusing Silent Automatic Pairing
Keiichiro Kimura, Hiroki Kuzuno, Yoshiaki Shiraishi, Masakatu Morii
Comments: 13 pages, 6 figures. We plan to extend our evaluation to additional device categories. Responsible disclosure completed
Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)

Bluetooth is a pervasive wireless communication technology used by billions of devices for short-range connectivity. The security of Bluetooth relies on the pairing process, where devices establish shared long-term keys for secure communications. However, many commercial Bluetooth devices implement automatic pairing functions to improve user convenience, creating a previously unexplored attack surface.
We present Stealtooth, a novel attack that abuses unknown vulnerabilities in the automatic pairing functions in commercial Bluetooth devices to achieve completely silent device link key overwriting. The Stealtooth attack leverages the fact that Bluetooth audio devices automatically transition to pairing mode under specific conditions, enabling attackers to hijack pairing processes without user awareness or specialized tools. We also extend the attack into the MitM Stealtooth attack, combining automatic pairing abuse with power-saving mode techniques to enable man-in-the-middle attacks.
We evaluate the attacks against 10 commercial Bluetooth devices from major manufacturers, demonstrating widespread vulnerabilities across diverse device types and manufacturers. Our practical implementation requires only commodity hardware and open-source software, highlighting the low barrier to entry for attackers.
We propose defenses both device and protocol levels, including enhanced user notifications and standardized automatic pairing guidelines. Our findings reveal a critical tension between security and usability, showing that current automatic pairing implementations create systematic vulnerabilities. We responsibly disclosed our findings to affected vendors, with several already releasing patches.

[386] arXiv:2507.00848 [pdf, other]
Title: Quantum Approximate Optimization Algorithm for Spatiotemporal Forecasting of HIV Clusters
Don Roosan, Saif Nirzhor, Rubayat Khan, Fahmida Hai, Mohammad Rifat Haidar
Comments: Conference details can be found here: this https URL
Subjects: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)

HIV epidemiological data is increasingly complex, requiring advanced computation for accurate cluster detection and forecasting. We employed quantum-accelerated machine learning to analyze HIV prevalence at the ZIP-code level using AIDSVu and synthetic SDoH data for 2022. Our approach compared classical clustering (DBSCAN, HDBSCAN) with a quantum approximate optimization algorithm (QAOA), developed a hybrid quantum-classical neural network for HIV prevalence forecasting, and used quantum Bayesian networks to explore causal links between SDoH factors and HIV incidence. The QAOA-based method achieved 92% accuracy in cluster detection within 1.6 seconds, outperforming classical algorithms. Meanwhile, the hybrid quantum-classical neural network predicted HIV prevalence with 94% accuracy, surpassing a purely classical counterpart. Quantum Bayesian analysis identified housing instability as a key driver of HIV cluster emergence and expansion, with stigma exerting a geographically variable influence. These quantum-enhanced methods deliver greater precision and efficiency in HIV surveillance while illuminating critical causal pathways. This work can guide targeted interventions, optimize resource allocation for PrEP, and address structural inequities fueling HIV transmission.

[387] arXiv:2507.00849 [pdf, html, other]
Title: UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection
Wei Li, Jiaman Tang, Yang Li, Beihao Xia, Ligang Tan, Hongmao Qin
Comments: The paper was accepted by the 36th IEEE Intelligent Vehicles Symposium (IEEE IV 2025)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ cross-enhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at this https URL.

[388] arXiv:2507.00851 [pdf, html, other]
Title: Aligning Learning and Endogenous Decision-Making
Rares Cristian, Pavithra Harsha, Georgia Perakis, Brian Quanz
Subjects: Machine Learning (cs.LG)

Many of the observations we make are biased by our decisions. For instance, the demand of items is impacted by the prices set, and online checkout choices are influenced by the assortments presented. The challenge in decision-making under this setting is the lack of counterfactual information, and the need to learn it instead. We introduce an end-to-end method under endogenous uncertainty to train ML models to be aware of their downstream, enabling their effective use in the decision-making stage. We further introduce a robust optimization variant that accounts for uncertainty in ML models -- specifically by constructing uncertainty sets over the space of ML models and optimizing actions to protect against worst-case predictions. We prove guarantees that this robust approach can capture near-optimal decisions with high probability as a function of data. Besides this, we also introduce a new class of two-stage stochastic optimization problems to the end-to-end learning framework that can now be addressed through our framework. Here, the first stage is an information-gathering problem to decide which random variable to poll and gain information about before making a second-stage decision based off of it. We present several computational experiments for pricing and inventory assortment/recommendation problems. We compare against existing methods in online learning/bandits/offline reinforcement learning and show our approach has consistent improved performance over these. Just as in the endogenous setting, the model's prediction also depends on the first-stage decision made. While this decision does not affect the random variable in this setting, it does affect the correct point forecast that should be made.

[389] arXiv:2507.00852 [pdf, html, other]
Title: Robust Component Detection for Flexible Manufacturing: A Deep Learning Approach to Tray-Free Object Recognition under Variable Lighting
Fatemeh Sadat Daneshmand
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Flexible manufacturing systems in Industry 4.0 require robots capable of handling objects in unstructured environments without rigid positioning constraints. This paper presents a computer vision system that enables industrial robots to detect and grasp pen components in arbitrary orientations without requiring structured trays, while maintaining robust performance under varying lighting conditions. We implement and evaluate a Mask R-CNN-based approach on a complete pen manufacturing line at ZHAW, addressing three critical challenges: object detection without positional constraints, robustness to extreme lighting variations, and reliable performance with cost-effective cameras. Our system achieves 95% detection accuracy across diverse lighting conditions while eliminating the need for structured component placement, demonstrating a 30% reduction in setup time and significant improvement in manufacturing flexibility. The approach is validated through extensive testing under four distinct lighting scenarios, showing practical applicability for real-world industrial deployment.

[390] arXiv:2507.00855 [pdf, html, other]
Title: A New Family of Thread to Core Allocation Policies for an SMT ARM Processor
Marta Navarro, Josué Feliu, Salvador Petit, María E. Gómez, Julio Sahuquillo
Comments: 13 pages
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR)

Modern high-performance servers commonly integrate Simultaneous Multithreading (SMT) processors, which efficiently boosts throughput over single-threaded cores. Optimizing performance in SMT processors faces challenges due to the inter-application interference within each SMT core. To mitigate the interference, thread-to-core (T2C) allocation policies play a pivotal role. State-of-the-art T2C policies work in two steps: i) building a per-application performance stack using performance counters and ii) building performance prediction models to identify the best pairs of applications to run on each core.
This paper explores distinct ways to build the performance stack in ARM processors and introduces the Instructions and Stalls Cycles (ISC) stack, a novel approach to overcome ARM PMU limitations. The ISC stacks are used as inputs for a performance prediction model to estimate the applications' performance considering the inter-application interference. The accuracy of the prediction model (second step) depends on the accuracy of the performance stack (first step); thus, the higher the accuracy of the performance stack, the higher the potential performance gains obtained by the T2C allocation policy.
This paper presents SYNPA as a family of T2C allocation policies. Experimental results show that $SYNPA4$, the best-performing SYNPA variant, outperforms turnaround time by 38\% over Linux, which represents 3$\times$ the gains achieved by the state-of-the-art policies for ARM processors. Furthermore, the multiple discussions and refinements presented throughout this paper can be applied to other SMT processors from distinct vendors and are aimed at helping performance analysts build performance stacks for accurate performance estimates in real processors.

[391] arXiv:2507.00856 [pdf, html, other]
Title: Enhancing Vehicular Platooning with Wireless Federated Learning: A Resource-Aware Control Framework
Beining Wu, Jun Huang, Qiang Duan, Liang Dong, Zhipeng Cai
Comments: Under review at IEEE Transactions on Networking
Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)

This paper aims to enhance the performance of Vehicular Platooning (VP) systems integrated with Wireless Federated Learning (WFL). In highly dynamic environments, vehicular platoons experience frequent communication changes and resource constraints, which significantly affect information exchange and learning model synchronization. To address these challenges, we first formulate WFL in VP as a joint optimization problem that simultaneously considers Age of Information (AoI) and Federated Learning Model Drift (FLMD) to ensure timely and accurate control. Through theoretical analysis, we examine the impact of FLMD on convergence performance and develop a two-stage Resource-Aware Control framework (RACE). The first stage employs a Lagrangian dual decomposition method for resource configuration, while the second stage implements a multi-agent deep reinforcement learning approach for vehicle selection. The approach integrates Multi-Head Self-Attention and Long Short-Term Memory networks to capture spatiotemporal correlations in communication states. Experimental results demonstrate that, compared to baseline methods, the proposed framework improves AoI optimization by up to 45%, accelerates learning convergence, and adapts more effectively to dynamic VP environments on the AI4MARS dataset.

[392] arXiv:2507.00861 [pdf, html, other]
Title: SafeMap: Robust HD Map Construction from Incomplete Observations
Xiaoshuai Hao, Lingdong Kong, Rong Yin, Pengwei Wang, Jing Zhang, Yunfeng Diao, Shu Zhao
Comments: Accepted by ICML 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Robust high-definition (HD) map construction is vital for autonomous driving, yet existing methods often struggle with incomplete multi-view camera data. This paper presents SafeMap, a novel framework specifically designed to secure accuracy even when certain camera views are missing. SafeMap integrates two key components: the Gaussian-based Perspective View Reconstruction (G-PVR) module and the Distillation-based Bird's-Eye-View (BEV) Correction (D-BEVC) module. G-PVR leverages prior knowledge of view importance to dynamically prioritize the most informative regions based on the relationships among available camera views. Furthermore, D-BEVC utilizes panoramic BEV features to correct the BEV representations derived from incomplete observations. Together, these components facilitate the end-to-end map reconstruction and robust HD map generation. SafeMap is easy to implement and integrates seamlessly into existing systems, offering a plug-and-play solution for enhanced robustness. Experimental results demonstrate that SafeMap significantly outperforms previous methods in both complete and incomplete scenarios, highlighting its superior performance and reliability.

[393] arXiv:2507.00862 [pdf, html, other]
Title: Machine Learning-based Early Detection of Potato Sprouting Using Electrophysiological Signals
Davide Andreoletti, Aris Marcolongo, Natasa Sarafijanovic Djukic, Julien Roulet, Stefano Billeter, Andrzej Kurenda, Margot Visse-Mansiaux, Brice Dupuis, Carrol Annette Plummer, Beatrice Paoli, Omran Ayoub
Comments: 8 pages, 7 figures
Subjects: Machine Learning (cs.LG)

Accurately predicting potato sprouting before the emergence of any visual signs is critical for effective storage management, as sprouting degrades both the commercial and nutritional value of tubers. Effective forecasting allows for the precise application of anti-sprouting chemicals (ASCs), minimizing waste and reducing costs. This need has become even more pressing following the ban on Isopropyl N-(3-chlorophenyl) carbamate (CIPC) or Chlorpropham due to health and environmental concerns, which has led to the adoption of significantly more expensive alternative ASCs. Existing approaches primarily rely on visual identification, which only detects sprouting after morphological changes have occurred, limiting their effectiveness for proactive management. A reliable early prediction method is therefore essential to enable timely intervention and improve the efficiency of post-harvest storage strategies, where early refers to detecting sprouting before any visible signs appear. In this work, we address the problem of early prediction of potato sprouting. To this end, we propose a novel machine learning (ML)-based approach that enables early prediction of potato sprouting using electrophysiological signals recorded from tubers using proprietary sensors. Our approach preprocesses the recorded signals, extracts relevant features from the wavelet domain, and trains supervised ML models for early sprouting detection. Additionally, we incorporate uncertainty quantification techniques to enhance predictions. Experimental results demonstrate promising performance in the early detection of potato sprouting by accurately predicting the exact day of sprouting for a subset of potatoes and while showing acceptable average error across all potatoes. Despite promising results, further refinements are necessary to minimize prediction errors, particularly in reducing the maximum observed deviations.

[394] arXiv:2507.00868 [pdf, html, other]
Title: Is Visual in-Context Learning for Compositional Medical Tasks within Reach?
Simon Reiß, Zdravko Marinov, Alexander Jaus, Constantin Seibold, M. Saquib Sarfraz, Erik Rodner, Rainer Stiefelhagen
Comments: Accepted to ICCV 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training in-context learners to adapt to sequences of tasks, rather than individual tasks. Our goal is to solve complex tasks that involve multiple intermediate steps using a single model, allowing users to define entire vision pipelines flexibly at test time. To achieve this, we first examine the properties and limitations of visual in-context learning architectures, with a particular focus on the role of codebooks. We then introduce a novel method for training in-context learners using a synthetic compositional task generation engine. This engine bootstraps task sequences from arbitrary segmentation datasets, enabling the training of visual in-context learners for compositional tasks. Additionally, we investigate different masking-based training objectives to gather insights into how to train models better for solving complex, compositional tasks. Our exploration not only provides important insights especially for multi-modal medical task sequences but also highlights challenges that need to be addressed.

[395] arXiv:2507.00875 [pdf, html, other]
Title: TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation
Xi Xuan, King-kui Sin, Yufei Zhou, Chunyu Kit
Comments: arXiv admin note: text overlap with arXiv:2501.09444; text overlap with arXiv:2409.20288 by other authors
Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)

Multi-agent systems empowered by large language models (LLMs) have demonstrated remarkable capabilities in a wide range of downstream applications, including machine translation. However, the potential of LLMs in translating Hong Kong legal judgments remains uncertain due to challenges such as intricate legal terminology, culturally embedded nuances, and strict linguistic structures. In this work, we introduce TransLaw, a novel multi-agent framework implemented for real-world Hong Kong case law translation. It employs three specialized agents, namely, Translator, Annotator, and Proofreader, to collaboratively produce translations for high accuracy in legal meaning, appropriateness in style, and adequate coherence and cohesion in structure. This framework supports customizable LLM configurations and achieves tremendous cost reduction compared to professional human translation services. We evaluated its performance using 13 open-source and commercial LLMs as agents and obtained interesting findings, including that it surpasses GPT-4o in legal semantic accuracy, structural coherence, and stylistic fidelity, yet trails human experts in contextualizing complex terminology and stylistic naturalness. Our platform website is available at CityUHK, and our bilingual judgment corpus used for the evaluation is available at Hugging Face.

[396] arXiv:2507.00877 [pdf, html, other]
Title: Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite
William H English, Chase Walker, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz
Subjects: Systems and Control (eess.SY); Computation and Language (cs.CL)

Empirical evaluation of state-of-the-art natural-language (NL) to temporal-logic (TL) translation systems reveals near-perfect performance on existing benchmarks. However, current studies measure only the accuracy of the translation of NL logic into formal TL, ignoring a system's capacity to ground atomic propositions into new scenarios or environments. This is a critical feature, necessary for the verification of resulting formulas in a concrete state space. Consequently, most NL-to-TL translation frameworks propose their own bespoke dataset in which the correct grounding is known a-priori, inflating performance metrics and neglecting the need for extensible, domain-general systems. In this paper, we introduce the Verifiable Linear Temporal Logic Benchmark ( VLTL-Bench), a unifying benchmark that measures verification and verifiability of automated NL-to-LTL translation. The dataset consists of three unique state spaces and thousands of diverse natural language specifications and corresponding formal specifications in temporal logic. Moreover, the benchmark contains sample traces to validate the temporal logic expressions. While the benchmark directly supports end-to-end evaluation, we observe that many frameworks decompose the process into i) lifting, ii) grounding, iii) translation, and iv) verification. The benchmark provides ground truths after each of these steps to enable researches to improve and evaluate different substeps of the overall problem. To encourage methodologically sound advances in verifiable NL-to-LTL translation approaches, we release VLTL-Bench here: this https URL bench.

[397] arXiv:2507.00880 [pdf, html, other]
Title: NN-Former: Rethinking Graph Structure in Neural Architecture Representation
Ruihan Xu, Haokui Zhang, Yaowei Wang, Wei Zeng, Shiliang Zhang
Comments: Accepted to CVPR 2025. Code is avaiable at this https URL
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each of both methods has its disadvantages. GNNs lack the capabilities to represent complicated features, while transformers face poor generalization when the depth of architecture grows. To mitigate the above issues, we rethink neural architecture topology and show that sibling nodes are pivotal while overlooked in previous research. We thus propose a novel predictor leveraging the strengths of GNNs and transformers to learn the enhanced topology. We introduce a novel token mixer that considers siblings, and a new channel mixer named bidirectional graph isomorphism feed-forward network. Our approach consistently achieves promising performance in both accuracy and latency prediction, providing valuable insights for learning Directed Acyclic Graph (DAG) topology. The code is available at this https URL.

[398] arXiv:2507.00881 [pdf, html, other]
Title: Towards Difficulty-Aware Analysis of Deep Neural Networks
Linhao Meng, Stef van den Elzen, Anna Vilanova
Subjects: Human-Computer Interaction (cs.HC)

Traditional instance-based model analysis focuses mainly on misclassified instances. However, this approach overlooks the varying difficulty associated with different instances. Ideally, a robust model should recognize and reflect the challenges presented by intrinsically difficult instances. It is also valuable to investigate whether the difficulty perceived by the model aligns with that perceived by humans. To address this, we propose incorporating instance difficulty into the deep neural network evaluation process, specifically for supervised classification tasks on image data. Specifically, we consider difficulty measures from three perspectives -- data, model, and human -- to facilitate comprehensive evaluation and comparison. Additionally, we develop an interactive visual tool, DifficultyEyes, to support the identification of instances of interest based on various difficulty patterns and to aid in analyzing potential data or model issues. Case studies demonstrate the effectiveness of our approach.

[399] arXiv:2507.00882 [pdf, html, other]
Title: I Move Therefore I Learn: Experience-Based Traversability in Outdoor Robotics
Miguel Ángel de Miguel, Jorge Beltrán, Juan S. Cely, Francisco Martín, Juan Carlos Manzanares, Alberto García
Subjects: Robotics (cs.RO)

Accurate traversability estimation is essential for safe and effective navigation of outdoor robots operating in complex environments. This paper introduces a novel experience-based method that allows robots to autonomously learn which terrains are traversable based on prior navigation experience, without relying on extensive pre-labeled datasets. The approach integrates elevation and texture data into multi-layered grid maps, which are processed using a variational autoencoder (VAE) trained on a generic texture dataset. During an initial teleoperated phase, the robot collects sensory data while moving around the environment. These experiences are encoded into compact feature vectors and clustered using the BIRCH algorithm to represent traversable terrain areas efficiently. In deployment, the robot compares new terrain patches to its learned feature clusters to assess traversability in real time. The proposed method does not require training with data from the targeted scenarios, generalizes across diverse surfaces and platforms, and dynamically adapts as new terrains are encountered. Extensive evaluations on both synthetic benchmarks and real-world scenarios with wheeled and legged robots demonstrate its effectiveness, robustness, and superior adaptability compared to state-of-the-art approaches.

[400] arXiv:2507.00883 [pdf, html, other]
Title: Mathematics Isn't Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations
Aditya Tomar, Nihar Ranjan Sahoo, Ashish Mittal, Rudra Murthy, Pushpak Bhattacharyya
Subjects: Computation and Language (cs.CL)

Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks

[401] arXiv:2507.00885 [pdf, html, other]
Title: Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check
Nicholas Lourie, Michael Y. Hu, Kyunghyun Cho
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales. Whether this prediction should be possible is unclear: some works demonstrate that task performance follows clear linear scaling trends under transformation, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases: 39% of the time. Furthermore, seemingly benign changes to the experimental setting can completely change the scaling trend. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To fully model the relationship between pretraining loss and downstream task performance, we must embrace the cases in which scaling behavior deviates from linear trends.

[402] arXiv:2507.00886 [pdf, html, other]
Title: GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond
Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, Luc Van Gool
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.

[403] arXiv:2507.00891 [pdf, html, other]
Title: MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes
Yuheng Wang, Xianhe Tang, Pufeng Huang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Memes are widely used in online social interactions, providing vivid, intuitive, and often humorous means to express intentions and emotions. Existing dialogue datasets are predominantly limited to either manually annotated or pure-text conversations, lacking the expressiveness and contextual nuance that multimodal interactions this http URL address these challenges, we introduce MemeCMD, an automatically generated Chinese Multi-turn Dialogue dataset with contextually retrieved memes. Our dataset combines a large-scale, MLLM-annotated meme library with dialogues auto-generated by dual agents across diverse scenarios. We introduce a retrieval framework and adaptive threshold to ensure contextually relevant, naturally spaced meme usage. Experiments demonstrate the effectiveness of our approach in generating contextually appropriate and diverse meme-incorporated dialogues, offering a scalable and privacy-preserving resource for advancing multimodal conversational AI.

[404] arXiv:2507.00896 [pdf, html, other]
Title: QUIC Delay Control: an implementation of congestion and delay control
Saverio Mascolo, Andrea Vittorio Balillo, Gioacchino Manfredi, Davide D'Agostino, Luca De Cicco
Comments: 8 pages, 9 figures
Subjects: Networking and Internet Architecture (cs.NI)

A new congestion and delay control algorithm named QUIC Delay Control (QUIC-DC) is proposed for controlling not only congestion but also the queuing delay encountered along the forward communication path. The core idea is to estimate the one-way queuing delay of a connection to trigger an early reaction to congestion. This idea, along with the TCP Westwood+ congestion control algorithm, has been implemented in QUIC-DC and compared with QUIC Cubic, BBRv2, NewReno, Westwood+. The results obtained in both emulated and real network connections show that QUIC-DC can significantly reduce packet losses along with end-to-end communication delays, while preserving network utilization, features that are both very useful for real-time applications.

[405] arXiv:2507.00898 [pdf, html, other]
Title: ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models
Zifu Wan, Ce Zhang, Silong Yong, Martin Q. Ma, Simon Stepputtis, Louis-Philippe Morency, Deva Ramanan, Katia Sycara, Yaqi Xie
Comments: Accepted by ICCV 2025. Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at this https URL.

[406] arXiv:2507.00899 [pdf, html, other]
Title: TABASCO: A Fast, Simplified Model for Molecular Generation with Improved Physical Quality
Carlos Vonessen, Charles Harris, Miruna Cretu, Pietro Liò
Subjects: Machine Learning (cs.LG)

State-of-the-art models for 3D molecular generation are based on significant inductive biases, SE(3), permutation equivariance to respect symmetry and graph message-passing networks to capture local chemistry, yet the generated molecules still struggle with physical plausibility. We introduce TABASCO which relaxes these assumptions: The model has a standard non-equivariant transformer architecture, treats atoms in a molecule as sequences and reconstructs bonds deterministically after generation. The absence of equivariant layers and message passing allows us to significantly simplify the model architecture and scale data throughput. On the GEOM-Drugs benchmark TABASCO achieves state-of-the-art PoseBusters validity and delivers inference roughly 10x faster than the strongest baseline, while exhibiting emergent rotational equivariance despite symmetry not being hard-coded. Our work offers a blueprint for training minimalist, high-throughput generative models suited to specialised tasks such as structure- and pharmacophore-based drug design. We provide a link to our implementation at this http URL.

[407] arXiv:2507.00902 [pdf, html, other]
Title: Constellation as a Service: Tailored Connectivity Management in Direct-Satellite-to-Device Networks
Feng Wang, Shengyu Zhang, Een-Kee Hong, Tony Q.S. Quek
Comments: To appear in IEEE Communications Magazine
Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)

Direct-satellite-to-device (DS2D) communication is emerging as a promising solution for global mobile service extension, leveraging the deployment of satellite constellations. However, the challenge of managing DS2D connectivity for multi-constellations becomes outstanding, including high interference and frequent handovers caused by multi-coverage overlap and rapid satellite movement. Moreover, existing approaches primarily operate within single-constellation shell, which inherently limits the ability to exploit the vast potential of multi-constellation connectivity provision, resulting in suboptimal DS2D service performances. To address these challenges, this article proposes a Constellation as a Service (CaaS) framework, which treats the entire multi-constellation infrastructure as a shared resource pool and dynamically forms optimal sub-constellations (SCs) for each DS2D service region. The formation of each SC integrates satellites from various orbits to provide tailored connectivity based on user demands, guided by two innovative strategies: predictive satellite beamforming using generative artificial intelligence (GenAI) and pre-configured handover path for efficient satellite access and mobility management. Simulation results demonstrate that CaaS significantly improves satellite service rates while reducing handover overhead, making it an efficient and continuable solution for managing DS2D connectivity in multi-constellation environments.

[408] arXiv:2507.00907 [pdf, other]
Title: The Age of Sensorial Zero Trust: Why We Can No Longer Trust Our Senses
Fabio Correa Xavier
Comments: 14 pages
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)

In a world where deepfakes and cloned voices are emerging as sophisticated attack vectors, organizations require a new security mindset: Sensorial Zero Trust [9]. This article presents a scientific analysis of the need to systematically doubt information perceived through the senses, establishing rigorous verification protocols to mitigate the risks of fraud based on generative artificial intelligence. Key concepts, such as Out-of-Band verification, Vision-Language Models (VLMs) as forensic collaborators, cryptographic provenance, and human training, are integrated into a framework that extends Zero Trust principles to human sensory information. The approach is grounded in empirical findings and academic research, emphasizing that in an era of AI-generated realities, even our eyes and ears can no longer be implicitly trusted without verification. Leaders are called to foster a culture of methodological skepticism to protect organizational integrity in this new threat landscape.

[409] arXiv:2507.00909 [pdf, html, other]
Title: Turning AI Data Centers into Grid-Interactive Assets: Results from a Field Demonstration in Phoenix, Arizona
Philip Colangelo, Ayse K. Coskun, Jack Megrue, Ciaran Roberts, Shayan Sengupta, Varun Sivaram, Ethan Tiao, Aroon Vijaykar, Chris Williams, Daniel C. Wilson, Zack MacFarland, Daniel Dreiling, Nathan Morey, Anuja Ratnayake, Baskar Vairamohan
Comments: 10 pages, 6 figures, 1 table
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF); Systems and Control (eess.SY)

Artificial intelligence (AI) is fueling exponential electricity demand growth, threatening grid reliability, raising prices for communities paying for new energy infrastructure, and stunting AI innovation as data centers wait for interconnection to constrained grids. This paper presents the first field demonstration, in collaboration with major corporate partners, of a software-only approach--Emerald Conductor--that transforms AI data centers into flexible grid resources that can efficiently and immediately harness existing power systems without massive infrastructure buildout. Conducted at a 256-GPU cluster running representative AI workloads within a commercial, hyperscale cloud data center in Phoenix, Arizona, the trial achieved a 25% reduction in cluster power usage for three hours during peak grid events while maintaining AI quality of service (QoS) guarantees. By orchestrating AI workloads based on real-time grid signals without hardware modifications or energy storage, this platform reimagines data centers as grid-interactive assets that enhance grid reliability, advance affordability, and accelerate AI's development.

[410] arXiv:2507.00911 [pdf, html, other]
Title: The Cognate Data Bottleneck in Language Phylogenetics
Luise Häuser, Alexandros Stamatakis
Subjects: Computation and Language (cs.CL); Populations and Evolution (q-bio.PE)

To fully exploit the potential of computational phylogenetic methods for cognate data one needs to leverage specific (complex) models an machine learning-based techniques. However, both approaches require datasets that are substantially larger than the manually collected cognate data currently available. To the best of our knowledge, there exists no feasible approach to automatically generate larger cognate datasets. We substantiate this claim by automatically extracting datasets from BabelNet, a large multilingual encyclopedic dictionary. We demonstrate that phylogenetic inferences on the respective character matrices yield trees that are largely inconsistent with the established gold standard ground truth trees. We also discuss why we consider it as being unlikely to be able to extract more suitable character matrices from other multilingual resources. Phylogenetic data analysis approaches that require larger datasets can therefore not be applied to cognate data. Thus, it remains an open question how, and if these computational approaches can be applied in historical linguistics.

[411] arXiv:2507.00914 [pdf, html, other]
Title: Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications
Jindong Han, Yansong Ning, Zirui Yuan, Hang Ni, Fan Liu, Tengfei Lyu, Hao Liu
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)

The long-standing vision of intelligent cities is to create efficient, livable, and sustainable urban environments using big data and artificial intelligence technologies. Recently, the advent of Large Language Models (LLMs) has opened new ways toward realizing this vision. With powerful semantic understanding and reasoning capabilities, LLMs can be deployed as intelligent agents capable of autonomously solving complex problems across domains. In this article, we focus on Urban LLM Agents, which are LLM-powered agents that are semi-embodied within the hybrid cyber-physical-social space of cities and used for system-level urban decision-making. First, we introduce the concept of urban LLM agents, discussing their unique capabilities and features. Second, we survey the current research landscape from the perspective of agent workflows, encompassing urban sensing, memory management, reasoning, execution, and learning. Third, we categorize the application domains of urban LLM agents into five groups: urban planning, transportation, environment, public safety, and urban society, presenting representative works in each group. Finally, we discuss trustworthiness and evaluation issues that are critical for real-world deployment, and identify several open problems for future research. This survey aims to establish a foundation for the emerging field of urban LLM agents and to provide a roadmap for advancing the intersection of LLMs and urban intelligence. A curated list of relevant papers and open-source resources is maintained and continuously updated at this https URL.

[412] arXiv:2507.00915 [pdf, other]
Title: MichelangeRoll: Sculpting Rational Distributions Exactly and Efficiently
Jui-Hsiang Shao, Hsin-Po Wang
Comments: 13 pages, 7 figures, RANDOM says no so here
Subjects: Information Theory (cs.IT)

Simulating an arbitrary discrete distribution $D \in [0, 1]^n$ using fair coin tosses incurs trade-offs between entropy complexity and space and time complexity. Shannon's theory suggests that $H(D)$ tosses are necessary and sufficient, but does not guarantee exact distribution. Knuth and Yao showed that a decision tree consumes fewer than $H(D) + 2$ tosses for one exact sample. Drapper and Saad's recent work addresses the space and time aspect, showing that $H(D) + 2$ tosses, $O(n \log(n) \log(m))$ memory, and $O(H(D))$ operations are all it costs, where $m$ is the common denominator of the probability masses in $D$ and $n$ is the number of possible outcomes.
In this paper, MichelangeRoll recycles leftover entropy to break the "$+2$" barrier. With $O((n + 1/\varepsilon) \log(m/\varepsilon))$ memory, the entropy cost of generating a ongoing sequence of $D$ is reduced to $H(D) + \varepsilon$ per sample.

[413] arXiv:2507.00916 [pdf, html, other]
Title: Masks make discriminative models great again!
Tianshi Cao, Marie-Julie Rakotosaona, Ben Poole, Federico Tombari, Michael Niemeyer
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We present Image2GS, a novel approach that addresses the challenging problem of reconstructing photorealistic 3D scenes from a single image by focusing specifically on the image-to-3D lifting component of the reconstruction process. By decoupling the lifting problem (converting an image to a 3D model representing what is visible) from the completion problem (hallucinating content not present in the input), we create a more deterministic task suitable for discriminative models. Our method employs visibility masks derived from optimized 3D Gaussian splats to exclude areas not visible from the source view during training. This masked training strategy significantly improves reconstruction quality in visible regions compared to strong baselines. Notably, despite being trained only on masked regions, Image2GS remains competitive with state-of-the-art discriminative models trained on full target images when evaluated on complete scenes. Our findings highlight the fundamental struggle discriminative models face when fitting unseen regions and demonstrate the advantages of addressing image-to-3D lifting as a distinct problem with specialized techniques.

[414] arXiv:2507.00917 [pdf, html, other]
Title: A Survey: Learning Embodied Intelligence from Physical Simulators and World Models
Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, Wei Li, Wei Yin, Yao Yao, Jia Pan, Qiu Shen, Ruigang Yang, Xun Cao, Qionghai Dai
Comments: this https URL
Subjects: Robotics (cs.RO)

The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interactions. Two foundational technologies, physical simulators and world models, have emerged as critical enablers in this quest. Physical simulators provide controlled, high-fidelity environments for training and evaluating robotic agents, allowing safe and efficient development of complex behaviors. In contrast, world models empower robots with internal representations of their surroundings, enabling predictive planning and adaptive decision-making beyond direct sensory input. This survey systematically reviews recent advances in learning embodied AI through the integration of physical simulators and world models. We analyze their complementary roles in enhancing autonomy, adaptability, and generalization in intelligent robots, and discuss the interplay between external simulation and internal modeling in bridging the gap between simulated training and real-world deployment. By synthesizing current progress and identifying open challenges, this survey aims to provide a comprehensive perspective on the path toward more capable and generalizable embodied AI systems. We also maintain an active repository that contains up-to-date literature and open-source projects at this https URL.

[415] arXiv:2507.00920 [pdf, html, other]
Title: Privacy-Preserving Quantized Federated Learning with Diverse Precision
Dang Qua Nguyen, Morteza Hashemi, Erik Perrins, Sergiy A. Vorobyov, David J. Love, Taejoon Kim
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)

Federated learning (FL) has emerged as a promising paradigm for distributed machine learning, enabling collaborative training of a global model across multiple local devices without requiring them to share raw data. Despite its advancements, FL is limited by factors such as: (i) privacy risks arising from the unprotected transmission of local model updates to the fusion center (FC) and (ii) decreased learning utility caused by heterogeneity in model quantization resolution across participating devices. Prior work typically addresses only one of these challenges because maintaining learning utility under both privacy risks and quantization heterogeneity is a non-trivial task. In this paper, our aim is therefore to improve the learning utility of a privacy-preserving FL that allows clusters of devices with different quantization resolutions to participate in each FL round. Specifically, we introduce a novel stochastic quantizer (SQ) that is designed to simultaneously achieve differential privacy (DP) and minimum quantization error. Notably, the proposed SQ guarantees bounded distortion, unlike other DP approaches. To address quantization heterogeneity, we introduce a cluster size optimization technique combined with a linear fusion approach to enhance model aggregation accuracy. Numerical simulations validate the benefits of our approach in terms of privacy protection and learning utility compared to the conventional LaplaceSQ-FL algorithm.

[416] arXiv:2507.00926 [pdf, other]
Title: HyperFusion: Hierarchical Multimodal Ensemble Learning for Social Media Popularity Prediction
Liliang Ye (1), Yunyao Zhang (1), Yafeng Wu (1), Yi-Ping Phoebe Chen (2), Junqing Yu (1), Wei Yang (1), Zikai Song (1) ((1) Huazhong University of Science and Technology, Wuhan, China, (2) La Trobe University, Melbourne, Australia)
Subjects: Multimedia (cs.MM); Machine Learning (cs.LG)

Social media popularity prediction plays a crucial role in content optimization, marketing strategies, and user engagement enhancement across digital platforms. However, predicting post popularity remains challenging due to the complex interplay between visual, textual, temporal, and user behavioral factors. This paper presents HyperFusion, a hierarchical multimodal ensemble learning framework for social media popularity prediction. Our approach employs a three-tier fusion architecture that progressively integrates features across abstraction levels: visual representations from CLIP encoders, textual embeddings from transformer models, and temporal-spatial metadata with user characteristics. The framework implements a hierarchical ensemble strategy combining CatBoost, TabNet, and custom multi-layer perceptrons. To address limited labeled data, we propose a two-stage training methodology with pseudo-labeling and iterative refinement. We introduce novel cross-modal similarity measures and hierarchical clustering features that capture inter-modal dependencies. Experimental results demonstrate that HyperFusion achieves competitive performance on the SMP challenge dataset. Our team achieved third place in the SMP Challenge 2025 (Image Track). The source code is available at this https URL.

[417] arXiv:2507.00927 [pdf, html, other]
Title: Understanding Generalization in Node and Link Prediction
Antonis Vasileiou, Timo Stoll, Christopher Morris
Comments: arXiv admin note: text overlap with arXiv:2412.07106
Subjects: Machine Learning (cs.LG)

Using message-passing graph neural networks (MPNNs) for node and link prediction is crucial in various scientific and industrial domains, which has led to the development of diverse MPNN architectures. Besides working well in practical settings, their ability to generalize beyond the training set remains poorly understood. While some studies have explored MPNNs' generalization in graph-level prediction tasks, much less attention has been given to node- and link-level predictions. Existing works often rely on unrealistic i.i.d.\@ assumptions, overlooking possible correlations between nodes or links, and assuming fixed aggregation and impractical loss functions while neglecting the influence of graph structure. In this work, we introduce a unified framework to analyze the generalization properties of MPNNs in inductive and transductive node and link prediction settings, incorporating diverse architectural parameters and loss functions and quantifying the influence of graph structure. Additionally, our proposed generalization framework can be applied beyond graphs to any classification task under the inductive or transductive setting. Our empirical study supports our theoretical insights, deepening our understanding of MPNNs' generalization capabilities in these tasks.

[418] arXiv:2507.00930 [pdf, html, other]
Title: Inverse matroid optimization under subset constraints
Kristóf íBérczi, Lydia Mirabel Mendoza-Cadena, José Soto
Comments: 20 pages
Subjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM)

In the Inverse Matroid problem, we are given a matroid, a fixed basis $B$, and an initial weight function, and the goal is to minimally modify the weights -- measured by some function -- so that $B$ becomes a maximum-weight basis. The problem arises naturally in settings where one wishes to explain or enforce a given solution by minimally perturbing the input.
We extend this classical problem by replacing the fixed basis with a subset $S_0$ of the ground set and imposing various structural constraints on the set of maximum-weight bases relative to $S_0$. Specifically, we study six variants: (A) Inverse Matroid Exists, where $S_0$ must contain at least one maximum-weight basis; (B) Inverse Matroid All, where all bases contained in $S_0$ are maximum-weight; and (C) Inverse Matroid Only, where $S_0$ contains exactly the maximum-weight bases, along with their natural negated counterparts.
For all variants, we develop combinatorial polynomial-time algorithms under the $\ell_\infty$-norm. A key ingredient is a refined min-max theorem for Inverse Matroid under the $\ell_\infty$-norm, which enables simpler and faster algorithms than previous approaches and may be of independent combinatorial interest. Our work significantly broadens the range of inverse optimization problems on matroids that can be solved efficiently, especially those that constrain the structure of optimal solutions through subset inclusion or exclusion.

[419] arXiv:2507.00937 [pdf, html, other]
Title: RaGNNarok: A Light-Weight Graph Neural Network for Enhancing Radar Point Clouds on Unmanned Ground Vehicles
David Hunt, Shaocheng Luo, Spencer Hallyburton, Shafii Nillongo, Yi Li, Tingjun Chen, Miroslav Pajic
Comments: 8 pages, accepted by IROS 2025
Subjects: Robotics (cs.RO); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Low-cost indoor mobile robots have gained popularity with the increasing adoption of automation in homes and commercial spaces. However, existing lidar and camera-based solutions have limitations such as poor performance in visually obscured environments, high computational overhead for data processing, and high costs for lidars. In contrast, mmWave radar sensors offer a cost-effective and lightweight alternative, providing accurate ranging regardless of visibility. However, existing radar-based localization suffers from sparse point cloud generation, noise, and false detections. Thus, in this work, we introduce RaGNNarok, a real-time, lightweight, and generalizable graph neural network (GNN)-based framework to enhance radar point clouds, even in complex and dynamic environments. With an inference time of just 7.3 ms on the low-cost Raspberry Pi 5, RaGNNarok runs efficiently even on such resource-constrained devices, requiring no additional computational resources. We evaluate its performance across key tasks, including localization, SLAM, and autonomous navigation, in three different environments. Our results demonstrate strong reliability and generalizability, making RaGNNarok a robust solution for low-cost indoor mobile robots.

[420] arXiv:2507.00938 [pdf, html, other]
Title: WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks
Zihao Sun, Meng Fang, Ling Chen
Comments: 10 pages, 9 figures, 4 tables
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)

Recent progress in large language models (LLMs) has enabled the development of autonomous web agents capable of navigating and interacting with real websites. However, evaluating such agents remains challenging due to the instability and inconsistency of existing benchmarks, which often rely on dynamic content or oversimplified simulations. In this work, we introduce WebArXiv, a static and time-invariant benchmark comprising 275 web-based tasks grounded in the arXiv platform. WebArXiv ensures reproducible and reliable evaluation by anchoring tasks in fixed web snapshots with deterministic ground truths and standardized action trajectories. Through behavioral analysis, we identify a common failure mode, Rigid History Reflection, where agents over-rely on fixed interaction histories. To address this, we propose a lightweight dynamic reflection mechanism that allows agents to selectively retrieve relevant past steps during decision-making. We evaluate ten state-of-the-art web agents on WebArXiv. Results demonstrate clear performance differences across agents and validate the effectiveness of our proposed reflection strategy.

[421] arXiv:2507.00942 [pdf, html, other]
Title: Optimal Feedback Schemes for Dirty Paper Channels With State Estimation at the Receiver
Dengfeng Xia, Han Deng, Haonan Zhang, Fan Cheng, Bin Dai, Liuguo Yin
Comments: This paper will be presented at the 2025 IEEE Information Theory Workshop (ITW)
Subjects: Information Theory (cs.IT)

In the literature, it has been shown that feedback does not increase the optimal rate-distortion region of the dirty paper channel with state estimation at the receiver (SE-R). On the other hand, it is well-known that feedback helps to construct low-complexity coding schemes in Gaussian channels, such as the elegant Schalkwijk-Kailath (SK) feedback scheme. This motivates us to explore capacity-achieving SK-type schemes in dirty paper channels with SE-R and feedback. In this paper, we first propose a capacity-achieving feedback scheme for the dirty paper channel with SE-R (DPC-SE-R), which combines the superposition coding and the classical SK-type scheme. Then, we extend this scheme to the dirty paper multiple-access channel with SE-R and feedback, and also show the extended scheme is capacity-achieving. Finally, we discuss how to extend our scheme to a noisy state observation case of the DPC-SE-R. However, the capacity-achieving SK-type scheme for such a case remains unknown.

[422] arXiv:2507.00945 [pdf, html, other]
Title: Time Series Foundation Models are Flow Predictors
Massimiliano Luca, Ciro Beneduce, Bruno Lepri
Comments: arXiv admin note: text overlap with arXiv:2203.07372
Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY)

We investigate the effectiveness of time series foundation models (TSFMs) for crowd flow prediction, focusing on Moirai and TimesFM. Evaluated on three real-world mobility datasets-Bike NYC, Taxi Beijing, and Spanish national OD flows-these models are deployed in a strict zero-shot setting, using only the temporal evolution of each OD flow and no explicit spatial information. Moirai and TimesFM outperform both statistical and deep learning baselines, achieving up to 33% lower RMSE, 39% lower MAE and up to 49% higher CPC compared to state-of-the-art competitors. Our results highlight the practical value of TSFMs for accurate, scalable flow prediction, even in scenarios with limited annotated data or missing spatial context.

[423] arXiv:2507.00949 [pdf, html, other]
Title: How Fast Can Graph Computations Go on Fine-grained Parallel Architectures
Yuqing Wang, Charles Colley, Brian Wheatman, Jiya Su, David F. Gleich, Andrew A. Chien
Comments: 13 pages, 11 figures, 6 tables
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR)

Large-scale graph problems are of critical and growing importance and historically parallel architectures have provided little support. In the spirit of co-design, we explore the question, How fast can graph computing go on a fine-grained architecture? We explore the possibilities of an architecture optimized for fine-grained parallelism, natural programming, and the irregularity and skew found in real-world graphs. Using two graph benchmarks, PageRank (PR) and Breadth-First Search (BFS), we evaluate a Fine-Grained Graph architecture, UpDown, to explore what performance codesign can achieve. To demonstrate programmability, we wrote five variants of these algorithms. Simulations of up to 256 nodes (524,288 lanes) and projections to 16,384 nodes (33M lanes) show the UpDown system can achieve 637K GTEPS PR and 989K GTEPS BFS on RMAT, exceeding the best prior results by 5x and 100x respectively.

[424] arXiv:2507.00950 [pdf, html, other]
Title: MVP: Winning Solution to SMP Challenge 2025 Video Track
Liliang Ye (1), Yunyao Zhang (1), Yafeng Wu (1), Yi-Ping Phoebe Chen (2), Junqing Yu (1), Wei Yang (1), Zikai Song (1) ((1) Huazhong University of Science and Technology, Wuhan, China, (2) La Trobe University, Melbourne, Australia)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

Social media platforms serve as central hubs for content dissemination, opinion expression, and public engagement across diverse modalities. Accurately predicting the popularity of social media videos enables valuable applications in content recommendation, trend detection, and audience engagement. In this paper, we present Multimodal Video Predictor (MVP), our winning solution to the Video Track of the SMP Challenge 2025. MVP constructs expressive post representations by integrating deep video features extracted from pretrained models with user metadata and contextual information. The framework applies systematic preprocessing techniques, including log-transformations and outlier removal, to improve model robustness. A gradient-boosted regression model is trained to capture complex patterns across modalities. Our approach ranked first in the official evaluation of the Video Track, demonstrating its effectiveness and reliability for multimodal video popularity prediction on social platforms. The source code is available at this https URL.

[425] arXiv:2507.00951 [pdf, html, other]
Title: Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact
Rizwan Qureshi, Ranjan Sapkota, Abbas Shah, Amgad Muneer, Anas Zafar, Ashmal Vayani, Maged Shoman, Abdelrahman B. M. Eldaly, Kai Zhang, Ferhat Sadak, Shaina Raza, Xinqi Fan, Ravid Shwartz-Ziv, Hong Yan, Vinjia Jain, Aman Chadha, Manoj Karkee, Jia Wu, Philip Torr, Seyedali Mirjalili
Subjects: Artificial Intelligence (cs.AI)

Can machines truly think, reason and act in domains like humans? This enduring question continues to shape the pursuit of Artificial General Intelligence (AGI). Despite the growing capabilities of models such as GPT-4.5, DeepSeek, Claude 3.5 Sonnet, Phi-4, and Grok 3, which exhibit multimodal fluency and partial reasoning, these systems remain fundamentally limited by their reliance on token-level prediction and lack of grounded agency. This paper offers a cross-disciplinary synthesis of AGI development, spanning artificial intelligence, cognitive neuroscience, psychology, generative models, and agent-based systems. We analyze the architectural and cognitive foundations of general intelligence, highlighting the role of modular reasoning, persistent memory, and multi-agent coordination. In particular, we emphasize the rise of Agentic RAG frameworks that combine retrieval, planning, and dynamic tool use to enable more adaptive behavior. We discuss generalization strategies, including information compression, test-time adaptation, and training-free methods, as critical pathways toward flexible, domain-agnostic intelligence. Vision-Language Models (VLMs) are reexamined not just as perception modules but as evolving interfaces for embodied understanding and collaborative task completion. We also argue that true intelligence arises not from scale alone but from the integration of memory and reasoning: an orchestration of modular, interactive, and self-improving components where compression enables adaptive behavior. Drawing on advances in neurosymbolic systems, reinforcement learning, and cognitive scaffolding, we explore how recent architectures begin to bridge the gap between statistical learning and goal-directed cognition. Finally, we identify key scientific, technical, and ethical challenges on the path to AGI.

[426] arXiv:2507.00961 [pdf, html, other]
Title: Digital Collections Explorer: An Open-Source, Multimodal Viewer for Searching Digital Collections
Ying-Hsiang Huang, Benjamin Charles Germain Lee
Comments: 14 pages, 8 figures, 2 tables
Subjects: Digital Libraries (cs.DL); Information Retrieval (cs.IR)

We present Digital Collections Explorer, a web-based, open-source exploratory search platform that leverages CLIP (Contrastive Language-Image Pre-training) for enhanced visual discovery of digital collections. Our Digital Collections Explorer can be installed locally and configured to run on a visual collection of interest on disk in just a few steps. Building upon recent advances in multimodal search techniques, our interface enables natural language queries and reverse image searches over digital collections with visual features. This paper describes the system's architecture, implementation, and application to various cultural heritage collections, demonstrating its potential for democratizing access to digital archives, especially those with impoverished metadata. We present case studies with maps, photographs, and PDFs extracted from web archives in order to demonstrate the flexibility of the Digital Collections Explorer, as well as its ease of use. We demonstrate that the Digital Collections Explorer scales to hundreds of thousands of images on a MacBook Pro with an M4 chip. Lastly, we host a public demo of Digital Collections Explorer.

[427] arXiv:2507.00963 [pdf, html, other]
Title: Social Robots for People with Dementia: A Literature Review on Deception from Design to Perception
Fan Wang, Giulia Perugia, Yuan Feng, Wijnand IJsselsteijn
Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Robotics (cs.RO)

As social robots increasingly enter dementia care, concerns about deception, intentional or not, are gaining attention. Yet, how robotic design cues might elicit misleading perceptions in people with dementia, and how these perceptions arise, remains insufficiently understood. In this scoping review, we examined 26 empirical studies on interactions between people with dementia and physical social robots. We identify four key design cue categories that may influence deceptive impressions: cues resembling physiological signs (e.g., simulated breathing), social intentions (e.g., playful movement), familiar beings (e.g., animal-like form and sound), and, to a lesser extent, cues that reveal artificiality. Thematic analysis of user responses reveals that people with dementia often attribute biological, social, and mental capacities to robots, dynamically shifting between awareness and illusion. These findings underscore the fluctuating nature of ontological perception in dementia contexts. Existing definitions of robotic deception often rest on philosophical or behaviorist premises, but rarely engage with the cognitive mechanisms involved. We propose an empirically grounded definition: robotic deception occurs when Type 1 (automatic, heuristic) processing dominates over Type 2 (deliberative, analytic) reasoning, leading to misinterpretation of a robot's artificial nature. This dual-process perspective highlights the ethical complexity of social robots in dementia care and calls for design approaches that are not only engaging, but also epistemically respectful.

[428] arXiv:2507.00964 [pdf, html, other]
Title: Benchmarking the Discovery Engine
Jack Foxabbott, Arush Tagade, Andrew Cusick, Robbie McCorkell, Leo McKee-Reid, Jugal Patel, Jamie Rumbelow, Jessica Rumbelow, Zohreh Shams
Comments: 16 pages, 8 figures, benchmarks Discovery Engine on five scientific datasets (medicine, materials science, climate, air quality, social science)
Subjects: Machine Learning (cs.LG)

The Discovery Engine is a general purpose automated system for scientific discovery, which combines machine learning with state-of-the-art ML interpretability to enable rapid and robust scientific insight across diverse datasets. In this paper, we benchmark the Discovery Engine against five recent peer-reviewed scientific publications applying machine learning across medicine, materials science, social science, and environmental science. In each case, the Discovery Engine matches or exceeds prior predictive performance while also generating deeper, more actionable insights through rich interpretability artefacts. These results demonstrate its potential as a new standard for automated, interpretable scientific modelling that enables complex knowledge discovery from data.

[429] arXiv:2507.00965 [pdf, html, other]
Title: Scalable Feature Learning on Huge Knowledge Graphs for Downstream Machine Learning
Félix Lefebvre, Gaël Varoquaux
Subjects: Machine Learning (cs.LG)

Many machine learning tasks can benefit from external knowledge. Large knowledge graphs store such knowledge, and embedding methods can be used to distill it into ready-to-use vector representations for downstream applications. For this purpose, current models have however two limitations: they are primarily optimized for link prediction, via local contrastive learning, and they struggle to scale to the largest graphs due to GPU memory limits. To address these, we introduce SEPAL: a Scalable Embedding Propagation ALgorithm for large knowledge graphs designed to produce high-quality embeddings for downstream tasks at scale. The key idea of SEPAL is to enforce global embedding alignment by optimizing embeddings only on a small core of entities, and then propagating them to the rest of the graph via message passing. We evaluate SEPAL on 7 large-scale knowledge graphs and 46 downstream machine learning tasks. Our results show that SEPAL significantly outperforms previous methods on downstream tasks. In addition, SEPAL scales up its base embedding model, enabling fitting huge knowledge graphs on commodity hardware.

[430] arXiv:2507.00966 [pdf, html, other]
Title: MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement
Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan
Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing for possible publication
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

With the advent of new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform state-of-the-art models in single-channel speech enhancement, automatic speech recognition, and self-supervised audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this issue, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VoiceBank+Demand Extended (VB-DemandEx), a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, our proposed MambAttention model significantly outperforms existing state-of-the-art LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 and EARS-WHAM_v2, while matching their performance on the in-domain dataset VB-DemandEx. Ablation studies highlight the role of weight sharing between the time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. However, our MambAttention model remains superior on both out-of-domain datasets across all reported evaluation metrics.

[431] arXiv:2507.00969 [pdf, html, other]
Title: Surgical Neural Radiance Fields from One Image
Alberto Neri, Maximilan Fehrentz, Veronica Penza, Leonardo S. Mattos, Nazim Haouchine
Journal-ref: Int J CARS (2025)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Purpose: Neural Radiance Fields (NeRF) offer exceptional capabilities for 3D reconstruction and view synthesis, yet their reliance on extensive multi-view data limits their application in surgical intraoperative settings where only limited data is available. In particular, collecting such extensive data intraoperatively is impractical due to time constraints. This work addresses this challenge by leveraging a single intraoperative image and preoperative data to train NeRF efficiently for surgical scenarios.
Methods: We leverage preoperative MRI data to define the set of camera viewpoints and images needed for robust and unobstructed training. Intraoperatively, the appearance of the surgical image is transferred to the pre-constructed training set through neural style transfer, specifically combining WTC2 and STROTSS to prevent over-stylization. This process enables the creation of a dataset for instant and fast single-image NeRF training.
Results: The method is evaluated with four clinical neurosurgical cases. Quantitative comparisons to NeRF models trained on real surgical microscope images demonstrate strong synthesis agreement, with similarity metrics indicating high reconstruction fidelity and stylistic alignment. When compared with ground truth, our method demonstrates high structural similarity, confirming good reconstruction quality and texture preservation.
Conclusion: Our approach demonstrates the feasibility of single-image NeRF training in surgical settings, overcoming the limitations of traditional multi-view methods.

[432] arXiv:2507.00971 [pdf, other]
Title: Reasoning as an Adaptive Defense for Safety
Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, Aviral Kumar
Comments: 42 pages, 11 Figures, 7 Tables
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Reasoning methods that adaptively allocate test-time compute have advanced LLM performance on easy to verify domains such as math and code. In this work, we study how to utilize this approach to train models that exhibit a degree of robustness to safety vulnerabilities, and show that doing so can provide benefits. We build a recipe called $\textit{TARS}$ (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion. To build TARS, we identify three critical design choices: (1) a "lightweight" warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as too many refusals, and (3) a reward function to prevent degeneration of reasoning capabilities during training. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, leading to better safety-refusal trade-offs. They also internally learn to better distinguish between safe and unsafe prompts and attain greater robustness to both white-box (e.g., GCG) and black-box attacks (e.g., PAIR). Overall, our work provides an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.

[433] arXiv:2507.00976 [pdf, html, other]
Title: Anatomy of High-Performance Column-Pivoted QR Decomposition
Maksim Melnichenko, Riley Murray, William Killian, James Demmel, Michael W. Mahoney, Piotr Luszczek, Mark Gates
Comments: v1: 33 pages in the body, 7 pages in the appendices, 17 figures
Subjects: Mathematical Software (cs.MS); Numerical Analysis (math.NA)

We introduce an algorithmic framework for performing QR factorization with column pivoting (QRCP) on general matrices. The framework enables the design of practical QRCP algorithms through user-controlled choices for the core subroutines. We provide a comprehensive overview of how to navigate these choices on modern hardware platforms, offering detailed descriptions of alternative methods for both CPUs and GPUs. The practical QRCP algorithms developed within this framework are implemented as part of the open-source RandLAPACK library. Our empirical evaluation demonstrates that, on a dual AMD EPYC 9734 system, the proposed method achieves performance improvements of up to two orders of magnitude over LAPACK's standard QRCP routine and greatly surpasses the performance of the current state-of-the-art randomized QRCP algorithm. Additionally, on an NVIDIA H100 GPU, our method attains approximately 65 percent of the performance of cuSOLVER's unpivoted QR factorization.

[434] arXiv:2507.00979 [pdf, other]
Title: Enhancing LLM Agent Safety via Causal Influence Prompting
Dongyoon Hahm, Woogyeol Jin, June Suk Choi, Sungsoo Ahn, Kimin Lee
Comments: Accepted at ACL 2025 Findings, Source code: this https URL
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

As autonomous agents powered by large language models (LLMs) continue to demonstrate potential across various assistive tasks, ensuring their safe and reliable behavior is crucial for preventing unintended consequences. In this work, we introduce CIP, a novel technique that leverages causal influence diagrams (CIDs) to identify and mitigate risks arising from agent decision-making. CIDs provide a structured representation of cause-and-effect relationships, enabling agents to anticipate harmful outcomes and make safer decisions. Our approach consists of three key steps: (1) initializing a CID based on task specifications to outline the decision-making process, (2) guiding agent interactions with the environment using the CID, and (3) iteratively refining the CID based on observed behaviors and outcomes. Experimental results demonstrate that our method effectively enhances safety in both code execution and mobile device control tasks.

[435] arXiv:2507.00980 [pdf, html, other]
Title: RTMap: Real-Time Recursive Mapping with Change Detection and Localization
Yuheng Du, Sheng Yang, Lingxuan Wang, Zhenghua Hou, Chengying Cai, Zhitao Tan, Mingxia Chen, Shi-Sheng Huang, Qiang Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)

While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an end-to-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) real-time detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously. Our source-code will be made publicly available at this https URL (Camera ready version incorporating reviewer suggestions will be updated soon).

[436] arXiv:2507.00981 [pdf, html, other]
Title: Evaluating Robustness of Monocular Depth Estimation with Procedural Scene Perturbations
Jack Nugent, Siyang Wu, Zeyu Ma, Beining Han, Meenal Parakh, Abhishek Joshi, Lingjie Mei, Alexander Raistrick, Xinyuan Li, Jia Deng
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent years have witnessed substantial progress on monocular depth estimation, particularly as measured by the success of large models on standard benchmarks. However, performance on standard benchmarks does not offer a complete assessment, because most evaluate accuracy but not robustness. In this work, we introduce PDE (Procedural Depth Evaluation), a new benchmark which enables systematic robustness evaluation. PDE uses procedural generation to create 3D scenes that test robustness to various controlled perturbations, including object, camera, material and lighting changes. Our analysis yields interesting findings on what perturbations are challenging for state-of-the-art depth models, which we hope will inform further research. Code and data are available at this https URL.

[437] arXiv:2507.00984 [pdf, html, other]
Title: Box Pose and Shape Estimation and Domain Adaptation for Large-Scale Warehouse Automation
Xihang Yu, Rajat Talak, Jingnan Shi, Ulrich Viereck, Igor Gilitschenski, Luca Carlone
Comments: 12 pages, 6 figures. This work will be presented at the 19th International Symposium on Experimental Robotics (ISER2025)
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Modern warehouse automation systems rely on fleets of intelligent robots that generate vast amounts of data -- most of which remains unannotated. This paper develops a self-supervised domain adaptation pipeline that leverages real-world, unlabeled data to improve perception models without requiring manual annotations. Our work focuses specifically on estimating the pose and shape of boxes and presents a correct-and-certify pipeline for self-supervised box pose and shape estimation. We extensively evaluate our approach across a range of simulated and real industrial settings, including adaptation to a large-scale real-world dataset of 50,000 images. The self-supervised model significantly outperforms models trained solely in simulation and shows substantial improvements over a zero-shot 3D bounding box estimation baseline.

[438] arXiv:2507.00985 [pdf, html, other]
Title: Discourse Heuristics For Paradoxically Moral Self-Correction
Guangliang Liu, Zimo Qi, Xitong Zhang, Kristen Marie Johnson
Subjects: Computation and Language (cs.CL)

Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.

[439] arXiv:2507.00990 [pdf, html, other]
Title: Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li
Comments: Project Page: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.

[440] arXiv:2507.00992 [pdf, html, other]
Title: UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis
Yuanrui Wang, Cong Han, YafeiLi, Zhipeng Jin, Xiawei Li, SiNan Du, Wen Tao, Yi Yang, shuanglong li, Chun Yuan, Liu Lin
Comments: Accepted by ICCV 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Text-to-image generation has greatly advanced content creation, yet accurately rendering visual text remains a key challenge due to blurred glyphs, semantic drift, and limited style control. Existing methods often rely on pre-rendered glyph images as conditions, but these struggle to retain original font styles and color cues, necessitating complex multi-branch designs that increase model overhead and reduce flexibility. To address these issues, we propose a segmentation-guided framework that uses pixel-level visual text masks -- rich in glyph shape, color, and spatial detail -- as unified conditional inputs. Our method introduces two core components: (1) a fine-tuned bilingual segmentation model for precise text mask extraction, and (2) a streamlined diffusion model augmented with adaptive glyph conditioning and a region-specific loss to preserve textual fidelity in both content and style. Our approach achieves state-of-the-art performance on the AnyText benchmark, significantly surpassing prior methods in both Chinese and English settings. To enable more rigorous evaluation, we also introduce two new benchmarks: GlyphMM-benchmark for testing layout and glyph consistency in complex typesetting, and MiniText-benchmark for assessing generation quality in small-scale text regions. Experimental results show that our model outperforms existing methods by a large margin in both scenarios, particularly excelling at small text rendering and complex layout preservation, validating its strong generalization and deployment readiness.

[441] arXiv:2507.00994 [pdf, html, other]
Title: Should We Still Pretrain Encoders with Masked Language Modeling?
Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo
Comments: 23 pages, 10 figures, 17 tables
Subjects: Computation and Language (cs.CL)

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at this https URL to foster further research.

[442] arXiv:2507.00997 [pdf, html, other]
Title: Geometrization of Higher-Order Linear Control Laws for Attitude Control on $\mathsf{SO(3)}$
Farooq Aslam, Hafiz Zeeshan Iqbal Khan, Muhammad Farooq Haydar, Suhail Akhtar, Jamshed Riaz
Comments: 14 pages, 5 figures
Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)

This paper presents a theoretical framework for analyzing the stability of higher-order geometric nonlinear control laws for attitude control on the Special Orthogonal Group $\mathrm{SO(3)}$. In particular, the paper extends existing results on the analysis of PID-type geometric nonlinear control laws to more general higher-order dynamic state-feedback compensators on $\mathrm{SO(3)}$. The candidate Lyapunov function is motivated by quadratic Lyapunov functions of the form $V(x)=x^{\top}Px$ typically considered in the analysis of linear time-invariant (LTI) systems. The stability analysis is carried out in two steps. In the first step, a sufficient condition is obtained for the positive definiteness of the candidate Lyapunov function, and a necessary and sufficient condition for the negative definiteness of the corresponding Lyapunov rate. These conditions ensure that the desired equilibrium is almost globally asymptotically stable (AGAS). In the second step, a convex relaxation of the proposed conditions is used to obtain sufficient conditions in the form of linear matrix inequalities (LMIs). Overall, the approach is motivated by the widespread use of LMI-based analysis and design tools for LTI systems. To reduce conservatism, matrix gains are considered for the controller gains as well as the Lyapunov function coefficients. The applicability of the approach to practical problems is illustrated by designing and analyzing a 21-state geometric nonlinear attitude control law for a multicopter.

[443] arXiv:2507.00999 [pdf, html, other]
Title: La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America
María Grandury, Javier Aula-Blasco, Júlia Falcão, Clémentine Fourrier, Miguel González, Gonzalo Martínez, Gonzalo Santamaría, Rodrigo Agerri, Nuria Aldama, Luis Chiruzzo, Javier Conde, Helena Gómez, Marta Guerrero, Guido Ivetta, Natalia López, Flor Miriam Plaza-del-Arco, María Teresa Martín-Valdivia, Helena Montoro, Carmen Muñoz, Pedro Reviriego, Leire Rosado, Alejandro Vaca, María Estrella Vallecillo-Rodríguez, Jorge Vallego, Irune Zubiaga
Comments: Accepted at ACL 2025 Main
Subjects: Computation and Language (cs.CL)

Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.

[444] arXiv:2507.01001 [pdf, other]
Title: SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks
Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.

[445] arXiv:2507.01003 [pdf, html, other]
Title: Description of the Training Process of Neural Networks via Ergodic Theorem : Ghost nodes
Eun-Ji Park, Sangwon Yun
Comments: 9 pages, 2 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Recent studies have proposed interpreting the training process from an ergodic perspective. Building on this foundation we present a unified framework for understanding and accelerating the training of deep neural networks via stochastic gradient descent. By analyzing the geometric landscape of the objective function we introduce a practical diagnostic, the running estimate of the largest Lyapunov exponent, which provably distinguishes genuine convergence toward stable minimizers from mere statistical stabilization near saddle points. We then propose a ghost category extension for standard classifiers that adds auxiliary ghost output nodes so the model gains extra descent directions that open a lateral corridor around narrow loss barriers and enable the optimizer to bypass poor basins during the early training phase. We show that this extension strictly reduces approximation error and that after sufficient convergence the ghost dimensions collapse and the extended model's invariant law coincides with that of the original and there exists a path in the enlarged parameter space along which the total loss does not increase while the original loss decreases by an arbitrary margin. Taken together these results provide a principled architecture level intervention that accelerates early stage trainability while preserving asymptotic behavior.

[446] arXiv:2507.01004 [pdf, other]
Title: ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention
Yuhong Chou, Zehao Liu, Ruijie Zhu, Xinyi Wan, Tianjian Li, Congying Chu, Qian Liu, Jibin Wu, Zejun Ma
Subjects: Machine Learning (cs.LG)

Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new collective communication primitive. All-Scan provides each SP rank with precisely the initial operator state it requires while maintaining a minimal communication footprint, effectively eliminating communication overhead. Theoretically, we prove the optimaity of ZeCO, showing that it introduces only negligible time and space overhead. Empirically, we compare the communication costs of different sequence parallelism strategies and demonstrate that All-Scan achieves the fastest communication in SP scenarios. Specifically, on 256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a clear path toward efficiently training next-generation LLMs on previously intractable sequence lengths.

[447] arXiv:2507.01006 [pdf, html, other]
Title: GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Leyi Pan, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianle Gong, Wenkai Li, Wei Jia, Xin Lyu, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuxuan Zhang, Zhanxiao Du, Zhenyu Hou, Zhao Xue, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. Reinforcement Learning with Curriculum Sampling (RLCS) then unlocks the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding, among others. To facilitate research in this field, we open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at this https URL.

[448] arXiv:2507.01008 [pdf, html, other]
Title: DexWrist: A Robotic Wrist for Constrained and Dynamic Manipulation
Martin Peticco, Gabriella Ulloa, John Marangola, Pulkit Agrawal
Comments: More details about the wrist can be found at: this http URL
Subjects: Robotics (cs.RO)

We present the DexWrist, a compliant robotic wrist designed to advance robotic manipulation in highly-constrained environments, enable dynamic tasks, and speed up data collection. DexWrist is designed to be close to the functional capabilities of the human wrist and achieves mechanical compliance and a greater workspace as compared to existing robotic wrist designs. The DexWrist can supercharge policy learning by (i) enabling faster teleoperation and therefore making data collection more scalable; (ii) completing tasks in fewer steps which reduces trajectory lengths and therefore can ease policy learning; (iii) DexWrist is designed to be torque transparent with easily simulatable kinematics for simulated data collection; and (iv) most importantly expands the workspace of manipulation for approaching highly cluttered scenes and tasks. More details about the wrist can be found at: this http URL.

[449] arXiv:2507.01009 [pdf, html, other]
Title: ShapeEmbed: a self-supervised learning framework for 2D contour quantification
Anna Foix Romero, Craig Russell, Alexander Krull, Virginie Uhlmann
Subjects: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

The shape of objects is an important source of visual information in a wide range of applications. One of the core challenges of shape quantification is to ensure that the extracted measurements remain invariant to transformations that preserve an object's intrinsic geometry, such as changing its size, orientation, and position in the image. In this work, we introduce ShapeEmbed, a self-supervised representation learning framework designed to encode the contour of objects in 2D images, represented as a Euclidean distance matrix, into a shape descriptor that is invariant to translation, scaling, rotation, reflection, and point indexing. Our approach overcomes the limitations of traditional shape descriptors while improving upon existing state-of-the-art autoencoder-based approaches. We demonstrate that the descriptors learned by our framework outperform their competitors in shape classification tasks on natural and biological images. We envision our approach to be of particular relevance to biological imaging applications.

[450] arXiv:2507.01012 [pdf, html, other]
Title: DAM-VSR: Disentanglement of Appearance and Motion for Video Super-Resolution
Zhe Kong, Le Li, Yong Zhang, Feng Gao, Shaoshu Yang, Tao Wang, Kaihao Zhang, Zhuoliang Kang, Xiaoming Wei, Guanying Chen, Wenhan Luo
Comments: Accepted by ACM SIGGRAPH 2025, Homepage: this https URL Github: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.

[451] arXiv:2507.01016 [pdf, html, other]
Title: VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, Tong He
Comments: Accepted by ICCV 2025
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

In this paper, we introduce an innovative vector quantization based action tokenizer built upon the largest-scale action trajectory dataset to date, leveraging over 100 times more data than previous approaches. This extensive dataset enables our tokenizer to capture rich spatiotemporal dynamics, resulting in a model that not only accelerates inference but also generates smoother and more coherent action outputs. Once trained, the tokenizer can be seamlessly adapted to a wide range of downstream tasks in a zero-shot manner, from short-horizon reactive behaviors to long-horizon planning. A key finding of our work is that the domain gap between synthetic and real action trajectories is marginal, allowing us to effectively utilize a vast amount of synthetic data during training without compromising real-world performance. To validate our approach, we conducted extensive experiments in both simulated environments and on real robotic platforms. The results demonstrate that as the volume of synthetic trajectory data increases, the performance of our tokenizer on downstream tasks improves significantly-most notably, achieving up to a 30% higher success rate on two real-world tasks in long-horizon scenarios. These findings highlight the potential of our action tokenizer as a robust and scalable solution for real-time embodied intelligence systems, paving the way for more efficient and reliable robotic control in diverse application this http URL website: this https URL

[452] arXiv:2507.01017 [pdf, html, other]
Title: A Comprehensive Review of Human Error in Risk-Informed Decision Making: Integrating Human Reliability Assessment, Artificial Intelligence, and Human Performance Models
Xingyu Xiao, Hongxu Zhu, Jingang Liang, Jiejuan Tong, Haitao Wang
Subjects: Human-Computer Interaction (cs.HC)

Human error remains a dominant risk driver in safety-critical sectors such as nuclear power, aviation, and healthcare, where seemingly minor mistakes can cascade into catastrophic outcomes. Although decades of research have produced a rich repertoire of mitigation techniques, persistent limitations: scarce high-quality data, algorithmic opacity, and residual reliance on expert judgment, continue to constrain progress. This review synthesizes recent advances at the intersection of risk-informed decision making, human reliability assessment (HRA), artificial intelligence (AI), and cognitive science to clarify how their convergence can curb human-error risk. We first categorize the principal forms of human error observed in complex sociotechnical environments and outline their quantitative impact on system reliability. Next, we examine risk-informed frameworks that embed HRA within probabilistic and data-driven methodologies, highlighting successes and gaps. We then survey cognitive and human-performance models, detailing how mechanistic accounts of perception, memory, and decision-making enrich error prediction and complement HRA metrics. Building on these foundations, we critically assess AI-enabled techniques for real-time error detection, operator-state estimation, and AI-augmented HRA workflows. Across these strands, a recurring insight emerges: integrating cognitive models with AI-based analytics inside risk-informed HRA pipelines markedly enhances predictive fidelity, yet doing so demands richer datasets, transparent algorithms, and rigorous validation. Finally, we identify promising research directions, coupling resilience engineering concepts with grounded theory, operationalizing the iceberg model of incident causation, and establishing cross-domain data consortia, to foster a multidisciplinary paradigm that elevates human reliability in high-stakes systems.

Cross submissions (showing 77 of 77 entries)

[453] arXiv:1809.08467 (cross-list from math.CO) [pdf, other]
Title: On Bivariegated Graphs and Line Graphs
Ranjan N. Naik
Subjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)

This note is on the structures of line graphs and 2-variegated graphs. We have given here solutions of some graph equations involving line graphs and 2-variegated graphs. In addition, a characterization of potentially 2-variegated line graphic degree sequences is given.

[454] arXiv:1809.08472 (cross-list from math.CO) [pdf, other]
Title: On Intersection Graphs of Graphs and Hypergraphs: A Survey
Ranjan N. Naik
Subjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)

In this survey, we have attempted to show some developmental milestones on the characterizations of intersection graphs of hypergraphs. The theory of intersection graphs of hypergraphs has been a classical topic in the theory of special graphs. To conclude, at the end, we have listed some open problems posed by various authors whose work has contributed to this survey and also the new trends coming out of intersection graphs. Keywords: Hypergraphs, Intersection graphs, Line graphs, Representative graphs, Derived graphs, Algorithms (ALG), Forbidden induced subgraphs (FIS), Krausz partitions, Eigenvalues.

[455] arXiv:2503.04995 (cross-list from eess.AS) [pdf, html, other]
Title: Musical Source Separation of Brazilian Percussion
Richa Namballa, Giovana Morais, Magdalena Fuentes
Comments: 2 pages + references, 1 figure, 1 table, Extended Abstracts for the Late-Breaking Demo Session of the 25th International Society for Music Information Retrieval Conference
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)

Musical source separation (MSS) has recently seen a big breakthrough in separating instruments from a mixture in the context of Western music, but research on non-Western instruments is still limited due to a lack of data. In this demo, we use an existing dataset of Brazilian sama percussion to create artificial mixtures for training a U-Net model to separate the surdo drum, a traditional instrument in samba. Despite limited training data, the model effectively isolates the surdo, given the drum's repetitive patterns and its characteristic low-pitched timbre. These results suggest that MSS systems can be successfully harnessed to work in more culturally-inclusive scenarios without the need of collecting extensive amounts of data.

[456] arXiv:2507.00051 (cross-list from eess.IV) [pdf, html, other]
Title: Real-Time Guidewire Tip Tracking Using a Siamese Network for Image-Guided Endovascular Procedures
Tianliang Yao, Zhiqiang Pei, Yong Li, Yixuan Yuan, Peng Qi
Comments: This paper has been accepted by Advanced Intelligent Systems
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

An ever-growing incorporation of AI solutions into clinical practices enhances the efficiency and effectiveness of healthcare services. This paper focuses on guidewire tip tracking tasks during image-guided therapy for cardiovascular diseases, aiding physicians in improving diagnostic and therapeutic quality. A novel tracking framework based on a Siamese network with dual attention mechanisms combines self- and cross-attention strategies for robust guidewire tip tracking. This design handles visual ambiguities, tissue deformations, and imaging artifacts through enhanced spatial-temporal feature learning. Validation occurred on 3 randomly selected clinical digital subtraction angiography (DSA) sequences from a dataset of 15 sequences, covering multiple interventional scenarios. The results indicate a mean localization error of 0.421 $\pm$ 0.138 mm, with a maximum error of 1.736 mm, and a mean Intersection over Union (IoU) of 0.782. The framework maintains an average processing speed of 57.2 frames per second, meeting the temporal demands of endovascular imaging. Further validations with robotic platforms for automating diagnostics and therapies in clinical routines yielded tracking errors of 0.708 $\pm$ 0.695 mm and 0.148 $\pm$ 0.057 mm in two distinct experimental scenarios.

[457] arXiv:2507.00062 (cross-list from physics.soc-ph) [pdf, html, other]
Title: Enhancing Car-Following Models with Bike Dynamics for Improved Traffic Simulation
Nico Ostendorf, Keno Garlichs, Lars Wolf
Subjects: Physics and Society (physics.soc-ph); Systems and Control (eess.SY)

Road traffic simulations are crucial for establishing safe and efficient traffic environments. They are used to test various road applications before real-world implementation. SUMO is a well-known simulator for road networks and intermodal traffic, often used in conjunction with other tools to test various types of applications. Realistic simulations require accurate movement models for different road users, such as cars, bicycles, and buses. While realistic models are already implemented for most vehicle types, bicycles, which are essential for achieving safe and efficient traffic, can only be modeled as slow vehicles or fast pedestrians at present. This paper introduces the Realistic Bicycle Dynamics Model (RBDM), the first dedicated bicycle model for SUMO, addressing this significant gap. Leveraging real-world bicycle data from the SimRa dataset, the RBDM implements realistic speed, acceleration, and deceleration behaviors of bicycles in urban scenarios. The evaluation is conducted using the Monaco SUMO traffic scenario and a newly generated Berlin scenario in SUMO. The RBDM significantly outperforms the existing slow-vehicle approximation in SUMO, aligning more closely with real-world data. These results underscore the necessity of a realistic bicycle movement model for accurate simulations, given the significant differences in the movement profiles of bicycles, cars, and pedestrians. Furthermore, the model is tested for its ability to generalize to disparate scenarios and urban topologies, which is dependent on the manner and geographical region in which the SimRa data were gathered. In addition, recommendations are provided for how it could be adapted for use in different city topologies.

[458] arXiv:2507.00065 (cross-list from quant-ph) [pdf, html, other]
Title: Segmentation-Based Regression for Quantum Neural Networks
James C. Hateley
Subjects: Quantum Physics (quant-ph); Numerical Analysis (math.NA)

Recent advances in quantum hardware motivate the development of algorithmic frameworks that integrate quantum sampling with classical inference. This work introduces a segmentation-based regression method tailored to quantum neural networks (QNNs), where real-valued outputs are encoded as base-b digit sequences and inferred through greedy digitwise optimization. By casting the regression task as a constrained combinatorial problem over a structured digit lattice, the method replaces continuous inference with interpretable and tractable updates. A hybrid quantum-classical architecture is employed: quantum circuits generate candidate digits through projective measurement, while classical forward models evaluate these candidates based on task-specific error functionals. We formalize the algorithm from first principles, derive convergence and complexity bounds, and demonstrate its effectiveness on inverse problems involving PDE-constrained models. The resulting framework provides a robust, high-precision interface between quantum outputs and continuous scientific inference.

[459] arXiv:2507.00067 (cross-list from physics.soc-ph) [pdf, other]
Title: The gradual transformation of inland countries -- human plowing, horse plowing and equity incentives
Hongfa Zi, Zhen Liu
Comments: 9 pages,2 figures
Subjects: Physics and Society (physics.soc-ph); Computational Engineering, Finance, and Science (cs.CE); General Economics (econ.GN)

Many modern countries have not learned their lessons and often hope for the wisdom of later generations, resulting in them only possessing modern technology and difficult to iterate ancient civilizations. At present, there is no way to tell how we should learn from history and promote the gradual upgrading of civilization. Therefore, we must tell the history of civilization's progress and the means of governance, learn from experience to improve the comprehensive strength and survival ability of civilization, and achieve an optimal solution for the tempering brought by conflicts and the reduction of internal conflicts. Firstly, we must follow the footsteps of history and explore the reasons for the long-term stability of each country in conflict, including providing economic benefits to the people and means of suppressing them; then, use mathematical methods to demonstrate how we can achieve the optimal solution at the current stage. After analysis, we can conclude that the civilization transformed from human plowing to horse plowing can easily suppress the resistance of the people and provide them with the ability to resist; The selection of rulers should consider multiple institutional aspects, such as exams, elections, and drawing lots; Economic development follows a lognormal distribution and can be adjusted by expected value and variance. Using a lognormal distribution with the maximum value to divide equity can adjust the wealth gap.

[460] arXiv:2507.00076 (cross-list from astro-ph.IM) [pdf, html, other]
Title: Time Invariant Sensor Tasking for Catalog Maintenance of LEO Space objects using Stochastic Geometry
Partha Chowdhury, Harsha M, Chinni Prabhunath Georg, Arun Balaji Buduru, Sanat K Biswas
Comments: This work has been accepted and presented at the 35th AAS/AIAA Space Flight Mechanics Meeting, 2025, Kaua'i, Hawai
Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Robotics (cs.RO); Systems and Control (eess.SY); Mathematical Physics (math-ph)

Catalog maintenance of space objects by limited number of ground-based sensors presents a formidable challenging task to the space community. This article presents a methodology for time-invariant tracking and surveillance of space objects in low Earth orbit (LEO) by optimally directing ground sensors. Our methodology aims to maximize the expected number of space objects from a set of ground stations by utilizing concepts from stochastic geometry, particularly the Poisson point process. We have provided a systematic framework to understand visibility patterns and enhance the efficiency of tracking multiple objects simultaneously. Our approach contributes to more informed decision-making in space operations, ultimately supporting efforts to maintain safety and sustainability in LEO.

[461] arXiv:2507.00088 (cross-list from physics.soc-ph) [pdf, html, other]
Title: How large language models judge and influence human cooperation
Alexandre S. Pires, Laurens Samson, Sennay Ghebreab, Fernando P. Santos
Subjects: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)

Humans increasingly rely on large language models (LLMs) to support decisions in social settings. Previous work suggests that such tools shape people's moral and political judgements. However, the long-term implications of LLM-based social decision-making remain unknown. How will human cooperation be affected when the assessment of social interactions relies on language models? This is a pressing question, as human cooperation is often driven by indirect reciprocity, reputations, and the capacity to judge interactions of others. Here, we assess how state-of-the-art LLMs judge cooperative actions. We provide 21 different LLMs with an extensive set of examples where individuals cooperate -- or refuse cooperating -- in a range of social contexts, and ask how these interactions should be judged. Furthermore, through an evolutionary game-theoretical model, we evaluate cooperation dynamics in populations where the extracted LLM-driven judgements prevail, assessing the long-term impact of LLMs on human prosociality. We observe a remarkable agreement in evaluating cooperation against good opponents. On the other hand, we notice within- and between-model variance when judging cooperation with ill-reputed individuals. We show that the differences revealed between models can significantly impact the prevalence of cooperation. Finally, we test prompts to steer LLM norms, showing that such interventions can shape LLM judgements, particularly through goal-oriented prompts. Our research connects LLM-based advices and long-term social dynamics, and highlights the need to carefully align LLM norms in order to preserve human cooperation.

[462] arXiv:2507.00095 (cross-list from quant-ph) [pdf, html, other]
Title: Authentication of Continuous-Variable Quantum Messages
Mehmet Hüseyin Temel, Boris Škorić
Comments: 15 pages
Subjects: Quantum Physics (quant-ph); Cryptography and Security (cs.CR)

We introduce the first quantum authentication scheme for continuous-variable states. Our scheme is based on trap states, and is an adaptation of a discrete-variable scheme by Broadbent et al. (arXiv:1211.1080), but with more freedom in choosing the number of traps. We provide a security proof, mostly following the approach of Broadbent and Wainewright (arXiv:1607.03075). As a necessary ingredient for the proof we derive the continuous-variable analogue of the Pauli Twirl.

[463] arXiv:2507.00103 (cross-list from physics.soc-ph) [pdf, html, other]
Title: Gender and Discipline Shape Length, Content and Tone of Grant Peer Review Reports
Stefan Müller, Gabriel Okasa, Michaela Strinzel, Anne Jorstad, Katrin Milzow, Matthias Egger
Subjects: Physics and Society (physics.soc-ph); Digital Libraries (cs.DL)

Peer review by experts is central to the evaluation of grant proposals, but little is known about how gender and disciplinary differences shape the content and tone of grant peer review reports. We analyzed 39,280 review reports submitted to the Swiss National Science Foundation between 2016 and 2023, covering 11,385 proposals for project funding across 21 disciplines from the Social Sciences and Humanities (SSH), Life Sciences (LS), and Mathematics, Informatics, Natural Sciences, and Technology (MINT). Using supervised machine learning, we classified over 1.3 million sentences by evaluation criteria and sentiment. Reviews in SSH were significantly longer and more critical, with less focus on the applicant's track record, while those in MINT were more concise and positive, with a higher focus on the track record, as compared to those in LS. Compared to male reviewers, female reviewers write longer reviews that more closely align with the evaluation criteria and express more positive sentiments. Female applicants tend to receive reviews with slightly more positive sentiment than male applicants. Gender and disciplinary culture influence how grant proposals are reviewed - shaping the tone, length, and focus of peer review reports. These differences have important implications for fairness and consistency in research funding.

[464] arXiv:2507.00155 (cross-list from eess.AS) [pdf, html, other]
Title: Do Music Source Separation Models Preserve Spatial Information in Binaural Audio?
Richa Namballa, Agnieszka Roginska, Magdalena Fuentes
Comments: 6 pages + references, 4 figures, 2 tables, 26th International Society for Music Information Retrieval (ISMIR) Conference
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD); Signal Processing (eess.SP)

Binaural audio remains underexplored within the music information retrieval community. Motivated by the rising popularity of virtual and augmented reality experiences as well as potential applications to accessibility, we investigate how well existing music source separation (MSS) models perform on binaural audio. Although these models process two-channel inputs, it is unclear how effectively they retain spatial information. In this work, we evaluate how several popular MSS models preserve spatial information on both standard stereo and novel binaural datasets. Our binaural data is synthesized using stems from MUSDB18-HQ and open-source head-related transfer functions by positioning instrument sources randomly along the horizontal plane. We then assess the spatial quality of the separated stems using signal processing and interaural cue-based metrics. Our results show that stereo MSS models fail to preserve the spatial information critical for maintaining the immersive quality of binaural audio, and that the degradation depends on model architecture as well as the target instrument. Finally, we highlight valuable opportunities for future work at the intersection of MSS and immersive audio.

[465] arXiv:2507.00185 (cross-list from eess.IV) [pdf, other]
Title: Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM)
Yang Zhou, Chrystie Wan Ning Quek, Jun Zhou, Yan Wang, Yang Bai, Yuhe Ke, Jie Yao, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting
Comments: 42 pages, 3 composite figures, 4 tables
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Current artificial intelligence models for medical imaging are predominantly single modality and single disease. Attempts to create multimodal and multi-disease models have resulted in inconsistent clinical accuracy. Furthermore, training these models typically requires large, labour-intensive, well-labelled datasets. We developed MerMED-FM, a state-of-the-art multimodal, multi-specialty foundation model trained using self-supervised learning and a memory module. MerMED-FM was trained on 3.3 million medical images from over ten specialties and seven modalities, including computed tomography (CT), chest X-rays (CXR), ultrasound (US), pathology patches, color fundus photography (CFP), optical coherence tomography (OCT) and dermatology images. MerMED-FM was evaluated across multiple diseases and compared against existing foundational models. Strong performance was achieved across all modalities, with AUROCs of 0.988 (OCT); 0.982 (pathology); 0.951 (US); 0.943 (CT); 0.931 (skin); 0.894 (CFP); 0.858 (CXR). MerMED-FM has the potential to be a highly adaptable, versatile, cross-specialty foundation model that enables robust medical imaging interpretation across diverse medical disciplines.

[466] arXiv:2507.00206 (cross-list from eess.IV) [pdf, html, other]
Title: Towards 3D Semantic Image Synthesis for Medical Imaging
Wenwu Tang, Khaled Seyam, Bin Yang
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In the medical domain, acquiring large datasets is challenging due to both accessibility issues and stringent privacy regulations. Consequently, data availability and privacy protection are major obstacles to applying machine learning in medical imaging. To address this, our study proposes the Med-LSDM (Latent Semantic Diffusion Model), which operates directly in the 3D domain and leverages de-identified semantic maps to generate synthetic data as a method of privacy preservation and data augmentation. Unlike many existing methods that focus on generating 2D slices, Med-LSDM is designed specifically for 3D semantic image synthesis, making it well-suited for applications requiring full volumetric data. Med-LSDM incorporates a guiding mechanism that controls the 3D image generation process by applying a diffusion model within the latent space of a pre-trained VQ-GAN. By operating in the compressed latent space, the model significantly reduces computational complexity while still preserving critical 3D spatial details. Our approach demonstrates strong performance in 3D semantic medical image synthesis, achieving a 3D-FID score of 0.0054 on the conditional Duke Breast dataset and similar Dice scores (0.70964) to those of real images (0.71496). These results demonstrate that the synthetic data from our model have a small domain gap with real data and are useful for data augmentation.

[467] arXiv:2507.00209 (cross-list from eess.IV) [pdf, html, other]
Title: SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures
Fengyi Jiang, Xiaorui Zhang, Lingbo Jin, Ruixing Liang, Yuxin Chen, Adi Chola Venkatesh, Jason Culman, Tiantian Wu, Lirong Shao, Wenqing Sun, Cong Gao, Hallie McNamara, Jingpei Lu, Omid Mohareri
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

High-resolution imaging is crucial for enhancing visual clarity and enabling precise computer-assisted guidance in minimally invasive surgery (MIS). Despite the increasing adoption of 4K endoscopic systems, there remains a significant gap in publicly available native 4K datasets tailored specifically for robotic-assisted MIS. We introduce SurgiSR4K, the first publicly accessible surgical imaging and video dataset captured at a native 4K resolution, representing realistic conditions of robotic-assisted procedures. SurgiSR4K comprises diverse visual scenarios including specular reflections, tool occlusions, bleeding, and soft tissue deformations, meticulously designed to reflect common challenges faced during laparoscopic and robotic surgeries. This dataset opens up possibilities for a broad range of computer vision tasks that might benefit from high resolution data, such as super resolution (SR), smoke removal, surgical instrument detection, 3D tissue reconstruction, monocular depth estimation, instance segmentation, novel view synthesis, and vision-language model (VLM) development. SurgiSR4K provides a robust foundation for advancing research in high-resolution surgical imaging and fosters the development of intelligent imaging technologies aimed at enhancing performance, safety, and usability in image-guided robotic surgeries.

[468] arXiv:2507.00225 (cross-list from hep-ph) [pdf, html, other]
Title: Discovering the underlying analytic structure within Standard Model constants using artificial intelligence
S. V. Chekanov, H. Kjellerstrand
Comments: 42 pages, 10 tables
Subjects: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)

This paper presents a search for underlying analytic structures among the fundamental parameters of the Standard Model (SM) using symbolic regression and genetic programming. We identify the simplest analytic relationships connecting pairs of these constants and report several notable observations based on about a thousand expressions with relative precision better than 1%. These results may serve as valuable inputs for model builders and artificial intelligence methods aimed at uncovering hidden patterns among the SM constants, or potentially used as building blocks for a deeper underlying law that connects all parameters of the SM through a small set of fundamental constants.

[469] arXiv:2507.00227 (cross-list from eess.AS) [pdf, html, other]
Title: Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis
Paul Mayer, Florian Lux, Alejandro Pérez-González-de-Martos, Angelina Elizarova, Lindsey Vanderlyn, Dirk Väth, Ngoc Thang Vu
Comments: Accepted at Interspeech 2025
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)

While generative methods have progressed rapidly in recent years, generating expressive prosody for an utterance remains a challenging task in text-to-speech synthesis. This is particularly true for systems that model prosody explicitly through parameters such as pitch, energy, and duration, which is commonly done for the sake of interpretability and controllability. In this work, we investigate the effectiveness of stochastic methods for this task, including Normalizing Flows, Conditional Flow Matching, and Rectified Flows. We compare these methods to a traditional deterministic baseline, as well as to real human realizations. Our extensive subjective and objective evaluations demonstrate that stochastic methods produce natural prosody on par with human speakers by capturing the variability inherent in human speech. Further, they open up additional controllability options by allowing the sampling temperature to be tuned.

[470] arXiv:2507.00231 (cross-list from physics.med-ph) [pdf, other]
Title: Observation of Blood Flow in Major Neck Vessels Modulated 1 by Physiological Maneuvers
Gennadi Saiko, Timothy Burton, Faraz Sadrzadeh-Afsharazar, Shota Yamashita, Kenshin Shimono, Yasuyuki Kakihana, Alexandre Douplik
Subjects: Medical Physics (physics.med-ph); Signal Processing (eess.SP); Systems and Control (eess.SY)

Large neck vessels (carotid artery and internal jugular vein, IJV) offer a unique opportunity to monitor hemodynamics non-invasively by optical means. The primary shortcoming of past work has been the focus on healthy volunteers in normal physiological conditions and well-controlled environments. To drive the technology closer to the bedside, testing is required under more re-alistic conditions, including in pathologies and real-world environments (e.g., similar toICU or emergency care settings). The primary goal of the current work was to extend the range of physiological maneuvers for blood flow modulation by introducing new maneuvers and ob-serving PPG response to them. The data from the necks of two healthy volunteers in a supine position were collected by clinical PPG and in-house built PPG sensors, accompanied by ECG signal collection. Seven maneuvers (abdominojugular test, breath holding, Valsalva, proximal occlusion of right IJV, distal occlusion of right IJV, proximal occlusion of left IJV, distal occlusion of left IJV) were performed in sequence with 1 min allocated for each maneuver. The 1 min was split into three segments: baseline (15 s), experiment (15 s), and recovery (30 s). Thus, the overall du-ration of the experiment was 7 min. AC amplitude from clinical PPG, DC amplitudes from in-house built PPG, and ECG signal were compared during all seven physiological maneuvers. Newly proposed maneuvers (Valsalva and IJV occlusions) demonstrated modulation of blood flow, which was more significant than previously reported maneuvers (abdominojugular test and breath holding). The proposed physiological maneuvers demonstrate high potential as instruments for modulating blood flow in major neck vessels.

[471] arXiv:2507.00249 (cross-list from econ.TH) [pdf, html, other]
Title: Endogenous Network Structures with Precision and Dimension Choices
Nikhil Kumar
Subjects: Theoretical Economics (econ.TH); Social and Information Networks (cs.SI)

This paper presents a social learning model where the network structure is endogenously determined by signal precision and dimension choices. Agents not only choose the precision of their signals and what dimension of the state to learn about, but these decisions directly determine the underlying network structure on which social learning occurs. We show that under a fixed network structure, the optimal precision choice is sublinear in the agent's stationary influence in the network, and this individually optimal choice is worse than the socially optimal choice by a factor of $n^{1/3}$. Under a dynamic network structure, we specify the network by defining a kernel distance between agents, which then determines how much weight agents place on one another. Agents choose dimensions to learn about such that their choice minimizes the squared sum of influences of all agents: a network with equally distributed influence across agents is ideal.

[472] arXiv:2507.00252 (cross-list from math.CO) [pdf, html, other]
Title: Compact Representation of Semilinear and Terrain-like Graphs
Jean Cardinal, Yelena Yuditsky
Subjects: Combinatorics (math.CO); Computational Geometry (cs.CG); Discrete Mathematics (cs.DM)

We consider the existence and construction of \textit{biclique covers} of graphs, consisting of coverings of their edge sets by complete bipartite graphs. The \textit{size} of such a cover is the sum of the sizes of the bicliques. Small-size biclique covers of graphs are ubiquitous in computational geometry, and have been shown to be useful compact representations of graphs. We give a brief survey of classical and recent results on biclique covers and their applications, and give new families of graphs having biclique covers of near-linear size.
In particular, we show that semilinear graphs, whose edges are defined by linear relations in bounded dimensional space, always have biclique covers of size $O(n\polylog n)$. This generalizes many previously known results on special classes of graphs including interval graphs, permutation graphs, and graphs of bounded boxicity, but also new classes such as intersection graphs of L-shapes in the plane. It also directly implies the bounds for Zarankiewicz's problem derived by Basit, Chernikov, Starchenko, Tao, and Tran (\textit{Forum Math. Sigma}, 2021).
We also consider capped graphs, also known as terrain-like graphs, defined as ordered graphs forbidding a certain ordered pattern on four vertices. Terrain-like graphs contain the induced subgraphs of terrain visibility graphs. We give an elementary proof that these graphs admit biclique partitions of size $O(n\log^3 n)$. This provides a simple combinatorial analogue of a classical result from Agarwal, Alon, Aronov, and Suri on polygon visibility graphs (\textit{Discrete Comput. Geom.} 1994).
Finally, we prove that there exists families of unit disk graphs on $n$ vertices that do not admit biclique coverings of size $o(n^{4/3})$, showing that we are unlikely to improve on Szemerédi-Trotter type incidence bounds for higher-degree semialgebraic graphs.

[473] arXiv:2507.00254 (cross-list from quant-ph) [pdf, html, other]
Title: Fully Parallelized BP Decoding for Quantum LDPC Codes Can Outperform BP-OSD
Ming Wang, Ang Li, Frank Mueller
Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT)

In this work, we propose a lightweight decoder based solely on belief-propagation (BP), augmented with a speculative post-processing strategy inspired by classical Chase decoding. Our method identifies unreliable bits via BP oscillation statistics, generates a set of modified test patterns, and decodes them in parallel using low-iteration BP. We demonstrate that our approach can achieve logical error rates comparable to or even better than BP-OSD, but has lower latency over its parallelization for a variety of bivariate bicycle codes, which significantly reduces decoding complexity.

[474] arXiv:2507.00255 (cross-list from quant-ph) [pdf, html, other]
Title: Robustness Analysis for Quantum Systems Controlled by Continuous-Time Pulses
Sean Patrick O'Neil, Edmond Jonckheere, Sophie Schirmer
Comments: 6 pages, 2 figures
Subjects: Quantum Physics (quant-ph); Systems and Control (eess.SY)

Differential sensitivity techniques originally developed to study the robustness of energy landscape controllers are generalized to the important case of closed quantum systems subject to continuously varying controls. Vanishing sensitivity to parameter variation is shown to coincide with perfect fidelity, as was the case for time-invariant controls. Bounds on the magnitude of the differential sensitivity to any parameter variation are derived based simply on knowledge of the system Hamiltonian and the maximum size of the control inputs

[475] arXiv:2507.00260 (cross-list from stat.ML) [pdf, html, other]
Title: Disentangled Feature Importance
Jin-Hong Du, Kathryn Roeder, Larry Wasserman
Comments: 26 main and 29 supplementary pages
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)

Feature importance quantification faces a fundamental challenge: when predictors are correlated, standard methods systematically underestimate their contributions. We prove that major existing approaches target identical population functionals under squared-error loss, revealing why they share this correlation-induced bias.
To address this limitation, we introduce \emph{Disentangled Feature Importance (DFI)}, a nonparametric generalization of the classical $R^2$ decomposition via optimal transport. DFI transforms correlated features into independent latent variables using a transport map, eliminating correlation distortion. Importance is computed in this disentangled space and attributed back through the transport map's sensitivity. DFI provides a principled decomposition of importance scores that sum to the total predictive variability for latent additive models and to interaction-weighted functional ANOVA variances more generally, under arbitrary feature dependencies.
We develop a comprehensive semiparametric theory for DFI. For general transport maps, we establish root-$n$ consistency and asymptotic normality of importance estimators in the latent space, which extends to the original feature space for the Bures-Wasserstein map. Notably, our estimators achieve second-order estimation error, which vanishes if both regression function and transport map estimation errors are $o_{\mathbb{P}}(n^{-1/4})$. By design, DFI avoids the computational burden of repeated submodel refitting and the challenges of conditional covariate distribution estimation, thereby achieving computational efficiency.

[476] arXiv:2507.00269 (cross-list from q-bio.NC) [pdf, other]
Title: Feature Integration Spaces: Joint Training Reveals Dual Encoding in Neural Network Representations
Omar Claflin
Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)

Current sparse autoencoder (SAE) approaches to neural network interpretability assume that activations can be decomposed through linear superposition into sparse, interpretable features. Despite high reconstruction fidelity, SAEs consistently fail to eliminate polysemanticity and exhibit pathological behavioral errors. We propose that neural networks encode information in two complementary spaces compressed into the same substrate: feature identity and feature integration. To test this dual encoding hypothesis, we develop sequential and joint-training architectures to capture identity and integration patterns simultaneously. Joint training achieves 41.3% reconstruction improvement and 51.6% reduction in KL divergence errors. This architecture spontaneously develops bimodal feature organization: low squared norm features contributing to integration pathways and the rest contributing directly to the residual. Small nonlinear components (3% of parameters) achieve 16.5% standalone improvements, demonstrating parameter-efficient capture of computational relationships crucial for behavior. Additionally, intervention experiments using 2x2 factorial stimulus designs demonstrated that integration features exhibit selective sensitivity to experimental manipulations and produce systematic behavioral effects on model outputs, including significant interaction effects across semantic dimensions. This work provides systematic evidence for (1) dual encoding in neural representations, (2) meaningful nonlinearly encoded feature interactions, and (3) introduces an architectural paradigm shift from post-hoc feature analysis to integrated computational design, establishing foundations for next-generation SAEs.

[477] arXiv:2507.00279 (cross-list from econ.GN) [pdf, html, other]
Title: Satellite and Mobile Phone Data Reveal How Violence Affects Seasonal Migration in Afghanistan
Xiao Hui Tai, Suraj R. Nair, Shikhar Mehra, Joshua E. Blumenstock
Subjects: General Economics (econ.GN); Social and Information Networks (cs.SI)

Seasonal migration plays a critical role in stabilizing rural economies and sustaining the livelihoods of agricultural households. Violence and civil conflict have long been thought to disrupt these labor flows, but this hypothesis has historically been hard to test given the lack of reliable data on migration in conflict zones. Focusing on Afghanistan in the 8-year period prior to the Taliban's takeover in 2021, we first demonstrate how satellite imagery can be used to infer the timing of the opium harvest, which employs a large number of seasonal workers in relatively well-paid jobs. We then use a dataset of nationwide mobile phone records to characterize the migration response to this harvest, and examine whether and how violence and civil conflict disrupt this migration. We find that, on average, districts with high levels of poppy cultivation receive significantly more seasonal migrants than districts with no poppy cultivation. These labor flows are surprisingly resilient to idiosyncratic violent events at the source or destination, including extreme violence resulting in large numbers of fatalities. However, seasonal migration is affected by longer-term patterns of conflict, such as the extent of Taliban control in origin and destination locations.

[478] arXiv:2507.00288 (cross-list from econ.TH) [pdf, other]
Title: Reconfiguring Digital Accountability: AI-Powered Innovations and Transnational Governance in a Postnational Accounting Context
Claire Li, David Freeborn
Comments: 22 pages
Subjects: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)

This study explores how AI-powered digital innovations are reshaping organisational accountability in a transnational governance context. As AI systems increasingly mediate decision-making in domains such as auditing and financial reporting, traditional mechanisms of accountability, based on control, transparency, and auditability, are being destabilised. We integrate the Technology Acceptance Model (TAM), Actor-Network Theory (ANT), and institutional theory to examine how organisations adopt AI technologies in response to regulatory, ethical, and cultural pressures that transcend national boundaries. We argue that accountability is co-constructed within global socio-technical networks, shaped not only by user perceptions but also by governance logics and normative expectations. Extending TAM, we incorporate compliance and legitimacy as key factors in perceived usefulness and usability. Drawing on ANT, we reconceptualise accountability as a relational and emergent property of networked assemblages. We propose two organisational strategies including internal governance reconfiguration and external actor-network engagement to foster responsible, legitimate, and globally accepted AI adoption in the accounting domain.

[479] arXiv:2507.00298 (cross-list from stat.ML) [pdf, html, other]
Title: Enhancing Interpretability in Generative Modeling: Statistically Disentangled Latent Spaces Guided by Generative Factors in Scientific Datasets
Arkaprabha Ganguli, Nesar Ramachandra, Julie Bessac, Emil Constantinescu
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

This study addresses the challenge of statistically extracting generative factors from complex, high-dimensional datasets in unsupervised or semi-supervised settings. We investigate encoder-decoder-based generative models for nonlinear dimensionality reduction, focusing on disentangling low-dimensional latent variables corresponding to independent physical factors. Introducing Aux-VAE, a novel architecture within the classical Variational Autoencoder framework, we achieve disentanglement with minimal modifications to the standard VAE loss function by leveraging prior statistical knowledge through auxiliary variables. These variables guide the shaping of the latent space by aligning latent factors with learned auxiliary variables. We validate the efficacy of Aux-VAE through comparative assessments on multiple datasets, including astronomical simulations.

[480] arXiv:2507.00313 (cross-list from math.CO) [pdf, html, other]
Title: Faces in rectilinear drawings of complete graphs
Martin Balko, Anna Brötzner, Fabian Klute, Josef Tkadlec
Comments: 21 pages, 11 figures
Journal-ref: European Journal of Combinatorics, 2025
Subjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)

We initiate the study of extremal problems about faces in convex rectilinear drawings of~$K_n$, that is, drawings where vertices are represented by points in the plane in convex position and edges by line segments between the points representing the end-vertices. We show that if a convex rectilinear drawing of $K_n$ does not contain a common interior point of at least three edges, then there is always a face forming a convex 5-gon while there are such drawings without any face forming a convex $k$-gon with $k \geq 6$.
A convex rectilinear drawing of $K_n$ is \emph{regular} if its vertices correspond to vertices of a regular convex $n$-gon. We characterize positive integers $n$ for which regular drawings of $K_n$ contain a face forming a convex 5-gon.
To our knowledge, this type of problems has not been considered in the literature before and so we also pose several new natural open problems.

[481] arXiv:2507.00361 (cross-list from math.OC) [pdf, html, other]
Title: Affine-Invariant Global Non-Asymptotic Convergence Analysis of BFGS under Self-Concordance
Qiujiang Jin, Aryan Mokhtari
Subjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)

In this paper, we establish global non-asymptotic convergence guarantees for the BFGS quasi-Newton method without requiring strong convexity or the Lipschitz continuity of the gradient or Hessian. Instead, we consider the setting where the objective function is strictly convex and strongly self-concordant. For an arbitrary initial point and any arbitrary positive-definite initial Hessian approximation, we prove global linear and superlinear convergence guarantees for BFGS when the step size is determined using a line search scheme satisfying the weak Wolfe conditions. Moreover, all our global guarantees are affine-invariant, with the convergence rates depending solely on the initial error and the strongly self-concordant constant. Our results extend the global non-asymptotic convergence theory of BFGS beyond traditional assumptions and, for the first time, establish affine-invariant convergence guarantees aligning with the inherent affine invariance of the BFGS method.

[482] arXiv:2507.00398 (cross-list from eess.IV) [pdf, html, other]
Title: Accurate and Efficient Fetal Birth Weight Estimation from 3D Ultrasound
Jian Wang, Qiongying Ni, Hongkui Yu, Ruixuan Yao, Jinqiao Ying, Bin Zhang, Xingyi Yang, Jin Peng, Jiongquan Chen, Junxuan Yu, Wenlong Shi, Chaoyu Chen, Zhongnuo Yan, Mingyuan Luo, Gaocheng Cai, Dong Ni, Jing Lu, Xin Yang
Comments: Accepted by MICCAI 2025
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Accurate fetal birth weight (FBW) estimation is essential for optimizing delivery decisions and reducing perinatal mortality. However, clinical methods for FBW estimation are inefficient, operator-dependent, and challenging to apply in cases of complex fetal anatomy. Existing deep learning methods are based on 2D standard ultrasound (US) images or videos that lack spatial information, limiting their prediction accuracy. In this study, we propose the first method for directly estimating FBW from 3D fetal US volumes. Our approach integrates a multi-scale feature fusion network (MFFN) and a synthetic sample-based learning framework (SSLF). The MFFN effectively extracts and fuses multi-scale features under sparse supervision by incorporating channel attention, spatial attention, and a ranking-based loss function. SSLF generates synthetic samples by simply combining fetal head and abdomen data from different fetuses, utilizing semi-supervised learning to improve prediction performance. Experimental results demonstrate that our method achieves superior performance, with a mean absolute error of $166.4\pm155.9$ $g$ and a mean absolute percentage error of $5.1\pm4.6$%, outperforming existing methods and approaching the accuracy of a senior doctor. Code is available at: this https URL.

[483] arXiv:2507.00400 (cross-list from quant-ph) [pdf, html, other]
Title: Logarithmic Depth Decomposition of Approximate Multi-Controlled Single-Qubit Gates Without Ancilla Qubits
Jefferson D. S. Silva, Adenilton J. da Silva
Comments: 6 pages, 10 figures
Subjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC)

The synthesis of quantum operators involves decomposing general quantum gates into the gate set supported by a given quantum device. Multi-controlled gates are essential components in this process. In this work, we present improved decompositions of multi-controlled NOT gates with logarithmic depth using a single ancilla qubit, while also reducing the constant factors in the circuit depth compared to previous work. We optimize a previously proposed decomposition of multi-target, multi-controlled special unitary SU(2) gates by identifying the presence of a conditionally clean qubit. Additionally, we introduce the best-known decomposition of multi-controlled approximate unitary U(2) gates without using ancilla qubits. This approach significantly reduces the overall circuit depth and CNOT count while preserving an adjustable error parameter, yielding a more efficient and scalable solution for synthesizing large controlled-unitary gates. Our method is particularly suitable for both NISQ and fault-tolerant quantum architectures. All software developed in this project is freely available.

[484] arXiv:2507.00402 (cross-list from stat.ML) [pdf, html, other]
Title: GRAND: Graph Release with Assured Node Differential Privacy
Suqing Liu, Xuan Bi, Tianxi Li
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)

Differential privacy is a well-established framework for safeguarding sensitive information in data. While extensively applied across various domains, its application to network data -- particularly at the node level -- remains underexplored. Existing methods for node-level privacy either focus exclusively on query-based approaches, which restrict output to pre-specified network statistics, or fail to preserve key structural properties of the network. In this work, we propose GRAND (Graph Release with Assured Node Differential privacy), which is, to the best of our knowledge, the first network release mechanism that releases entire networks while ensuring node-level differential privacy and preserving structural properties. Under a broad class of latent space models, we show that the released network asymptotically follows the same distribution as the original network. The effectiveness of the approach is evaluated through extensive experiments on both synthetic and real-world datasets.

[485] arXiv:2507.00407 (cross-list from physics.chem-ph) [pdf, html, other]
Title: Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials
Cong Fu, Yuchao Lin, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji
Subjects: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)

Accurate molecular property predictions require 3D geometries, which are typically obtained using expensive methods such as density functional theory (DFT). Here, we attempt to obtain molecular geometries by relying solely on machine learning interatomic potential (MLIP) models. To this end, we first curate a large-scale molecular relaxation dataset comprising 3.5 million molecules and 300 million snapshots. Then MLIP foundation models are trained with supervised learning to predict energy and forces given 3D molecular structures. Once trained, we show that the foundation models can be used in different ways to obtain geometries either explicitly or implicitly. First, it can be used to obtain low-energy 3D geometries via geometry optimization, providing relaxed 3D geometries for downstream molecular property predictions. To mitigate potential biases and enhance downstream predictions, we introduce geometry fine-tuning based on the relaxed 3D geometries. Second, the foundation models can be directly fine-tuned for property prediction when ground truth 3D geometries are available. Our results demonstrate that MLIP foundation models trained on relaxation data can provide valuable molecular geometries that benefit property predictions.

[486] arXiv:2507.00408 (cross-list from physics.chem-ph) [pdf, html, other]
Title: Learning collective variables that respect permutational symmetry
Jiaxin Yuan, Shashank Sule, Yeuk Yin Lam, Maria Cameron
Comments: 66 pages, 35 figures, 18 tables
Subjects: Chemical Physics (physics.chem-ph); Numerical Analysis (math.NA)

In addition to translational and rotational symmetries, clusters of identical interacting particles possess permutational symmetry. Coarse-grained models for such systems are instrumental in identifying metastable states, providing an effective description of their dynamics, and estimating transition rates. We propose a numerical framework for learning collective variables that respect translational, rotational, and permutational symmetries, and for estimating transition rates and residence times. It combines a sort-based featurization, residence manifold learning in the feature space, and learning collective variables with autoencoders whose loss function utilizes the orthogonality relationship (Legoll and Lelievre, 2010). The committor of the resulting reduced model is used as the reaction coordinate in the forward flux sampling and to design a control for sampling the transition path process. We offer two case studies, the Lennard-Jones-7 in 2D and the Lennard-Jones-8 in 3D. The transition rates and residence times computed with the aid of the reduced models agree with those obtained via brute-force methods.

[487] arXiv:2507.00419 (cross-list from physics.geo-ph) [pdf, html, other]
Title: Geological Everything Model 3D: A Promptable Foundation Model for Unified and Zero-Shot Subsurface Understanding
Yimin Dou, Xinming Wu, Nathan L Bangs, Harpreet Singh Sethi, Jintao Li, Hang Gao, Zhixiang Guo
Subjects: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)

Understanding Earth's subsurface is critical for energy transition, natural hazard mitigation, and planetary science. Yet subsurface analysis remains fragmented, with separate models required for structural interpretation, stratigraphic analysis, geobody segmentation, and property modeling-each tightly coupled to specific data distributions and task formulations. We introduce the Geological Everything Model 3D (GEM), a unified generative architecture that reformulates all these tasks as prompt-conditioned inference along latent structural frameworks derived from subsurface imaging. This formulation moves beyond task-specific models by enabling a shared inference mechanism, where GEM propagates human-provided prompts-such as well logs, masks, or structural sketches-along inferred structural frameworks to produce geologically coherent outputs. Through this mechanism, GEM achieves zero-shot generalization across tasks with heterogeneous prompt types, without retraining for new tasks or data sources. This capability emerges from a two-stage training process that combines self-supervised representation learning on large-scale field seismic data with adversarial fine-tuning using mixed prompts and labels across diverse subsurface tasks. GEM demonstrates broad applicability across surveys and tasks, including Martian radar stratigraphy analysis, structural interpretation in subduction zones, full seismic stratigraphic interpretation, geobody delineation, and property modeling. By bridging expert knowledge with generative reasoning in a structurally aware manner, GEM lays the foundation for scalable, human-in-the-loop geophysical AI-transitioning from fragmented pipelines to a vertically integrated, promptable reasoning system. Project page: this https URL

[488] arXiv:2507.00458 (cross-list from eess.AS) [pdf, html, other]
Title: Mitigating Language Mismatch in SSL-Based Speaker Anonymization
Zhe Zhang, Wen-Chin Huang, Xin Wang, Xiaoxiao Miao, Junichi Yamagishi
Comments: Accepted to Interspeech 2025
Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

Speaker anonymization aims to protect speaker identity while preserving content information and the intelligibility of speech. However, most speaker anonymization systems (SASs) are developed and evaluated using only English, resulting in degraded utility for other languages. This paper investigates language mismatch in SASs for Japanese and Mandarin speech. First, we fine-tune a self-supervised learning (SSL)-based content encoder with Japanese speech to verify effective language adaptation. Then, we propose fine-tuning a multilingual SSL model with Japanese speech and evaluating the SAS in Japanese and Mandarin. Downstream experiments show that fine-tuning an English-only SSL model with the target language enhances intelligibility while maintaining privacy and that multilingual SSL further extends SASs' utility across different languages. These findings highlight the importance of language adaptation and multilingual pre-training of SSLs for robust multilingual speaker anonymization.

[489] arXiv:2507.00459 (cross-list from cond-mat.mtrl-sci) [pdf, other]
Title: Process-aware and high-fidelity microstructure generation using stable diffusion
Hoang Cuong Phan, Minh Tien Tran, Chihun Lee, Hoheok Kim, Sehyok Oh, Dong-Kyu Kim, Ho Won Lee
Comments: 46 pages, 13 figures, 5 tables, 3rd Word Congress on Artificial Intelligence in Materials & Manufacturing 2025
Subjects: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)

Synthesizing realistic microstructure images conditioned on processing parameters is crucial for understanding process-structure relationships in materials design. However, this task remains challenging due to limited training micrographs and the continuous nature of processing variables. To overcome these challenges, we present a novel process-aware generative modeling approach based on Stable Diffusion 3.5 Large (SD3.5-Large), a state-of-the-art text-to-image diffusion model adapted for microstructure generation. Our method introduces numeric-aware embeddings that encode continuous variables (annealing temperature, time, and magnification) directly into the model's conditioning, enabling controlled image generation under specified process conditions and capturing process-driven microstructural variations. To address data scarcity and computational constraints, we fine-tune only a small fraction of the model's weights via DreamBooth and Low-Rank Adaptation (LoRA), efficiently transferring the pre-trained model to the materials domain. We validate realism using a semantic segmentation model based on a fine-tuned U-Net with a VGG16 encoder on 24 labeled micrographs. It achieves 97.1% accuracy and 85.7% mean IoU, outperforming previous methods. Quantitative analyses using physical descriptors and spatial statistics show strong agreement between synthetic and real microstructures. Specifically, two-point correlation and lineal-path errors remain below 2.1% and 0.6%, respectively. Our method represents the first adaptation of SD3.5-Large for process-aware microstructure generation, offering a scalable approach for data-driven materials design.

[490] arXiv:2507.00482 (cross-list from physics.optics) [pdf, other]
Title: Physics-Aware Style Transfer for Adaptive Holographic Reconstruction
Chanseok Lee, Fakhriyya Mammadova, Jiseong Barg, Mooseok Jang
Comments: Keywords: holographic imaging, style transfer, phase retrieval, deep learning
Subjects: Optics (physics.optics); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Inline holographic imaging presents an ill-posed inverse problem of reconstructing objects' complex amplitude from recorded diffraction patterns. Although recent deep learning approaches have shown promise over classical phase retrieval algorithms, they often require high-quality ground truth datasets of complex amplitude maps to achieve a statistical inverse mapping operation between the two domains. Here, we present a physics-aware style transfer approach that interprets the object-to-sensor distance as an implicit style within diffraction patterns. Using the style domain as the intermediate domain to construct cyclic image translation, we show that the inverse mapping operation can be learned in an adaptive manner only with datasets composed of intensity measurements. We further demonstrate its biomedical applicability by reconstructing the morphology of dynamically flowing red blood cells, highlighting its potential for real-time, label-free imaging. As a framework that leverages physical cues inherently embedded in measurements, the presented method offers a practical learning strategy for imaging applications where ground truth is difficult or impossible to obtain.

[491] arXiv:2507.00511 (cross-list from eess.IV) [pdf, html, other]
Title: Medical Image Segmentation Using Advanced Unet: VMSE-Unet and VM-Unet CBAM+
Sayandeep Kanrar, Raja Piyush, Qaiser Razi, Debanshi Chakraborty, Vikas Hassija, GSS Chalapathi
Comments: under review
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

In this paper, we present the VMSE U-Net and VM-Unet CBAM+ model, two cutting-edge deep learning architectures designed to enhance medical image segmentation. Our approach integrates Squeeze-and-Excitation (SE) and Convolutional Block Attention Module (CBAM) techniques into the traditional VM U-Net framework, significantly improving segmentation accuracy, feature localization, and computational efficiency. Both models show superior performance compared to the baseline VM-Unet across multiple datasets. Notably, VMSEUnet achieves the highest accuracy, IoU, precision, and recall while maintaining low loss values. It also exhibits exceptional computational efficiency with faster inference times and lower memory usage on both GPU and CPU. Overall, the study suggests that the enhanced architecture VMSE-Unet is a valuable tool for medical image analysis. These findings highlight its potential for real-world clinical applications, emphasizing the importance of further research to optimize accuracy, robustness, and computational efficiency.

[492] arXiv:2507.00514 (cross-list from astro-ph.CO) [pdf, html, other]
Title: Simulation-Efficient Cosmological Inference with Multi-Fidelity SBI
Leander Thiele, Adrian E. Bayer, Naoya Takeishi
Comments: 5 pages, 4 figures; accepted at ICML-colocated ML4Astro 2025 workshop
Subjects: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)

The simulation cost for cosmological simulation-based inference can be decreased by combining simulation sets of varying fidelity. We propose an approach to such multi-fidelity inference based on feature matching and knowledge distillation. Our method results in improved posterior quality, particularly for small simulation budgets and difficult inference problems.

[493] arXiv:2507.00546 (cross-list from physics.app-ph) [pdf, html, other]
Title: Inverse Design in Nanophotonics via Representation Learning
Reza Marzban, Ali Adibi, Raphael Pestourie
Subjects: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optics (physics.optics)

Inverse design in nanophotonics, the computational discovery of structures achieving targeted electromagnetic (EM) responses, has become a key tool for recent optical advances. Traditional intuition-driven or iterative optimization methods struggle with the inherently high-dimensional, non-convex design spaces and the substantial computational demands of EM simulations. Recently, machine learning (ML) has emerged to address these bottlenecks effectively. This review frames ML-enhanced inverse design methodologies through the lens of representation learning, classifying them into two categories: output-side and input-side approaches. Output-side methods use ML to learn a representation in the solution space to create a differentiable solver that accelerates optimization. Conversely, input-side techniques employ ML to learn compact, latent-space representations of feasible device geometries, enabling efficient global exploration through generative models. Each strategy presents unique trade-offs in data requirements, generalization capacity, and novel design discovery potentials. Hybrid frameworks that combine physics-based optimization with data-driven representations help escape poor local optima, improve scalability, and facilitate knowledge transfer. We conclude by highlighting open challenges and opportunities, emphasizing complexity management, geometry-independent representations, integration of fabrication constraints, and advancements in multiphysics co-designs.

[494] arXiv:2507.00569 (cross-list from math.CO) [pdf, html, other]
Title: Linear rank-metric intersecting codes
Daniele Bartoli, Martino Borello, Giuseppe Marino, Martin Scotti
Comments: 17 pages, 1 figure
Subjects: Combinatorics (math.CO); Information Theory (cs.IT)

In this paper we introduce and investigate rank-metric intersecting codes, a new class of linear codes in the rank-metric context, inspired by the well-studied notion of intersecting codes in the Hamming metric. A rank-metric code is said to be intersecting if any two nonzero codewords have supports intersecting non trivially. We explore this class from both a coding-theoretic and geometric perspective, highlighting its relationship with minimal codes, MRD codes, and Hamming-metric intersecting codes. We derive structural properties, sufficient conditions based on minimum distance, and geometric characterizations in terms of 2-spannable $q$-systems. We establish upper and lower bounds on code parameters and show some constructions, which leave a range of unexplored parameters. Finally, we connect rank-intersecting codes to other combinatorial structures such as $(2,1)$-separating systems and frameproof codes.

[495] arXiv:2507.00582 (cross-list from eess.IV) [pdf, html, other]
Title: Bridging Classical and Learning-based Iterative Registration through Deep Equilibrium Models
Yi Zhang, Yidong Zhao, Qian Tao
Comments: Submitted version. Accepted by MICCAI 2025
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Deformable medical image registration is traditionally formulated as an optimization problem. While classical methods solve this problem iteratively, recent learning-based approaches use recurrent neural networks (RNNs) to mimic this process by unrolling the prediction of deformation fields in a fixed number of steps. However, classical methods typically converge after sufficient iterations, but learning-based unrolling methods lack a theoretical convergence guarantee and show instability empirically. In addition, unrolling methods have a practical bottleneck at training time: GPU memory usage grows linearly with the unrolling steps due to backpropagation through time (BPTT). To address both theoretical and practical challenges, we propose DEQReg, a novel registration framework based on Deep Equilibrium Models (DEQ), which formulates registration as an equilibrium-seeking problem, establishing a natural connection between classical optimization and learning-based unrolling methods. DEQReg maintains constant memory usage, enabling theoretically unlimited iteration steps. Through extensive evaluation on the public brain MRI and lung CT datasets, we show that DEQReg can achieve competitive registration performance, while substantially reducing memory consumption compared to state-of-the-art unrolling methods. We also reveal an intriguing phenomenon: the performance of existing unrolling methods first increases slightly then degrades irreversibly when the inference steps go beyond the training configuration. In contrast, DEQReg achieves stable convergence with its inbuilt equilibrium-seeking mechanism, bridging the gap between classical optimization-based and modern learning-based registration methods.

[496] arXiv:2507.00613 (cross-list from eess.IV) [pdf, html, other]
Title: Physics-Informed Neural ODEs for Temporal Dynamics Modeling in Cardiac T1 Mapping
Nuno Capitão, Yi Zhang, Yidong Zhao, Qian Tao
Comments: Submitted version. Accepted at MICCAI 2025
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)

Spin-lattice relaxation time ($T_1$) is an important biomarker in cardiac parametric mapping for characterizing myocardial tissue and diagnosing cardiomyopathies. Conventional Modified Look-Locker Inversion Recovery (MOLLI) acquires 11 breath-hold baseline images with interleaved rest periods to ensure mapping accuracy. However, prolonged scanning can be challenging for patients with poor breathholds, often leading to motion artifacts that degrade image quality. In addition, $T_1$ mapping requires voxel-wise nonlinear fitting to a signal recovery model involving an iterative estimation process. Recent studies have proposed deep-learning approaches for rapid $T_1$ mapping using shortened sequences to reduce acquisition time for patient comfort. Nevertheless, existing methods overlook important physics constraints, limiting interpretability and generalization. In this work, we present an accelerated, end-to-end $T_1$ mapping framework leveraging Physics-Informed Neural Ordinary Differential Equations (ODEs) to model temporal dynamics and address these challenges. Our method achieves high-accuracy $T_1$ estimation from a sparse subset of baseline images and ensures efficient null index estimation at test time. Specifically, we develop a continuous-time LSTM-ODE model to enable selective Look-Locker (LL) data acquisition with arbitrary time lags. Experimental results show superior performance in $T_1$ estimation for both native and post-contrast sequences and demonstrate the strong benefit of our physics-based formulation over direct data-driven $T_1$ priors.

[497] arXiv:2507.00616 (cross-list from math.DG) [pdf, html, other]
Title: Geometric Gaussian Approximations of Probability Distributions
Nathaël Da Costa, Bálint Mucsányi, Philipp Hennig
Subjects: Differential Geometry (math.DG); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)

Approximating complex probability distributions, such as Bayesian posterior distributions, is of central interest in many applications. We study the expressivity of geometric Gaussian approximations. These consist of approximations by Gaussian pushforwards through diffeomorphisms or Riemannian exponential maps. We first review these two different kinds of geometric Gaussian approximations. Then we explore their relationship to one another. We further provide a constructive proof that such geometric Gaussian approximations are universal, in that they can capture any probability distribution. Finally, we discuss whether, given a family of probability distributions, a common diffeomorphism can be found to obtain uniformly high-quality geometric Gaussian approximations for that family.

[498] arXiv:2507.00624 (cross-list from math.FA) [pdf, html, other]
Title: A Simple Proof of Nehari's Theorem Based on Duality
Cristian R. Rojas
Comments: 11 pages
Subjects: Functional Analysis (math.FA); Systems and Control (eess.SY)

In this technical note we provide a simple proof of Nehari's theorem on the optimal approximation by $H_\infty$ functions, based on convex duality.

[499] arXiv:2507.00629 (cross-list from cond-mat.dis-nn) [pdf, html, other]
Title: Generalization performance of narrow one-hidden layer networks in the teacher-student setting
Jean Barbier, Federica Gerace, Alessandro Ingrosso, Clarissa Lauditi, Enrico M. Malatesta, Gibbs Nwemadji, Rodrigo Pérez Ortiz
Comments: 34 pages, figures
Subjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)

Understanding the generalization abilities of neural networks for simple input-output distributions is crucial to account for their learning performance on real datasets. The classical teacher-student setting, where a network is trained from data obtained thanks to a label-generating teacher model, serves as a perfect theoretical test bed. In this context, a complete theoretical account of the performance of fully connected one-hidden layer networks in the presence of generic activation functions is lacking. In this work, we develop such a general theory for narrow networks, i.e. networks with a large number of hidden units, yet much smaller than the input dimension. Using methods from statistical physics, we provide closed-form expressions for the typical performance of both finite temperature (Bayesian) and empirical risk minimization estimators, in terms of a small number of weight statistics. In doing so, we highlight the presence of a transition where hidden neurons specialize when the number of samples is sufficiently large and proportional to the number of parameters of the network. Our theory accurately predicts the generalization error of neural networks trained on regression or classification tasks with either noisy full-batch gradient descent (Langevin dynamics) or full-batch gradient descent.

[500] arXiv:2507.00640 (cross-list from stat.ML) [pdf, html, other]
Title: Forward Reverse Kernel Regression for the Schrödinger bridge problem
Denis Belomestny, John. Schoenmakers
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)

In this paper, we study the Schrödinger Bridge Problem (SBP), which is central to entropic optimal transport. For general reference processes and begin--endpoint distributions, we propose a forward-reverse iterative Monte Carlo procedure to approximate the Schrödinger potentials in a nonparametric way. In particular, we use kernel based Monte Carlo regression in the context of Picard iteration of a corresponding fixed point problem. By preserving in the iteration positivity and contractivity in a Hilbert metric sense, we develop a provably convergent algorithm. Furthermore, we provide convergence rates for the potential estimates and prove their optimality. Finally, as an application, we propose a non-nested Monte Carlo procedure for the final dimensional distributions of the Schrödinger Bridge process, based on the constructed potentials and the forward-reverse simulation method for conditional diffusions.

[501] arXiv:2507.00641 (cross-list from nlin.AO) [pdf, html, other]
Title: Hebbian Physics Networks: A Self-Organizing Computational Architecture Based on Local Physical Laws
Gunjan Auti, Hirofumi Daiguji, Gouhei Tanaka
Comments: 6 pages, 2 figures, 2 supplementary videos
Subjects: Adaptation and Self-Organizing Systems (nlin.AO); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)

Traditional machine learning approaches in physics rely on global optimization, limiting interpretability and enforcing physical constraints externally. We introduce the Hebbian Physics Network (HPN), a self-organizing computational framework in which learning emerges from local Hebbian updates driven by violations of conservation laws. Grounded in non-equilibrium thermodynamics and inspired by Prigogine/'s theory of dissipative structures, HPNs eliminate the need for global loss functions by encoding physical laws directly into the system/'s local dynamics. Residuals - quantified imbalances in continuity, momentum, or energy - serve as thermodynamic signals that drive weight adaptation through generalized Hebbian plasticity. We demonstrate this approach on incompressible fluid flow and continuum diffusion, where physically consistent structures emerge from random initial conditions without supervision. HPNs reframe computation as a residual-driven thermodynamic process, offering an interpretable, scalable, and physically grounded alternative for modeling complex dynamical systems.

[502] arXiv:2507.00645 (cross-list from math.AP) [pdf, html, other]
Title: A convex lifting approach for the Calderón problem
Giovanni S. Alberti, Romain Petit, Simone Sanna
Subjects: Analysis of PDEs (math.AP); Functional Analysis (math.FA); Numerical Analysis (math.NA); Optimization and Control (math.OC)

The Calderón problem consists in recovering an unknown coefficient of a partial differential equation from boundary measurements of its solution. These measurements give rise to a highly nonlinear forward operator. As a consequence, the development of reconstruction methods for this inverse problem is challenging, as they usually suffer from the problem of local convergence. To circumvent this issue, we propose an alternative approach based on lifting and convex relaxation techniques, that have been successfully developed for solving finite-dimensional quadratic inverse problems. This leads to a convex optimization problem whose solution coincides with the sought-after coefficient, provided that a non-degenerate source condition holds. We demonstrate the validity of our approach on a toy model where the solution of the partial differential equation is known everywhere in the domain. In this simplified setting, we verify that the non-degenerate source condition holds under certain assumptions on the unknown coefficient. We leave the investigation of its validity in the Calderón setting for future works.

[503] arXiv:2507.00660 (cross-list from eess.IV) [pdf, html, other]
Title: MTCNet: Motion and Topology Consistency Guided Learning for Mitral Valve Segmentationin 4D Ultrasound
Rusi Chen, Yuanting Yang, Jiezhi Yao, Hongning Song, Ji Zhang, Yongsong Zhou, Yuhao Huang, Ronghao Yang, Dan Jia, Yuhan Zhang, Xing Tao, Haoran Dou, Qing Zhou, Xin Yang, Dong Ni
Comments: Accepted by MICCAI 2025
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Mitral regurgitation is one of the most prevalent cardiac disorders. Four-dimensional (4D) ultrasound has emerged as the primary imaging modality for assessing dynamic valvular morphology. However, 4D mitral valve (MV) analysis remains challenging due to limited phase annotations, severe motion artifacts, and poor imaging quality. Yet, the absence of inter-phase dependency in existing methods hinders 4D MV analysis. To bridge this gap, we propose a Motion-Topology guided consistency network (MTCNet) for accurate 4D MV ultrasound segmentation in semi-supervised learning (SSL). MTCNet requires only sparse end-diastolic and end-systolic annotations. First, we design a cross-phase motion-guided consistency learning strategy, utilizing a bi-directional attention memory bank to propagate spatio-temporal features. This enables MTCNet to achieve excellent performance both per- and inter-phase. Second, we devise a novel topology-guided correlation regularization that explores physical prior knowledge to maintain anatomically plausible. Therefore, MTCNet can effectively leverage structural correspondence between labeled and unlabeled phases. Extensive evaluations on the first largest 4D MV dataset, with 1408 phases from 160 patients, show that MTCNet performs superior cross-phase consistency compared to other advanced methods (Dice: 87.30%, HD: 1.75mm). Both the code and the dataset are available at this https URL.

[504] arXiv:2507.00670 (cross-list from eess.IV) [pdf, html, other]
Title: Mind the Detail: Uncovering Clinically Relevant Image Details in Accelerated MRI with Semantically Diverse Reconstructions
Jan Nikolas Morshuis, Christian Schlarmann, Thomas Küstner, Christian F. Baumgartner, Matthias Hein
Comments: MICCAI 2025
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In recent years, accelerated MRI reconstruction based on deep learning has led to significant improvements in image quality with impressive results for high acceleration factors. However, from a clinical perspective image quality is only secondary; much more important is that all clinically relevant information is preserved in the reconstruction from heavily undersampled data. In this paper, we show that existing techniques, even when considering resampling for diffusion-based reconstruction, can fail to reconstruct small and rare pathologies, thus leading to potentially wrong diagnosis decisions (false negatives). To uncover the potentially missing clinical information we propose ``Semantically Diverse Reconstructions'' (\SDR), a method which, given an original reconstruction, generates novel reconstructions with enhanced semantic variability while all of them are fully consistent with the measured data. To evaluate \SDR automatically we train an object detector on the fastMRI+ dataset. We show that \SDR significantly reduces the chance of false-negative diagnoses (higher recall) and improves mean average precision compared to the original reconstructions. The code is available on this https URL

[505] arXiv:2507.00671 (cross-list from stat.CO) [pdf, html, other]
Title: Harnessing the Power of Reinforcement Learning for Adaptive MCMC
Congye Wang, Matthew A. Fisher, Heishiro Kanagawa, Wilson Chen, Chris. J. Oates
Subjects: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)

Sampling algorithms drive probabilistic machine learning, and recent years have seen an explosion in the diversity of tools for this task. However, the increasing sophistication of sampling algorithms is correlated with an increase in the tuning burden. There is now a greater need than ever to treat the tuning of samplers as a learning task in its own right. In a conceptual breakthrough, Wang et al (2025) formulated Metropolis-Hastings as a Markov decision process, opening up the possibility for adaptive tuning using Reinforcement Learning (RL). Their emphasis was on theoretical foundations; realising the practical benefit of Reinforcement Learning Metropolis-Hastings (RLMH) was left for subsequent work. The purpose of this paper is twofold: First, we observe the surprising result that natural choices of reward, such as the acceptance rate, or the expected squared jump distance, provide insufficient signal for training RLMH. Instead, we propose a novel reward based on the contrastive divergence, whose superior performance in the context of RLMH is demonstrated. Second, we explore the potential of RLMH and present adaptive gradient-based samplers that balance flexibility of the Markov transition kernel with learnability of the associated RL task. A comprehensive simulation study using the posteriordb benchmark supports the practical effectiveness of RLMH.

[506] arXiv:2507.00673 (cross-list from eess.IV) [pdf, html, other]
Title: Prompt2SegCXR:Prompt to Segment All Organs and Diseases in Chest X-rays
Abduz Zami, Shadman Sobhan, Rounaq Hossain, Md. Sawran Sorker, Mohiuddin Ahmed, Md. Redwan Hossain
Comments: 29 Pages
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Image segmentation plays a vital role in the medical field by isolating organs or regions of interest from surrounding areas. Traditionally, segmentation models are trained on a specific organ or a disease, limiting their ability to handle other organs and diseases. At present, few advanced models can perform multi-organ or multi-disease segmentation, offering greater flexibility. Also, recently, prompt-based image segmentation has gained attention as a more flexible approach. It allows models to segment areas based on user-provided prompts. Despite these advances, there has been no dedicated work on prompt-based interactive multi-organ and multi-disease segmentation, especially for Chest X-rays. This work presents two main contributions: first, generating doodle prompts by medical experts of a collection of datasets from multiple sources with 23 classes, including 6 organs and 17 diseases, specifically designed for prompt-based Chest X-ray segmentation. Second, we introduce Prompt2SegCXR, a lightweight model for accurately segmenting multiple organs and diseases from Chest X-rays. The model incorporates multi-stage feature fusion, enabling it to combine features from various network layers for better spatial and semantic understanding, enhancing segmentation accuracy. Compared to existing pre-trained models for prompt-based image segmentation, our model scores well, providing a reliable solution for segmenting Chest X-rays based on user prompts.

[507] arXiv:2507.00675 (cross-list from physics.soc-ph) [pdf, html, other]
Title: Can Machines Philosophize?
Michele Pizzochero, Giorgia Dellaferrera
Subjects: Physics and Society (physics.soc-ph); Computers and Society (cs.CY); History and Philosophy of Physics (physics.hist-ph)

Inspired by the Turing test, we present a novel methodological framework to assess the extent to which a population of machines mirrors the philosophical views of a population of humans. The framework consists of three steps: (i) instructing machines to impersonate each human in the population, reflecting their backgrounds and beliefs, (ii) administering a questionnaire covering various philosophical positions to both humans and machines, and (iii) statistically analyzing the resulting responses. We apply this methodology to the debate on scientific realism, a long-standing philosophical inquiry exploring the relationship between science and reality. By considering the outcome of a survey of over 500 human participants, including both physicists and philosophers of science, we generate their machine personas using an artificial intelligence engine based on a large-language generative model. We reveal that the philosophical views of a population of machines are, on average, similar to those endorsed by a population of humans, irrespective of whether they are physicists or philosophers of science. As compared to humans, however, machines exhibit a weaker inclination toward scientific realism and a stronger coherence in their philosophical positions. Given the observed similarities between the populations of humans and machines, this methodological framework may offer unprecedented opportunities for advancing research in experimental philosophy by replacing human participants with their machine-impersonated counterparts, possibly mitigating the efficiency and reproducibility issues that affect survey-based empirical studies.

[508] arXiv:2507.00683 (cross-list from cond-mat.mtrl-sci) [pdf, html, other]
Title: Testing the spin-bath view of self-attention: A Hamiltonian analysis of GPT-2 Transformer
Satadeep Bhattacharjee, Seung-Cheol Lee
Subjects: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)

The recently proposed physics-based framework by Huo and Johnson~\cite{huo2024capturing} models the attention mechanism of Large Language Models (LLMs) as an interacting two-body spin system, offering a first-principles explanation for phenomena like repetition and bias. Building on this hypothesis, we extract the complete Query-Key weight matrices from a production-grade GPT-2 model and derive the corresponding effective Hamiltonian for every attention head. From these Hamiltonians we obtain analytic \textit{phase boundaries} logit gap criteria that predict which token should dominate the next-token distribution for a given context. A systematic evaluation on 144 heads across 20 factual-recall prompts reveals a strong negative correlation between the theoretical logit gaps and the model's empirical token rankings ($r\approx-0.70$, $p<10^{-3}$).Targeted ablations further show that suppressing the heads most aligned with the spin-bath predictions induces the anticipated shifts in output probabilities, confirming a causal link rather than a coincidental association. Taken together, our findings provide the first strong empirical evidence for the spin-bath analogy in a production-grade model. This validation not only furnishes a tractable, physics-inspired lens for interpretability but also provides the groundwork for novel generative models, bridging the gap between theoretical condensed matter physics and AI.

[509] arXiv:2507.00717 (cross-list from math.OC) [pdf, html, other]
Title: General Perturbation Resilient Dynamic String-Averaging for Inconsistent Problems with Superiorization
Kay Barshad, Yair Censor
Comments: 31 pages. Accepted for publication in Journal of Optimization Theory and Applications
Subjects: Optimization and Control (math.OC); Functional Analysis (math.FA); Numerical Analysis (math.NA)

In this paper we introduce a General Dynamic String-Averaging (GDSA) iterative scheme and investigate its convergence properties in the inconsistent case, that is, when the input operators don't have a common fixed point. The Dynamic String-Averaging Projection (DSAP) algorithm itself was introduced in an 2013 paper, where its strong convergence and bounded perturbation resilience were studied in the consistent case (that is, when the sets under consideration had a nonempty intersection). Results involving combination of the DSAP method with superiorization, were presented in 2015. The proof of the weak convergence of our GDSA method is based on the notion of "strong coherence" of sequences of operators that was introduced in 2019. This is an improvement of the property of "coherence" of sequences of operators introduced in 2001 by Bauschke and Combettes. Strong coherence provides a more convenient sufficient convergence condition for methods that employ infinite sequences of operators and it turns out to be a useful general tool when applied to proving the convergence of many iterative methods. In this paper we combine the ideas of both dynamic string-averaging and strong coherence, in order to analyze our GDSA method for a general class of operators and its bounded perturbation resilience in the inconsistent case with weak and strong convergence. We then discuss an application of the GDSA method to the Superiorization Methodology, developing results on the behavior of its superiorized version.

[510] arXiv:2507.00719 (cross-list from physics.flu-dyn) [pdf, html, other]
Title: Guided Unconditional and Conditional Generative Models for Super-Resolution and Inference of Quasi-Geostrophic Turbulence
Anantha Narayanan Suresh Babu, Akhil Sadam, Pierre F.J. Lermusiaux
Comments: 56 pages, 23 figures, 7 tables
Subjects: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Geophysics (physics.geo-ph)

Typically, numerical simulations of the ocean, weather, and climate are coarse, and observations are sparse and gappy. In this work, we apply four generative diffusion modeling approaches to super-resolution and inference of forced two-dimensional quasi-geostrophic turbulence on the beta-plane from coarse, sparse, and gappy observations. Two guided approaches minimally adapt a pre-trained unconditional model: SDEdit modifies the initial condition, and Diffusion Posterior Sampling (DPS) modifies the reverse diffusion process score. The other two conditional approaches, a vanilla variant and classifier-free guidance, require training with paired high-resolution and observation data. We consider eight test cases spanning: two regimes, eddy and anisotropic-jet turbulence; two Reynolds numbers, 10^3 and 10^4; and two observation types, 4x coarse-resolution fields and coarse, sparse and gappy observations. Our comprehensive skill metrics include norms of the reconstructed vorticity fields, turbulence statistical quantities, and quantification of the super-resolved probabilistic ensembles and their errors. We also study the sensitivity to tuning parameters such as guidance strength. Results show that SDEdit generates unphysical fields, while DPS generates reasonable reconstructions at low computational cost but with smoothed fine-scale features. Both conditional approaches require re-training, but they reconstruct missing fine-scale features, are cycle-consistent with observations, and possess the correct statistics such as energy spectra. Further, their mean model errors are highly correlated with and predictable from their ensemble standard deviations. Results highlight the trade-offs between ease of implementation, fidelity (sharpness), and cycle-consistency of the diffusion models, and offer practical guidance for deployment in geophysical inverse problems.

[511] arXiv:2507.00720 (cross-list from physics.flu-dyn) [pdf, html, other]
Title: A simplified unified wave-particle method for diatomic gases with rotational and vibrational non-equilibrium
Sirui Yang, Chengwen Zhong, Ningchao Ding, Junzhe Cao, He Zhang, Congshan Zhuo, Sha Liu
Subjects: Fluid Dynamics (physics.flu-dyn); Numerical Analysis (math.NA)

The hypersonic flow around near-space vehicles constitutes a multi-scale flow problem. Due to insufficient molecular collisions to achieve equilibrium, rarefied gas effects are present in the flow field. Thus, numerical methods capable of accurately resolving multi-scale flows are required. Furthermore, high-temperature gas effects in hypersonic flows mean vibrational excitation of polyatomic molecules. Consequently, numerical methods accounting for non-equilibrium in rotational and vibrational internal energy modes are required. This study derives a quantified model-competition (QMC) mechanism for diatomic gases with rotational and vibrational non-equilibrium, starting from integral solutions of kinetic model equations with rotational and vibrational energy. The QMC mechanism categorize collisional and free-transport particles in cell, applying computational weighting based on their local scale regimes. We developed a simplified unified wave-particle (SUWP) method for diatomic gases based on QMC mechanism. For the macroscopic of the method, a three-temperature model accounting for rotational and vibrational energy is incorporated into both the kinetic inviscid flux scheme and {Navier-Stokes} solvers. For the microscopic of the method, a collisionless DSMC solver is employed to resolve non-equilibrium flow physics. This work validates the proposed SUWP method with rotational and vibrational non-equilibrium through benchmark cases, including shock tube, shock structures, flow past a cylinder, Apollo 6 command module and space station Mir. Compared to the DSMC and deterministic methods, the SUWP method exhibits favorable computational efficiency while maintaining accuracy.

[512] arXiv:2507.00743 (cross-list from eess.IV) [pdf, html, other]
Title: Tunable Wavelet Unit based Convolutional Neural Network in Optical Coherence Tomography Analysis Enhancement for Classifying Type of Epiretinal Membrane Surgery
An Le, Nehal Mehta, William Freeman, Ines Nagel, Melanie Tran, Anna Heinke, Akshay Agnihotri, Lingyun Cheng, Dirk-Uwe Bartsch, Hung Nguyen, Truong Nguyen, Cheolhong An
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

In this study, we developed deep learning-based method to classify the type of surgery performed for epiretinal membrane (ERM) removal, either internal limiting membrane (ILM) removal or ERM-alone removal. Our model, based on the ResNet18 convolutional neural network (CNN) architecture, utilizes postoperative optical coherence tomography (OCT) center scans as inputs. We evaluated the model using both original scans and scans preprocessed with energy crop and wavelet denoising, achieving 72% accuracy on preprocessed inputs, outperforming the 66% accuracy achieved on original scans. To further improve accuracy, we integrated tunable wavelet units with two key adaptations: Orthogonal Lattice-based Wavelet Units (OrthLatt-UwU) and Perfect Reconstruction Relaxation-based Wavelet Units (PR-Relax-UwU). These units allowed the model to automatically adjust filter coefficients during training and were incorporated into downsampling, stride-two convolution, and pooling layers, enhancing its ability to distinguish between ERM-ILM removal and ERM-alone removal, with OrthLattUwU boosting accuracy to 76% and PR-Relax-UwU increasing performance to 78%. Performance comparisons showed that our AI model outperformed a trained human grader, who achieved only 50% accuracy in classifying the removal surgery types from postoperative OCT scans. These findings highlight the potential of CNN based models to improve clinical decision-making by providing more accurate and reliable classifications. To the best of our knowledge, this is the first work to employ tunable wavelets for classifying different types of ERM removal surgery.

[513] arXiv:2507.00747 (cross-list from math.DS) [pdf, html, other]
Title: SINDy on slow manifolds
Diemen Delgado-Cano, Erick Kracht, Urban Fasel, Benjamin Herrmann
Comments: 18 pages, 6 figures, to be submitted to Nonlinear Dynamics (Springer)
Subjects: Dynamical Systems (math.DS); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)

The sparse identification of nonlinear dynamics (SINDy) has been established as an effective method to learn interpretable models of dynamical systems from data. However, for high-dimensional slow-fast dynamical systems, the regression problem becomes simultaneously computationally intractable and ill-conditioned. Although, in principle, modeling only the dynamics evolving on the underlying slow manifold addresses both of these challenges, the truncated fast variables have to be compensated by including higher-order nonlinearities as candidate terms for the model, leading to an explosive growth in the size of the SINDy library. In this work, we develop a SINDy variant that is able to robustly and efficiently identify slow-fast dynamics in two steps: (i) identify the slow manifold, that is, an algebraic equation for the fast variables as functions of the slow ones, and (ii) learn a model for the dynamics of the slow variables restricted to the manifold. Critically, the equation learned in (i) is leveraged to build a manifold-informed function library for (ii) that contains only essential higher-order nonlinearites as candidate terms. Rather than containing all monomials of up to a certain degree, the resulting custom library is a sparse subset of the latter that is tailored to the specific problem at hand. The approach is demonstrated on numerical examples of a snap-through buckling beam and the flow over a NACA 0012 airfoil. We find that our method significantly reduces both the condition number and the size of the SINDy library, thus enabling accurate identification of the dynamics on slow manifolds.

[514] arXiv:2507.00755 (cross-list from eess.AS) [pdf, other]
Title: LearnAFE: Circuit-Algorithm Co-design Framework for Learnable Audio Analog Front-End
Jinhai Hu, Zhongyi Zhang, Cong Sheng Leow, Wang Ling Goh, Yuan Gao
Comments: 11 pages, 15 figures, accepted for publication on IEEE Transactions on Circuits and Systems I: Regular Papers
Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)

This paper presents a circuit-algorithm co-design framework for learnable analog front-end (AFE) in audio signal classification. Designing AFE and backend classifiers separately is a common practice but non-ideal, as shown in this paper. Instead, this paper proposes a joint optimization of the backend classifier with the AFE's transfer function to achieve system-level optimum. More specifically, the transfer function parameters of an analog bandpass filter (BPF) bank are tuned in a signal-to-noise ratio (SNR)-aware training loop for the classifier. Using a co-design loss function LBPF, this work shows superior optimization of both the filter bank and the classifier. Implemented in open-source SKY130 130nm CMOS process, the optimized design achieved 90.5%-94.2% accuracy for 10-keyword classification task across a wide range of input signal SNR from 5 dB to 20 dB, with only 22k classifier parameters. Compared to conventional approach, the proposed audio AFE achieves 8.7% and 12.9% reduction in power and capacitor area respectively.

[515] arXiv:2507.00780 (cross-list from eess.IV) [pdf, other]
Title: Research on Improving the High Precision and Lightweight Diabetic Retinopathy Detection of YOLOv8n
Fei Yuhuan, Sun Xufei, Zang Ran, Wang Gengchen, Su Meng, Liu Fenghao
Comments: in Chinese language
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Early detection and diagnosis of diabetic retinopathy is one of the current research focuses in ophthalmology. However, due to the subtle features of micro-lesions and their susceptibility to background interference, ex-isting detection methods still face many challenges in terms of accuracy and robustness. To address these issues, a lightweight and high-precision detection model based on the improved YOLOv8n, named YOLO-KFG, is proposed. Firstly, a new dynamic convolution KWConv and C2f-KW module are designed to improve the backbone network, enhancing the model's ability to perceive micro-lesions. Secondly, a fea-ture-focused diffusion pyramid network FDPN is designed to fully integrate multi-scale context information, further improving the model's ability to perceive micro-lesions. Finally, a lightweight shared detection head GSDHead is designed to reduce the model's parameter count, making it more deployable on re-source-constrained devices. Experimental results show that compared with the base model YOLOv8n, the improved model reduces the parameter count by 20.7%, increases [email protected] by 4.1%, and improves the recall rate by 7.9%. Compared with single-stage mainstream algorithms such as YOLOv5n and YOLOv10n, YOLO-KFG demonstrates significant advantages in both detection accuracy and efficiency.

[516] arXiv:2507.00823 (cross-list from quant-ph) [pdf, other]
Title: Quantum Speedups for Polynomial-Time Dynamic Programming Algorithms
Susanna Caroppo, Giordano Da Lozzo, Giuseppe Di Battista, Michael T. Goodrich, Martin Nöllenburg
Comments: This is the extended version of a paper to appear at the 19th Algorithms and Data Structures Symposium (WADS 2025)
Subjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS)

We introduce a quantum dynamic programming framework that allows us to directly extend to the quantum realm a large body of classical dynamic programming algorithms. The corresponding quantum dynamic programming algorithms retain the same space complexity as their classical counterpart, while achieving a computational speedup. For a combinatorial (search or optimization) problem $\mathcal P$ and an instance $I$ of $\mathcal P$, such a speedup can be expressed in terms of the average degree $\delta$ of the dependency digraph $G_{\mathcal{P}}(I)$ of $I$, determined by a recursive formulation of $\mathcal P$. The nodes of this graph are the subproblems of $\mathcal P$ induced by $I$ and its arcs are directed from each subproblem to those on whose solution it relies. In particular, our framework allows us to solve the considered problems in $\tilde{O}(|V(G_{\mathcal{P}}(I))| \sqrt{\delta})$ time. As an example, we obtain a quantum version of the Bellman-Ford algorithm for computing shortest paths from a single source vertex to all the other vertices in a weighted $n$-vertex digraph with $m$ edges that runs in $\tilde{O}(n\sqrt{nm})$ time, which improves the best known classical upper bound when $m \in \Omega(n^{1.4})$.

[517] arXiv:2507.00832 (cross-list from eess.IV) [pdf, other]
Title: Automated anatomy-based post-processing reduces false positives and improved interpretability of deep learning intracranial aneurysm detection
Jisoo Kim, Chu-Hsuan Lin, Alberto Ceballos-Arroyo, Ping Liu, Huaizu Jiang, Shrikanth Yadav, Qi Wan, Lei Qin, Geoffrey S Young
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Introduction: Deep learning (DL) models can help detect intracranial aneurysms on CTA, but high false positive (FP) rates remain a barrier to clinical translation, despite improvement in model architectures and strategies like detection threshold tuning. We employed an automated, anatomy-based, heuristic-learning hybrid artery-vein segmentation post-processing method to further reduce FPs. Methods: Two DL models, CPM-Net and a deformable 3D convolutional neural network-transformer hybrid (3D-CNN-TR), were trained with 1,186 open-source CTAs (1,373 annotated aneurysms), and evaluated with 143 held-out private CTAs (218 annotated aneurysms). Brain, artery, vein, and cavernous venous sinus (CVS) segmentation masks were applied to remove possible FPs in the DL outputs that overlapped with: (1) brain mask; (2) vein mask; (3) vein more than artery masks; (4) brain plus vein mask; (5) brain plus vein more than artery masks. Results: CPM-Net yielded 139 true-positives (TP); 79 false-negative (FN); 126 FP. 3D-CNN-TR yielded 179 TP; 39 FN; 182 FP. FPs were commonly extracranial (CPM-Net 27.3%; 3D-CNN-TR 42.3%), venous (CPM-Net 56.3%; 3D-CNN-TR 29.1%), arterial (CPM-Net 11.9%; 3D-CNN-TR 53.3%), and non-vascular (CPM-Net 25.4%; 3D-CNN-TR 9.3%) structures. Method 5 performed best, reducing CPM-Net FP by 70.6% (89/126) and 3D-CNN-TR FP by 51.6% (94/182), without reducing TP, lowering the FP/case rate from 0.88 to 0.26 for CPM-NET, and from 1.27 to 0.62 for the 3D-CNN-TR. Conclusion: Anatomy-based, interpretable post-processing can improve DL-based aneurysm detection model performance. More broadly, automated, domain-informed, hybrid heuristic-learning processing holds promise for improving the performance and clinical acceptance of aneurysm detection models.

[518] arXiv:2507.00853 (cross-list from math.OC) [pdf, html, other]
Title: Ranking Quantilized Mean-Field Games with an Application to Early-Stage Venture Investments
Rinel Foguen Tchuendom, Dena Firoozi, Michèle Breton
Subjects: Optimization and Control (math.OC); Systems and Control (eess.SY); Mathematical Finance (q-fin.MF)

Quantilized mean-field game models involve quantiles of the population's distribution. We study a class of such games with a capacity for ranking games, where the performance of each agent is evaluated based on its terminal state relative to the population's $\alpha$-quantile value, $\alpha \in (0,1)$. This evaluation criterion is designed to select the top $(1-\alpha)\%$ performing agents. We provide two formulations for this competition: a target-based formulation and a threshold-based formulation. In the former and latter formulations, to satisfy the selection condition, each agent aims for its terminal state to be \textit{exactly} equal and \textit{at least} equal to the population's $\alpha$-quantile value, respectively.
For the target-based formulation, we obtain an analytic solution and demonstrate the $\epsilon$-Nash property for the asymptotic best-response strategies in the $N$-player game. Specifically, the quantilized mean-field consistency condition is expressed as a set of forward-backward ordinary differential equations, characterizing the $\alpha$-quantile value at equilibrium. For the threshold-based formulation, we obtain a semi-explicit solution and numerically solve the resulting quantilized mean-field consistency condition.
Subsequently, we propose a new application in the context of early-stage venture investments, where a venture capital firm financially supports a group of start-up companies engaged in a competition over a finite time horizon, with the goal of selecting a percentage of top-ranking ones to receive the next round of funding at the end of the time horizon. We present the results and interpretations of numerical experiments for both formulations discussed in this context and show that the target-based formulation provides a very good approximation for the threshold-based formulation.

[519] arXiv:2507.00858 (cross-list from q-bio.PE) [pdf, html, other]
Title: The Evolution of Altruistic Rationality Provides a Solution to Social Dilemmas via Rational Reciprocity
Mohammad Salahshour, Iain D. Couzin
Comments: 25 pages, 5 figures
Subjects: Populations and Evolution (q-bio.PE); Computer Science and Game Theory (cs.GT); Physics and Society (physics.soc-ph)

Decades of scientific inquiry have sought to understand how evolution fosters cooperation, a concept seemingly at odds with the belief that evolution should produce rational, self-interested individuals. Most previous work has focused on the evolution of cooperation among boundedly rational individuals whose decisions are governed by behavioral rules that do not need to be rational. Here, using an evolutionary model, we study how altruism can evolve in a community of rational agents and promote cooperation. We show that in both well-mixed and structured populations, a population of objectively rational agents is readily invaded by mutant individuals who make rational decisions but evolve a distorted (i.e., subjective) perception of their payoffs. This promotes behavioral diversity and gives rise to the evolution of rational, other-regarding agents who naturally solve all the known strategic problems of two-person, two-strategy games by perceiving their games as pure coordination games.

[520] arXiv:2507.00866 (cross-list from astro-ph.IM) [pdf, html, other]
Title: Template-Fitting Meets Deep Learning: Redshift Estimation Using Physics-Guided Neural Networks
Jonas Chris Ferrao, Dickson Dias, Pranav Naik, Glory D'Cruz, Anish Naik, Siya Khandeparkar, Manisha Gokuldas Fal Dessai
Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)

Accurate photometric redshift estimation is critical for observational cosmology, especially in large-scale surveys where spectroscopic measurements are impractical. Traditional approaches include template fitting and machine learning, each with distinct strengths and limitations. We present a hybrid method that integrates template fitting with deep learning using physics-guided neural networks. By embedding spectral energy distribution templates into the network architecture, our model encodes physical priors into the training process. The system employs a multimodal design, incorporating cross-attention mechanisms to fuse photometric and image data, along with Bayesian layers for uncertainty estimation. We evaluate our model on the publicly available PREML dataset, which includes approximately 400,000 galaxies from the Hyper Suprime-Cam PDR3 release, with 5-band photometry, multi-band imaging, and spectroscopic redshifts. Our approach achieves an RMS error of 0.0507, a 3-sigma catastrophic outlier rate of 0.13%, and a bias of 0.0028. The model satisfies two of the three LSST photometric redshift requirements for redshifts below 3. These results highlight the potential of combining physically motivated templates with data-driven models for robust redshift estimation in upcoming cosmological surveys.

[521] arXiv:2507.00871 (cross-list from math.OC) [pdf, html, other]
Title: Swarm-based optimization with jumps: a kinetic BGK framework and convergence analysis
Giacomo Borghi, Hyesung Im, Lorenzo Pareschi
Subjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)

Metaheuristic algorithms are powerful tools for global optimization, particularly for non-convex and non-differentiable problems where exact methods are often impractical. Particle-based optimization methods, inspired by swarm intelligence principles, have shown effectiveness due to their ability to balance exploration and exploitation within the search space. In this work, we introduce a novel particle-based optimization algorithm where velocities are updated via random jumps, a strategy commonly used to enhance stochastic exploration. We formalize this approach by describing the dynamics through a kinetic modelling of BGK type, offering a unified framework that accommodates general noise distributions, including heavy-tailed ones like Cauchy. Under suitable parameter scaling, the model reduces to the Consensus-Based Optimization (CBO) dynamics. For non-degenerate Gaussian noise in bounded domains, we prove propagation of chaos and convergence towards minimizers. Numerical results on benchmark problems validate the approach and highlight its connection to CBO.

[522] arXiv:2507.00894 (cross-list from stat.ML) [pdf, html, other]
Title: An in depth look at the Procrustes-Wasserstein distance: properties and barycenters
Davide Adamo, Marco Corneli, Manon Vuillien, Emmanuelle Vila
Comments: 16 pages
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Due to its invariance to rigid transformations such as rotations and reflections, Procrustes-Wasserstein (PW) was introduced in the literature as an optimal transport (OT) distance, alternative to Wasserstein and more suited to tasks such as the alignment and comparison of point clouds. Having that application in mind, we carefully build a space of discrete probability measures and show that over that space PW actually is a distance. Algorithms to solve the PW problems already exist, however we extend the PW framework by discussing and testing several initialization strategies. We then introduce the notion of PW barycenter and detail an algorithm to estimate it from the data. The result is a new method to compute representative shapes from a collection of point clouds. We benchmark our method against existing OT approaches, demonstrating superior performance in scenarios requiring precise alignment and shape preservation. We finally show the usefulness of the PW barycenters in an archaeological context. Our results highlight the potential of PW in boosting 2D and 3D point cloud analysis for machine learning and computational geometry applications.

[523] arXiv:2507.00903 (cross-list from eess.IV) [pdf, other]
Title: Deep learning-based segmentation of T1 and T2 cardiac MRI maps for automated disease detection
Andreea Bianca Popescu, Andreas Seitz, Heiko Mahrholdt, Jens Wetzl, Athira Jacob, Lucian Mihai Itu, Constantin Suciu, Teodora Chitiboi
Comments: This work has been submitted for consideration at European Radiology (Springer). Upon acceptance, this preprint will be updated with the journal reference
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Objectives Parametric tissue mapping enables quantitative cardiac tissue characterization but is limited by inter-observer variability during manual delineation. Traditional approaches relying on average relaxation values and single cutoffs may oversimplify myocardial complexity. This study evaluates whether deep learning (DL) can achieve segmentation accuracy comparable to inter-observer variability, explores the utility of statistical features beyond mean T1/T2 values, and assesses whether machine learning (ML) combining multiple features enhances disease detection. Materials & Methods T1 and T2 maps were manually segmented. The test subset was independently annotated by two observers, and inter-observer variability was assessed. A DL model was trained to segment left ventricle blood pool and myocardium. Average (A), lower quartile (LQ), median (M), and upper quartile (UQ) were computed for the myocardial pixels and employed in classification by applying cutoffs or in ML. Dice similarity coefficient (DICE) and mean absolute percentage error evaluated segmentation performance. Bland-Altman plots assessed inter-user and model-observer agreement. Receiver operating characteristic analysis determined optimal cutoffs. Pearson correlation compared features from model and manual segmentations. F1-score, precision, and recall evaluated classification performance. Wilcoxon test assessed differences between classification methods, with p < 0.05 considered statistically significant. Results 144 subjects were split into training (100), validation (15) and evaluation (29) subsets. Segmentation model achieved a DICE of 85.4%, surpassing inter-observer agreement. Random forest applied to all features increased F1-score (92.7%, p < 0.001). Conclusion DL facilitates segmentation of T1/ T2 maps. Combining multiple features with ML improves disease detection.

[524] arXiv:2507.00941 (cross-list from math.CO) [pdf, html, other]
Title: Seeing is not believing in limited visibility cops and robbers
Bojan Bašić, Alfie Davies, Aleksa Džuklevski, Strahinja Gvozdić, Yannick Mogge
Comments: 30 pages, 11 figures
Subjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)

We consider the model of limited visibility Cops and Robbers, where the cops can only see within their $l$-neighbourhood. We prove that the number of cops needed to see the robber can be arbitrarily smaller than the number needed to capture the robber, answering an open question from the literature. We then consider how close we can get to seeing the robber when we do not have enough cops, along with a probabilistic interpretation.

[525] arXiv:2507.00953 (cross-list from q-bio.BM) [pdf, other]
Title: From Sentences to Sequences: Rethinking Languages in Biological System
Ke Liu, Shuanke Shen, Hao Chen
Subjects: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)

The paradigm of large language models in natural language processing (NLP) has also shown promise in modeling biological languages, including proteins, RNA, and DNA. Both the auto-regressive generation paradigm and evaluation metrics have been transferred from NLP to biological sequence modeling. However, the intrinsic structural correlations in natural and biological languages differ fundamentally. Therefore, we revisit the notion of language in biological systems to better understand how NLP successes can be effectively translated to biological domains. By treating the 3D structure of biomolecules as the semantic content of a sentence and accounting for the strong correlations between residues or bases, we highlight the importance of structural evaluation and demonstrate the applicability of the auto-regressive paradigm in biological language modeling. Code can be found at \href{this https URL}{this http URL}

[526] arXiv:2507.00957 (cross-list from astro-ph.SR) [pdf, html, other]
Title: Atmospheric model-trained machine learning selection and classification of ultracool TY dwarfs
Ankit Biswas
Comments: 12 pages, 9 figures, to be published in Monthly Notices of the Royal Astronomical Society
Subjects: Solar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)

The T and Y spectral classes represent the coolest and lowest-mass population of brown dwarfs, yet their census remains incomplete due to limited statistics. Existing detection frameworks are often constrained to identifying M, L, and early T dwarfs, owing to the sparse observational sample of ultracool dwarfs (UCDs) at later types. This paper presents a novel machine learning framework capable of detecting and classifying late-T and Y dwarfs, trained entirely on synthetic photometry from atmospheric models. Utilizing grids from the ATMO 2020 and Sonora Bobcat models, I produce a training dataset over two orders of magnitude larger than any empirical set of >T6 UCDs. Polynomial color relations fitted to the model photometry are used to assign spectral types to these synthetic models, which in turn train an ensemble of classifiers to identify and classify the spectral type of late UCDs. The model is highly performant when validating on both synthetic and empirical datasets, verifying catalogs of known UCDs with object classification metrics >99% and an average spectral type precision within 0.35 +/- 0.37 subtypes. Application of the model to a 1.5 degree region around Pisces and the UKIDSS UDS field results in the discovery of one previously uncatalogued T8.2 candidate, demonstrating the ability of this model-trained approach in discovering faint, late-type UCDs from photometric catalogs.

[527] arXiv:2507.00983 (cross-list from eess.IV) [pdf, html, other]
Title: DMCIE: Diffusion Model with Concatenation of Inputs and Errors to Improve the Accuracy of the Segmentation of Brain Tumors in MRI Images
Sara Yavari, Rahul Nitin Pandya, Jacob Furst
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Accurate segmentation of brain tumors in MRI scans is essential for reliable clinical diagnosis and effective treatment planning. Recently, diffusion models have demonstrated remarkable effectiveness in image generation and segmentation tasks. This paper introduces a novel approach to corrective segmentation based on diffusion models. We propose DMCIE (Diffusion Model with Concatenation of Inputs and Errors), a novel framework for accurate brain tumor segmentation in multi-modal MRI scans. We employ a 3D U-Net to generate an initial segmentation mask, from which an error map is generated by identifying the differences between the prediction and the ground truth. The error map, concatenated with the original MRI images, are used to guide a diffusion model. Using multimodal MRI inputs (T1, T1ce, T2, FLAIR), DMCIE effectively enhances segmentation accuracy by focusing on misclassified regions, guided by the original inputs. Evaluated on the BraTS2020 dataset, DMCIE outperforms several state-of-the-art diffusion-based segmentation methods, achieving a Dice Score of 93.46 and an HD95 of 5.94 mm. These results highlight the effectiveness of error-guided diffusion in producing precise and reliable brain tumor segmentations.

[528] arXiv:2507.00991 (cross-list from math.AP) [pdf, html, other]
Title: Stable skeleton integral equations for general coefficient Helmholtz transmission problems
Benedikt Gräßle, Ralf Hiptmair, Stefan Sauter
Subjects: Analysis of PDEs (math.AP); Numerical Analysis (math.NA)

A novel variational formulation of layer potentials and boundary integral operators generalizes their classical construction by Green's functions, which are not explicitly available for Helmholtz problems with variable coefficients. Wavenumber explicit estimates and properties like jump conditions follow directly from their variational definition and enable a non-local (``integral'') formulation of acoustic transmission problems (TP) with piecewise Lipschitz coefficients. We obtain the well-posedness of the integral equations directly from the stability of the underlying TP. The simultaneous analysis for general dimensions and complex wavenumbers (in this paper) imposes an artificial boundary on the external Helmholtz problem and employs recent insights into the associated Dirichlet-to-Neumann map.

[529] arXiv:2507.00993 (cross-list from eess.IV) [pdf, html, other]
Title: Advancing Lung Disease Diagnosis in 3D CT Scans
Qingqiu Li, Runtian Yuan, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

To enable more accurate diagnosis of lung disease in chest CT scans, we propose a straightforward yet effective model. Firstly, we analyze the characteristics of 3D CT scans and remove non-lung regions, which helps the model focus on lesion-related areas and reduces computational cost. We adopt ResNeSt50 as a strong feature extractor, and use a weighted cross-entropy loss to mitigate class imbalance, especially for the underrepresented squamous cell carcinoma category. Our model achieves a Macro F1 Score of 0.80 on the validation set of the Fair Disease Diagnosis Challenge, demonstrating its strong performance in distinguishing between different lung conditions.

Replacement submissions (showing 283 of 283 entries)

[530] arXiv:2203.09952 (replaced) [pdf, html, other]
Title: Conquering Ghosts: Relation Learning for Information Reliability Representation and End-to-End Robust Navigation
Kefan Jin, Xingyao Han
Subjects: Artificial Intelligence (cs.AI)

Environmental disturbances, such as sensor data noises, various lighting conditions, challenging weathers and external adversarial perturbations, are inevitable in real self-driving applications. Existing researches and testings have shown that they can severely influence the vehicles perception ability and performance, one of the main issue is the false positive detection, i.e., the ghost object which is not real existed or occurs in the wrong position (such as a non-existent vehicle). Traditional navigation methods tend to avoid every detected objects for safety, however, avoiding a ghost object may lead the vehicle into a even more dangerous situation, such as a sudden break on the highway. Considering the various disturbance types, it is difficult to address this issue at the perceptual aspect. A potential solution is to detect the ghost through relation learning among the whole scenario and develop an integrated end-to-end navigation system. Our underlying logic is that the behavior of all vehicles in the scene is influenced by their neighbors, and normal vehicles behave in a logical way, while ghost vehicles do not. By learning the spatio-temporal relation among surrounding vehicles, an information reliability representation is learned for each detected vehicle and then a robot navigation network is developed. In contrast to existing works, we encourage the network to learn how to represent the reliability and how to aggregate all the information with uncertainties by itself, thus increasing the efficiency and generalizability. To the best of the authors knowledge, this paper provides the first work on using graph relation learning to achieve end-to-end robust navigation in the presence of ghost vehicles. Simulation results in the CARLA platform demonstrate the feasibility and effectiveness of the proposed method in various scenarios.

[531] arXiv:2210.02773 (replaced) [pdf, other]
Title: Computing Threshold Budgets in Discrete-Bidding Games
Guy Avni, Suman Sadhukhan
Comments: Journal version for TheoretiCS of a paper published at FSTTCS 2022. Corrects typos that previously appeared in crucial places
Journal-ref: TheoretiCS, Volume 4 (2025), Article 5, 1-46
Subjects: Formal Languages and Automata Theory (cs.FL); Computer Science and Game Theory (cs.GT)

In a two-player zero-sum graph game, the players move a token throughout a graph to produce an infinite play, which determines the winner of the game. Bidding games are graph games in which in each turn, an auction (bidding) determines which player moves the token: the players have budgets, and in each turn, both players simultaneously submit bids that do not exceed their available budgets, the higher bidder moves the token, and pays the bid to the lower bidder (called Richman bidding). We focus on discrete-bidding games, in which, motivated by practical applications, the granularity of the players' bids is restricted, e.g., bids must be given in cents.
A central quantity in bidding games is threshold budgets: a necessary and sufficient initial budget for winning the game. Previously, thresholds were shown to exist in parity games, but their structure was only understood for reachability games. Moreover, the previously-known algorithms have a worst-case exponential running time for both reachability and parity objectives, and output strategies that use exponential memory. We describe two algorithms for finding threshold budgets in parity discrete-bidding games. The first is a fixed-point algorithm. It reveals, for the first time, the structure of threshold budgets in parity discrete-bidding games. Based on this structure, we develop a second algorithm that shows that the problem of finding threshold budgets is in NP and coNP for both reachability and parity objectives. Moreover, our algorithm constructs strategies that use only linear memory.
This is a corrected version of the paper (arXiv:2210.02773v4) published originally on Jan 22, 2025.

[532] arXiv:2210.06230 (replaced) [pdf, html, other]
Title: Quasi-symbolic Semantic Geometry over Transformer-based Variational AutoEncoder
Yingji Zhang, Danilo S. Carvalho, André Freitas
Comments: CoNLL2025 (Best Paper nomination)
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Formal/symbolic semantics can provide canonical, rigid controllability and interpretability to sentence representations due to their \textit{localisation} or \textit{composition} property. How can we deliver such property to the current distributional sentence representations to control and interpret the generation of language models (LMs)? In this work, we theoretically frame the sentence semantics as the composition of \textit{semantic role - word content} features and propose the formal semantic geometry. To inject such geometry into Transformer-based LMs (i.e. GPT2), we deploy Transformer-based Variational AutoEncoder with a supervision approach, where the sentence generation can be manipulated and explained over low-dimensional latent Gaussian space. In addition, we propose a new probing algorithm to guide the movement of sentence vectors over such geometry. Experimental results reveal that the formal semantic geometry can potentially deliver better control and interpretation to sentence generation.

[533] arXiv:2301.09997 (replaced) [pdf, other]
Title: Higher-Order Weakest Precondition Transformers via a CPS Transformation
Satoshi Kura
Subjects: Logic in Computer Science (cs.LO)

Weakest preconditions are a useful notion for program verification as they reduce a problem of program verification to a problem of constraint solving. Category-theoretic generalisations of weakest preconditions have been studied to capture various computational effects and various properties in a unified framework. In this paper, we propose a novel and general relationship between weakest precondition transformers and CPS transformations for higher-order functional languages with general computational effects and recursion. Technically, this gives a syntactic counterpart of the categorically-defined generic weakest precondition transformer in [Aguirre & Katsumata, 2020]. The usefulness of our results is threefold. (1) Since CPS transformations purify effectful programs, various verification problems for effectful programs can be reduced to verification problems for pure programs. This syntactic reduction makes it easier to solve the verification problems and potentially facilitates combinations with other sophisticated verification methods tailored for pure programs. (2) We capture two existing verification methods, namely, verification of event sequences [Kobayashi et al., 2018] and expected cost [Avanzini et al., 2021] as instances of our framework. (3) Our results streamline the process of extending weakest precondition transformers for imperative programs to those for higher-order programs. We show two such extensions: analysis of higher moments of cost and the conditional weakest pre-expectation for higher-order probabilistic programs. These extensions demonstrate that our theoretical framework can produce novel verification methods.

[534] arXiv:2302.14368 (replaced) [pdf, html, other]
Title: Enhanced Controllability of Diffusion Models via Feature Disentanglement and Realism-Enhanced Sampling Methods
Wonwoong Cho, Hareesh Ravi, Midhun Harikumar, Vinh Khuc, Krishna Kumar Singh, Jingwan Lu, David I. Inouye, Ajinkya Kale
Comments: ECCV 2024; Code is available at this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

As Diffusion Models have shown promising performance, a lot of efforts have been made to improve the controllability of Diffusion Models. However, how to train Diffusion Models to have the disentangled latent spaces and how to naturally incorporate the disentangled conditions during the sampling process have been underexplored. In this paper, we present a training framework for feature disentanglement of Diffusion Models (FDiff). We further propose two sampling methods that can boost the realism of our Diffusion Models and also enhance the controllability. Concisely, we train Diffusion Models conditioned on two latent features, a spatial content mask, and a flattened style embedding. We rely on the inductive bias of the denoising process of Diffusion Models to encode pose/layout information in the content feature and semantic/style information in the style feature. Regarding the sampling methods, we first generalize Composable Diffusion Models (GCDM) by breaking the conditional independence assumption to allow for some dependence between conditional inputs, which is shown to be effective in realistic generation in our experiments. Second, we propose timestep-dependent weight scheduling for content and style features to further improve the performance. We also observe better controllability of our proposed methods compared to existing methods in image manipulation and image translation.

[535] arXiv:2303.10270 (replaced) [pdf, other]
Title: Adaptive Modified RISE Control for Quadrotors: Enhancing Trajectory Tracking Through Uncertainty Compensation
Kevin Johnston, Musabbir Ahmed Arrafi, Krishna B Kidambi, Madhur Tiwari
Comments: This paper is under review
Subjects: Systems and Control (eess.SY)

This paper presents an adaptive modified Robust Inverse of Signum Error (AM-RISE) control method, which achieves reliable trajectory tracking control for a quadrotor unmanned aerial vehicle. The proposed method systematically accounts for gyroscopic effects, rotor dynamics, parametric uncertainties, and external disturbances, ensuring robust performance across varying trajectory speeds. Through novel mathematical manipulation in the error system development, the quadrotor dynamics are expressed in a control-oriented form, which explicitly incorporates the uncertainty in the gyroscopic term and control actuation term. An adaptive modified RISE law is then designed to stabilize both the position and attitude loops of the quadrotor system. A rigorous Lyapunov-based analysis is utilized to prove asymptotic trajectory tracking, where the region of convergence can be made arbitrarily large through judicious control gain selection. Moreover, the stability analysis formally addresses gyroscopic effects and actuator uncertainty. To illustrate the performance of the control law, comparative numerical simulation results are provided, which demonstrate the improved closed-loop performance achieved under varying levels of parametric uncertainty, disturbance magnitudes and trajectory speeds.

[536] arXiv:2307.16579 (replaced) [pdf, html, other]
Title: Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Dong Li, Yiran Zhong, Yuchao Dai
Journal-ref: IEEE Transactions on Image Processing 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

We propose a contrastive conditional latent diffusion model for audio-visual segmentation (AVS) to thoroughly investigate the impact of audio, where the correlation between audio and the final segmentation map is modeled to guarantee the strong correlation between them. To achieve semantic-correlated representation learning, our framework incorporates a latent diffusion model. The diffusion model learns the conditional generation process of the ground-truth segmentation map, resulting in ground-truth aware inference during the denoising process at the test stage. As our model is conditional, it is vital to ensure that the conditional variable contributes to the model output. We thus extensively model the contribution of the audio signal by minimizing the density ratio between the conditional probability of the multimodal data, e.g. conditioned on the audio-visual data, and that of the unimodal data, e.g. conditioned on the audio data only. In this way, our latent diffusion model via density ratio optimization explicitly maximizes the contribution of audio for AVS, which can then be achieved with contrastive learning as a constraint, where the diffusion part serves as the main objective to achieve maximum likelihood estimation, and the density ratio optimization part imposes the constraint. By adopting this latent diffusion model via contrastive learning, we effectively enhance the contribution of audio for AVS. The effectiveness of our solution is validated through experimental results on the benchmark dataset. Code and results are online via our project page: this https URL.

[537] arXiv:2310.02277 (replaced) [pdf, html, other]
Title: Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs
Lu Yin, Ajay Jaiswal, Shiwei Liu, Souvik Kundu, Zhangyang Wang
Comments: Published at ICML 2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

We present Junk DNA Hypothesis by adopting a novel task-centric angle for the pre-trained weights of large language models (LLMs). It has been believed that weights in LLMs contain significant redundancy, leading to the conception that a considerable chunk of the parameters can be removed by pruning without compromising performance. Contrary to this belief, this paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks - manifested as the monotonic relationship between the performance drop of downstream tasks across the difficulty spectrum, as we prune more pre-trained weights by magnitude. Moreover, we reveal that these seemingly inconsequential weights can result in irreparable loss of knowledge and performance degradation in difficult tasks, even when downstream continual training is allowed. Interestingly, our evaluations show that the other popular compression, namely quantization, fails to exhibit similar monotonic effect and does not as convincingly disentangle this task-difficulty information. To study formally, we introduce several quantifiable metrics to gauge the downstream task difficulty: (1) within the same task category, and (2) across different task categories. Our extensive experiments substantiate the Junk DNA Hypothesis across a diverse range of model sizes, tasks, datasets, and even pruning methods. Codes are available at: this https URL.

[538] arXiv:2310.05175 (replaced) [pdf, html, other]
Title: Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, Shiwei Liu
Comments: Published at ICML 2024
Subjects: Machine Learning (cs.LG)

Large Language Models (LLMs), renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively, while delivering 2.6x end-to-end inference speed-up in the DeepSparse inference engine. Codes are available at this https URL.

[539] arXiv:2311.07438 (replaced) [pdf, html, other]
Title: Hardest Monotone Functions for Evolutionary Algorithms
Marc Kaufmann, Maxime Larcher, Johannes Lengler, Oliver Sieberling
Comments: revised journal version
Subjects: Neural and Evolutionary Computing (cs.NE); Probability (math.PR)

In this paper we revisit the question how hard it can be for the $(1+1)$ Evolutionary Algorithm to optimize monotone pseudo-Boolean functions. By introducing a more pessimistic stochastic process, the partially-ordered evolutionary algorithm (PO-EA) model, Jansen first proved a runtime bound of $O(n^{3/2})$. More recently, Lengler, Martinsson and Steger improved this upper bound to $O(n \log^2 n)$ by an entropy compression argument. In this work, we analyze monotone functions that may adversarially vary at each step of the optimization, so-called dynamic monotone functions. We introduce the function Switching Dynamic BinVal (SDBV) and prove, using a combinatorial argument, that for the $(1+1)$-EA with any mutation rate $p \in [0,1]$, SDBV is drift minimizing within the class of dynamic monotone functions. We further show that the $(1+1)$-EA optimizes SDBV in $\Theta(n^{3/2})$ generations. Therefore, our construction provides the first explicit example which realizes the pessimism of the \poea model. Our simulations demonstrate matching runtimes for both the static and the self-adjusting $(1,\lambda)$-EA and $(1+\lambda)$-EA. Moreover, devising an example for fixed dimension, we illustrate that drift minimization does not equal maximal runtime beyond asymptotic analysis.

[540] arXiv:2311.09511 (replaced) [pdf, html, other]
Title: Identifying Systems with Symmetries using Equivariant Autoregressive Reservoir Computers
Fredy Vides, Idelfonso B. R. Nogueira, Gabriela Lopez Gutierrez, Lendy Banegas, Evelyn Flores
Comments: The views expressed in the article do not necessarily represent the views of the National Commission of Banks and Insurance Companies of Honduras
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)

The investigation reported in this document focuses on identifying systems with symmetries using equivariant autoregressive reservoir computers. General results in structured matrix approximation theory are presented, exploring a two-fold approach. Firstly, a comprehensive examination of generic symmetry-preserving nonlinear time delay embedding is conducted. This involves analyzing time series data sampled from an equivariant system under study. Secondly, sparse least-squares methods are applied to discern approximate representations of the output coupling matrices. These matrices play a critical role in determining the nonlinear autoregressive representation of an equivariant system. The structural characteristics of these matrices are dictated by the set of symmetries inherent in the system. The document outlines prototypical algorithms derived from the described techniques, offering insight into their practical applications. Emphasis is placed on the significant improvement on structured identification precision when compared to classical reservoir computing methods for the simulation of equivariant dynamical systems.

[541] arXiv:2311.10248 (replaced) [pdf, html, other]
Title: Identifying the Truth of Global Model: A Generic Solution to Defend Against Byzantine and Backdoor Attacks in Federated Learning (full version)
Sheldon C. Ebron, Meiying Zhang, Kan Yang
Comments: Accepted to ACISP 2025. This is the full version
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)

Federated Learning (FL) enables multiple parties to train machine learning models collaboratively without sharing the raw training data. However, the federated nature of FL enables malicious clients to influence a trained model by injecting error model updates via Byzantine or backdoor attacks. To detect malicious model updates, a typical approach is to measure the distance between each model update and a \textit{ground-truth model update}. To find such \textit{ground-truth model updates}, existing defenses either require a benign root dataset on the server (e.g., FLTrust) or simply use trimmed mean or median as the threshold for clipping (e.g., FLAME). However, such benign root datasets are impractical, and the trimmed mean or median may also eliminate contributions from these underrepresented datasets.
In this paper, we propose a generic solution, namely FedTruth, to defend against model poisoning attacks in FL, where the \textit{ground-truth model update} (i.e., the global model update) will be estimated among all the model updates with dynamic aggregation weights. Specifically, FedTruth does not have specific assumptions on the benign or malicious data distribution or access to a benign root dataset. Moreover, FedTruth considers the potential contributions from all benign clients. Our empirical results show that FedTruth can reduce the impacts of poisoned model updates against both Byzantine and backdoor attacks, and is also efficient in large-scale FL systems.

[542] arXiv:2311.17592 (replaced) [pdf, other]
Title: Robust Correlated Equilibrium: Definition and Computation
Rahul Misra, Rafał Wisniewski, Carsten Skovmose Kallesøe, Manuela L. Bujorianu
Comments: Preprint submitted to Automatica
Subjects: Systems and Control (eess.SY); Multiagent Systems (cs.MA); Machine Learning (stat.ML)

We study N-player finite games with costs perturbed due to time-varying disturbances in the underlying system and to that end, we propose the concept of Robust Correlated Equilibrium that generalizes the definition of Correlated Equilibrium. Conditions under which the Robust Correlated Equilibrium exists are specified, and a decentralized algorithm for learning strategies that are optimal in the sense of Robust Correlated Equilibrium is proposed. The primary contribution of the paper is the convergence analysis of the algorithm and to that end, we propose a modification of the celebrated Blackwell's Approachability theorem to games with costs that are not just time-average, as in the original Blackwell's Approachability Theorem, but also include the time-average of previous algorithm iterates. The designed algorithm is applied to a practical water distribution network with pumps being the controllers and their costs being perturbed by uncertain consumption due to the consumers. Simulation results show that each controller achieves no regret, and empirical distributions converge to the Robust Correlated Equilibrium.

[543] arXiv:2401.06657 (replaced) [pdf, html, other]
Title: How Resilient is QUIC to Security and Privacy Attacks?
Jayasree Sengupta, Debasmita Dey, Simone Ferlin-Reiter, Nirnay Ghosh, Vaibhav Bajpai
Comments: 7 pages, 1 figure, 1 table
Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)

QUIC has rapidly evolved into a cornerstone transport protocol for secure, low-latency communications, yet its deployment continues to expose critical security and privacy vulnerabilities, particularly during connection establishment phases and via traffic analysis. This paper systematically revisits a comprehensive set of attacks on QUIC and emerging privacy threats. Building upon these observations, we critically analyze recent IETF mitigation efforts, including TLS Encrypted Client Hello (ECH), Oblivious HTTP (OHTTP) and MASQUE. We analyze how these mechanisms enhance privacy while introducing new operational risks, particularly under adversarial load. Additionally, we discuss emerging challenges posed by post-quantum cryptographic (PQC) handshakes, including handshake expansion and metadata leakage risks. Our analysis highlights ongoing gaps between theoretical defenses and practical deployments, and proposes new research directions focused on adaptive privacy mechanisms. Building on these insights, we propose future directions to ensure long-term security of QUIC and aim to guide its evolution as a robust, privacy-preserving, and resilient transport foundation for the next-generation Internet.

[544] arXiv:2401.14640 (replaced) [pdf, html, other]
Title: Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs
Nan Hu, Jiaoyan Chen, Yike Wu, Guilin Qi, Hongru Wang, Sheng Bi, Yongrui Chen, Tongtong Wu, Jeff Z. Pan
Comments: Accepted to ACL 2025 (Main Conference)
Subjects: Computation and Language (cs.CL)

Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions, including lacking fine-grained attribution categories, relying on manual annotations, and failing to compare attributions with only subtle differences. To bridge these gaps, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories, automatically generated using Knowledge Graphs (KGs), and complex attribution scenarios. We have conducted extensive experiments to verify the effectiveness of CAQA, including the benchmarking of 25 automatic evaluators, their comparison with human evaluators, the testing of LLM evaluators fine-tuned by CAQA and so on. These experiments also lead to a series of important findings that can benefit the future research of AQA. All the codes and data are publicly accessible at this https URL.

[545] arXiv:2402.01020 (replaced) [pdf, html, other]
Title: Quantifying analogy of concepts via ologs and wiring diagrams
Jason Lo
Comments: 30 pages. Minor updates to Section 5
Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Combinatorics (math.CO); Category Theory (math.CT)

We build on the theory of ontology logs (ologs) created by Spivak and Kent, and define a notion of wiring diagrams. In this article, a wiring diagram is a finite directed labelled graph. The labels correspond to types in an olog; they can also be interpreted as readings of sensors in an autonomous system. As such, wiring diagrams can be used as a framework for an autonomous system to form abstract concepts. We show that the graphs underlying skeleton wiring diagrams form a category. This allows skeleton wiring diagrams to be compared and manipulated using techniques from both graph theory and category theory. We also extend the usual definition of graph edit distance to the case of wiring diagrams by using operations only available to wiring diagrams, leading to a metric on the set of all skeleton wiring diagrams. In the end, we give an extended example on calculating the distance between two concepts represented by wiring diagrams, and explain how to apply our framework to any application domain.

[546] arXiv:2402.10665 (replaced) [pdf, html, other]
Title: Soft Dice Confidence: A Near-Optimal Confidence Estimator for Selective Prediction in Semantic Segmentation
Bruno Laboissiere Camargos Borges, Bruno Machado Pacheco, Danilo Silva
Comments: 42 pages, 9 figures
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Selective prediction augments a model with the option to abstain from providing unreliable predictions. The key ingredient is a confidence score function, which should be directly related to the conditional risk. In the case of binary semantic segmentation, existing score functions either ignore the particularities of the evaluation metric or demand additional held-out data for tuning. We propose the Soft Dice Confidence (SDC), a simple, tuning-free confidence score function that directly aligns with the Dice coefficient metric. We prove that, under conditional independence, the SDC is near optimal: we establish upper and lower bounds on the ratio between the SDC and the ideal (intractable) confidence score function and show that these bounds are very close to 1. Experiments on six public medical-imaging benchmarks and on synthetic data corroborate our theoretical findings. In fact, SDC outperformed all prior confidence estimators from the literature in all of our experiments, including those that rely on additional data. These results position SDC as a reliable and efficient confidence estimator for selective prediction in semantic segmentation.

[547] arXiv:2402.10747 (replaced) [pdf, html, other]
Title: Fully Differentiable Lagrangian Convolutional Neural Network for Physics-Informed Precipitation Nowcasting
Peter Pavlík, Martin Výboh, Anna Bou Ezzeddine, Viera Rozinajová
Comments: Submitted to Applied Computing and Geosciences
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

This paper presents a convolutional neural network model for precipitation nowcasting that combines data-driven learning with physics-informed domain knowledge. We propose LUPIN, a Lagrangian Double U-Net for Physics-Informed Nowcasting, that draws from existing extrapolation-based nowcasting methods. It consists of a U-Net that dynamically produces mesoscale advection motion fields, a differentiable semi-Lagrangian extrapolation operator, and an advection-free U-Net capturing the growth and decay of precipitation over time. Using our approach, we successfully implement the Lagrangian convolutional neural network for precipitation nowcasting in a fully differentiable and GPU-accelerated manner. This allows for end-to-end training and inference, including the data-driven Lagrangian coordinate system transformation of the data at runtime. We evaluate the model and compare it with other related AI-based models both quantitatively and qualitatively in an extreme event case study. Based on our evaluation, LUPIN matches and even exceeds the performance of the chosen benchmarks, opening the door for other Lagrangian machine learning models.

[548] arXiv:2403.14023 (replaced) [pdf, other]
Title: A system capable of verifiably and privately screening global DNA synthesis
Carsten Baum (1 and 2), Jens Berlips (3), Walther Chen (3), Helena Cozzarini (3), Hongrui Cui (4), Ivan Damgård (1), Jiangbin Dong (5), Kevin M. Esvelt (3 and 6), Leonard Foner (3), Mingyu Gao (5 and 12), Dana Gretton (3 and 6), Martin Kysel (3), Juanru Li (4), Xiang Li (5), Omer Paneth (7), Ronald L. Rivest (7), Francesca Sage-Ling (3), Adi Shamir (8), Yue Shen (10), Meicen Sun (11), Vinod Vaikuntanathan (7), Lynn Van Hauwe (3), Theia Vogel (3), Benjamin Weinstein-Raun (3), Yun Wang (10), Daniel Wichs (9), Stephen Wooster (3), Andrew C. Yao (3 and 5 and 12), Yu Yu (4 and 12), Haoling Zhang (10), Kaiyi Zhang (4) ((1) Department of Computer Science, Aarhus University, Denmark, (2) DTU Compute, Technical University of Denmark, Denmark, (3) SecureDNA Foundation, Switzerland, (4) Department of Computer Science and Engineering, Shanghai Jiao Tong University, China, (5) Institute for Interdisciplinary Information Sciences, Tsinghua University, China, (6) Media Lab, Massachusetts Institute of Technology, USA, (7) Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, USA, (8) Department of Applied Mathematics, Weizmann Institute of Science, Israel, (9) Department of Computer Science, Northeastern University, USA, (10) China National GeneBank, China, (11) Department of Political Science, Massachusetts Institute of Technology, USA, (12) Shanghai Qi Zhi Institute, China)
Comments: Main text 12 pages, 5 figures. 4 supplementary figures and 2 supplementary tables. 5 appendices. Total 37 pages. Direct correspondence to: Ivan B. Damgård (ivan@cs.this http URL), Andrew C. Yao (andrewcyao@mail.this http URL), Kevin M. Esvelt ([email protected])
Subjects: Cryptography and Security (cs.CR)

Printing custom DNA sequences is essential to scientific and biomedical research, but the technology can be used to manufacture plagues as well as cures. Just as ink printers recognize and reject attempts to counterfeit money, DNA synthesizers and assemblers should deny unauthorized requests to make viral DNA that could be misused. There are three complications. First, we don't need to quickly update printers to deal with newly discovered currencies, whereas we regularly learn of new potential pandemic viruses and other biological threats. Second, convincing counterfeit bills can't be printed in small pieces and taped together, while preventing the distributed synthesis and subsequent re-assembly of controlled sequences will require tracking which DNA fragments have been ordered across all providers and benchtop devices while protecting legitimate customer privacy. Finally, counterfeiting can at worst undermine faith in currency, whereas unauthorized DNA synthesis could be used to deliberately cause pandemics. Here we describe SecureDNA, a free, privacy-preserving, and fully automated system capable of verifiably screening all DNA synthesis orders of 30+ nucleotides against an up-to-date database of controlled sequences, and its operational performance and specificity when applied to 67 million nucleotides of DNA synthesized by providers in the United States, Europe, and China.

[549] arXiv:2404.09158 (replaced) [pdf, html, other]
Title: StreakNet-Arch: An Anti-scattering Network-based Architecture for Underwater Carrier LiDAR-Radar Imaging
Xuelong Li, Hongjun An, Haofei Zhao, Guangying Li, Bo Liu, Xing Wang, Guanghua Cheng, Guojun Wu, Zhe Sun
Comments: Accepted by IEEE Transactions on Image Processing (T-IP)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

In this paper, we introduce StreakNet-Arch, a real-time, end-to-end binary-classification framework based on our self-developed Underwater Carrier LiDAR-Radar (UCLR) that embeds Self-Attention and our novel Double Branch Cross Attention (DBC-Attention) to enhance scatter suppression. Under controlled water tank validation conditions, StreakNet-Arch with Self-Attention or DBC-Attention outperforms traditional bandpass filtering and achieves higher $F_1$ scores than learning-based MP networks and CNNs at comparable model size and complexity. Real-time benchmarks on an NVIDIA RTX 3060 show a constant Average Imaging Time (54 to 84 ms) regardless of frame count, versus a linear increase (58 to 1,257 ms) for conventional methods. To facilitate further research, we contribute a publicly available streak-tube camera image dataset contains 2,695,168 real-world underwater 3D point cloud data. More importantly, we validate our UCLR system in a South China Sea trial, reaching an error of 46mm for 3D target at 1,000 m depth and 20 m range. Source code and data are available at this https URL .

[550] arXiv:2404.15917 (replaced) [pdf, other]
Title: Hardness and Tight Approximations of Demand Strip Packing
Klaus Jansen, Malin Rau, Malte Tutas
Subjects: Data Structures and Algorithms (cs.DS)

We settle the pseudo-polynomial complexity of the Demand Strip Packing (DSP) problem: Given a strip of fixed width and a set of items with widths and heights, the items must be placed inside the strip with the objective of minimizing the peak height. This problem has gained significant scientific interest due to its relevance in smart grids[Deppert et al.\ APPROX'21, Gálvez et al.\ APPROX'21]. Smart Grids are a modern form of electrical grid that provide opportunities for optimization. They are forecast to impact the future of energy provision significantly. Algorithms running in pseudo-polynomial time lend themselves to these applications as considered time intervals, such as days, are small. Moreover, such algorithms can provide superior approximation guarantees over those running in polynomial time. Consequently, they evoke scientific interest in related problems.
We prove that Demand Strip Packing is strongly NP-hard for approximation ratios below $5/4$. Through this proof, we provide novel insights into the relation of packing and scheduling problems. Using these insights, we show a series of frameworks that solve both Demand Strip Packing and Parallel Task Scheduling optimally when increasing the strip's width or number of machines. Such alterations to problems are known as resource augmentation. Applications are found when penalty costs are prohibitively large. Finally, we provide a pseudo-polynomial time approximation algorithm for DSP with an approximation ratio of $(5/4+\varepsilon)$, which is nearly optimal assuming $P\neq NP$. The construction of this algorithm provides several insights into the structure of DSP solutions and uses novel techniques to restructure optimal solutions.

[551] arXiv:2405.03264 (replaced) [pdf, other]
Title: Delooping presented groups in homotopy type theory
Camil Champin, Samuel Mimram, Emile Oleon
Subjects: Logic in Computer Science (cs.LO); Category Theory (math.CT)

Homotopy type theory is a logical setting based on Martin-Löf type theory in which one can perform geometric constructions and proofs in a synthetic way. Namely, types can be interpreted as spaces up to homotopy and proofs as homotopy invariant constructions. In this context, loop spaces of pointed connected groupoids provide a natural representation of groups, and any group can be obtained as the loop space of such a type, which is then called a delooping of the group. There are two main methods for constructing the delooping of an arbitrary group G. The first one consists in describing it as a pointed higher inductive type, whereas the second one consists in taking the connected component of the principal G-torsor in the type of sets equipped with an action of G. We show here that, when a presentation (or even a generating set) is known for the group, simpler variants of those constructions can be used to build deloopings. The resulting types are more amenable to computations and lead to simpler meta-theoretic reasoning. Finally, we develop a type theoretical notion of 2-polygraph, which allows manipulating higher inductive types such as the ones involved in the description of deloopings. These allow us to investigate in this context a construction for the Cayley graph of a generated group and show that it encodes the relations of the group, as well as a Cayley complex which encodes relations between relations. Many of the developments performed in the article have been formalized using the cubical version of the Agda proof assistant.

[552] arXiv:2405.05769 (replaced) [pdf, html, other]
Title: Exploring Text-Guided Single Image Editing for Remote Sensing Images
Fangzhou Han, Lingyu Si, Zhizhuo Jiang, Hongwei Dong, Lamei Zhang, Yu Liu, Hao Chen, Bo Du
Comments: 17 pages, 18 figures, Accepted by IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Artificial intelligence generative content (AIGC) has significantly impacted image generation in the field of remote sensing. However, the equally important area of remote sensing image (RSI) editing has not received sufficient attention. Deep learning based editing methods generally involve two sequential stages: generation and editing. For natural images, these stages primarily rely on generative backbones pre-trained on large-scale benchmark datasets and text guidance facilitated by vision-language models (VLMs). However, it become less viable for RSIs: First, existing generative RSI benchmark datasets do not fully capture the diversity of RSIs, and is often inadequate for universal editing tasks. Second, the single text semantic corresponds to multiple image semantics, leading to the introduction of incorrect semantics. To solve above problems, this paper proposes a text-guided RSI editing method and can be trained using only a single image. A multi-scale training approach is adopted to preserve consistency without the need for training on extensive benchmarks, while leveraging RSI pre-trained VLMs and prompt ensembling (PE) to ensure accuracy and controllability. Experimental results on multiple RSI editing tasks show that the proposed method offers significant advantages in both CLIP scores and subjective evaluations compared to existing methods. Additionally, we explore the ability of the edited RSIs to support disaster assessment tasks in order to validate their practicality. Codes will be released at this https URL.

[553] arXiv:2406.00772 (replaced) [pdf, html, other]
Title: Unsupervised contrastive analysis for anomaly detection in brain MRIs via conditional diffusion models
Cristiano Patrício, Carlo Alberto Barbano, Attilio Fiandrotti, Riccardo Renzulli, Marco Grangetto, Luis F. Teixeira, João C. Neves
Comments: Under consideration at Pattern Recognition Letters
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Contrastive Analysis (CA) detects anomalies by contrasting patterns unique to a target group (e.g., unhealthy subjects) from those in a background group (e.g., healthy subjects). In the context of brain MRIs, existing CA approaches rely on supervised contrastive learning or variational autoencoders (VAEs) using both healthy and unhealthy data, but such reliance on target samples is challenging in clinical settings. Unsupervised Anomaly Detection (UAD) offers an alternative by learning a reference representation of healthy anatomy without the need for target samples. Deviations from this reference distribution can indicate potential anomalies. In this context, diffusion models have been increasingly adopted in UAD due to their superior performance in image generation compared to VAEs. Nonetheless, precisely reconstructing the anatomy of the brain remains a challenge. In this work, we propose an unsupervised framework to improve the reconstruction quality by training a self-supervised contrastive encoder on healthy images to extract meaningful anatomical features. These features are used to condition a diffusion model to reconstruct the healthy appearance of a given image, enabling interpretable anomaly localization via pixel-wise comparison. We validate our approach through a proof-of-concept on a facial image dataset and further demonstrate its effectiveness on four brain MRI datasets, achieving state-of-the-art anomaly localization performance on the NOVA benchmark.

[554] arXiv:2406.04370 (replaced) [pdf, html, other]
Title: Large Language Model Confidence Estimation via Black-Box Access
Tejaswini Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, Prasanna Sattigeri
Comments: Accepted to TMLR 2025
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b, Mistral-7b and GPT-4 on four benchmark Q\&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over $10\%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.

[555] arXiv:2406.04814 (replaced) [pdf, html, other]
Title: Lifelong Learning of Video Diffusion Models From a Single Video Stream
Jason Yoo, Yingchen He, Saeid Naderiparizi, Dylan Green, Gido M. van de Ven, Geoff Pleiss, Frank Wood
Comments: Video samples are available here: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

This work demonstrates that training autoregressive video diffusion models from a single video stream$\unicode{x2013}$resembling the experience of embodied agents$\unicode{x2013}$is not only possible, but can also be as effective as standard offline training given the same number of gradient steps. Our work further reveals that this main result can be achieved using experience replay methods that only retain a subset of the preceding video stream. To support training and evaluation in this setting, we introduce four new datasets for streaming lifelong generative video modeling: Lifelong Bouncing Balls, Lifelong 3D Maze, Lifelong Drive, and Lifelong PLAICraft, each consisting of one million consecutive frames from environments of increasing complexity.

[556] arXiv:2407.19342 (replaced) [pdf, html, other]
Title: Parameter-Efficient Fine-Tuning via Circular Convolution
Aochuan Chen, Jiashun Cheng, Zijing Liu, Ziqi Gao, Fugee Tsung, Yu Li, Jia Li
Comments: ACL 2025
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices $\mathbf{A}$ and $\mathbf{B}$ to represent weight changes (i.e., $\Delta \mathbf{W} = \mathbf{B} \mathbf{A}$). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying $\mathbf{A}$ and $\mathbf{B}$ with the activation. Despite its success, the intrinsic low-rank characteristic may limit its performance. Although several variants have been proposed to address this issue, they often overlook the crucial computational and memory efficiency brought by LoRA. In this paper, we propose Circular Convolution Adaptation (C$^3$A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. Extensive experiments demonstrate that C$^3$A consistently outperforms LoRA and its variants across various fine-tuning tasks.

[557] arXiv:2408.05975 (replaced) [pdf, html, other]
Title: A Metascience Study of the Low-Code Scientific Field
Mauro Dalle Lucca Tosi, Javier Luis Cánovas Izquierdo, Jordi Cabot
Journal-ref: Journal of Object Technology. Vol. 24, No. 2, 2025. Licensed under Attribution 4.0 International (CC BY 4.0)
Subjects: Software Engineering (cs.SE); Digital Libraries (cs.DL)

In the last years, model-related publications have been exploring the application of modeling techniques across various domains. Initially focused on UML and the Model-Driven Architecture approach, the literature has been evolving towards the usage of more general concepts such as Model-Driven Development or Model-Driven Engineering. More recently, however, the term "low-code" has taken the modeling field by storm, largely due to its association with several highly popular development platforms. The research community is still discussing the differences and commonalities between this emerging term and previous modeling-related concepts, as well as the broader implications of low-code on the modeling field. In this paper, we present a metascience study of Low-Code. Our study follows a two-fold approach: (1) to analyze the composition and growth (e.g., size, diversity, venues, and topics) of the emerging Low-Code community; and (2) to explore how these aspects differ from those of the "classical" model-driven community. Ultimately, we hope to trigger a discussion on the current state and potential future trajectory of the low-code community, as well as the opportunities for collaboration and synergies between the low-code and modeling communities.

[558] arXiv:2408.10774 (replaced) [pdf, html, other]
Title: Flexora: Flexible Low Rank Adaptation for Large Language Models
Chenxing Wei, Yao Shu, Ying Tiffany He, Fei Richard Yu
Comments: 40 pages, 15 figures
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Large Language Models (LLMs) are driving advancements in artificial intelligence by increasing the scale of model parameters, which has significantly enhanced generalization ability and unlocked new capabilities in practice. However, their performance in specific downstream tasks is usually hindered by their knowledge boundaries on these tasks. Thus, fine-tuning techniques, especially the widely used Low-Rank Adaptation (LoRA) method, have been introduced to expand the boundaries on these tasks, whereas LoRA would underperform on certain tasks owing to its potential overfitting on these tasks. To overcome this overfitting and improve the performance of LoRA, we propose the flexible low rank adaptation (Flexora) method to automatically and flexibly select the most important layers needing to be fine-tuned to achieve the best performance on different downstream tasks. Specifically, Flexora firstly frames this layer selection problem as a well-defined hyperparameter optimization (HPO) problem, then addresses it using the unrolled differentiation (UD) method, and finally selects the most useful layers based on the optimized hyperparameters. Our extensive experiments on many pretrained models and natural language tasks show that Flexora is able to consistently improve over the existing baselines, indicating the effectiveness of our Flexora in practice. We additionally provide insightful theoretical results and many ablation studies to deliver a comprehensive understanding of our Flexora.

[559] arXiv:2408.15742 (replaced) [pdf, html, other]
Title: On the impact of coordinated fleets size on traffic efficiency
Tommaso Toso, Francesca Parise, Paolo Frasca, Alain Y. Kibangou
Subjects: Computer Science and Game Theory (cs.GT)

We investigate a traffic assignment problem on a transportation network, considering both the demands of individual drivers and of a large fleet controlled by a central operator (minimizing the fleet's average travel time). We formulate this problem as a two-player convex game and we study how the size of the coordinated fleet, measured in terms of share of the total demand, influences the Price of Anarchy (PoA). We show that, for two-terminal networks, there are cases in which the fleet must reach a minimum share before actually affecting the PoA, which otherwise remains unchanged. Moreover, for parallel networks we prove that, under suitable assumptions, the PoA is monotonically non-increasing in the fleet share.

[560] arXiv:2409.04212 (replaced) [pdf, html, other]
Title: Improving the Parameter Dependency for High-Multiplicity Scheduling on Uniform Machines
Klaus Jansen, Kai Kahler, Lis Pirotton, Malte Tutas
Subjects: Data Structures and Algorithms (cs.DS)

Block-structured integer linear programs (ILPs) play an important role in various application fields. We address $n$-fold ILPs where the matrix $A$ has a specific structure, i.e., where the blocks in the lower part of A consist only of the row vectors $(1, ... ,1)$.
In this paper, we propose an approach tailored to exactly these combinatorial $n$-folds. We utilize a divide and conquer approach to separate the original problem such that the right-hand side iteratively decreases in size. We show that this decrease in size can be calculated such that we only need to consider a bounded amount of possible right-hand sides. This, in turn, lets us efficiently combine solutions of the smaller right-hand sides to solve the original problem. We can decide the feasibility of, and also optimally solve, such problems in time $(nr\Delta)^{O(r)} \log(\|b\|_\infty)$, where $n$ is the number of blocks, $r$ the number of rows in the upper blocks and $\Delta=\|A\|_\infty$.
We complement the algorithm by discussing applications of the $n$-fold ILPs with the specific structure we require. We consider the problems of (i) scheduling on uniform machines, (ii) closest string and (iii) (graph) imbalance.
Regarding (i), our algorithm results in running times of $p_{\max}^{O(d)}\text{poly}(I)$, matching a lower bound derived via ETH.
For (ii) we achieve running times matching the current state-of-the-art in the general case. In contrast to the state-of-the-art, our result can leverage a bounded number of column-types to yield an improved running time.
For (iii), we improve the parameter dependency on the size of the vertex cover.

[561] arXiv:2409.07995 (replaced) [pdf, html, other]
Title: Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes
Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai
Comments: Accepted by IROS 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.

[562] arXiv:2409.09111 (replaced) [pdf, other]
Title: Transformers from Diffusion: A Unified Framework for Neural Message Passing
Qitian Wu, David Wipf, Junchi Yan
Comments: Published in Journal of Machine Learning Research (JMLR). Extended from DIFFormer in ICLR 2023
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Learning representations for structured data with certain geometries (e.g., observed or unobserved) is a fundamental challenge, wherein message passing neural networks (MPNNs) have become a de facto class of model solutions. In this paper, inspired by physical systems, we propose an energy-constrained diffusion model, which integrates the inductive bias of diffusion on manifolds with layer-wise constraints of energy minimization. We identify that the diffusion operators have a one-to-one correspondence with the energy functions implicitly descended by the diffusion process, and the finite-difference iteration for solving the energy-constrained diffusion system induces the propagation layers of various types of MPNNs operating on observed or latent structures. This leads to a unified mathematical framework for common neural architectures whose computational flows can be cast as message passing (or its special case), including MLPs, GNNs, and Transformers. Building on these insights, we devise a new class of neural message passing models, dubbed diffusion-inspired Transformers (DIFFormer), whose global attention layers are derived from the principled energy-constrained diffusion framework. Across diverse datasets ranging from real-world networks to images, texts, and physical particles, we demonstrate that the new model achieves promising performance in scenarios where the data structures are observed (as a graph), partially observed, or entirely unobserved.

[563] arXiv:2409.12446 (replaced) [pdf, other]
Title: Neural Networks Generalize on Low Complexity Data
Sourav Chatterjee, Timothy Sudijono
Comments: 37 pages. V4: sharpened results and typos fixed
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)

We show that feedforward neural networks with ReLU activation generalize on low complexity data, suitably defined. Given i.i.d.~data generated from a simple programming language, the minimum description length (MDL) feedforward neural network which interpolates the data generalizes with high probability. We define this simple programming language, along with a notion of description length of such networks. We provide several examples on basic computational tasks, such as checking primality of a natural number. For primality testing, our theorem shows the following and more. Suppose that we draw an i.i.d.~sample of $n$ numbers uniformly at random from $1$ to $N$. For each number $x_i$, let $y_i = 1$ if $x_i$ is a prime and $0$ if it is not. Then, the interpolating MDL network accurately answers, with error probability $1- O((\ln N)/n)$, whether a newly drawn number between $1$ and $N$ is a prime or not. Note that the network is not designed to detect primes; minimum description learning discovers a network which does so. Extensions to noisy data are also discussed, suggesting that MDL neural network interpolators can demonstrate tempered overfitting.

[564] arXiv:2409.15128 (replaced) [pdf, html, other]
Title: The Number of Trials Matters in Infinite-Horizon General-Utility Markov Decision Processes
Pedro P. Santos, Alberto Sardinha, Francisco S. Melo
Subjects: Machine Learning (cs.LG)

The general-utility Markov decision processes (GUMDPs) framework generalizes the MDPs framework by considering objective functions that depend on the frequency of visitation of state-action pairs induced by a given policy. In this work, we contribute with the first analysis on the impact of the number of trials, i.e., the number of randomly sampled trajectories, in infinite-horizon GUMDPs. We show that, as opposed to standard MDPs, the number of trials plays a key-role in infinite-horizon GUMDPs and the expected performance of a given policy depends, in general, on the number of trials. We consider both discounted and average GUMDPs, where the objective function depends, respectively, on discounted and average frequencies of visitation of state-action pairs. First, we study policy evaluation under discounted GUMDPs, proving lower and upper bounds on the mismatch between the finite and infinite trials formulations for GUMDPs. Second, we address average GUMDPs, studying how different classes of GUMDPs impact the mismatch between the finite and infinite trials formulations. Third, we provide a set of empirical results to support our claims, highlighting how the number of trajectories and the structure of the underlying GUMDP influence policy evaluation.

[565] arXiv:2409.20302 (replaced) [pdf, html, other]
Title: OM4OV: Leveraging Ontology Matching for Ontology Versioning
Zhangcheng Qiang, Kerry Taylor, Weiqing Wang
Comments: 15 pages, 8 figures, 1 table
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

Due to the dynamic nature of the Semantic Web, version control is necessary to capture time-varying information, particularly for widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component for efficient ontology management, the growing size of ontologies and accumulating errors caused by manual labour overwhelm current OV approaches. In this paper, we propose a fresh approach to performing OV using existing ontology matching (OM) techniques and systems. We introduce a unified OM4OV pipeline. From an OM perspective, we reconstruct a new task formulation and measurements for OV tasks. Building upon the prior alignment(s) from OM, we propose a pipeline optimisation method called the cross-reference (CR) mechanism to enhance overall OV performance. We experimentally validate the OM4OV pipeline and the cross-reference mechanism in an OV testbed originating from the Ontology Alignment Evaluation Initiative (OAEI) datasets. We also discuss insights into OM used for OV tasks, where some apparent false mappings detected by OV systems are not actually untrue.

[566] arXiv:2410.00548 (replaced) [pdf, html, other]
Title: The complexity of separability for semilinear sets and Parikh automata
Elias Rojas Collins, Chris Köcher, Georg Zetzsche
Comments: accepted for MFCS 2025
Subjects: Formal Languages and Automata Theory (cs.FL)

In a \emph{separability problem}, we are given two sets $K$ and $L$ from a class $\mathcal{C}$, and we want to decide whether there exists a set $S$ from a class $\mathcal{S}$ such that $K\subseteq S$ and $S\cap L=\emptyset$. In this case, we speak of \emph{separability of sets in $\mathcal{C}$ by sets in $\mathcal{S}$}.
We study two types of separability problems. First, we consider separability of semilinear sets (i.e. subsets of $\mathbb{N}^d$ for some $d$) by sets definable by quantifier-free monadic Presburger formulas (or equivalently, the recognizable subsets of $\mathbb{N}^d$). Here, a formula is monadic if each atom uses at most one variable. Second, we consider separability of languages of Parikh automata by regular languages. A Parikh automaton is a machine with access to counters that can only be incremented, and have to meet a semilinear constraint at the end of the run. Both of these separability problems are known to be decidable with elementary complexity.
Our main results are that both problems are coNP-complete. In the case of semilinear sets, coNP-completeness holds regardless of whether the input sets are specified by existential Presburger formulas, quantifier-free formulas, or semilinear representations. Our results imply that recognizable separability of rational subsets of $\Sigma^*\times\mathbb{N}^d$ (shown decidable by Choffrut and Grigorieff) is coNP-complete as well. Another application is that regularity of deterministic Parikh automata (where the target set is specified using a quantifier-free Presburger formula) is coNP-complete as well.

[567] arXiv:2410.01141 (replaced) [pdf, html, other]
Title: Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs
Doohee You, S Fraiberger
Comments: 6 pages, 1 figure
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.

[568] arXiv:2410.03637 (replaced) [pdf, html, other]
Title: On the Cost of Consecutive Estimation Error: Significance-Aware Non-linear Aging
Jiping Luo, Nikolaos Pappas
Comments: This paper has been accepted for publication in the IEEE Transactions on Information Theory
Subjects: Information Theory (cs.IT); Systems and Control (eess.SY)

This paper considers the semantics-aware remote state estimation of an asymmetric Markov chain with prioritized states. Due to resource constraints, the sensor needs to trade between estimation quality and communication cost. The aim is to exploit the significance of information through the history of system realizations to determine the optimal timing of transmission, thereby reducing the amount of uninformative data transmitted in the network. To this end, we introduce a new metric, the significance-aware Age of Consecutive Error (AoCE), that captures two semantic attributes: the significance of estimation error and the cost of consecutive error. Different costs and non-linear age functions are assigned to different estimation errors to account for their relative importance to system performance. We identify the optimal transmission problem as a countably infinite state Markov decision process (MDP) with unbounded costs. We first give sufficient conditions on the age functions, source pattern, and channel reliability so that an optimal policy exists to have bounded average costs. We show that the optimal policy exhibits a switching structure. That is, the sensor triggers a transmission only when the system has been trapped in an error for a certain number of consecutive time slots. We also provide sufficient conditions under which the switching policy degenerates into a simple threshold policy, i.e., featuring identical thresholds for all estimation errors. Furthermore, we exploit the structural properties and develop a structured policy iteration (SPI) algorithm that considerably reduces computation overhead. Numerical results show that the optimal policy outperforms the classic rule-, distortion- and age-based policies. An important takeaway is that the more semantic attributes we utilize, the fewer transmissions are needed.

[569] arXiv:2410.05255 (replaced) [pdf, html, other]
Title: Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization
Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Pengcheng Chen, Yu Dong, Christopher Brinton, Jiebo Luo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Existing post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement learning (RL) methods; the former is stable during training but suffers from limited generalization, while the latter, despite its stronger generalization capability, relies on additional preference data or reward models and carries the risk of reward exploitation. In order to preserve the advantages of both SFT and RL -- namely, eliminating the need for paired data and reward models while retaining the training stability of SFT and the generalization ability of RL -- a new alignment method, Self-Sampling Preference Optimization (SSPO), is proposed in this paper. SSPO introduces a Random Checkpoint Replay (RCR) strategy that utilizes historical checkpoints to construct paired data, thereby effectively mitigating overfitting. Simultaneously, a Self-Sampling Regularization (SSR) strategy is employed to dynamically evaluate the quality of generated samples; when the generated samples are more likely to be winning samples, the approach automatically switches from DPO (Direct Preference Optimization) to SFT, ensuring that the training process accurately reflects the quality of the samples. Experimental results demonstrate that SSPO not only outperforms existing methods on text-to-image benchmarks, but its effectiveness has also been validated in text-to-video tasks. We validate SSPO across both text-to-image and text-to-video benchmarks. SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.

[570] arXiv:2410.06889 (replaced) [pdf, html, other]
Title: Subspace method of moments for ab initio 3-D single-particle cryo-EM reconstruction
Jeremy Hoskins, Yuehaw Khoo, Oscar Mickelin, Amit Singer, Yuguan Wang
Subjects: Numerical Analysis (math.NA)

Cryo-electron microscopy (cryo-EM) is a widely-used technique for recovering the 3-D structure of biological molecules from a large number of experimentally generated noisy 2-D tomographic projection images of the 3-D structure, taken from unknown viewing angles. Through computationally intensive algorithms, these observed images are processed to reconstruct the 3-D structures. Many popular computational methods rely on estimating the unknown angles as part of the reconstruction process, which becomes particularly challenging at low signal-to-noise ratios. The method of moments (MoM) offers an alternative approach that circumvents the estimation of viewing orientations of individual projection images by instead estimating the underlying distribution of the viewing angles, and is robust to noise given sufficiently many images. However, the method of moments typically entails computing high-order moments of the projection images, incurring significant computational and memory costs. To mitigate this, we propose a new approach called the subspace method of moments (subspace MoM), which compresses the first three moments using data-driven low-rank tensor techniques as well as expansion into a suitable function basis. The compressed moments can be efficiently computed from the set of projection images using numerical quadrature and can be employed to jointly reconstruct the 3-D structure and the distribution of viewing orientations. We illustrate the practical applicability of the subspace MoM through numerical experiments using up to the third-order moment on synthetic datasets with a simplified cryo-EM image formation model, which significantly improves the resolution of MoM reconstructions compared to previous approaches.

[571] arXiv:2410.10458 (replaced) [pdf, html, other]
Title: The Fujita exponent for finite difference approximations of nonlocal and local semilinear blow-up problems
Félix del Teso, Raúl Ferreira
Comments: 4 figures
Subjects: Numerical Analysis (math.NA)

We study monotone finite difference approximations for a broad class of reaction-diffusion problems, incorporating general symmetric Lévy operators. By employing an adaptive time-stepping discretization, we derive the discrete Fujita critical exponent for these problems. Additionally, under general consistency assumptions, we establish the convergence of discrete blow-up times to their continuous counterparts. As complementary results, we also present the asymptotic-in-time behavior of discrete heat-type equations as well as an extensive analysis of discrete eigenvalue problems.

[572] arXiv:2410.11281 (replaced) [pdf, html, other]
Title: DynaCLR: Contrastive Learning of Cellular Dynamics with Temporal Regularization
Eduardo Hirata-Miyasaki, Soorya Pradeep, Ziwen Liu, Alishba Imran, Taylla Milena Theodoro, Ivan E. Ivanov, Sudip Khadka, See-Chi Lee, Michelle Grunberg, Hunter Woosley, Madhura Bhave, Carolina Arias, Shalin B. Mehta
Comments: 30 pages, 6 figures, 13 appendix figures, 5 videos (ancillary files)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

We report DynaCLR, a self-supervised method for embedding cell and organelle Dynamics via Contrastive Learning of Representations of time-lapse images. DynaCLR integrates single-cell tracking and time-aware contrastive sampling to learn robust, temporally regularized representations of cell dynamics. DynaCLR embeddings generalize effectively to in-distribution and out-of-distribution datasets, and can be used for several downstream tasks with sparse human annotations. We demonstrate efficient annotations of cell states with a human-in-the-loop using fluorescence and label-free imaging channels. DynaCLR method enables diverse downstream biological analyses: classification of cell division and infection, clustering heterogeneous cell migration patterns, cross-modal distillation of cell states from fluorescence to label-free channel, alignment of asynchronous cellular responses and broken cell tracks, and discovering organelle response due to infection. DynaCLR is a flexible method for comparative analyses of dynamic cellular responses to pharmacological, microbial, and genetic perturbations. We provide PyTorch-based implementations of the model training and inference pipeline (this https URL) and a GUI (this https URL) for the visualization and annotation of trajectories of cells in the real space and the embedding space.

[573] arXiv:2410.14038 (replaced) [pdf, html, other]
Title: Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning
Bryan L. M. de Oliveira, Luana G. B. Martins, Bruno Brandão, Murilo L. da Luz, Telma W. de L. Soares, Luckeciano C. Melo
Comments: Accepted at ICML 2025
Subjects: Machine Learning (cs.LG)

Effective visual representation learning is crucial for reinforcement learning (RL) agents to extract task-relevant information from raw sensory inputs and generalize across diverse environments. However, existing RL benchmarks lack the ability to systematically evaluate representation learning capabilities in isolation from other learning challenges. To address this gap, we introduce the Sliding Puzzles Gym (SPGym), a novel benchmark that transforms the classic 8-tile puzzle into a visual RL task with images drawn from arbitrarily large datasets. SPGym's key innovation lies in its ability to precisely control representation learning complexity through adjustable grid sizes and image pools, while maintaining fixed environment dynamics, observation, and action spaces. This design enables researchers to isolate and scale the visual representation challenge independently of other learning components. Through extensive experiments with model-free and model-based RL algorithms, we uncover fundamental limitations in current methods' ability to handle visual diversity. As we increase the pool of possible images, all algorithms exhibit in- and out-of-distribution performance degradation, with sophisticated representation learning techniques often underperforming simpler approaches like data augmentation. These findings highlight critical gaps in visual representation learning for RL and establish SPGym as a valuable tool for driving progress in robust, generalizable decision-making systems.

[574] arXiv:2410.14405 (replaced) [pdf, other]
Title: Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion
Denitsa Saynova, Lovisa Hagström, Moa Johansson, Richard Johansson, Marco Kuhlmann
Comments: accepted to ACL Findings 2025
Subjects: Computation and Language (cs.CL)

Language models (LMs) can make a correct prediction based on many possible signals in a prompt, not all corresponding to recall of factual associations. However, current interpretations of LMs fail to take this into account. For example, given the query "Astrid Lindgren was born in" with the corresponding completion "Sweden", no difference is made between whether the prediction was based on knowing where the author was born or assuming that a person with a Swedish-sounding name was born in Sweden. In this paper, we present a model-specific recipe - PrISM - for constructing datasets with examples of four different prediction scenarios: generic language modeling, guesswork, heuristics recall and exact fact recall. We apply two popular interpretability methods to the scenarios: causal tracing (CT) and information flow analysis. We find that both yield distinct results for each scenario. Results for exact fact recall and generic language modeling scenarios confirm previous conclusions about the importance of mid-range MLP sublayers for fact recall, while results for guesswork and heuristics indicate a critical role of late last token position MLP sublayers. In summary, we contribute resources for a more extensive and granular study of fact completion in LMs, together with analyses that provide a more nuanced understanding of how LMs process fact-related queries.

[575] arXiv:2410.14649 (replaced) [pdf, html, other]
Title: EvoPress: Accurate Dynamic Model Compression via Evolutionary Search
Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh
Comments: ICML camera-ready
Subjects: Machine Learning (cs.LG)

The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on estimating the importance of a given layer, implicitly assuming that layers contribute independently to the overall compression error. We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance. To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths. Our code is available at this https URL}.

[576] arXiv:2411.00442 (replaced) [pdf, html, other]
Title: Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations
Peichen Xie, Yanjie Gao, Yang Wang, Jilong Xue
Comments: Camera-ready for USENIX ATC 2025
Subjects: Numerical Analysis (math.NA); Mathematical Software (cs.MS); Software Engineering (cs.SE)

Accumulation-based operations, such as summation and matrix multiplication, are fundamental to numerous computational domains. However, their accumulation orders are often undocumented in existing software and hardware implementations, making it difficult for developers to ensure consistent results across systems. To address this issue, we introduce FPRev, a diagnostic tool designed to reveal the accumulation order in the software and hardware implementations through numerical testing. With FPRev, developers can identify and compare accumulation orders, enabling developers to create reproducible software and verify implementation equivalence.
FPRev is a testing-based tool that non-intrusively reveals the accumulation order by analyzing the outputs of the tested implementation for distinct specially designed inputs. Employing FPRev, we showcase the accumulation orders of popular libraries (such as NumPy and PyTorch) on CPUs and GPUs (including GPUs with specialized matrix accelerators such as Tensor Cores). We also validate the efficiency of FPRev through extensive experiments. FPRev exhibits a lower time complexity compared to the basic solution. FPRev is open-sourced at this https URL.

[577] arXiv:2411.04037 (replaced) [pdf, html, other]
Title: Investigating the heterogenous effects of a massive content moderation intervention via Difference-in-Differences
Lorenzo Cima, Benedetta Tessa, Stefano Cresci, Amaury Trujillo, Marco Avvenuti
Comments: arXiv admin note: text overlap with arXiv:2401.11254 This work is an extension of this conference paper: Cima, L., Trujillo, A., Avvenuti, M., & Cresci, S. (2024, May). The Great Ban: Efficacy and Unintended Consequences of a Massive Deplatforming Operation on Reddit. In Companion Publication of the 16th ACM Web Science Conference (pp. 85-93)
Journal-ref: Online Social Networks and Media (OSNEM), 2025
Subjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

In today's online environments, users encounter harm and abuse on a daily basis. Therefore, content moderation is crucial to ensure their safety and well-being. However, the effectiveness of many moderation interventions is still uncertain. Here, we apply a causal inference approach to shed light on the effectiveness of The Great Ban, a massive social media deplatforming intervention on Reddit. We analyze 53M comments shared by nearly 34K users, providing in-depth results on both the intended and unintended consequences of the ban. Our causal analyses reveal that 15.6% of the moderated users abandoned the platform while the remaining ones decreased their overall toxicity by 4.1%. Nonetheless, a small subset of users exhibited marked increases in both the intensity and volume of toxic behavior, particularly among those whose activity levels changed after the intervention. However, these reactions were not accompanied by greater activity or engagement, suggesting that even the most toxic users maintained a limited overall impact. Our findings bring to light new insights on the effectiveness of deplatforming moderation interventions. Furthermore, they also contribute to informing future content moderation strategies and regulations.

[578] arXiv:2411.04403 (replaced) [pdf, html, other]
Title: Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
Zhichao Geng, Yiwen Wang, Dongyu Ru, Yang Yang
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Learned sparse retrieval, which can efficiently perform retrieval through mature inverted-index engines, has garnered growing attention in recent years. Particularly, the inference-free sparse retrievers are attractive as they eliminate online model inference in the retrieval phase thereby avoids huge computational cost, offering reasonable throughput and latency. However, even the state-of-the-art (SOTA) inference-free sparse models lag far behind in terms of search relevance when compared to both sparse and dense siamese models. Towards competitive search relevance for inference-free sparse retrievers, we argue that they deserve dedicated training methods other than using same ones with siamese encoders. In this paper, we propose two different approaches for performance improvement. First, we propose an IDF-aware penalty for the matching function that suppresses the contribution of low-IDF tokens and increases the model's focus on informative terms. Moreover, we propose a heterogeneous ensemble knowledge distillation framework that combines siamese dense and sparse retrievers to generate supervisory signals during the pre-training phase. The ensemble framework of dense and sparse retriever capitalizes on their strengths respectively, providing a strong upper bound for knowledge distillation. To concur the diverse feedback from heterogeneous supervisors, we normalize and then aggregate the outputs of the teacher models to eliminate score scale differences. On the BEIR benchmark, our model outperforms existing SOTA inference-free sparse model by \textbf{3.3 NDCG@10 score}. It exhibits search relevance comparable to siamese sparse retrievers and client-side latency only \textbf{1.1x that of BM25}.

[579] arXiv:2411.09105 (replaced) [pdf, html, other]
Title: VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models
Chenglin Li, Qianglong Chen, Zhi Li, Feng Tao, Yin Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Recent advancements in Large Video-Language Models (LVLMs) have led to promising results in multimodal video understanding. However, it remains unclear whether these models possess the cognitive capabilities required for high-level tasks, particularly those involving symbolic and abstract perception. Existing benchmarks typically rely on real-world, annotated videos, which lack control over video content and inherent difficulty, limiting their diagnostic power. To bridge this gap, we propose VideoCogQA, a scalable and fully controllable benchmark inspired by game-world environments, designed to evaluate the cognitive abilities of LVLMs. By generating synthetic videos via a programmatic engine, VideoCogQA allows fine-grained control over visual elements, temporal dynamics, and task difficulty. This approach enables a focused evaluation of video cognitive abilities, independent of prior knowledge from visual scene semantics. The dataset includes 800 videos and 3,280 question-answer pairs, featuring tasks related to abstract concepts, symbolic elements, and multimodal integration, with varying levels of difficulty. Experimental results show that even state-of-the-art (SOTA) models, such as GPT-4o, achieve an average performance of 48.8% on tasks involving abstract concepts. Additionally, performance drops by 15% as task complexity increases, highlighting the challenges LVLMs face in maintaining consistent performance. Through this work, we hope to show the limitations of current LVLMs and offer insights into how they can more effectively emulate human cognitive processes in the future.

[580] arXiv:2411.11194 (replaced) [pdf, other]
Title: Careless Whisper: Exploiting Silent Delivery Receipts to Monitor Users on Mobile Instant Messengers
Gabriel K. Gegenhuber, Maximilian Günther, Markus Maier, Aljosha Judmayer, Florian Holzbauer, Philipp É. Frenzel, Johanna Ullrich
Subjects: Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)

With over 3 billion users globally, mobile instant messaging apps have become indispensable for both personal and professional communication. Besides plain messaging, many services implement additional features such as delivery and read receipts informing a user when a message has successfully reached its target. This paper highlights that delivery receipts can pose significant privacy risks to users. We use specifically crafted messages that trigger delivery receipts allowing any user to be pinged without their knowledge or consent. By using this technique at high frequency, we demonstrate how an attacker could extract private information such as the online and activity status of a victim, e.g., screen on/off. Moreover, we can infer the number of currently active user devices and their operating system, as well as launch resource exhaustion attacks, such as draining a user's battery or data allowance, all without generating any notification on the target side. Due to the widespread adoption of vulnerable messengers (WhatsApp and Signal) and the fact that any user can be targeted simply by knowing their phone number, we argue for a design change to address this issue.

[581] arXiv:2411.12787 (replaced) [pdf, html, other]
Title: From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning
Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang
Comments: ICCV 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Efficient Visual Instruction Fine-Tuning (EVIT) seeks to adapt Multimodal Large Language Models (MLLMs) to downstream tasks with minimal computational overhead. However, as task diversity and complexity increase, EVIT faces significant challenges in resolving data conflicts. To address this limitation, we propose the Dual Low-Rank Adaptation (Dual-LoRA), a holistic-to-local framework that enhances the adapter's capacity to address data conflict through dual structural optimization. Specifically, we utilize two subspaces: a skill space for stable, holistic knowledge retention, and a rank-rectified task space that locally activates the holistic knowledge. Additionally, we introduce Visual Cue Enhancement (VCE), a multi-level local feature aggregation module designed to enrich the vision-language projection with local details. Our approach is both memory- and time-efficient, requiring only 1.16$\times$ the inference time of the standard LoRA method (with injection into the query and value projection layers), and just 73\% of the inference time of a 4-expert LoRA-MoE. Extensive experiments on various downstream tasks and general MLLM benchmarks validate the effectiveness of our proposed methods.

[582] arXiv:2411.13536 (replaced) [pdf, html, other]
Title: Identity Preserving 3D Head Stylization with Multiview Score Distillation
Bahri Batuhan Bilecen, Ahmet Berke Gokmen, Furkan Guzelant, Aysegul Dundar
Comments: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)

3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across gaming and virtual reality applications. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. This paper addresses these challenges by leveraging the PanoHead model, synthesizing images from a comprehensive 360-degree perspective. We propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Please visit the this https URL for more visuals.

[583] arXiv:2411.13949 (replaced) [pdf, html, other]
Title: SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning
Ziqi Wang, Chang Che, Qi Wang, Yangyang Li, Zenglin Shi, Meng Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Visual instruction tuning (VIT) enables multimodal large language models (MLLMs) to effectively handle a wide range of vision tasks by framing them as language-based instructions. Building on this, continual visual instruction tuning (CVIT) extends the capability of MLLMs to incrementally learn new tasks, accommodating evolving functionalities. While prior work has advanced CVIT through the development of new benchmarks and approaches to mitigate catastrophic forgetting, these efforts largely follow traditional continual learning paradigms, neglecting the unique challenges specific to CVIT. We identify a dual form of catastrophic forgetting in CVIT, where MLLMs not only forget previously learned visual understanding but also experience a decline in instruction following abilities as they acquire new tasks. To address this, we introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules-one for visual understanding and another for instruction following. This dual-routing design enables specialized adaptation in both domains, preventing forgetting while improving performance. Furthermore, we propose a new CVIT benchmark that goes beyond existing benchmarks by additionally evaluating a model's ability to generalize to unseen tasks and handle diverse instructions across various tasks. Extensive experiments demonstrate that SMoLoRA outperforms existing methods in mitigating dual forgetting, improving generalization to unseen tasks, and ensuring robustness in following diverse instructions. Code is available at this https URL.

[584] arXiv:2411.14003 (replaced) [pdf, html, other]
Title: Generative Intervention Models for Causal Perturbation Modeling
Nora Schneider, Lars Lorch, Niki Kilbertus, Bernhard Schölkopf, Andreas Krause
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We consider the problem of predicting perturbation effects via causal models. In many applications, it is a priori unknown which mechanisms of a system are modified by an external perturbation, even though the features of the perturbation are available. For example, in genomics, some properties of a drug may be known, but not their causal effects on the regulatory pathways of cells. We propose a generative intervention model (GIM) that learns to map these perturbation features to distributions over atomic interventions in a jointly-estimated causal model. Contrary to prior approaches, this enables us to predict the distribution shifts of unseen perturbation features while gaining insights about their mechanistic effects in the underlying data-generating process. On synthetic data and scRNA-seq drug perturbation data, GIMs achieve robust out-of-distribution predictions on par with unstructured approaches, while effectively inferring the underlying perturbation mechanisms, often better than other causal inference methods.

[585] arXiv:2411.17766 (replaced) [pdf, html, other]
Title: Integrating Dual Prototypes for Task-Wise Adaption in Pre-Trained Model-Based Class-Incremental Learning
Zhiming Xu, Suorong Yang, Baile Xu, Furao Shen, Jian Zhao
Comments: 10 pages,9 figures,2 tables
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Class-incremental learning (CIL) aims to acquire new classes while conserving historical knowledge incrementally. Despite existing pre-trained model (PTM) based methods performing excellently in CIL, it is better to fine-tune them on downstream incremental tasks with massive patterns unknown to PTMs. However, using task streams for fine-tuning could lead to \textit{catastrophic forgetting} that will erase the knowledge in PTMs. This paper proposes the Dual Prototype network for Task-wise Adaption (DPTA) of PTM-based CIL. For each incremental learning task, an adapter module is built to fine-tune the PTM, where the center-adapt loss forces the representation to be more centrally clustered and class separable. The dual prototype network improves the prediction process by enabling test-time adapter selection, where the raw prototypes deduce several possible task indexes of test samples to select suitable adapter modules for PTM, and the augmented prototypes that could separate highly correlated classes are utilized to determine the final result. Experiments on several benchmark datasets demonstrate the excellent performance of DPTA. Code is available in this https URL

[586] arXiv:2411.19278 (replaced) [pdf, html, other]
Title: OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration
Yiming Zuo, Willow Yang, Zeyu Ma, Jia Deng
Comments: Accepted to ICCV 2025. Added additional results and ablations
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Depth completion (DC) aims to predict a dense depth map from an RGB image and a sparse depth map. Existing DC methods generalize poorly to new datasets or unseen sparse depth patterns, limiting their real-world applications. We propose OMNI-DC, a highly robust DC model that generalizes well zero-shot to various datasets. The key design is a novel Multi-resolution Depth Integrator, allowing our model to deal with very sparse depth inputs. We also introduce a novel Laplacian loss to model the ambiguity in the training process. Moreover, we train OMNI-DC on a mixture of high-quality datasets with a scale normalization technique and synthetic depth patterns. Extensive experiments on 7 datasets show consistent improvements over baselines, reducing errors by as much as 43%. Codes and checkpoints are available at this https URL.

[587] arXiv:2412.01530 (replaced) [pdf, html, other]
Title: Generative AI-based data augmentation for improved bioacoustic classification in noisy environments
Anthony Gibbons, Emma King, Ian Donohue, Andrew Parnell
Comments: 20 pages, 3 tables, 6 figures
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS); Applications (stat.AP)

1. Obtaining data to train robust artificial intelligence (AI)-based models for species classification can be challenging, particularly for rare species. Data augmentation can boost classification accuracy by increasing the diversity of training data and is cheaper to obtain than expert-labelled data. However, many classic image-based augmentation techniques are not suitable for audio spectrograms. 2. We investigate two generative AI models as data augmentation tools to synthesise spectrograms and supplement audio data: Auxiliary Classifier Generative Adversarial Networks (ACGAN) and Denoising Diffusion Probabilistic Models (DDPMs). The latter performed particularly well in terms of both realism of generated spectrograms and accuracy in a resulting classification task. 3. Alongside these new approaches, we present a new audio data set of 640 hours of bird calls from wind farm sites in Ireland, approximately 800 samples of which have been labelled by experts. Wind farm data are particularly challenging for classification models given the background wind and turbine noise. 4. Training an ensemble of classification models on real and synthetic data combined gave 92.6% accuracy (and 90.5% with just the real data) when compared with highly confident BirdNET predictions. 5. Our approach can be used to augment acoustic signals for more species and other land-use types, and has the potential to bring about a step-change in our capacity to develop reliable AI-based detection of rare species. Our code is available at this https URL.

[588] arXiv:2412.03704 (replaced) [pdf, html, other]
Title: Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
Xiyao Wang, Zhengyuan Yang, Linjie Li, Hongjin Lu, Yuancheng Xu, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs. Our value model and code are available at this https URL.

[589] arXiv:2412.05576 (replaced) [pdf, html, other]
Title: STONet: A neural operator for modeling solute transport in micro-cracked reservoirs
Ehsan Haghighat, Mohammad Hesan Adeli, S Mohammad Mousavi, Ruben Juanes
Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE); Fluid Dynamics (physics.flu-dyn)

In this work, we introduce a novel neural operator, the Solute Transport Operator Network (STONet), to efficiently model contaminant transport in micro-cracked porous media. STONet's model architecture is specifically designed for this problem and uniquely integrates an enriched DeepONet structure with a transformer-based multi-head attention mechanism, enhancing performance without incurring additional computational overhead compared to existing neural operators. The model combines different networks to encode heterogeneous properties effectively and predict the rate of change of the concentration field to accurately model the transport process. The training data is obtained using finite element (FEM) simulations by random sampling of micro-fracture distributions and applied pressure boundary conditions, which capture diverse scenarios of fracture densities, orientations, apertures, lengths, and balance of pressure-driven to density-driven flow. Our numerical experiments demonstrate that, once trained, STONet achieves accurate predictions, with relative errors typically below 1% compared with FEM simulations while reducing runtime by approximately two orders of magnitude. This type of computational efficiency facilitates building digital twins for rapid assessment of subsurface contamination risks and optimization of environmental remediation strategies. The data and code for the paper will be published at this https URL.

[590] arXiv:2412.06432 (replaced) [pdf, html, other]
Title: Integrating Expert Labels into LLM-based Emission Goal Detection: Example Selection vs Automatic Prompt Design
Marco Wrzalik, Adrian Ulges, Anne Uersfeld, Florian Faust, Viola Campos
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)

We address the detection of emission reduction goals in corporate reports, an important task for monitoring companies' progress in addressing climate change. Specifically, we focus on the issue of integrating expert feedback in the form of labeled example passages into LLM-based pipelines, and compare the two strategies of (1) a dynamic selection of few-shot examples and (2) the automatic optimization of the prompt by the LLM itself. Our findings on a public dataset of 769 climate-related passages from real-world business reports indicate that automatic prompt optimization is the superior approach, while combining both methods provides only limited benefit. Qualitative results indicate that optimized prompts do indeed capture many intricacies of the targeted emission goal extraction task.

[591] arXiv:2412.08328 (replaced) [pdf, html, other]
Title: Thévenin Equivalent Parameters Identification Based on Statistical Characteristics of System Ambient Data
Boying Zhou, Chen Shen, Kexuan Tang
Subjects: Systems and Control (eess.SY)

This paper proposes a novel method for identifying Thévenin equivalent parameters (TEP) in power system, based on the statistical characteristics of the system's stochastic response. The method leverages stochastic fluctuation data under steady-state grid conditions and applies sliding window techniques to compute sensitivity parameters between voltage magnitude, current magnitude and power. This enables high-accuracy and robust TEP identification. In contrast to traditional methods, the proposed approach does not rely on large disturbances or probing signals but instead utilizes the natural fluctuation behavior of the system. Additionally, the method supports distributed implementation using local measurements of voltage magnitude, current magnitude, and power, offering significant practical value for engineering applications. The theoretical analysis demonstrates the method's robustness in the presence of low signal-to-noise ratio (SNR), asynchronous measurements, and data collinearity issues. Simulation results further confirm the effectiveness of the proposed method in diverse practical scenarios, demonstrating its ability to consistently provide accurate and reliable identification of TEP using system ambient data.

[592] arXiv:2412.09238 (replaced) [pdf, html, other]
Title: Disturbance-Adaptive Data-Driven Predictive Control: Trading Comfort Violations for Savings in Building Climate Control
Jicheng Shi, Christophe Salzmann, Colin N. Jones
Subjects: Systems and Control (eess.SY)

Model Predictive Control (MPC) has demonstrated significant potential in improving energy efficiency in building climate control, outperforming traditional controllers commonly used in modern building management systems. Among MPC variants, Data-driven Predictive Control (DPC) offers the advantage of modeling building dynamics directly from data, thereby substantially reducing commissioning efforts. However, inevitable model uncertainties and measurement noise can result in comfort violations, even with dedicated MPC setups. This paper introduces a Disturbance-Adaptive DPC (DAD-DPC) framework that ensures asymptotic satisfaction of predefined violation bounds without knowing the uncertainty and noise distributions. The framework employs a data-driven pipeline based on Willems' Fundamental Lemma and conformal prediction for application in building climate control. The proposed DAD-DPC framework was validated through four building cases using the high-fidelity BOPTEST simulation platform and an occupied campus building, Polydome. DAD-DPC successfully regulated the average comfort violations to meet pre-defined bounds. Notably, the 5%-violation DAD-DPC setup achieved 30.1%/11.2%/27.1%/20.5% energy savings compared to default controllers across four cases. These results demonstrate the framework's effectiveness in balancing energy consumption and comfort violations, offering a practical solution for building climate control applications.

[593] arXiv:2412.14373 (replaced) [pdf, html, other]
Title: ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
William Han, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Comments: 38 pages, 9 figures
Subjects: Computation and Language (cs.CL); Signal Processing (eess.SP)

Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding textual prompts. Existing approaches often involve a two-stage process: pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective, followed by finetuning an LLM for natural language generation (NLG) using encoder-derived features. However, these methods face two key limitations: inefficiency due to multi-stage training and challenges in interpreting encoder-generated features. To overcome these issues, we propose ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. ECG-Byte compresses and encodes ECG signals into tokens, enabling direct end-to-end LLM training by combining ECG and text tokens. This approach enhances interpretability, as ECG tokens can be directly mapped back to the original signals. Leveraging ECG-Byte, we achieve competitive NLG performance while training 3 times faster and using just 48\% of the data required by traditional two-stage methods.

[594] arXiv:2412.18241 (replaced) [pdf, html, other]
Title: An Automatic Graph Construction Framework based on Large Language Models for Recommendation
Rong Shan, Jianghao Lin, Chenxu Zhu, Bo Chen, Menghui Zhu, Kangning Zhang, Jieming Zhu, Ruiming Tang, Yong Yu, Weinan Zhang
Comments: Accepted by KDD'25
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Graph neural networks (GNNs) have emerged as state-of-the-art methods to learn from graph-structured data for recommendation. However, most existing GNN-based recommendation methods focus on the optimization of model structures and learning strategies based on pre-defined graphs, neglecting the importance of the graph construction stage. Earlier works for graph construction usually rely on speciffic rules or crowdsourcing, which are either too simplistic or too labor-intensive. Recent works start to utilize large language models (LLMs) to automate the graph construction, in view of their abundant open-world knowledge and remarkable reasoning capabilities. Nevertheless, they generally suffer from two limitations: (1) invisibility of global view (e.g., overlooking contextual information) and (2) construction inefficiency. To this end, we introduce AutoGraph, an automatic graph construction framework based on LLMs for recommendation. Specifically, we first use LLMs to infer the user preference and item knowledge, which is encoded as semantic vectors. Next, we employ vector quantization to extract the latent factors from the semantic vectors. The latent factors are then incorporated as extra nodes to link the user/item nodes, resulting in a graph with in-depth global-view semantics. We further design metapath-based message aggregation to effectively aggregate the semantic and collaborative information. The framework is model-agnostic and compatible with different backbone models. Extensive experiments on three real-world datasets demonstrate the efficacy and efffciency of AutoGraph compared to existing baseline methods. We have deployed AutoGraph in Huawei advertising platform, and gain a 2.69% improvement on RPM and a 7.31% improvement on eCPM in the online A/B test. Currently AutoGraph has been used as the main trafffc model, serving hundreds of millions of people.

[595] arXiv:2412.19351 (replaced) [pdf, other]
Title: ETTA: Elucidating the Design Space of Text-to-Audio Models
Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro
Comments: ICML 2025. Demo: this https URL Code: this https URL
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions -- a task that is more challenging than current benchmarks.

[596] arXiv:2412.20771 (replaced) [pdf, html, other]
Title: Optimally Decoding Two-Dimensional Reed-Solomon Codes Against Deletion Errors
Shubhransh Singhvi
Subjects: Information Theory (cs.IT)

Constructing Reed-Solomon (RS) codes that can correct insertion and deletion (ins-del) errors has been the focus of several recent studies. However, efficient decoding algorithms for such codes have received less attention and remain a significant open problem. In this work, we take a first step toward addressing this problem by designing a decoding algorithm for the case of $2$-dimensional RS codes that can correct deletions up to the half-Singleton bound and is optimal in terms of field operations.

[597] arXiv:2501.01144 (replaced) [pdf, other]
Title: BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference
Wonsuk Jang, Thierry Tambe
Comments: ICML 2025
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.

[598] arXiv:2501.01855 (replaced) [pdf, html, other]
Title: UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery
Huaxiang Zhang, Kai Liu, Zhongxue Gan, Guo-Niu Zhu
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused down-sampling module is presented to retain critical spatial details during down-sampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1\% and $\text{AP}_{50}$ by 4.2\% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page: this https URL

[599] arXiv:2501.04032 (replaced) [pdf, html, other]
Title: Efficient Computation of Collatz Sequence Stopping Times: A Novel Algorithmic Approach
Eyob Solomon Getachew, Beakal Gizachew Assefa
Journal-ref: Published in: IEEE Access ( Volume: 13), Page(s): 41210 - 41220, Date of Publication: 05 March 2025
Subjects: Mathematical Software (cs.MS)

The Collatz conjecture, which posits that any positive integer will eventually reach 1 through a specific iterative process, is a classic unsolved problem in mathematics. This research focuses on designing an efficient algorithm to compute the stopping time of numbers in the Collatz sequence, achieving significant computational improvements. By leveraging structural patterns in the Collatz tree, the proposed algorithm minimizes redundant operations and optimizes computational steps. Unlike prior methods, it efficiently handles extremely large numbers without requiring advanced techniques such as memoization or parallelization. Experimental evaluations confirm computational efficiency improvements of approximately 28% over state-of-the-art methods. These findings underscore the algorithm's scalability and robustness, providing a foundation for future large-scale verification of the conjecture and potential applications in computational mathematics.

[600] arXiv:2501.08044 (replaced) [pdf, html, other]
Title: UFGraphFR: Graph Federation Recommendation System based on User Text description features
Xudong Wang, Qingbo Hao, Xu Cheng, Yingyuan Xiao
Subjects: Machine Learning (cs.LG)

Federated learning has emerged as a key paradigm in privacy-preserving computing due to its "data usable but not visible" property, enabling users to collaboratively train models without sharing raw data. Motivated by this, federated recommendation systems offer a promising architecture that balances user privacy with recommendation accuracy through distributed collaborative learning. However, existing federated recommendation methods often neglect the underlying semantic or behavioral relationships between users during parameter aggregation, which limits their recommendation effectiveness. To overcome this limitation, graph-based federated recommendation systems have been proposed to leverage neighborhood information. Yet, conventional graph construction methods usually require access to raw user data or explicit social links, which contradicts the strict privacy requirements of federated learning. In this work, we propose UFGraphFR (User Text-feature-based Graph Federated Recommendation), a novel personalized federated recommendation framework that constructs a user graph based on clients' locally embedded text features. Our core assumption is that users with similar textual feature descriptions exhibit similar preferences. Accordingly, UFGraphFR introduces two key components: (1) a privacy-preserving user relationship graph constructed from the joint embedding layer's weight matrix without leaking raw user attributes; (2) a Transformer-based architecture to model temporal dependencies in user-item interaction sequences. Experimental results on benchmark datasets such as MovieLens and HetRec2011 demonstrate that UFGraphFR achieves recommendation accuracy comparable to both centralized and state-of-the-art federated baselines while preserving user privacy. The code is available at: this https URL.

[601] arXiv:2501.09310 (replaced) [pdf, html, other]
Title: A Study of In-Context-Learning-Based Text-to-SQL Errors
Jiawei Shen, Chengcheng Wan, Ruoyi Qiao, Jiazhen Zou, Hang Xu, Yuchen Shao, Yueling Zhang, Weikai Miao, Geguang Pu
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

Large language models (LLMs) have been adopted to perform text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into structured query language (SQL). However, such a technique faces correctness problems and requires efficient repairing solutions. In this paper, we conduct the first comprehensive study of text-to-SQL errors. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 29 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement at the cost of high computational overhead with many mis-repairs. Based on the findings, we propose MapleRepair, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleRepair outperforms existing solutions by repairing 13.8% more queries with neglectable mis-repairs and 67.4% less overhead.

[602] arXiv:2501.10736 (replaced) [pdf, html, other]
Title: Semi-supervised Semantic Segmentation for Remote Sensing Images via Multi-scale Uncertainty Consistency and Cross-Teacher-Student Attention
Shanwen Wang, Xin Sun, Changrui Chen, Danfeng Hong, Jungong Han
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Semi-supervised learning offers an appealing solution for remote sensing (RS) image segmentation to relieve the burden of labor-intensive pixel-level labeling. However, RS images pose unique challenges, including rich multi-scale features and high inter-class similarity. To address these problems, this paper proposes a novel semi-supervised Multi-Scale Uncertainty and Cross-Teacher-Student Attention (MUCA) model for RS image semantic segmentation tasks. Specifically, MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi-scale uncertainty consistency regularization. It improves the multi-scale learning capability of semi-supervised algorithms on unlabeled data. Additionally, MUCA utilizes a Cross-Teacher-Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations through complementary features from the teacher network. This design effectively integrates weak and strong augmentations (WA and SA) to further boost segmentation performance. To verify the effectiveness of our model, we conduct extensive experiments on ISPRS-Potsdam and LoveDA datasets. The experimental results show the superiority of our method over state-of-the-art semi-supervised methods. Notably, our model excels in distinguishing highly similar objects, showcasing its potential for advancing semi-supervised RS image segmentation tasks.

[603] arXiv:2501.11922 (replaced) [pdf, html, other]
Title: On the convergence of two-step modified Newton method for nonsymmetric algebraic Riccati equations from transport theory
Juan Liang, Yonghui Ling
Comments: 29pages
Subjects: Numerical Analysis (math.NA)

This paper is concerned with the convergence of a two-step modified Newton method for solving the nonlinear system arising from the minimal nonnegative solution of nonsymmetric algebraic Riccati equations from neutron transport theory. We show the monotonic convergence of the two-step modified Newton method under mild assumptions. When the Jacobian of the nonlinear operator at the minimal positive solution is singular, we present a convergence analysis of the two-step modified Newton method in this context. Numerical experiments are conducted to demonstrate that the proposed method yields comparable results to several existing Newton-type methods and that it brings a significant reduction in computation time for nearly singular and large-scale problems.

[604] arXiv:2501.13094 (replaced) [pdf, html, other]
Title: Robust Representation Consistency Model via Contrastive Denoising
Jiachen Lei, Julius Berner, Jiongxiao Wang, Zhongzhu Chen, Zhongjia Ba, Kui Ren, Jun Zhu, Anima Anandkumar
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Robustness is essential for deep neural networks, especially in security-sensitive applications. To this end, randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations. Recently, diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples before making predictions with a standard classifier. While these methods excel at small perturbation radii, they struggle with larger perturbations and incur a significant computational overhead during inference compared to classical methods. To address this, we reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space. Specifically, we use instance discrimination to achieve consistent representations along the trajectories by aligning temporally adjacent points. After fine-tuning based on the learned representations, our model enables implicit denoising-then-classification via a single prediction, substantially reducing inference costs. We conduct extensive experiments on various datasets and achieve state-of-the-art performance with minimal computation budget during inference. For example, our method outperforms the certified accuracy of diffusion-based methods on ImageNet across all perturbation radii by 5.3% on average, with up to 11.6% at larger radii, while reducing inference costs by 85$\times$ on average. Codes are available at: this https URL.

[605] arXiv:2501.15582 (replaced) [pdf, html, other]
Title: A Benchmarking Platform for DDR4 Memory Performance in Data-Center-Class FPGAs
Andrea Galimberti, Gabriele Montanaro, Andrea Motta, Federico Proverbio, Davide Zoni
Comments: 5 pages, 5 figures
Journal-ref: 2025 IEEE International Symposium on Circuits and Systems (ISCAS), London, United Kingdom, 2025, pp. 1-5
Subjects: Hardware Architecture (cs.AR)

FPGAs are increasingly utilized in data centers due to their capacity to exploit data parallelism in computationally intensive workloads. Furthermore, the processing of modern data center workloads requires moving vast amounts of data, making it essential to optimize data exchange between FPGAs and memory. This paper introduces a novel benchmarking platform for the evaluation of DDR4 memory performance in data-center-class FPGAs. The proposed solution features highly configurable traffic generation with complex memory access patterns defined at run time and can be flexibly instantiated on the target FPGA to support multiple memory channels and varying data rates. An extensive experimental campaign, targeting the AMD Kintex UltraScale 115 FPGA and encompassing up to three memory channels with data rates ranging from 1600 to 2400 MT/s and various memory traffic configurations, demonstrates the benchmarking platform's capability to effectively evaluate DDR4 memory performance.

[606] arXiv:2501.16197 (replaced) [pdf, other]
Title: HERITRACE: A User-Friendly Semantic Data Editor with Change Tracking and Provenance Management for Cultural Heritage Institutions
Arcangelo Massari (1), Silvio Peroni (1) ((1) Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy)
Comments: 25 pages, 5 figures, 2 tables, 1 listing, submitted to Umanistica Digitale
Subjects: Digital Libraries (cs.DL)

HERITRACE is a data editor designed for galleries, libraries, archives and museums, aimed at simplifying data curation while enabling non-technical domain experts to manage data intuitively without losing its semantic integrity. While the semantic nature of RDF can pose a barrier to data curation due to its complexity, HERITRACE conceals this intricacy while preserving the advantages of semantic representation. The system natively supports provenance management and change tracking, ensuring transparency and accountability throughout the curation process. Although HERITRACE functions effectively out of the box, it offers a straightforward customization interface for technical staff, enabling adaptation to the specific data model required by a given collection. Current applications include the ParaText project, and its adoption is already planned for OpenCitations. Future developments will focus on integrating the RDF Mapping Language (RML) to enhance compatibility with non-RDF data formats, further expanding its applicability in digital heritage management.

[607] arXiv:2501.16362 (replaced) [pdf, other]
Title: A novel Trunk Branch-net PINN for flow and heat transfer prediction in porous medium
Haoyun Xing, Kaiyan Jin, Guice Yao, Jin Zhao, Dichu Xu, Dongsheng Wen
Comments: 33 pages, 24 figures,
Subjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)

A novel Trunk-Branch (TB)-net physics-informed neural network (PINN) architecture is developed, which is a PINN-based method incorporating trunk and branch nets to capture both global and local features. The aim is to solve four main classes of problems: forward flow problem, forward heat transfer problem, inverse heat transfer problem, and transfer learning problem within the porous medium, which are notoriously complex that could not be handled by origin PINN. In the proposed TB-net PINN architecture, a Fully-connected Neural Network (FNN) is used as the trunk net, followed by separated FNNs as the branch nets with respect to outputs, and automatic differentiation is performed for partial derivatives of outputs with respect to inputs by considering various physical loss. The effectiveness and flexibility of the novel TB-net PINN architecture is demonstrated through a collection of forward problems, and transfer learning validates the feasibility of resource reuse. Combining with the superiority over traditional numerical methods in solving inverse problems, the proposed TB-net PINN shows its great potential for practical engineering applications.

[608] arXiv:2501.16529 (replaced) [pdf, html, other]
Title: An artificial viscosity approach to high order entropy stable discontinuous Galerkin methods
Jesse Chan
Subjects: Numerical Analysis (math.NA)

Entropy stable discontinuous Galerkin (DG) methods improve the robustness of high order DG simulations of nonlinear conservation laws. These methods yield a semi-discrete entropy inequality, and rely on an algebraic flux differencing formulation which involves both summation-by-parts (SBP) discretization matrices and entropy conservative two-point finite volume fluxes. However, explicit expressions for such two-point finite volume fluxes may not be available for all systems, or may be computationally expensive to compute.
This paper proposes an alternative approach to constructing entropy stable DG methods using an entropy correction artificial viscosity, where the artificial viscosity coefficient is determined based on the local violation of a cell entropy inequality and the local entropy dissipation. The resulting method \bnote{is a modification of the entropy correction introduced by Abgrall, Offner, and Ranocha (2022) in "Reinterpretation and Extension of Entropy Correction Terms for Residual Distribution and Discontinuous Galerkin Schemes: Application to Structure Preserving Discretization", and recovers the same global semi-discrete entropy inequality that is satisfied by entropy stable flux differencing DG methods. The entropy correction artificial viscosity coefficients are parameter-free and locally computable over each cell, and the resulting artificial viscosity preserves both high order accuracy and a hyperbolic maximum stable time-step size under explicit time-stepping.

[609] arXiv:2501.17126 (replaced) [pdf, other]
Title: ECLYPSE: a Python Framework for Simulation and Emulation of the Cloud-Edge Continuum
Jacopo Massa, Valerio De Caro, Stefano Forti, Patrizio Dazzi, Davide Bacciu, Antonio Brogi
Subjects: Networking and Internet Architecture (cs.NI)

The Cloud-Edge continuum enhances application performance by bringing computation closer to data sources. However, it presents considerable challenges in managing resources and determining service placement, as these tasks require navigating diverse, dynamic environments characterised by fluctuating network conditions. Addressing these challenges calls for tools combining simulation and emulation of Cloud-Edge systems to rigorously assess novel application and resource management strategies. In this paper, we introduce ECLYPSE, a Python-based framework that enables the simulation and emulation of the Cloud-Edge continuum via adaptable resource allocation and service placement models. ECLYPSE features an event-driven architecture for dynamically adapting network configurations and resources. It also supports seamless transitions between simulated and emulated setups. In this work, ECLYPSE capabilities are illustrated over three use cases, showing how the framework supports rapid prototyping across diverse experimental settings.

[610] arXiv:2501.17706 (replaced) [pdf, html, other]
Title: Source-Channel Separation Theorems for Distortion Perception Coding
Chao Tian, Jun Chen, Krishna Narayanan
Comments: 1 figure, 6 pages
Subjects: Information Theory (cs.IT)

It is well known that separation between lossy source coding and channel coding is asymptotically optimal under classical additive distortion measures. Recently, coding under a new class of quality considerations, often referred to as perception or realism, has attracted significant attention due to its close connection to neural generative models and semantic communications. In this work, we revisit source-channel separation under the consideration of distortion-perception. We show that when the perception quality is measured on the block level, i.e., in the strong-sense, the optimality of separation still holds when common randomness is shared between the encoder and the decoder; however, separation is no longer optimal when such common randomness is not available. In contrast, when the perception quality is the average per-symbol measure, i.e., in the weak-sense, the optimality of separation holds regardless of the availability of common randomness.

[611] arXiv:2502.00300 (replaced) [pdf, other]
Title: Uncertainty Quantification of Wind Gust Predictions in the Northeast United States: An Evidential Neural Network and Explainable Artificial Intelligence Approach
Israt Jahan, John S. Schreck, David John Gagne, Charlie Becker, Marina Astitha
Journal-ref: Environmental Modelling & Software, Volume 193, 2025, 106595, ISSN 1364-8152
Subjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)

Machine learning algorithms have shown promise in reducing bias in wind gust predictions, while still underpredicting high gusts. Uncertainty quantification (UQ) supports this issue by identifying when predictions are reliable or need cautious interpretation. Using data from 61 extratropical storms in the Northeastern USA, we introduce evidential neural network (ENN) as a novel approach for UQ in gust predictions, leveraging atmospheric variables from the Weather Research and Forecasting (WRF) model. Explainable AI techniques suggested that key predictive features contributed to higher uncertainty, which correlated strongly with storm intensity and spatial gust gradients. Compared to WRF, ENN demonstrated a 47% reduction in RMSE and allowed the construction of gust prediction intervals without an ensemble, successfully capturing at least 95% of observed gusts at 179 out of 266 stations. From an operational perspective, providing gust forecasts with quantified uncertainty enhances stakeholders' confidence in risk assessment and response planning for extreme gust events.

[612] arXiv:2502.02091 (replaced) [pdf, html, other]
Title: Instruct-4DGS: Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation
Joohyun Kwon, Hanbyel Cho, Junmo Kim
Comments: Accepted to CVPR 2025. The first two authors contributed equally
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent 4D dynamic scene editing methods require editing thousands of 2D images used for dynamic scene synthesis and updating the entire scene with additional training loops, resulting in several hours of processing to edit a single dynamic scene. Therefore, these methods are not scalable with respect to the temporal dimension of the dynamic scene (i.e., the number of timesteps). In this work, we propose Instruct-4DGS, an efficient dynamic scene editing method that is more scalable in terms of temporal dimension. To achieve computational efficiency, we leverage a 4D Gaussian representation that models a 4D dynamic scene by combining static 3D Gaussians with a Hexplane-based deformation field, which captures dynamic information. We then perform editing solely on the static 3D Gaussians, which is the minimal but sufficient component required for visual editing. To resolve the misalignment between the edited 3D Gaussians and the deformation field, which may arise from the editing process, we introduce a refinement stage using a score distillation mechanism. Extensive editing results demonstrate that Instruct-4DGS is efficient, reducing editing time by more than half compared to existing methods while achieving high-quality edits that better follow user instructions. Code and results: this https URL

[613] arXiv:2502.02869 (replaced) [pdf, html, other]
Title: Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds
Fan Wang, Pengtao Shao, Yiming Zhang, Bo Yu, Shaoshan Liu, Ning Ding, Yang Cao, Yu Kang, Haifeng Wang
Comments: Preprint
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce step-wise supervision and induce prior information in the ICRL this http URL results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.

[614] arXiv:2502.03628 (replaced) [pdf, html, other]
Title: The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering
Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, Dimitris N. Metaxas
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Large Vision-Language Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits ranking throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss - visually grounded tokens gradually become less favored throughout generation, and (2) early excitation - semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information - visually grounded tokens though not being eventually decoded still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information. VISTA works by combining two complementary approaches: reinforcing visual information in activation space and leveraging early layer activations to promote semantically meaningful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Extensive experiments show that VISTA on average reduces hallucination by about 40% on evaluated open-ended generation task, and it consistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies. Code is available at this https URL.

[615] arXiv:2502.04388 (replaced) [pdf, html, other]
Title: Position: Emergent Machina Sapiens Urge Rethinking Multi-Agent Paradigms
Hepeng Li, Yuhong Liu, Jun Yan, Jie Gao, Xiaoou Yang
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)

Artificial Intelligence (AI) agents capable of autonomous learning and independent decision-making hold great promise for addressing complex challenges across various critical infrastructure domains, including transportation, energy systems, and manufacturing. However, the surge in the design and deployment of AI systems, driven by various stakeholders with distinct and unaligned objectives, introduces a crucial challenge: How can uncoordinated AI systems coexist and evolve harmoniously in shared environments without creating chaos or compromising safety? To address this, we advocate for a fundamental rethinking of existing multi-agent frameworks, such as multi-agent systems and game theory, which are largely limited to predefined rules and static objective structures. We posit that AI agents should be empowered to adjust their objectives dynamically, make compromises, form coalitions, and safely compete or cooperate through evolving relationships and social feedback. Through two case studies in critical infrastructure applications, we call for a shift toward the emergent, self-organizing, and context-aware nature of these multi-agentic AI systems.

[616] arXiv:2502.04616 (replaced) [pdf, html, other]
Title: Energy dissipation law and maximum bound principle-preserving linear BDF2 schemes with variable steps for the Allen-Cahn equation
Bingyin Zhang, Hongfei Fu, Rihui Lan, Shusen Xie
Comments: 32 pages,31 figures
Subjects: Numerical Analysis (math.NA)

In this paper, we propose and analyze a linear, structure-preserving scalar auxiliary variable (SAV) method for solving the Allen--Cahn equation based on the second-order backward differentiation formula (BDF2) with variable time steps. To this end, we first design a novel and essential auxiliary functional that serves twofold functions: (i) ensuring that a first-order approximation to the auxiliary variable, which is essentially important for deriving the unconditional energy dissipation law, does not affect the second-order temporal accuracy of the phase function $\phi$; and (ii) allowing us to develop effective stabilization terms that are helpful to establish the MBP-preserving linear methods. Together with this novel functional and standard central difference stencil, we then propose a linear, second-order variable-step BDF2 type stabilized exponential SAV scheme, namely BDF2-sESAV-I, which is shown to preserve both the discrete modified energy dissipation law under the temporal stepsize ratio $ 0 < r_{k} := \tau_{k}/\tau_{k-1} < 4.864 - \delta $ with a positive constant $\delta$ and the MBP under $ 0 < r_{k} < 1 + \sqrt{2} $. Moreover, an analysis of the approximation to the original energy by the modified one is presented. With the help of the kernel recombination technique, optimal $ H^{1}$- and $ L^{\infty}$-norm error estimates of the variable-step BDF2-sESAV-I scheme are rigorously established. Numerical examples are carried out to verify the theoretical results and demonstrate the effectiveness and efficiency of the proposed scheme.

[617] arXiv:2502.04640 (replaced) [pdf, html, other]
Title: Building Rome with Convex Optimization
Haoyu Han, Heng Yang
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

Global bundle adjustment is made easy by depth prediction and convex optimization. We (i) propose a scaled bundle adjustment (SBA) formulation that lifts 2D keypoint measurements to 3D with learned depth, (ii) design an empirically tight convex semidfinite program (SDP) relaxation that solves SBA to certfiable global optimality, (iii) solve the SDP relaxations at extreme scale with Burer-Monteiro factorization and a CUDA-based trust-region Riemannian optimizer (dubbed XM), (iv) build a structure from motion (SfM) pipeline with XM as the optimization engine and show that XM-SfM compares favorably with existing pipelines in terms of reconstruction quality while being significantly faster, more scalable, and initialization-free.

[618] arXiv:2502.05795 (replaced) [pdf, html, other]
Title: The Curse of Depth in Large Language Models
Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{this https URL}{LayerNorm-Scaling}.

[619] arXiv:2502.07007 (replaced) [pdf, other]
Title: Grounding Creativity in Physics: A Brief Survey of Physical Priors in AIGC
Siwei Meng, Yawei Luo, Ping Liu
Comments: Accepted by IJCAI 2025 Survey Track
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent advancements in AI-generated content have significantly improved the realism of 3D and 4D generation. However, most existing methods prioritize appearance consistency while neglecting underlying physical principles, leading to artifacts such as unrealistic deformations, unstable dynamics, and implausible objects interactions. Incorporating physics priors into generative models has become a crucial research direction to enhance structural integrity and motion realism. This survey provides a review of physics-aware generative methods, systematically analyzing how physical constraints are integrated into 3D and 4D generation. First, we examine recent works in incorporating physical priors into static and dynamic 3D generation, categorizing methods based on representation types, including vision-based, NeRF-based, and Gaussian Splatting-based approaches. Second, we explore emerging techniques in 4D generation, focusing on methods that model temporal dynamics with physical simulations. Finally, we conduct a comparative analysis of major methods, highlighting their strengths, limitations, and suitability for different materials and motion dynamics. By presenting an in-depth analysis of physics-grounded AIGC, this survey aims to bridge the gap between generative models and physical realism, providing insights that inspire future research in physically consistent content generation.

[620] arXiv:2502.07707 (replaced) [pdf, html, other]
Title: PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization
Bing Fan, Yunhe Feng, Yapeng Tian, James Chenhao Liang, Yuewei Lin, Yan Huang, Heng Fan
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Egocentric visual query localization (EgoVQL) focuses on localizing the target of interest in space and time from first-person videos, given a visual query. Despite recent progressive, existing methods often struggle to handle severe object appearance changes and cluttering background in the video due to lacking sufficient target cues, leading to degradation. Addressing this, we introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL. The core is to continuously exploit target-relevant knowledge directly from videos and utilize it as guidance to refine both query and video features for improving target localization. Our PRVQL contains multiple processing stages. The target knowledge from one stage, comprising appearance and spatial knowledge extracted via two specially designed knowledge learning modules, are utilized as guidance to refine the query and videos features for the next stage, which are used to generate more accurate knowledge for further feature refinement. With such a progressive process, target knowledge in PRVQL can be gradually improved, which, in turn, leads to better refined query and video features for localization in the final stage. Compared to previous methods, our PRVQL, besides the given object cues, enjoys additional crucial target information from a video as guidance to refine features, and hence enhances EgoVQL in complicated scenes. In our experiments on challenging Ego4D, PRVQL achieves state-of-the-art result and largely surpasses other methods, showing its efficacy. Our code, model and results will be released at this https URL.

[621] arXiv:2502.11620 (replaced) [pdf, html, other]
Title: Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation
Arindam Sharma, Cristina David
Comments: 18 pages and 3 References Pages
Subjects: Software Engineering (cs.SE)

In this work, we explore uncertainty estimation as a proxy for correctness in LLM-generated code. To this end, we adapt two state-of-the-art techniques from natural language generation -- one based on entropy and another on mutual information -- to the domain of code generation. Given the distinct semantic properties of code, we introduce modifications, including a semantic equivalence check based on symbolic execution. Our findings indicate a strong correlation between the uncertainty computed through these techniques and correctness, highlighting the potential of uncertainty estimation for quality assessment. Additionally, we propose a simplified version of the entropy-based method that assumes a uniform distribution over the LLM's responses, demonstrating comparable effectiveness. Using these techniques, we develop an abstention policy that prevents the model from making predictions when uncertainty is high, reducing incorrect outputs to near zero. Our evaluation on the LiveCodeBench shows that our approach significantly outperforms a baseline relying solely on LLM-reported log-probabilities.

[622] arXiv:2502.14051 (replaced) [pdf, html, other]
Title: RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov
Comments: ICML 2025
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400$\times$, end-to-end speedup of up to 3.7$\times$ as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme.

[623] arXiv:2502.14156 (replaced) [pdf, html, other]
Title: Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration
Katie Z Luo, Minh-Quan Dao, Zhenzhen Liu, Mark Campbell, Wei-Lun Chao, Kilian Q. Weinberger, Ezio Malis, Vincent Fremont, Bharath Hariharan, Mao Shan, Stewart Worrall, Julie Stephany Berrio Perez
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different configurations of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. The Mixed Signals dataset is ready-to-use, with precise alignment and consistent annotations across time and viewpoints. Dataset website is available at this https URL.

[624] arXiv:2502.14351 (replaced) [pdf, html, other]
Title: SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Yichi Zhang, Le Xue, Wenbo Zhang, Lanlan Li, Yuchen Liu, Chen Jiang, Yuan Cheng, Yuan Qi
Comments: Accept for ICCV 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Positron Emission Tomography (PET) is a powerful molecular imaging tool that plays a crucial role in modern medical diagnostics by visualizing radio-tracer distribution to reveal physiological processes. Accurate organ segmentation from PET images is essential for comprehensive multi-systemic analysis of interactions between different organs and pathologies. Existing segmentation methods are limited by insufficient annotation data and varying levels of annotation, resulting in weak generalization ability and difficulty in clinical application. Recent developments in segmentation foundation models have shown superior versatility across diverse segmentation tasks. Despite the efforts of medical adaptations, these works primarily focus on structural medical images with detailed physiological structural information and exhibit limited generalization performance on molecular PET imaging. In this paper, we collect and construct PETS-5k, the largest PET segmentation dataset to date, comprising 5,731 three-dimensional whole-body PET images and encompassing over 1.3M 2D images. Based on the established dataset, we develop SegAnyPET, a modality-specific 3D foundation model for universal promptable segmentation from PET images. To issue the challenge of discrepant annotation quality, we adopt a cross prompting confident learning (CPCL) strategy with an uncertainty-guided self-rectification process to robustly learn segmentation from high-quality labeled data and low-quality noisy labeled data for promptable segmentation. Experimental results demonstrate that SegAnyPET can segment seen and unseen target organs using only one or a few prompt points, outperforming state-of-the-art foundation models and task-specific fully supervised models with higher accuracy and strong generalization ability for universal segmentation.

[625] arXiv:2502.16889 (replaced) [pdf, html, other]
Title: Beyond Diagnostic Performance: Revealing and Quantifying Ethical Risks in Pathology Foundation Models
Weiping Lin, Shen Liu, Runchen Zhu, Yixuan Lin, Baoshun Wang, Liansheng Wang
Comments: 33 pages,5 figure,23 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Pathology foundation models (PFMs), as large-scale pre-trained models tailored for computational pathology, have significantly advanced a wide range of applications. Their ability to leverage prior knowledge from massive datasets has streamlined the development of intelligent pathology models. However, we identify several critical and interrelated ethical risks that remain underexplored, yet must be addressed to enable the safe translation of PFMs from lab to clinic. These include the potential leakage of patient-sensitive attributes, disparities in model performance across demographic and institutional subgroups, and the reliance on diagnosis-irrelevant features that undermine clinical reliability. In this study, we pioneer the quantitative analysis for ethical risks in PFMs, including privacy leakage, clinical reliability, and group fairness. Specifically, we propose an evaluation framework that systematically measures key dimensions of ethical concern: the degree to which patient-sensitive attributes can be inferred from model representations, the extent of performance disparities across demographic and institutional subgroups, and the influence of diagnostically irrelevant features on model decisions. We further investigate the underlying causes of these ethical risks in PFMs and empirically validate our findings. Then we offer insights into potential directions for mitigating such risks, aiming to inform the development of more ethically robust PFMs. This work provides the first quantitative and systematic evaluation of ethical risks in PFMs. Our findings highlight the urgent need for ethical safeguards in PFMs and offer actionable insights for building more trustworthy and clinically robust PFMs. To facilitate future research and deployment, we will release the assessment framework as an online toolkit to support the development, auditing, and deployment of ethically robust PFMs.

[626] arXiv:2502.18325 (replaced) [pdf, html, other]
Title: A Unified Bayesian Perspective for Conventional and Robust Adaptive Filters
Leszek Szczecinski, Jacob Benesty, Eduardo Vinicius Kuhn
Subjects: Information Retrieval (cs.IR); Statistics Theory (math.ST)

In this work, we present a new perspective on the origin and interpretation of adaptive filters. By applying Bayesian principles of recursive inference from the state-space model and using a series of simplifications regarding the structure of the solution, we can present, in a unified framework, derivations of many adaptive filters that depend on the probabilistic model of the measurement noise. In particular, under a Gaussian model, we obtain solutions well-known in the literature (such as LMS, NLMS, or Kalman filter), while using non-Gaussian noise, we derive new adaptive algorithms. Notably, under the assumption of Laplacian noise, we obtain a family of robust filters of which the sign-error algorithm is a well-known member, while other algorithms, derived effortlessly in the proposed framework, are entirely new. Numerical examples are shown to illustrate the properties and provide a better insight into the performance of the derived adaptive filters.

[627] arXiv:2502.18390 (replaced) [pdf, html, other]
Title: Unbent Collections of Orthogonal Drawings
Todor Antić, Giuseppe Liotta, Tomáš Masařík, Giacomo Ortali, Matthias Pfretzschner, Peter Stumpf, Alexander Wolff, Johannes Zink
Comments: 37 pages, 18 figures
Subjects: Computational Geometry (cs.CG); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Combinatorics (math.CO)

Recently, there has been interest in representing single graphs by multiple drawings; for example, using graph stories, storyplans, or uncrossed collections. In this paper, we apply this idea to orthogonal graph drawing. Due to the orthogonal drawing style, we focus on 4-graphs, that is, graphs of maximum degree 4. We restrict ourselves to plane graphs, that is, planar graphs whose embedding is fixed. Our goal is to represent any plane 4-graph $G$ by an unbent collection, that is, a collection of orthogonal drawings of $G$ that adhere to the embedding of $G$ and ensure that each edge of $G$ is drawn without bends in at least one of the drawings. We investigate two objectives. First, we consider minimizing the number of drawings in an unbent collection. We prove that every plane 4-graph can be represented by a collection with at most three drawings, which is tight. We also give necessary and sufficient conditions for a graph to admit an unbent collection of size $2$. Second, we consider minimizing the total number of bends over all drawings in an unbent collection. We show that this problem is NP-hard and give a 3-approximation algorithm. For the special case of plane triconnected cubic graphs, we show how to compute minimum-bend collections in linear time.

[628] arXiv:2503.02134 (replaced) [pdf, html, other]
Title: Enabling mixed-precision in spectral element codes
Yanxiang Chen, Pablo de Oliveira Castro, Paolo Bientinesi, Niclas Jansson, Roman Iakymchuk
Comments: arXiv admin note: text overlap with arXiv:2405.11065
Subjects: Mathematical Software (cs.MS); Distributed, Parallel, and Cluster Computing (cs.DC)

Mixed-precision computing has the potential to significantly reduce the cost of exascale computations, but determining when and how to implement it in programs can be challenging. In this article, we propose a methodology for enabling mixed-precision with the help of computer arithmetic tools, roofline model, and computer arithmetic techniques. As case studies, we consider Nekbone, a mini-application for the Computational Fluid Dynamics (CFD) solver Nek5000, and a modern Neko CFD application. With the help of the Verificarlo tool and computer arithmetic techniques, we introduce a strategy to address stagnation issues in the preconditioned Conjugate Gradient method in Nekbone and apply these insights to implement a mixed-precision version of Neko. We evaluate the derived mixed-precision versions of these codes by combining metrics in three dimensions: accuracy, time-to-solution, and energy-to-solution. Notably, mixed-precision in Nekbone reduces time-to-solution by roughly 1.62x and energy-to-solution by 2.43x on MareNostrum 5, while in the real-world Neko application, the gain is up to 1.3x in both time and energy, with the accuracy that matches double-precision results.

[629] arXiv:2503.02424 (replaced) [pdf, html, other]
Title: Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection
Wei Luo, Yunkang Cao, Haiming Yao, Xiaotian Zhang, Jianan Lou, Yuqi Cheng, Weiming Shen, Wenyong Yu
Comments: Accepted by CVPR2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Anomaly detection (AD) is essential for industrial inspection, yet existing methods typically rely on ``comparing'' test images to normal references from a training set. However, variations in appearance and positioning often complicate the alignment of these references with the test image, limiting detection accuracy. We observe that most anomalies manifest as local variations, meaning that even within anomalous images, valuable normal information remains. We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. Specifically, we introduce the INP Extractor, which linearly combines normal tokens to represent INPs. We further propose an INP Coherence Loss to ensure INPs can faithfully represent normality for the testing image. These INPs then guide the INP-Guided Decoder to reconstruct only normal tokens, with reconstruction errors serving as anomaly scores. Additionally, we propose a Soft Mining Loss to prioritize hard-to-optimize samples during training. INP-Former achieves state-of-the-art performance in single-class, multi-class, and few-shot AD tasks across MVTec-AD, VisA, and Real-IAD, positioning it as a versatile and universal solution for AD. Remarkably, INP-Former also demonstrates some zero-shot AD capability. Code is available at:this https URL.

[630] arXiv:2503.03040 (replaced) [pdf, html, other]
Title: SAGE: Steering Dialog Generation with Future-Aware State-Action Augmentation
Yizhe Zhang, Navdeep Jaitly
Comments: 9 pages main text
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Recent advances in large language models have demonstrated impressive capabilities in task-oriented applications, yet building emotionally intelligent chatbots that can engage in natural, strategic conversations remains a challenge. We present a novel approach called SAGE that uses latent variables to control long-horizon behavior in dialogue generation. At the core of our method is the State-Action Chain (SAC), which augments standard language model fine-tuning by introducing latent variables that encapsulate emotional states and conversational strategies between dialogue turns. During inference, these variables are generated before each response, enabling coarse-grained control over dialogue progression while maintaining natural interaction patterns. We also introduce a self-improvement pipeline that leverages dialogue tree search, LLM-based reward modeling, and targeted fine-tuning to optimize conversational trajectories. Our experimental results show that models trained with this approach demonstrate improved performance in emotional intelligence metrics while maintaining strong capabilities on LLM benchmarks. The discrete nature of our latent variables facilitates search-based strategies and provides a foundation for future applications of reinforcement learning to dialogue systems, where learning can occur at the state level rather than the token level. this https URL

[631] arXiv:2503.04038 (replaced) [pdf, html, other]
Title: Autonomous Robotic Bone Micro-Milling System with Automatic Calibration and 3D Surface Fitting
Enduo Zhao, Xiaofeng Lin, Yifan Wang, Kanako Harada
Comments: 8 pages, 8 figures, submitted to RA-L
Subjects: Robotics (cs.RO); Systems and Control (eess.SY)

Automating bone micro-milling using a robotic system presents challenges due to the uncertainties in both the external and internal features of bone tissue. For example, during mouse cranial window creation, a circular path with a radius of 2 to 4 mm needs to be milled on the mouse skull using a microdrill. The uneven surface and non-uniform thickness of the mouse skull make it difficult to fully automate this process, requiring the system to possess advanced perceptual and adaptive capabilities. In this study, we address this challenge by integrating a Microscopic Stereo Camera System (MSCS) into the robotic bone micro-milling system and proposing a novel pre-measurement pipeline for the target surface. Starting from uncalibrated cameras, the pipeline enables automatic calibration and 3D surface fitting through a convolutional neural network (CNN)-based keypoint detection. Combined with the existing feedback-based system, we develop the world's first autonomous robotic bone micro-milling system capable of rapidly, in real-time, and accurately perceiving and adapting to surface unevenness and non-uniform thickness, thereby enabling an end-to-end autonomous cranial window creation workflow without human assistance. Validation experiments on euthanized mice demonstrate that the improved system achieves a success rate of 85.7% and an average milling time of 2.1 minutes, showing not only significant performance improvements over the previous system but also exceptional accuracy, speed, and stability compared to human operators.

[632] arXiv:2503.06385 (replaced) [pdf, html, other]
Title: A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization
Md Yousuf Harun, Christopher Kanan
Comments: Accepted to the Conference on Lifelong Learning Agents (CoLLAs) 2025
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

To adapt to real-world data streams, continual learning (CL) systems must rapidly learn new concepts while preserving and utilizing prior knowledge. When it comes to adding new information to continually-trained deep neural networks (DNNs), classifier weights for newly encountered categories are typically initialized randomly, leading to high initial training loss (spikes) and instability. Consequently, achieving optimal convergence and accuracy requires prolonged training, increasing computational costs. Inspired by Neural Collapse (NC), we propose a weight initialization strategy to improve learning efficiency in CL. In DNNs trained with mean-squared-error, NC gives rise to a Least-Square (LS) classifier in the last layer, whose weights can be analytically derived from learned features. We leverage this LS formulation to initialize classifier weights in a data-driven manner, aligning them with the feature distribution rather than using random initialization. Our method mitigates initial loss spikes and accelerates adaptation to new tasks. We evaluate our approach in large-scale CL settings, demonstrating faster adaptation and improved CL performance.

[633] arXiv:2503.07330 (replaced) [pdf, html, other]
Title: Mitigating Hallucinations in YOLO-based Object Detection Models: A Revisit to Out-of-Distribution Detection
Weicheng He, Changshun Wu, Chih-Hong Cheng, Xiaowei Huang, Saddek Bensalem
Comments: Camera-ready version for IROS 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

Object detection systems must reliably perceive objects of interest without being overly confident to ensure safe decision-making in dynamic environments. Filtering techniques based on out-of-distribution (OoD) detection are commonly added as an extra safeguard to filter hallucinations caused by overconfidence in novel objects. Nevertheless, evaluating YOLO-family detectors and their filters under existing OoD benchmarks often leads to unsatisfactory performance. This paper studies the underlying reasons for performance bottlenecks and proposes a methodology to improve performance fundamentally. Our first contribution is a calibration of all existing evaluation results: Although images in existing OoD benchmark datasets are claimed not to have objects within in-distribution (ID) classes (i.e., categories defined in the training dataset), around 13% of objects detected by the object detector are actually ID objects. Dually, the ID dataset containing OoD objects can also negatively impact the decision boundary of filters. These ultimately lead to a significantly imprecise performance estimation. Our second contribution is to consider the task of hallucination reduction as a joint pipeline of detectors and filters. By developing a methodology to carefully synthesize an OoD dataset that semantically resembles the objects to be detected, and using the crafted OoD dataset in the fine-tuning of YOLO detectors to suppress the objectness score, we achieve a 88% reduction in overall hallucination error with a combined fine-tuned detection and filtering system on the self-driving benchmark BDD-100K. Our code and dataset are available at: this https URL.

[634] arXiv:2503.08271 (replaced) [pdf, html, other]
Title: LangTime: A Language-Guided Unified Model for Time Series Forecasting with Proximal Policy Optimization
Wenzhe Niu, Zongxia Xie, Yanru Sun, Wei He, Man Xu, Chao Hao
Subjects: Machine Learning (cs.LG)

Recent research has shown an increasing interest in utilizing pre-trained large language models (LLMs) for a variety of time series applications. However, there are three main challenges when using LLMs as foundational models for time series forecasting: (1) Cross-domain generalization. (2) Cross-modality alignment. (3) Error accumulation in autoregressive frameworks. To address these challenges, we proposed LangTime, a language-guided unified model for time series forecasting that incorporates cross-domain pre-training with reinforcement learning-based fine-tuning. Specifically, LangTime constructs Temporal Comprehension Prompts (TCPs), which include dataset-wise and channel-wise instructions, to facilitate domain adaptation and condense time series into a single token, enabling LLMs to understand better and align temporal data. To improve autoregressive forecasting, we introduce TimePPO, a reinforcement learning-based fine-tuning algorithm. TimePPO mitigates error accumulation by leveraging a multidimensional rewards function tailored for time series and a repeat-based value estimation strategy. Extensive experiments demonstrate that LangTime achieves state-of-the-art cross-domain forecasting performance, while TimePPO fine-tuning effectively enhances the stability and accuracy of autoregressive forecasting.

[635] arXiv:2503.08358 (replaced) [pdf, html, other]
Title: DG16M: A Large-Scale Dataset for Dual-Arm Grasping with Force-Optimized Grasps
Md Faizal Karim, Mohammed Saad Hashmi, Shreya Bollimuntha, Mahesh Reddy Tapeti, Gaurav Singh, Nagamanikandan Govindan, K Madhava Krishna
Subjects: Robotics (cs.RO)

Dual-arm robotic grasping is crucial for handling large objects that require stable and coordinated manipulation. While single-arm grasping has been extensively studied, datasets tailored for dual-arm settings remain scarce. We introduce a large-scale dataset of 16 million dual-arm grasps, evaluated under improved force-closure constraints. Additionally, we develop a benchmark dataset containing 300 objects with approximately 30,000 grasps, evaluated in a physics simulation environment, providing a better grasp quality assessment for dual-arm grasp synthesis methods. Finally, we demonstrate the effectiveness of our dataset by training a Dual-Arm Grasp Classifier network that outperforms the state-of-the-art methods by 15\%, achieving higher grasp success rates and improved generalization across objects.

[636] arXiv:2503.08904 (replaced) [pdf, other]
Title: Towards Efficient Parametric State Estimation in Circulating Fuel Reactors with Shallow Recurrent Decoder Networks
Stefano Riva, Carolina Introini, J. Nathan Kutz, Antonio Cammi
Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)

The recent developments in data-driven methods have paved the way to new methodologies to provide accurate state reconstruction of engineering systems; nuclear reactors represent particularly challenging applications for this task due to the complexity of the strongly coupled physics involved and the extremely harsh and hostile environments, especially for new technologies such as Generation-IV reactors. Data-driven techniques can combine different sources of information, including computational proxy models and local noisy measurements on the system, to robustly estimate the state. This work leverages the novel Shallow Recurrent Decoder architecture to infer the entire state vector (including neutron fluxes, precursors concentrations, temperature, pressure and velocity) of a reactor from three out-of-core time-series neutron flux measurements alone. In particular, this work extends the standard architecture to treat parametric time-series data, ensuring the possibility of investigating different accidental scenarios and showing the capabilities of this approach to provide an accurate state estimation in various operating conditions. This paper considers as a test case the Molten Salt Fast Reactor (MSFR), a Generation-IV reactor concept, characterised by strong coupling between the neutronics and the thermal hydraulics due to the liquid nature of the fuel. The promising results of this work are further strengthened by the possibility of quantifying the uncertainty associated with the state estimation, due to the considerably low training cost. The accurate reconstruction of every characteristic field in real-time makes this approach suitable for monitoring and control purposes in the framework of a reactor digital twin.

[637] arXiv:2503.09850 (replaced) [pdf, html, other]
Title: TabNSA: Native Sparse Attention for Efficient Tabular Data Learning
Ali Eslamian, Qiang Cheng
Comments: 26 pages, 11 tables
Subjects: Machine Learning (cs.LG)

Tabular data poses unique challenges for deep learning due to its heterogeneous feature types, lack of spatial structure, and often limited sample sizes. We propose TabNSA, a novel deep learning framework that integrates Native Sparse Attention (NSA) with a TabMixer backbone to efficiently model tabular data. TabNSA tackles computational and representational challenges by dynamically focusing on relevant feature subsets per instance. The NSA module employs a hierarchical sparse attention mechanism, including token compression, selective preservation, and localized sliding windows, to significantly reduce the quadratic complexity of standard attention operations while addressing feature heterogeneity. Complementing this, the TabMixer backbone captures complex, non-linear dependencies through parallel multilayer perceptron (MLP) branches with independent parameters. These modules are synergistically combined via element-wise summation and mean pooling, enabling TabNSA to model both global context and fine-grained interactions. Extensive experiments across supervised and transfer learning settings show that TabNSA consistently outperforms state-of-the-art deep learning models. Furthermore, by augmenting TabNSA with a fine-tuned large language model (LLM), we enable it to effectively address Few-Shot Learning challenges through language-guided generalization on diverse tabular benchmarks.

[638] arXiv:2503.10345 (replaced) [pdf, html, other]
Title: Mirror Online Conformal Prediction with Intermittent Feedback
Bowen Wang, Matteo Zecchin, Osvaldo Simeone
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)

Online conformal prediction enables the runtime calibration of a pre-trained artificial intelligence model using feedback on its performance. Calibration is achieved through set predictions that are updated via online rules so as to ensure long-term coverage guarantees. While recent research has demonstrated the benefits of incorporating prior knowledge into the calibration process, this has come at the cost of replacing coverage guarantees with less tangible regret guarantees based on the quantile loss. This work introduces intermittent mirror online conformal prediction (IM-OCP), a novel runtime calibration framework that integrates prior knowledge, operates under potentially intermittent feedback, and features minimal memory complexity. IM-OCP guarantees long-term coverage and sub-linear regret, both of which hold deterministically for any given data sequence and in expectation with respect to the intermittent feedback.

[639] arXiv:2503.11801 (replaced) [pdf, html, other]
Title: Diffuse-CLoC: Guided Diffusion for Physics-based Character Look-ahead Control
Xiaoyu Huang, Takara Truong, Yunbo Zhang, Fangzhou Yu, Jean Pierre Sleiman, Jessica Hodgins, Koushil Sreenath, Farbod Farshidian
Subjects: Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)

We present Diffuse-CLoC, a guided diffusion framework for physics-based look-ahead control that enables intuitive, steerable, and physically realistic motion generation. While existing kinematics motion generation with diffusion models offer intuitive steering capabilities with inference-time conditioning, they often fail to produce physically viable motions. In contrast, recent diffusion-based control policies have shown promise in generating physically realizable motion sequences, but the lack of kinematics prediction limits their steerability. Diffuse-CLoC addresses these challenges through a key insight: modeling the joint distribution of states and actions within a single diffusion model makes action generation steerable by conditioning it on the predicted states. This approach allows us to leverage established conditioning techniques from kinematic motion generation while producing physically realistic motions. As a result, we achieve planning capabilities without the need for a high-level planner. Our method handles a diverse set of unseen long-horizon downstream tasks through a single pre-trained model, including static and dynamic obstacle avoidance, motion in-betweening, and task-space control. Experimental results show that our method significantly outperforms the traditional hierarchical framework of high-level motion diffusion and low-level tracking.

[640] arXiv:2503.13327 (replaced) [pdf, html, other]
Title: Edit Transfer: Learning Image Editing via Vision In-Context Relations
Lan Chen, Qi Mao, Yuchao Gu, Mike Zheng Shou
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.

[641] arXiv:2503.13504 (replaced) [pdf, html, other]
Title: CoCMT: Communication-Efficient Cross-Modal Transformer for Collaborative Perception
Rujia Wang, Xiangbo Gao, Hao Xiang, Runsheng Xu, Zhengzhong Tu
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Multi-agent collaborative perception enhances each agent perceptual capabilities by sharing sensing information to cooperatively perform robot perception tasks. This approach has proven effective in addressing challenges such as sensor deficiencies, occlusions, and long-range perception. However, existing representative collaborative perception systems transmit intermediate feature maps, such as bird-eye view (BEV) representations, which contain a significant amount of non-critical information, leading to high communication bandwidth requirements. To enhance communication efficiency while preserving perception capability, we introduce CoCMT, an object-query-based collaboration framework that optimizes communication bandwidth by selectively extracting and transmitting essential features. Within CoCMT, we introduce the Efficient Query Transformer (EQFormer) to effectively fuse multi-agent object queries and implement a synergistic deep supervision to enhance the positive reinforcement between stages, leading to improved overall performance. Experiments on OPV2V and V2V4Real datasets show CoCMT outperforms state-of-the-art methods while drastically reducing communication needs. On V2V4Real, our model (Top-50 object queries) requires only 0.416 Mb bandwidth, 83 times less than SOTA methods, while improving AP70 by 1.1 percent. This efficiency breakthrough enables practical collaborative perception deployment in bandwidth-constrained environments without sacrificing detection accuracy.

[642] arXiv:2503.15044 (replaced) [pdf, html, other]
Title: SPADE: Structured Prompting Augmentation for Dialogue Enhancement in Machine-Generated Text Detection
Haoyi Li, Angela Yifei Yuan, Soyeon Caren Han, Christopher Leckie
Comments: ACL LLMSEC
Subjects: Computation and Language (cs.CL)

The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of high-quality synthetic datasets for training. To address this issue, we propose SPADE, a structured framework for detecting synthetic dialogues using prompt-based positive and negative samples. Our proposed methods yield 14 new dialogue datasets, which we benchmark against eight MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by proposed augmentation frameworks, offering a practical approach to enhancing LLM application security. Considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. Our open-source datasets, code and prompts can be downloaded from this https URL.

[643] arXiv:2503.18549 (replaced) [pdf, html, other]
Title: RLCAD: Reinforcement Learning Training Gym for Revolution Involved CAD Command Sequence Generation
Xiaolong Yin, Xingyu Lu, Jiahang Shen, Jingzhe Ni, Hailong Li, Ruofeng Tong, Min Tang, Peng Du
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

A CAD command sequence is a typical parametric design paradigm in 3D CAD systems where a model is constructed by overlaying 2D sketches with operations such as extrusion, revolution, and Boolean operations. Although there is growing academic interest in the automatic generation of command sequences, existing methods and datasets only support operations such as 2D sketching, extrusion,and Boolean operations. This limitation makes it challenging to represent more complex geometries. In this paper, we present a reinforcement learning (RL) training environment (gym) built on a CAD geometric engine. Given an input boundary representation (B-Rep) geometry, the policy network in the RL algorithm generates an action. This action, along with previously generated actions, is processed within the gym to produce the corresponding CAD geometry, which is then fed back into the policy network. The rewards, determined by the difference between the generated and target geometries within the gym, are used to update the RL network. Our method supports operations beyond sketches, Boolean, and extrusion, including revolution operations. With this training gym, we achieve state-of-the-art (SOTA) quality in generating command sequences from B-Rep geometries.

[644] arXiv:2503.21248 (replaced) [pdf, html, other]
Title: ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.

[645] arXiv:2503.21393 (replaced) [pdf, html, other]
Title: An evaluation of LLMs and Google Translate for translation of selected Indian languages via sentiment and semantic analyses
Rohitash Chandra, Aryan Chaudhari, Yeshwanth Rayavarapu
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Large Language models (LLMs) have been prominent for language translation, including low-resource languages. There has been limited study on the assessment of the quality of translations generated by LLMs, including Gemini, GPT, and Google Translate. This study addresses this limitation by using semantic and sentiment analysis of selected LLMs for Indian languages, including Sanskrit, Telugu and Hindi. We select prominent texts (Bhagavad Gita, Tamas and Maha Prasthanam ) that have been well translated by experts and use LLMs to generate their translations into English, and provide a comparison with selected expert (human) translations. Our investigation revealed that while LLMs have made significant progress in translation accuracy, challenges remain in preserving sentiment and semantic integrity, especially in metaphorical and philosophical contexts for texts such as the Bhagavad Gita. The sentiment analysis revealed that GPT models are better at preserving the sentiment polarity for the given texts when compared to human (expert) translation. The results revealed that GPT models are generally better at maintaining the sentiment and semantics when compared to Google Translate. This study could help in the development of accurate and culturally sensitive translation systems for large language models.

[646] arXiv:2504.04065 (replaced) [pdf, html, other]
Title: Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering
Jiaqi Deng, Kaize Shi, Zonghan Wu, Huan Huo, Dingxian Wang, Guandong Xu
Comments: 10 pages, 5 figures, Under Review
Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)

Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions with knowledge retrieved from external knowledge bases. The tasks of knowledge retrieval and answer generation tasks both necessitate precise multimodal understanding of question context and external knowledge. However, existing methods treat these two stages as separate modules with limited interaction during training, which hinders bi-directional parametric knowledge sharing, ultimately leading to suboptimal performance. To fully exploit the cross-task synergy in KB-VQA, we propose a unified retrieval-augmented VQA framework with collaborative parametric knowledge calibration. The proposed framework can effectively adapt general multimodal pre-trained models for fine-grained, knowledge-intensive tasks while enabling the retriever and generator to collaboratively enhance and share their parametric knowledge during both training and inference. To enhance fine-grained understanding of questions and external documents, we also integrate late interaction mechanism into the proposed training framework. Additionally, we introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7\% improvement in answering accuracy, and brings an average 7.5\% boost in base MLLMs' VQA performance.

[647] arXiv:2504.05250 (replaced) [pdf, html, other]
Title: PEAKS: Selecting Key Training Examples Incrementally via Prediction Error Anchored by Kernel Similarity
Mustafa Burak Gurbuz, Xingyu Zheng, Constantine Dovrolis
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

As deep learning continues to be driven by ever-larger datasets, understanding which examples are most important for generalization has become a critical question. While progress in data selection continues, emerging applications require studying this problem in dynamic contexts. To bridge this gap, we pose the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source. In this setting, the learner must incrementally build a training dataset of predefined size while simultaneously learning the underlying task. We find that in IDS, the impact of a new sample on the model state depends fundamentally on both its geometric relationship in the feature space and its prediction error. Leveraging this insight, we propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS. Our comprehensive evaluations demonstrate that PEAKS consistently outperforms existing selection strategies. Furthermore, PEAKS yields increasingly better performance returns than random selection as training data size grows on real-world datasets. The code is available at this https URL.

[648] arXiv:2504.05288 (replaced) [pdf, other]
Title: Seeking and Updating with Live Visual Knowledge
Mingyang Fu, Yuyang Peng, Dongping Chen, Zetong Zhou, Benlin Liu, Yao Wan, Zhou Zhao, Philip S. Yu, Ranjay Krishna
Comments: Preprint. Under Review
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

The visual world around us constantly evolves, from real-time news and social media trends to global infrastructure changes visible through satellite imagery and augmented reality enhancements. However, Multimodal Large Language Models (MLLMs), which automate many tasks, struggle to stay current, limited by the cutoff dates in their fixed training datasets. To quantify this stagnation, we introduce LiveVQA, the first-of-its-kind dataset featuring 107,143 samples and 12 categories data specifically designed to support research in both seeking and updating with live visual knowledge. Drawing from recent news articles, video platforms, and academic publications in April 2024-May 2025, LiveVQA enables evaluation of how models handle latest visual information beyond their knowledge boundaries and how current methods help to update them. Our comprehensive benchmarking of 17 state-of-the-art MLLMs reveals significant performance gaps on content beyond knowledge cutoff, and tool-use or agentic visual seeking framework drastically gain an average of 327% improvement. Furthermore, we explore parameter-efficient fine-tuning (PEFT) methods to update MLLMs with new visual knowledge. We dive deeply to the critical balance between adapter capacity and model capability when updating MLLMs with new visual knowledge. All the experimental dataset and source code are publicly available at: this https URL.

[649] arXiv:2504.06722 (replaced) [pdf, html, other]
Title: Plastic tensor networks for interpretable generative modeling
Katsuya O. Akamatsu, Kenji Harada, Tsuyoshi Okubo, Naoki Kawashima
Comments: 18 pages, 17 figures
Subjects: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)

A structural optimization scheme for a single-layer nonnegative adaptive tensor tree (NATT) that models a target probability distribution is proposed as an alternative paradigm for generative modeling. The NATT scheme, by construction, automatically searches for a tree structure that best fits a given discrete dataset whose features serve as inputs, and has the advantage that it is interpretable as a probabilistic graphical model. We consider the NATT scheme and a recently proposed Born machine adaptive tensor tree (BMATT) optimization scheme and demonstrate their effectiveness on a variety of generative modeling tasks where the objective is to infer the hidden structure of a provided dataset. Our results show that in terms of minimizing the negative log-likelihood, the single-layer scheme has model performance comparable to the Born machine scheme, though not better. The tasks include deducing the structure of binary bitwise operations, learning the internal structure of random Bayesian networks given only visible sites, and a real-world example related to hierarchical clustering where a cladogram is constructed from mitochondrial DNA sequences. In doing so, we also show the importance of the choice of network topology and the versatility of a least-mutual information criterion in selecting a candidate structure for a tensor tree, as well as discuss aspects of these tensor tree generative models including their information content and interpretability.

[650] arXiv:2504.07416 (replaced) [pdf, html, other]
Title: RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Radiology with Zero-Shot Multi-Task Capability
Jonggwon Park, Soobum Kim, Byungmu Yoon, Kyoyun Choi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

Recent advancements in multi-modal models have significantly improved vision-language (VL) alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning and offer limited interpretability through attention probability visualizations. To address these challenges, we introduce RadZero, a novel framework for VL alignment in radiology with zero-shot multi-task capability. A key component of our approach is VL-CABS (Vision-Language Cross-Attention Based on Similarity), which aligns text embeddings with local image features for interpretable, fine-grained VL reasoning. RadZero leverages large language models to extract concise semantic sentences from radiology reports and employs multi-positive contrastive training to effectively capture relationships between images and multiple relevant textual descriptions. It uses a pre-trained vision encoder with additional trainable Transformer layers, allowing efficient high-resolution image processing. By computing similarity between text embeddings and local image patch features, VL-CABS enables zero-shot inference with similarity probability for classification, and pixel-level VL similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation. Furthermore, VL similarity map analysis highlights the potential of VL-CABS for improving explainability in VL alignment. Additionally, qualitative evaluation demonstrates RadZero's capability for open-vocabulary semantic segmentation, further validating its effectiveness in medical imaging.

[651] arXiv:2504.08811 (replaced) [pdf, html, other]
Title: Analogical Learning for Cross-Scenario Generalization: Framework and Application to Intelligent Localization
Zirui Chen, Zhaoyang Zhang, Ziqing Xing, Ridong Li, Zhaohui Yang, Richeng Jin, Chongwen Huang, Yuzhi Yang, Mérouane Debbah
Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP)

Existing learning models often exhibit poor generalization when deployed across diverse scenarios. It is primarily due to that the underlying reference frame of the data varies with the deployment environment and settings. However, despite that data of each scenario has a distinct reference frame, its generation generally follows common underlying physical rules. Based on this understanding, this article proposes a deep learning framework named analogical learning (AL), which implicitly retrieves the reference frame information associated with a scenario and then to make accurate prediction by relative analogy with other scenarios. Specifically, we design a bipartite neural network called Mateformer. Its first part captures the relativity within multiple latent feature spaces between the input data and a small amount of embedded data from the studied scenario, while its second part uses this relativity to guide the nonlinear analogy. We apply AL to the typical multi-scenario learning problem of intelligent wireless localization in cellular networks. Extensive experiments validate AL's superiority across three key dimensions. First, it achieves state-of-the-art accuracy in single-scenario benchmarks. Second, it demonstrates stable transferability between different scenarios, avoiding catastrophic forgetting. Finally, and most importantly, it robustly adapts to new, unseen scenarios--including dynamic weather and traffic conditions--without any tuning. All data and code are available at this https URL.

[652] arXiv:2504.09296 (replaced) [pdf, html, other]
Title: Look and Talk: Seamless AI Assistant Interaction with Gaze-Triggered Activation
Zhang Qing, Rekimoto Jun
Subjects: Human-Computer Interaction (cs.HC)

Engaging with AI assistants to gather essential information in a timely manner is becoming increasingly common. Traditional activation methods, like wake words such as Hey Siri, Ok Google, and Hey Alexa, are constrained by technical challenges such as false activations, recognition errors, and discomfort in public settings. Similarly, activating AI systems via physical buttons imposes strict interactive limitations as it demands particular physical actions, which hinders fluid and spontaneous communication with AI. Our approach employs eye-tracking technology within AR glasses to discern a user's intention to engage with the AI assistant. By sustaining eye contact on a virtual AI avatar for a specific time, users can initiate an interaction silently and without using their hands. Preliminary user feedback suggests that this technique is relatively intuitive, natural, and less obtrusive, highlighting its potential for integrating AI assistants fluidly into everyday interactions.

[653] arXiv:2504.12652 (replaced) [pdf, html, other]
Title: AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification
Md. Sanaullah Chowdhury Lameya Sabrin
Subjects: Computer Vision and Pattern Recognition (cs.CV)

This paper introduces AdaptoVision, a novel convolutional neural network (CNN) architecture designed to efficiently balance computational complexity and classification accuracy. By leveraging enhanced residual units, depth-wise separable convolutions, and hierarchical skip connections, AdaptoVision significantly reduces parameter count and computational requirements while preserving competitive performance across various benchmark and medical image datasets. Extensive experimentation demonstrates that AdaptoVision achieves state-of-the-art on BreakHis dataset and comparable accuracy levels, notably 95.3\% on CIFAR-10 and 85.77\% on CIFAR-100, without relying on any pretrained weights. The model's streamlined architecture and strategic simplifications promote effective feature extraction and robust generalization, making it particularly suitable for deployment in real-time and resource-constrained environments.

[654] arXiv:2504.12948 (replaced) [pdf, html, other]
Title: Algorithms for the Shortest Vector Problem in $2$-dimensional Lattices, Revisited
Lihao Zhao, Chengliang Tian, Jingguo Bi, Guangwu Xu, Jia Yu
Subjects: Computational Geometry (cs.CG); Cryptography and Security (cs.CR)

Efficiently solving the Shortest Vector Problem (SVP) in two-dimensional lattices holds practical significance in cryptography and computational geometry. While simpler than its high-dimensional counterpart, two-dimensional SVP motivates scalable solutions for high-dimensional lattices and benefits applications like sequence cipher cryptanalysis involving large integers. In this work, we first propose a novel definition of reduced bases and develop an efficient adaptive lattice reduction algorithm \textbf{CrossEuc} that strategically applies the Euclidean algorithm across dimensions. Building on this framework, we introduce \textbf{HVec}, a vectorized generalization of the Half-GCD algorithm originally defined for integers, which can efficiently halve the bit-length of two vectors and may have independent interest. By iteratively invoking \textbf{HVec}, our optimized algorithm \textbf{HVecSBP} achieves a reduced basis in $O(\log n M(n) )$ time for arbitrary input bases with bit-length $n$, where $M(n)$ denotes the cost of multiplying two $n$-bit integers. Compared to existing algorithms, our design is applicable to general forms of input lattices, eliminating the cost of pre-converting input bases to Hermite Normal Form (HNF). The comprehensive experimental results demonstrate that for the input lattice bases in HNF, the optimized algorithm \textbf{HVecSBP} achieves at least a $13.5\times$ efficiency improvement compared to existing methods. For general-form input lattice bases, converting them to HNF before applying \textbf{HVecSBP} offers only marginal advantages in extreme cases where the two basis vectors are nearly degenerate. However, as the linear dependency between input basis vectors decreases, directly employing \textbf{HVecSBP} yields increasingly significant efficiency gains, outperforming hybrid approaches that rely on prior \textbf{HNF} conversion.

[655] arXiv:2504.13800 (replaced) [pdf, html, other]
Title: Unified Manipulability and Compliance Analysis of Modular Soft-Rigid Hybrid Fingers
Jianshu Zhou, Boyuan Liang, Junda Huang, Masayoshi Tomizuka
Subjects: Robotics (cs.RO)

This paper presents a unified framework to analyze the manipulability and compliance of modular soft-rigid hybrid robotic fingers. The approach applies to both hydraulic and pneumatic actuation systems. A Jacobian-based formulation maps actuator inputs to joint and task-space responses. Hydraulic actuators are modeled under incompressible assumptions, while pneumatic actuators are described using nonlinear pressure-volume relations. The framework enables consistent evaluation of manipulability ellipsoids and compliance matrices across actuation modes. We validate the analysis using two representative hands: DexCo (hydraulic) and Edgy-2 (pneumatic). Results highlight actuation-dependent trade-offs in dexterity and passive stiffness. These findings provide insights for structure-aware design and actuator selection in soft-rigid robotic fingers.

[656] arXiv:2504.15162 (replaced) [pdf, html, other]
Title: To Offload or Not To Offload: Model-driven Comparison of Edge-native and On-device Processing
Nathan Ng, David Irwin, Ananthram Swami, Don Towsley, Prashant Shenoy
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)

Computational offloading is a promising approach for overcoming resource constraints on client devices by moving some or all of an application's computations to remote servers. With the advent of specialized hardware accelerators, client devices are now able to perform fast local processing of specific tasks, such as machine learning inference, reducing the need for offloading computations. However, edge servers with accelerators also offer faster processing for offloaded tasks than was previously possible. In this paper, we present an analytic and experimental comparison of on-device processing and edge offloading for a range of accelerator, network, and application workload scenarios, with the goal of understanding when to use local on-device processing and when to offload computations. We present models that leverage analytical queuing results to capture the effects of dynamic factors such as the performance gap between the device and edge server, network variability, server load, and multi-tenancy on the edge server. We experimentally demonstrate the accuracy of our models for a range of hardware and application scenarios and show that our models achieve a mean absolute percentage error of 2.2% compared to observed latencies. We use our models to develop an adaptive resource manager for intelligent offloading and show its efficacy in the presence of variable network conditions and dynamic multi-tenant edge settings.

[657] arXiv:2504.17269 (replaced) [pdf, other]
Title: Towards Generalized and Training-Free Text-Guided Semantic Manipulation
Yu Hong, Xiao Cai, Pengpeng Zeng, Shuai Zhang, Jingkuan Song, Lianli Gao, Heng Tao Shen
Comments: Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Text-guided semantic manipulation refers to semantically editing an image generated from a source prompt to match a target prompt, enabling the desired semantic changes (e.g., addition, removal, and style transfer) while preserving irrelevant contents. With the powerful generative capabilities of the diffusion model, the task has shown the potential to generate high-fidelity visual content. Nevertheless, existing methods either typically require time-consuming fine-tuning (inefficient), fail to accomplish multiple semantic manipulations (poorly extensible), and/or lack support for different modality tasks (limited generalizability). Upon further investigation, we find that the geometric properties of noises in the diffusion model are strongly correlated with the semantic changes. Motivated by this, we propose a novel $\textit{GTF}$ for text-guided semantic manipulation, which has the following attractive capabilities: 1) $\textbf{Generalized}$: our $\textit{GTF}$ supports multiple semantic manipulations (e.g., addition, removal, and style transfer) and can be seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play) across different modalities (i.e., modality-agnostic); and 2) $\textbf{Training-free}$: $\textit{GTF}$ produces high-fidelity results via simply controlling the geometric relationship between noises without tuning or optimization. Our extensive experiments demonstrate the efficacy of our approach, highlighting its potential to advance the state-of-the-art in semantics manipulation.

[658] arXiv:2504.17924 (replaced) [pdf, html, other]
Title: Learning Attentive Neural Processes for Planning with Pushing Actions
Atharv Jain, Seiji Shaw, Nicholas Roy
Comments: Presented at 2025 RoboReps Workshop
Subjects: Robotics (cs.RO)

Our goal is to enable robots to plan sequences of tabletop actions to push a block with unknown physical properties to a desired goal pose. We approach this problem by learning the constituent models of a Partially-Observable Markov Decision Process (POMDP), where the robot can observe the outcome of a push, but the physical properties of the block that govern the dynamics remain unknown. A common solution approach is to train an observation model in a supervised fashion, and do inference with a general inference technique such as particle filters. However, supervised training requires knowledge of the relevant physical properties that determine the problem dynamics, which we do not assume to be known. Planning also requires simulating many belief updates, which becomes expensive when using particle filters to represent the belief. We propose to learn an Attentive Neural Process that computes the belief over a learned latent representation of the relevant physical properties given a history of actions. To address the pushing planning problem, we integrate a trained Neural Process with a double-progressive widening sampling strategy. Simulation results indicate that Neural Process Tree with Double Progressive Widening (NPT-DPW) generates better-performing plans faster than traditional particle-filter methods that use a supervised-trained observation model, even in complex pushing scenarios.

[659] arXiv:2504.19856 (replaced) [pdf, html, other]
Title: Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language
Anastasia Zhukova, Christian E. Matt, Bela Gipp
Comments: accepted to TSD 2025
Subjects: Computation and Language (cs.CL)

Domain-adaptive continual pretraining (DAPT) is a state-of-the-art technique that further trains a language model (LM) on its pretraining task, e.g., masked language modeling (MLM), when common domain adaptation via LM fine-tuning is not possible due to a lack of labeled task data. Although popular, MLM requires a significant corpus of domain-related data, which is difficult to obtain for specific domains in languages other than English, such as the process industry in the German language. This paper introduces an efficient approach called ICL-augmented pretraining or ICL-APT that leverages in-context learning (ICL) and k-nearest neighbors (kNN) to augment target data with domain-related and in-domain texts, significantly reducing GPU time while maintaining strong model performance. Our results show that the best configuration of ICL-APT performed better than the state-of-the-art DAPT by 28.7% (7.87 points) and requires almost 4 times less GPU-computing time, providing a cost-effective solution for industries with limited computational capacity. The findings highlight the broader applicability of this framework to other low-resource industries, making NLP-based solutions more accessible and feasible in production environments.

[660] arXiv:2504.20367 (replaced) [pdf, html, other]
Title: Theoretical Grid-Forming Extreme of Inverters
Qianxi Tang, Li Peng
Comments: 4 pages, 2 figures, letter, This work has been submitted to the IEEE for possible publication
Subjects: Systems and Control (eess.SY)

What are the theoretical and physical limits of a grid-forming inverter? This letter proposes that the extreme grid-forming ability of inverters is limited by their dc-side, ac-side, circuit topology dynamics, but not control. While many papers focus on how to improve grid-forming inverters stability, power sharing, inertia emulation, fault response, few, if any, formally define the fundamental theoretical limits or extremes of grid-forming behavior. It seems that the grid-forming can be improved endlessly. No physical system can support a grid indefinitely without limitations, especially under increasing levels of disturbance or uncertainty. Therefore, this boundary is explicitly shown by a mathematical expression in this letter. Consequently, the results show that relatively low dc-side voltage and high active power injection could damage the grid-forming ability. Poor consideration of dc-side, ac-side, and circuit topology dynamics in real practice will cause jeopardizing oscillation even by the theoretical best grid-forming control strategy.

[661] arXiv:2505.00384 (replaced) [pdf, html, other]
Title: Improving the scalability of a high-order atmospheric dynamics solver based on the deal.II library
Giuseppe Orlando, Tommaso Benacchio, Luca Bonaventura
Subjects: Numerical Analysis (math.NA); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)

We present recent advances on the massively parallel performance of a numerical scheme for atmosphere dynamics applications based on the this http URL library. The implicit-explicit discontinuous finite element scheme is based on a matrix-free approach, meaning that no global sparse matrix is built and only the action of the linear operators on a vector is actually implemented. Following a profiling analysis, we focus on the performance optimization of the numerical method and describe the impact of different preconditioning and solving techniques in this framework. Moreover, we show how the use of the latest version of the this http URL library and of suitable execution flags can improve the parallel performance.

[662] arXiv:2505.00703 (replaced) [pdf, html, other]
Title: T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li
Comments: Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: this https URL

[663] arXiv:2505.00949 (replaced) [pdf, html, other]
Title: Llama-Nemotron: Efficient Reasoning Models
Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zijia Chen, Zhilin Wang, David Mosallanezhad, Adi Renduchintala, Haifeng Qian, Dima Rekesh, Fei Jia, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad, Sean Narenthiran, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Igor Gitman, Ivan Moshkov, Wei Du, Shubham Toshniwal, George Armstrong, Branislav Kisacanin, Matvei Novikov, Daria Gitman, Evelina Bakhturina, Prasoon Varshney, Makesh Narsimhan, Jane Polak Scowcroft, John Kamalu, Dan Su, Kezhi Kong, Markus Kliegl, Rabeeh Karimi, Ying Lin, Sanjeev Satheesh, Jupinder Parmar, Pritam Gundecha, Brandon Norick, Joseph Jennings, Shrimai Prabhumoye, Syeda Nahida Akter, Mostofa Patwary, Abhinav Khattar, Deepak Narayanan, Roger Waleffe, Jimmy Zhang, Bor-Yiing Su, Guyue Huang, Terry Kong, Parth Chadha, Sahil Jain, Christine Harvey, Elad Segal, Jining Huang, Sergey Kashirsky, Robert McQueen, Izzy Putterman, George Lam, Arun Venkatesan, Sherry Wu, Vinh Nguyen, Manoj Kilaru, Andrew Wang, Anna Warno, Abhilash Somasamudramath, Sandip Bhaskar, Maka Dong, Nave Assaf, Shahar Mor, Omer Ullman Argov, Scot Junkin, Oleksandr Romanenko, Pedro Larroy, Monika Katariya, Marco Rovinelli, Viji Balas, Nicholas Edelman
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models -- LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.

[664] arXiv:2505.02093 (replaced) [pdf, html, other]
Title: A Deep Learning-Aided Approach for Estimating Field Permeability Map by Fusing Well Logs, Well Tests, and Seismic Data
Grigoriy Shutov, Viktor Duplyakov, Shadfar Davoodi, Anton Morozov, Dmitriy Popkov, Kirill Pavlenko, Albert Vainshtein, Viktor Kotezhekov, Sergey Kaygorodov, Boris Belozerov, Mars M Khasanov, Vladimir Vanovskiy, Andrei Osiptsov, Evgeny Burnaev
Subjects: Computational Engineering, Finance, and Science (cs.CE)

Obtaining reliable permeability maps of oil reservoirs is crucial for building a robust and accurate reservoir simulation model and, therefore, designing effective recovery strategies. This problem, however, remains challenging, as it requires the integration of various data sources by experts from different disciplines. Moreover, there are no sources to provide direct information about the inter-well space. In this work, a new method based on the data-fusion approach is proposed for predicting two-dimensional permeability maps on the whole reservoir area. This method utilizes non-parametric regression with a custom kernel shape accounting for different data sources: well logs, well tests, and seismics. A convolutional neural network is developed to process seismic data and then incorporate it with other sources. A multi-stage data fusion procedure helps to artificially increase the training dataset for the seismic interpretation model and finally to construct the adequate permeability map. The proposed methodology of permeability map construction from different sources was tested on a real oil reservoir located in Western Siberia. The results demonstrate that the developed map perfectly corresponds to the permeability estimations in the wells, and the inter-well space permeability predictions are considerably improved through the incorporation of the seismic data.

[665] arXiv:2505.02952 (replaced) [pdf, other]
Title: Iterative Resolution of Prompt Ambiguities Using a Progressive Cutting-Search Approach
Fabrizio Marozzo
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Generative AI systems have revolutionized human interaction by enabling natural language-based coding and problem solving. However, the inherent ambiguity of natural language often leads to imprecise instructions, forcing users to iteratively test, correct, and resubmit their prompts. We propose an iterative approach that systematically narrows down these ambiguities through a structured series of clarification questions and alternative solution proposals, illustrated with input/output examples as well. Once every uncertainty is resolved, a final, precise solution is generated. Evaluated on a diverse dataset spanning coding, data analysis, and creative writing, our method demonstrates superior accuracy, competitive resolution times, and higher user satisfaction compared to conventional one-shot solutions, which typically require multiple manual iterations to achieve a correct output.

[666] arXiv:2505.03529 (replaced) [pdf, html, other]
Title: SKALD: Scalable K-Anonymisation for Large Datasets
Kailash Reddy, Novoneel Chakraborty, Amogh Dharmavaram, Anshoo Tandon
Comments: 7 pages, 3 figures, 3 tables
Subjects: Information Theory (cs.IT); Cryptography and Security (cs.CR)

Data privacy and anonymisation are critical concerns in today's data-driven society, particularly when handling personal and sensitive user data. Regulatory frameworks worldwide recommend privacy-preserving protocols such as k-anonymisation to de-identify releases of tabular data. Available hardware resources provide an upper bound on the maximum size of dataset that can be processed at a time. Large datasets with sizes exceeding this upper bound must be broken up into smaller data chunks for processing. In these cases, standard k-anonymisation tools such as ARX can only operate on a per-chunk basis. This paper proposes SKALD, a novel algorithm for performing k-anonymisation on large datasets with limited RAM. Our SKALD algorithm offers multi-fold performance improvement over standard k-anonymisation methods by extracting and combining sufficient statistics from each chunk during processing to ensure successful k-anonymisation while providing better utility.

[667] arXiv:2505.04203 (replaced) [pdf, html, other]
Title: ELGAR: Expressive Cello Performance Motion Generation for Audio Rendition
Zhiping Qiu, Yitong Jin, Yuan Wang, Yi Shi, Chongwu Wang, Chao Tan, Xiaobing Li, Feng Yu, Tao Yu, Qionghai Dai
Journal-ref: SIGGRAPH 2025
Subjects: Graphics (cs.GR); Sound (cs.SD); Audio and Speech Processing (eess.AS)

The art of instrument performance stands as a vivid manifestation of human creativity and emotion. Nonetheless, generating instrument performance motions is a highly challenging task, as it requires not only capturing intricate movements but also reconstructing the complex dynamics of the performer-instrument interaction. While existing works primarily focus on modeling partial body motions, we propose Expressive ceLlo performance motion Generation for Audio Rendition (ELGAR), a state-of-the-art diffusion-based framework for whole-body fine-grained instrument performance motion generation solely from audio. To emphasize the interactive nature of the instrument performance, we introduce Hand Interactive Contact Loss (HICL) and Bow Interactive Contact Loss (BICL), which effectively guarantee the authenticity of the interplay. Moreover, to better evaluate whether the generated motions align with the semantic context of the music audio, we design novel metrics specifically for string instrument performance motion generation, including finger-contact distance, bow-string distance, and bowing score. Extensive evaluations and ablation studies are conducted to validate the efficacy of the proposed methods. In addition, we put forward a motion generation dataset SPD-GEN, collated and normalized from the MoCap dataset SPD. As demonstrated, ELGAR has shown great potential in generating instrument performance motions with complicated and fast interactions, which will promote further development in areas such as animation, music education, interactive art creation, etc.

[668] arXiv:2505.05469 (replaced) [pdf, html, other]
Title: Generating Physically Stable and Buildable Brick Structures from Text
Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, Jun-Yan Zhu
Comments: Project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

We introduce BrickGPT, the first approach for generating physically stable interconnecting brick assembly models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of brick structures, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that BrickGPT produces stable, diverse, and aesthetically pleasing brick structures that align closely with the input text prompts. We also develop a text-based brick texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We release our new dataset, StableText2Brick, containing over 47,000 brick structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: this https URL.

[669] arXiv:2505.06043 (replaced) [pdf, html, other]
Title: Triangular preconditioners for double saddle point linear systems arising in the mixed form of poroelasticity equations
Luca Bergamaschi, Massimiliano Ferronato, Angeles Martinez
Subjects: Numerical Analysis (math.NA)

In this paper, we study a class of inexact block triangular preconditioners for double saddle-point symmetric linear systems arising from the mixed finite element and mixed hybrid finite element discretization of Biot's poroelasticity equations. We develop a spectral analysis of the preconditioned matrix, showing that the complex eigenvalues lie in a circle of center $(1,0)$ and radius smaller than 1. In contrast, the real eigenvalues are described in terms of the roots of a third-degree polynomial with real coefficients. The results of numerical experiments are reported to show the quality of the theoretical bounds and illustrate the efficiency of the proposed preconditioners used with GMRES, especially in comparison with similar block diagonal preconditioning strategies along with the MINRES iteration.

[670] arXiv:2505.06096 (replaced) [pdf, html, other]
Title: Free and Fair Hardware: A Pathway to Copyright Infringement-Free Verilog Generation using LLMs
Sam Bush, Matthew DeLorenzo, Phat Tieu, Jeyavijayan Rajendran
Comments: Accepted at DAC 2025
Subjects: Artificial Intelligence (cs.AI)

Limitations in Large Language Model (LLM) capabilities for hardware design tasks, such as generating functional Verilog codes, have motivated various fine-tuning optimizations utilizing curated hardware datasets from open-source repositories. However, these datasets remain limited in size and contain minimal checks on licensing for reuse, resulting in potential copyright violations by fine-tuned LLMs. Therefore, we propose an evaluation benchmark to estimate the risk of Verilog-trained LLMs to generate copyright-protected codes. To minimize this risk, we present an open-source Verilog dataset, FreeSet, containing over 220k files, along with the automated dataset curation framework utilized to provide additional guarantees of fair-use Verilog data. We then execute an LLM fine-tuning framework consisting of continual pre-training, resulting in a fine-tuned Llama model for Verilog, FreeV. Our results indicate that FreeV demonstrates the smallest risk of copyright-infringement among prior works, with only a 3% violation rate. Furthermore, experimental results demonstrate improvements in Verilog generation functionality over its baseline model, improving VerilogEval pass@10 rates by over 10%.

[671] arXiv:2505.08000 (replaced) [pdf, html, other]
Title: Limitations to Computing Quadratic Functions on Reed-Solomon Encoded Data
Keller Blackwell, Mary Wootters
Subjects: Information Theory (cs.IT)

We study the problem of low-bandwidth non-linear computation on Reed-Solomon encoded data. Given an $[n,k]$ Reed-Solomon encoding of a message vector $\vec{f} \in \mathbb{F}_q^k$, and a polynomial $g \in \mathbb{F}_q[X_1, X_2, \ldots, X_k]$, a user wishing to evaluate $g(\vec{f})$ is given local query access to each codeword symbol. The query response is allowed to be the output of an arbitrary function evaluated locally on the codeword symbol, and the user's aim is to minimize the total information downloaded in order to compute $g(\vec{f})$.
We show that when $k=2$ and $q = p^e$ for prime $p > 2$, then any scheme that evaluates the quadratic monomial $g(X_1, X_2) := X_1 X_2$ must download at least $2 \log_2(q-1) - 3$ bits of information; compare this with the na\"ıve scheme of Reed-Solomon interpolation which recovers $\vec{f}$ in its entirety, which downloads $2 \log_2(q) $ bits. Our result shows that dimension-2 Reed-Solomon codes do not admit any meaningful low-bandwidth scheme for the evaluation of quadratic functions over the encoded data. This contrasts sharply with prior work for low-bandwidth evaluation of \emph{linear} functions $g(\vec{f})$ over Reed-Solomon encoded data, for which it is possible to substantially improve upon the na\"ıve bound of $k \log_2(q) $.

[672] arXiv:2505.12147 (replaced) [pdf, other]
Title: Causal Machine Learning in IoT-based Engineering Problems: A Tool Comparison in the Case of Household Energy Consumption
Nikolaos-Lysias Kosioris, Sotirios Nikoletseas, Gavrilis Filios, Stefanos Panagiotou
Subjects: Machine Learning (cs.LG)

The rapid increase in computing power and the ability to store Big Data in the infrastructure has enabled predictions in a large variety of domains by Machine Learning. However, in many cases, existing Machine Learning tools are considered insufficient or incorrect since they exploit only probabilistic dependencies rather than inference logic. Causal Machine Learning methods seem to close this gap. In this paper, two prevalent tools based on Causal Machine Learning methods are compared, as well as their mathematical underpinning background. The operation of the tools is demonstrated by examining their response to 18 queries, based on the IDEAL Household Energy Dataset, published by the University of Edinburgh. First, it was important to evaluate the causal relations assumption that allowed the use of this approach; this was based on the preexisting scientific knowledge of the domain and was implemented by use of the in-built validation tools. Results were encouraging and may easily be extended to other domains.

[673] arXiv:2505.12514 (replaced) [pdf, html, other]
Title: Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian
Comments: 26 pages, 7 figures
Subjects: Machine Learning (cs.LG)

Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens'' before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem, where $D$ is the diameter of the graph, while the best known result of constant-depth transformers with discrete CoTs requires $O(n^2)$ decoding steps where $n$ is the number of vertices ($D<n$). In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously (i.e., parallel breadth-first search (BFS)), while discrete CoTs must choose a single path sampled from the superposition state, which leads to sequential search that requires many more steps and may be trapped into local solutions. We also performed extensive experiments to verify that our theoretical construction aligns well with the empirical solution obtained via training dynamics. Notably, encoding of multiple search frontiers as a superposition state automatically emerges in training continuous CoTs, without explicit supervision to guide the model to explore multiple paths simultaneously.

[674] arXiv:2505.16211 (replaced) [pdf, html, other]
Title: AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, XiaoFeng Wang, Wenyuan Xu, Wei Dong, Xinfeng Li
Comments: Technical Report
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at this https URL.

[675] arXiv:2505.16459 (replaced) [pdf, other]
Title: MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks
Guiyao Tie, Xueyang Zhou, Tianhe Gu, Ruihang Zhang, Chaoran Hu, Sizhe Zhang, Mengqu Sun, Yan Zhang, Pan Zhou, Lichao Sun
Comments: 39 pages, 28 figures, 4 tables
Subjects: Artificial Intelligence (cs.AI)

Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs, particularly those augmented with intermediate thinking traces (MLLMs-T), remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMMR, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMMR comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMMR offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.

[676] arXiv:2505.16722 (replaced) [pdf, html, other]
Title: Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at this https URL.

[677] arXiv:2505.17080 (replaced) [pdf, html, other]
Title: Not Minds, but Signs: Reframing LLMs through Semiotics
Davide Picca
Subjects: Computation and Language (cs.CL)

This paper challenges the prevailing tendency to frame Large Language Models (LLMs) as cognitive systems, arguing instead for a semiotic perspective that situates these models within the broader dynamics of sign manipulation and meaning-making. Rather than assuming that LLMs understand language or simulate human thought, we propose that their primary function is to recombine, recontextualize, and circulate linguistic forms based on probabilistic associations. By shifting from a cognitivist to a semiotic framework, we avoid anthropomorphism and gain a more precise understanding of how LLMs participate in cultural processes, not by thinking, but by generating texts that invite interpretation. Through theoretical analysis and practical examples, the paper demonstrates how LLMs function as semiotic agents whose outputs can be treated as interpretive acts, open to contextual negotiation and critical reflection. We explore applications in literature, philosophy, education, and cultural production, emphasizing how LLMs can serve as tools for creativity, dialogue, and critical inquiry. The semiotic paradigm foregrounds the situated, contingent, and socially embedded nature of meaning, offering a more rigorous and ethically aware framework for studying and using LLMs. Ultimately, this approach reframes LLMs as technological participants in an ongoing ecology of signs. They do not possess minds, but they alter how we read, write, and make meaning, compelling us to reconsider the foundations of language, interpretation, and the role of artificial systems in the production of knowledge.

[678] arXiv:2505.17117 (replaced) [pdf, html, other]
Title: From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
Chen Shani, Dan Jurafsky, Yann LeCun, Ravid Shwartz-Ziv
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)

Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.

[679] arXiv:2505.18232 (replaced) [pdf, html, other]
Title: Two-Stage Regularization-Based Structured Pruning for LLMs
Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

The deployment of large language models (LLMs) is largely hindered by their large number of parameters. Structural pruning has emerged as a promising solution. Prior structured pruning methods directly remove unimportant parameters based on certain metrics, which often causes knowledge loss and necessitates extensive retraining. To overcome this, we introduce a novel pruning method TRSP: Two-Stage Regularization-Based Structured Pruning for LLMs. Specifically, we multiply the output of each transformer layer by an initial learnable weight and iteratively learn these weights by adding their $\ell_1$-norm as a regularization term to the loss function, serving as the first-stage regularization. Subsequently, we apply additional regularization to the difference between the output and input of layers with smaller weights, encouraging the shift of knowledge to the preserved layers. This serves as the second-stage regularization. TRSP retains more knowledge and better preserves model performance than direct parameter elimination. Through extensive experimentation we show that TRSP outperforms strong layer-wise structured pruning methods without requiring retraining. As a layer-wise pruning method, it delivers notable end-to-end acceleration, making it a promising solution for efficient LLM deployment.

[680] arXiv:2505.19955 (replaced) [pdf, html, other]
Title: MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi
Comments: 42 pages, 9 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results--posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.

[681] arXiv:2505.20094 (replaced) [pdf, html, other]
Title: SwarmThinkers: Learning Physically Consistent Atomic KMC Transitions at Scale
Qi Li, Kun Li, Haozhi Han, Honghui Shang, Xinfu He, Yunquan Zhang, Hong An, Ting Cao, Mao Yang
Subjects: Artificial Intelligence (cs.AI)

Can a scientific simulation system be physically consistent, interpretable by design, and scalable across regimes--all at once? Despite decades of progress, this trifecta remains elusive. Classical methods like Kinetic Monte Carlo ensure thermodynamic accuracy but scale poorly; learning-based methods offer efficiency but often sacrifice physical consistency and interpretability. We present SwarmThinkers, a reinforcement learning framework that recasts atomic-scale simulation as a physically grounded swarm intelligence system. Each diffusing particle is modeled as a local decision-making agent that selects transitions via a shared policy network trained under thermodynamic constraints. A reweighting mechanism fuses learned preferences with transition rates, preserving statistical fidelity while enabling interpretable, step-wise decision making. Training follows a centralized-training, decentralized-execution paradigm, allowing the policy to generalize across system sizes, concentrations, and temperatures without retraining. On a benchmark simulating radiation-induced Fe-Cu alloy precipitation, SwarmThinkers is the first system to achieve full-scale, physically consistent simulation on a single A100 GPU, previously attainable only via OpenKMC on a supercomputer. It delivers up to 4963x (3185x on average) faster computation with 485x lower memory usage. By treating particles as decision-makers, not passive samplers, SwarmThinkers marks a paradigm shift in scientific simulation--one that unifies physical consistency, interpretability, and scalability through agent-driven intelligence.

[682] arXiv:2505.20170 (replaced) [pdf, other]
Title: Program of Equations Thoughts to Solve Algebra Word Problems
Yunze Lin
Comments: Withdrawn pending institutional authorization and core revisions to address methodological inconsistencies in Sections 3-4
Subjects: Artificial Intelligence (cs.AI)

Solving algebraic word problems (AWPs) has recently emerged as an important natural language processing task. Recently, large language models (LLMs) have demonstrated powerful mathematical capabilities, and the Chain-of-Thought technique, which guides LLMs through step-by-step reasoning, has yielded impressive results. However, this reasoning ability is limited by the computational weaknesses of LLMs themselves, where calculation errors can accumulate, leading to incorrect final answers. To address this, we propose Program of Equations Thoughts (POET), which transforms the task of generating step-by-step reasoning answers into a two-stage task of predicting equations and generating code, offloading complex computations to a Python interpreter to avoid calculation errors in LLMs. Furthermore, we propose Zero-shot POET, which utilizes a manually designed template to enable LLMs to directly generate Python code for one-step solving. Our method achieves accuracies of 95.3% and 98.0% on the PEN and ALG514 datasets, respectively, setting a new state-of-the-art (SOTA). Zero-shot POET also achieves the SOTA result of 95.5% on the DRAW-1K dataset.

[683] arXiv:2505.20485 (replaced) [pdf, html, other]
Title: Avoid Forgetting by Preserving Global Knowledge Gradients in Federated Learning with Non-IID Data
Abhijit Chunduru, Majid Morafah, Mahdi Morafah, Vishnu Pandi Chellapandi, Ang Li
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)

The inevitable presence of data heterogeneity has made federated learning very challenging. There are numerous methods to deal with this issue, such as local regularization, better model fusion techniques, and data sharing. Though effective, they lack a deep understanding of how data heterogeneity can affect the global decision boundary. In this paper, we bridge this gap by performing an experimental analysis of the learned decision boundary using a toy example. Our observations are surprising: (1) we find that the existing methods suffer from forgetting and clients forget the global decision boundary and only learn the perfect local one, and (2) this happens regardless of the initial weights, and clients forget the global decision boundary even starting from pre-trained optimal weights. In this paper, we present FedProj, a federated learning framework that robustly learns the global decision boundary and avoids its forgetting during local training. To achieve better ensemble knowledge fusion, we design a novel server-side ensemble knowledge transfer loss to further calibrate the learned global decision boundary. To alleviate the issue of learned global decision boundary forgetting, we further propose leveraging an episodic memory of average ensemble logits on a public unlabeled dataset to regulate the gradient updates at each step of local training. Experimental results demonstrate that FedProj outperforms state-of-the-art methods by a large margin.

[684] arXiv:2505.21725 (replaced) [pdf, html, other]
Title: Lazarus Group Targets Crypto-Wallets and Financial Data while employing new Tradecrafts
Alessio Di Santo
Subjects: Cryptography and Security (cs.CR); Operating Systems (cs.OS)

This report presents a comprehensive analysis of a malicious software sample, detailing its architecture, behavioral characteristics, and underlying intent. Through static and dynamic examination, the malware core functionalities, including persistence mechanisms, command-and-control communication, and data exfiltration routines, are identified and its supporting infrastructure is mapped. By correlating observed indicators of compromise with known techniques, tactics, and procedures, this analysis situates the sample within the broader context of contemporary threat campaigns and infers the capabilities and motivations of its likely threat actor.
Building on these findings, actionable threat intelligence is provided to support proactive defenses. Threat hunting teams receive precise detection hypotheses for uncovering latent adversarial presence, while monitoring systems can refine alert logic to detect anomalous activity in real time. Finally, the report discusses how this structured intelligence enhances predictive risk assessments, informs vulnerability prioritization, and strengthens organizational resilience against advanced persistent threats. By integrating detailed technical insights with strategic threat landscape mapping, this malware analysis report not only reconstructs past adversary actions but also establishes a robust foundation for anticipating and mitigating future attacks.

[685] arXiv:2505.22663 (replaced) [pdf, html, other]
Title: Training Free Stylized Abstraction
Aimon Rahman, Kartik Narayan, Vishal M. Patel
Comments: Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Stylized abstraction synthesizes visually exaggerated yet semantically faithful representations of subjects, balancing recognizability with perceptual distortion. Unlike image-to-image translation, which prioritizes structural fidelity, stylized abstraction demands selective retention of identity cues while embracing stylistic divergence, especially challenging for out-of-distribution individuals. We propose a training-free framework that generates stylized abstractions from a single image using inference-time scaling in vision-language models (VLLMs) to extract identity-relevant features, and a novel cross-domain rectified flow inversion strategy that reconstructs structure based on style-dependent priors. Our method adapts structural restoration dynamically through style-aware temporal scheduling, enabling high-fidelity reconstructions that honor both subject and style. It supports multi-round abstraction-aware generation without fine-tuning. To evaluate this task, we introduce StyleBench, a GPT-based human-aligned metric suited for abstract styles where pixel-level similarity fails. Experiments across diverse abstraction (e.g., LEGO, knitted dolls, South Park) show strong generalization to unseen identities and styles in a fully open-source setup.

[686] arXiv:2505.23153 (replaced) [pdf, html, other]
Title: Conceptual Framework Toward Embodied Collective Adaptive Intelligence
Fan Wang, Shaoshan Liu
Subjects: Artificial Intelligence (cs.AI)

Collective Adaptive Intelligence (CAI) represent a transformative approach in embodied AI, wherein numerous autonomous agents collaborate, adapt, and self-organize to navigate complex, dynamic environments. By enabling systems to reconfigure themselves in response to unforeseen challenges, CAI facilitate robust performance in real-world scenarios. This article introduces a conceptual framework for designing and analyzing CAI. It delineates key attributes including task generalization, resilience, scalability, and self-assembly, aiming to bridge theoretical foundations with practical methodologies for engineering adaptive, emergent intelligence. By providing a structured foundation for understanding and implementing CAI, this work seeks to guide researchers and practitioners in developing more resilient, scalable, and adaptable AI systems across various domains.

[687] arXiv:2505.23452 (replaced) [pdf, html, other]
Title: What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews
Quim Motger, Marc Oriol, Max Tiessler, Xavier Franch, Jordi Marco
Comments: Accepted at the 33rd IEEE International Requirements Engineering 2025 conference
Subjects: Information Retrieval (cs.IR); Software Engineering (cs.SE)

Opinion mining plays a vital role in analysing user feedback and extracting insights from textual data. While most research focuses on sentiment polarity (e.g., positive, negative, neutral), fine-grained emotion classification in app reviews remains underexplored. Fine-grained emotion classification is thus needed to better understand users' affective responses and support downstream tasks such as feature-emotion analysis, user-oriented release planning, and issue triaging. This paper addresses this gap by identifying and addressing the challenges and limitations in fine-grained emotion analysis in the context of app reviews. Our study adapts Plutchik's emotion taxonomy to app reviews by developing a structured annotation framework and dataset. Through an iterative human annotation process, we define clear annotation guidelines and document key challenges in emotion classification. Additionally, we evaluate the feasibility of automating emotion annotation using large language models, assessing their cost-effectiveness and agreement with human-labelled data. Our findings reveal that while large language models significantly reduce manual effort and maintain substantial agreement with human annotators, full automation remains challenging due to the complexity of emotional interpretation. This work contributes to opinion mining in requirements engineering by providing structured guidelines, an annotated dataset, and insights for developing automated pipelines to capture the complexity of emotions in app reviews.

[688] arXiv:2505.24625 (replaced) [pdf, html, other]
Title: Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors
Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM). Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences. This information is integrated with visual tokens and fed into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.

[689] arXiv:2505.24778 (replaced) [pdf, html, other]
Title: Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?
Jiayu Liu, Qing Zong, Weiqi Wang, Yangqiu Song
Comments: ACL2025 Main
Subjects: Computation and Language (cs.CL)

As large language models (LLMs) are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., "fairly confident") instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at this https URL.

[690] arXiv:2505.24852 (replaced) [pdf, html, other]
Title: Chameleon: A MatMul-Free Temporal Convolutional Network Accelerator for End-to-End Few-Shot and Continual Learning from Sequential Data
Douwe den Blanken, Charlotte Frenkel
Comments: 14 pages, 7 figures; added FSL power consumption measurements at 100 kHz clock speed, fixed typos
Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)

On-device learning at the edge enables low-latency, private personalization with improved long-term robustness and reduced maintenance costs. Yet, achieving scalable, low-power end-to-end on-chip learning, especially from real-world sequential data with a limited number of examples, is an open challenge. Indeed, accelerators supporting error backpropagation optimize for learning performance at the expense of inference efficiency, while simplified learning algorithms often fail to reach acceptable accuracy targets. In this work, we present Chameleon, leveraging three key contributions to solve these challenges. (i) A unified learning and inference architecture supports few-shot learning (FSL), continual learning (CL) and inference at only 0.5% area overhead to the inference logic. (ii) Long temporal dependencies are efficiently captured with temporal convolutional networks (TCNs), enabling the first demonstration of end-to-end on-chip FSL and CL on sequential data and inference on 16-kHz raw audio. (iii) A dual-mode, matrix-multiplication-free compute array allows either matching the power consumption of state-of-the-art inference-only keyword spotting (KWS) accelerators or enabling $4.3\times$ higher peak GOPS. Fabricated in 40-nm CMOS, Chameleon sets new accuracy records on Omniglot for end-to-end on-chip FSL (96.8%, 5-way 1-shot, 98.8%, 5-way 5-shot) and CL (82.2% final accuracy for learning 250 classes with 10 shots), while maintaining an inference accuracy of 93.3% on the 12-class Google Speech Commands dataset at an extreme-edge power budget of 3.1 $\mu$W.

[691] arXiv:2506.01208 (replaced) [pdf, html, other]
Title: Multiresolution Analysis and Statistical Thresholding on Dynamic Networks
Raphaël Romero, Tijl De Bie, Nick Heard, Alexander Modell
Subjects: Machine Learning (cs.LG)

Detecting structural change in dynamic network data has wide-ranging applications. Existing approaches typically divide the data into time bins, extract network features within each bin, and then compare these features over time. This introduces an inherent tradeoff between temporal resolution and the statistical stability of the extracted features. Despite this tradeoff, reminiscent of time-frequency tradeoffs in signal processing, most methods rely on a fixed temporal resolution. Choosing an appropriate resolution parameter is typically difficult and can be especially problematic in domains like cybersecurity, where anomalous behavior may emerge at multiple time scales. We address this challenge by proposing ANIE (Adaptive Network Intensity Estimation), a multi-resolution framework designed to automatically identify the time scales at which network structure evolves, enabling the joint detection of both rapid and gradual changes. Modeling interactions as Poisson processes, our method proceeds in two steps: (1) estimating a low-dimensional subspace of node behavior, and (2) deriving a set of novel empirical affinity coefficients that quantify change in interaction intensity between latent factors and support statistical testing for structural change across time scales. We provide theoretical guarantees for subspace estimation and the asymptotic behavior of the affinity coefficients, enabling model-based change detection. Experiments on synthetic networks show that ANIE adapts to the appropriate time resolution and is able to capture sharp structural changes while remaining robust to noise. Furthermore, applications to real-world data showcase the practical benefits of ANIE's multiresolution approach to detecting structural change over fixed resolution methods.

[692] arXiv:2506.01325 (replaced) [pdf, html, other]
Title: Understanding the Identity-Transformation Approach in OIDC-Compatible Privacy-Preserving SSO Services
Jingqiang Lin, Baitao Zhang, Wei Wang, Quanwei Cai, Jiwu Jing, Huiyang He
Subjects: Cryptography and Security (cs.CR)

OpenID Connect (OIDC) enables a user with commercial-off-the-shelf browsers to log into multiple websites, called relying parties (RPs), by her username and credential set up in another trusted web system, called the identity provider (IdP). Identity transformations are proposed in UppreSSO to provide OIDC-compatible SSO services, preventing both IdP-based login tracing and RP-based identity linkage. While security and privacy of SSO services in UppreSSO have been proved, several essential issues of this identity-transformation approach are not well studied. In this paper, we comprehensively investigate the approach as below. Firstly, several suggestions for the efficient integration of identity transformations in OIDC-compatible SSO are explained. Then, we uncover the relationship between identity-transformations in SSO and oblivious pseudo-random functions (OPRFs), and present two variations of the properties required for SSO security as well as the privacy requirements, to analyze existing OPRF protocols. Finally, new identity transformations different from those designed in UppreSSO, are constructed based on OPRFs, satisfying different variations of SSO security requirements. To the best of our knowledge, this is the first time to uncover the relationship between identity transformations in OIDC-compatible privacy-preserving SSO services and OPRFs, and prove the SSO-related properties (i.e., key-identifier freeness, RP designation and user identification) of OPRF protocols, in addition to the basic properties of correctness, obliviousness and pseudo-randomness.

[693] arXiv:2506.02007 (replaced) [pdf, html, other]
Title: eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems
Ruilin Xu, Zongxuan Xie, Pengfei Chen
Comments: IWQoS 2025 (Camera-Ready Version)
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)

We present eACGM, a full-stack AI/ML system monitoring framework based on eBPF. eACGM collects real-time performance data from key hardware components, including the GPU and network communication layer, as well as from key software stacks such as CUDA, Python, and PyTorch, all without requiring any code instrumentation or modifications. Additionally, it leverages libnvml to gather process-level GPU resource usage information. By applying a Gaussian Mixture Model (GMM) to the collected multidimensional performance metrics for statistical modeling and clustering analysis, eACGM effectively identifies complex failure modes, such as latency anomalies, hardware failures, and communication inefficiencies, enabling rapid diagnosis of system bottlenecks and abnormal behaviors.
To evaluate eACGM's effectiveness and practicality, we conducted extensive empirical studies and case analyses in multi-node distributed training scenarios. The results demonstrate that eACGM, while maintaining a non-intrusive and low-overhead profile, successfully captures critical performance anomalies during model training and inference. Its stable anomaly detection performance and comprehensive monitoring capabilities validate its applicability and scalability in real-world production environments, providing strong support for performance optimization and fault diagnosis in large-scale AI/ML systems.

[694] arXiv:2506.02205 (replaced) [pdf, html, other]
Title: Bregman Centroid Guided Cross-Entropy Method
Yuliang Gu, Hongpeng Cao, Marco Caccamo, Naira Hovakimyan
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)

The Cross-Entropy Method (CEM) is a widely adopted trajectory optimizer in model-based reinforcement learning (MBRL), but its unimodal sampling strategy often leads to premature convergence in multimodal landscapes. In this work, we propose Bregman Centroid Guided CEM ($\mathcal{BC}$-EvoCEM), a lightweight enhancement to ensemble CEM that leverages $\textit{Bregman centroids}$ for principled information aggregation and diversity control. $\textbf{$\mathcal{BC}$-EvoCEM}$ computes a performance-weighted Bregman centroid across CEM workers and updates the least contributing ones by sampling within a trust region around the centroid. Leveraging the duality between Bregman divergences and exponential family distributions, we show that $\textbf{$\mathcal{BC}$-EvoCEM}$ integrates seamlessly into standard CEM pipelines with negligible overhead. Empirical results on synthetic benchmarks, a cluttered navigation task, and full MBRL pipelines demonstrate that $\textbf{$\mathcal{BC}$-EvoCEM}$ enhances both convergence and solution quality, providing a simple yet effective upgrade for CEM.

[695] arXiv:2506.06946 (replaced) [pdf, html, other]
Title: Making a Pipeline Production-Ready: Challenges and Lessons Learned in the Healthcare Domain
Daniel Angelo Esteves Lawand (1), Lucas Quaresma Medina Lam (1), Roberto Oliveira Bolgheroni (1), Renato Cordeiro Ferreira (1,2,3,4), Alfredo Goldman (1), Marcelo Finger (1) ((1) University of São Paulo, (2) Jheronimus Academy of Data Science, (3) Technical University of Eindhoven, (4) Tilburg University)
Comments: 8 pages, 3 figures (2 diagrams, 2 code listings), accepted to the workshop SADIS 2025
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Deploying a Machine Learning (ML) training pipeline into production requires good software engineering practices. Unfortunately, the typical data science workflow often leads to code that lacks critical software quality attributes. This experience report investigates this problem in SPIRA, a project whose goal is to create an ML-Enabled System (MLES) to pre-diagnose insufficiency respiratory via speech analysis. This paper presents an overview of the architecture of the MLES, then compares three versions of its Continuous Training subsystem: from a proof of concept Big Ball of Mud (v1), to a design pattern-based Modular Monolith (v2), to a test-driven set of Microservices (v3) Each version improved its overall extensibility, maintainability, robustness, and resiliency. The paper shares challenges and lessons learned in this process, offering insights for researchers and practitioners seeking to productionize their pipelines.

[696] arXiv:2506.10217 (replaced) [pdf, other]
Title: Data-Centric Safety and Ethical Measures for Data and AI Governance
Srija Chakraborty
Comments: Paper accepted and presented at the AAAI 2025 Workshop on Datasets and Evaluators of AI Safety this https URL
Subjects: Computers and Society (cs.CY)

Datasets play a key role in imparting advanced capabilities to artificial intelligence (AI) foundation models that can be adapted to various downstream tasks. These downstream applications can introduce both beneficial and harmful capabilities -- resulting in dual use AI foundation models, with various technical and regulatory approaches to monitor and manage these risks. However, despite the crucial role of datasets, responsible dataset design and ensuring data-centric safety and ethical practices have received less attention. In this study, we pro-pose responsible dataset design framework that encompasses various stages in the AI and dataset lifecycle to enhance safety measures and reduce the risk of AI misuse due to low quality, unsafe and unethical data content. This framework is domain agnostic, suitable for adoption for various applications and can promote responsible practices in dataset creation, use, and sharing to facilitate red teaming, minimize risks, and increase trust in AI models.

[697] arXiv:2506.10685 (replaced) [pdf, html, other]
Title: Defensive Adversarial CAPTCHA: A Semantics-Driven Framework for Natural Adversarial Example Generation
Xia Du, Xiaoyuan Liu, Jizhe Zhou, Zheng Lin, Chi-man Pun, Cong Wu, Tao Li, Zhe Chen, Wei Ni, Jun Luo
Comments: 13 pages, 6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

Traditional CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on the original image characteristics, resulting in distortions that hinder human interpretation and limit their applicability in scenarios where no initial input images are available. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (DAC), a novel framework that generates high-fidelity adversarial examples guided by attacker-specified semantics information. Leveraging a Large Language Model (LLM), DAC enhances CAPTCHA diversity and enriches the semantic information. To address various application scenarios, we examine the white-box targeted attack scenario and the black box untargeted attack scenario. For target attacks, we introduce two latent noise variables that are alternately guided in the diffusion step to achieve robust inversion. The synergy between gradient guidance and latent variable optimization achieved in this way ensures that the generated adversarial examples not only accurately align with the target conditions but also achieve optimal performance in terms of distributional consistency and attack effectiveness. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-DAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show that the defensive adversarial CAPTCHA generated by BP-DAC is able to defend against most of the unknown models, and the generated CAPTCHA is indistinguishable to both humans and DNNs.

[698] arXiv:2506.10755 (replaced) [pdf, html, other]
Title: Quantifying Azure RBAC Wildcard Overreach
Christophe Parisel
Subjects: Cryptography and Security (cs.CR)

Azure RBAC leverages wildcard permissions to simplify policy authoring, but this abstraction often obscures the actual set of allowed operations and undermines least-privilege guarantees. We introduce Belshazaar, a two-stage framework that targets both the effective permission set problem and the evaluation of wildcards permissions spread. First, we formalize Azure action syntax via a context free grammar and implement a compiler that expands any wildcard into its explicit action set. Second, we define an ultrametric diameter metric to quantify semantic overreach in wildcard scenarios. Applied to Microsoft s official catalog of 15481 actions, Belshazaar reveals that about 50 percent of actions admit a cross Resource Provider reach when associated with non obvious wildcards, and that effective permissions sets are effectively computable. These findings demonstrate that wildcard patterns can introduce substantial privilege bloat, and that our approach offers a scalable, semantics driven path toward tighter, least-privilege RBAC policies in Azure environments.

[699] arXiv:2506.10967 (replaced) [pdf, html, other]
Title: Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, Shanghang Zhang
Comments: 22 pages, 5 figures, code: this https URL, project page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95\% and CUDA latency by 78\%, while maintaining 94\% of the original accuracy. Our code is available at this https URL.

[700] arXiv:2506.11013 (replaced) [pdf, html, other]
Title: Toward a Brazilian Research Agenda in Quantum Software Engineering: A Systematic Mapping Study
Filipe Fernandes, Cláudia Werner
Comments: 11 pages, 13 figures
Subjects: Software Engineering (cs.SE)

Context: Quantum Software Engineering (QSE) has emerged as a promising discipline to support the development of quantum applications by integrating quantum computing principles with established software engineering practices. Problem: Despite recent growth, QSE still lacks standardized methodologies, tools, and guidelines. Moreover, countries like Brazil have had minimal representation in the development of this emerging field. Objective: This study aims to map the current state of QSE by identifying research trends, contributions, and gaps that can inform future investigations and strategic initiatives. Methodology: A systematic mapping study was conducted across major scientific databases, retrieving 3,219 studies. After applying inclusion and exclusion criteria, 3,052 studies were excluded, resulting in 167 selected for analysis. The publications were classified by study type, research type, and alignment with SWEBOK knowledge areas. Results: Most studies focused on Software Engineering Models and Methods, Software Architecture, and Software Testing. Conceptual and technical proposals were predominant, while empirical validations remained limited. Conclusions: QSE is still a maturing field. Advancing it requires standardization, more empirical research, and greater participation from developing countries. As its main contribution, this study proposes a Brazilian Research Agenda in QSE to guide national efforts and foster the development of a strong local scientific community.

[701] arXiv:2506.11504 (replaced) [pdf, html, other]
Title: Symmetric Sliding-Mode Control of Grid-Forming Inverters With Precision Region Under AC and DC Sides Varying
Qianxi Tang, Li Peng, Xuefeng Wang, Xinchen Yao
Comments: 10 pages, 10 figures. This work has been submitted to the IEEE for possible publication
Subjects: Systems and Control (eess.SY)

Voltage regulation under conventional grid-forming controllers is tightly coupled to power sharing and dc-link dynamics. Consequently, its tracking accuracy deteriorates during grid faults, sudden power sharing changes, or dc-bus voltage varying. To address this issue, a symmetric sliding-mode control (SSMC) method is developed and its voltage precision region is derived. It illustrates how much ac-side power dynamics and dc-link voltage varying can be decoupled from the voltage regulation task, which helps predict when an abnormal entangling appears. While conventional sliding-mode controls address voltage-tracking error through complex sliding surface designs, repetitive correction techniques or special reaching laws, this work identifies that the error at power-line frequency primarily stem from the asymmetry property of inverters with the delay effect and the computational inaccuracy. Guided by this insight, an asymmetry compensation structure is proposed, which avoids added design complexity and directly mitigates voltage tracking error. Furthermore, the control design is supported by a physical and quantitative explanation, aiding in parameter tuning. Simulation and experimental results demonstrate that the proposed method achieves faster tracking responses while maintaining robust and more accurate tracking under both dc-link voltage and ac-side current variations. Conventional grid-forming and classical sliding-mode controllers, which handle these variations separately, cannot match this combined speed and robustness. Furthermore, the voltage precision region is explicitly verified.

[702] arXiv:2506.11999 (replaced) [pdf, html, other]
Title: Generative Representational Learning of Foundation Models for Recommendation
Zheli Zhou, Chenxu Zhu, Jianghao Lin, Bo Chen, Ruiming Tang, Weinan Zhang, Yong Yu
Comments: Project page is available at this https URL
Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)

Developing a single foundation model with the capability to excel across diverse tasks has been a long-standing objective in the field of artificial intelligence. As the wave of general-purpose foundation models sweeps across various domains, their influence has significantly extended to the field of recommendation systems. While recent efforts have explored recommendation foundation models for various generative tasks, they often overlook crucial embedding tasks and struggle with the complexities of multi-task learning, including knowledge sharing & conflict resolution, and convergence speed inconsistencies. To address these limitations, we introduce RecFound, a generative representational learning framework for recommendation foundation models. We construct the first comprehensive dataset for recommendation foundation models covering both generative and embedding tasks across diverse scenarios. Based on this dataset, we propose a novel multi-task training scheme featuring a Task-wise Mixture of Low-rank Experts (TMoLE) to handle knowledge sharing & conflict, a Step-wise Convergence-oriented Sample Scheduler (S2Sched) to address inconsistent convergence, and a Model Merge module to balance the performance across tasks. Experiments demonstrate that RecFound achieves state-of-the-art performance across various recommendation tasks, outperforming existing baselines.

[703] arXiv:2506.12036 (replaced) [pdf, html, other]
Title: A Minimalist Method for Fine-tuning Text-to-Image Diffusion Models
Yanting Miao, William Loh, Pacal Poupart, Suraj Kothawade
Comments: 17 pages, 6 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Recent work uses reinforcement learning (RL) to fine-tune text-to-image diffusion models, improving text-image alignment and sample quality. However, existing approaches introduce unnecessary complexity: they cache the full sampling trajectory, depend on differentiable reward models or large preference datasets, or require specialized guidance techniques. Motivated by the "golden noise" hypothesis -- that certain initial noise samples can consistently yield superior alignment -- we introduce Noise PPO, a minimalist RL algorithm that leaves the pre-trained diffusion model entirely frozen and learns a prompt-conditioned initial noise generator. Our approach requires no trajectory storage, reward backpropagation, or complex guidance tricks. Extensive experiments show that optimizing the initial noise distribution consistently improves alignment and sample quality over the original model, with the most significant gains at low inference steps. As the number of inference steps increases, the benefit of noise optimization diminishes but remains present. These findings clarify the scope and limitations of the golden noise hypothesis and reinforce the practical value of minimalist RL fine-tuning for diffusion models.

[704] arXiv:2506.12747 (replaced) [pdf, html, other]
Title: Unleashing Diffusion and State Space Models for Medical Image Segmentation
Rong Wu, Ziqi Chen, Liming Zhong, Heng Li, Hai Shu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Existing segmentation models trained on a single medical imaging dataset often lack robustness when encountering unseen organs or tumors. Developing a robust model capable of identifying rare or novel tumor categories not present during training is crucial for advancing medical imaging applications. We propose DSM, a novel framework that leverages diffusion and state space models to segment unseen tumor categories beyond the training data. DSM utilizes two sets of object queries trained within modified attention decoders to enhance classification accuracy. Initially, the model learns organ queries using an object-aware feature grouping strategy to capture organ-level visual features. It then refines tumor queries by focusing on diffusion-based visual prompts, enabling precise segmentation of previously unseen tumors. Furthermore, we incorporate diffusion-guided feature fusion to improve semantic segmentation performance. By integrating CLIP text embeddings, DSM captures category-sensitive classes to improve linguistic transfer knowledge, thereby enhancing the model's robustness across diverse scenarios and multi-label tasks. Extensive experiments demonstrate the superior performance of DSM in various tumor segmentation tasks. Code is available at this https URL.

[705] arXiv:2506.13759 (replaced) [pdf, html, other]
Title: Discrete Diffusion in Large Language and Multimodal Models: A Survey
Runpeng Yu, Qi Li, Xinchao Wang
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

In this work, we provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs). Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm using full attention and a denoising-based generation strategy. This paradigm naturally enables parallel generation, fine-grained output controllability, and dynamic, response-aware perception. These capabilities are previously difficult to achieve with AR models. Recently, a growing number of industrial-scale proprietary d(M)LLMs, as well as a large number of open-source academic d(M)LLMs, have demonstrated performance comparable to their autoregressive counterparts, while achieving up to 10x acceleration in inference speed.
The advancement of discrete diffusion LLMs and MLLMs has been largely driven by progress in two domains. The first is the development of autoregressive LLMs and MLLMs, which has accumulated vast amounts of data, benchmarks, and foundational infrastructure for training and inference. The second contributing domain is the evolution of the mathematical models underlying discrete diffusion. Together, these advancements have catalyzed a surge in dLLMs and dMLLMs research in early 2025.
In this work, we present a comprehensive overview of the research in the dLLM and dMLLM domains. We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, and categorize representative models. We further analyze key techniques for training and inference, and summarize emerging applications across language, vision-language, and biological domains. We conclude by discussing future directions for research and deployment.
Paper collection: this https URL

[706] arXiv:2506.14810 (replaced) [pdf, html, other]
Title: Intelligent Routing for Sparse Demand Forecasting: A Comparative Evaluation of Selection Strategies
Qiwen Zhang
Comments: 8 pages, 4 figures, conference
Subjects: Machine Learning (cs.LG)

Sparse and intermittent demand forecasting in supply chains presents a critical challenge, as frequent zero-demand periods hinder traditional model accuracy and impact inventory management. We propose and evaluate a Model-Router framework that dynamically selects the most suitable forecasting model-spanning classical, ML, and DL methods for each product based on its unique demand pattern. By comparing rule-based, LightGBM, and InceptionTime routers, our approach learns to assign appropriate forecasting strategies, effectively differentiating between smooth, lumpy, or intermittent demand regimes to optimize predictions. Experiments on the large-scale Favorita dataset show our deep learning (Inception Time) router improves forecasting accuracy by up to 11.8% (NWRMSLE) over strong, single-model benchmarks with 4.67x faster inference time. Ultimately, these gains in forecasting precision will drive substantial reductions in both stockouts and wasteful excess inventory, underscoring the critical role of intelligent, adaptive Al in optimizing contemporary supply chain operations.

[707] arXiv:2506.15709 (replaced) [pdf, html, other]
Title: Studying and Improving Graph Neural Network-based Motif Estimation
Pedro C. Vieira, Miguel E. P. Silva, Pedro Manuel Pinto Ribeiro
Comments: This manuscript represents a revised version from the paper on this https URL. Still a work in progress. Comments are welcome! 23 pages (12 main text + references), 9 figures, 5 tables. (First update: Fix broken links, references and text review.)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Graph Neural Networks (GNNs) are a predominant method for graph representation learning. However, beyond subgraph frequency estimation, their application to network motif significance-profile (SP) prediction remains under-explored, with no established benchmarks in the literature. We propose to address this problem, framing SP estimation as a task independent of subgraph frequency estimation. Our approach shifts from frequency counting to direct SP estimation and modulates the problem as multitarget regression. The reformulation is optimised for interpretability, stability and scalability on large graphs. We validate our method using a large synthetic dataset and further test it on real-world graphs. Our experiments reveal that 1-WL limited models struggle to make precise estimations of SPs. However, they can generalise to approximate the graph generation processes of networks by comparing their predicted SP with the ones originating from synthetic generators. This first study on GNN-based motif estimation also hints at how using direct SP estimation can help go past the theoretical limitations that motif estimation faces when performed through subgraph counting.

[708] arXiv:2506.16571 (replaced) [pdf, html, other]
Title: Capturing Visualization Design Rationale
Maeve Hutchinson, Radu Jianu, Aidan Slingsby, Jo Wood, Pranava Madhyastha
Comments: To be presented at IEEE VIS 2025
Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

Prior natural language datasets for data visualization have focused on tasks such as visualization literacy assessment, insight generation, and visualization generation from natural language instructions. These studies often rely on controlled setups with purpose-built visualizations and artificially constructed questions. As a result, they tend to prioritize the interpretation of visualizations, focusing on decoding visualizations rather than understanding their encoding. In this paper, we present a new dataset and methodology for probing visualization design rationale through natural language. We leverage a unique source of real-world visualizations and natural language narratives: literate visualization notebooks created by students as part of a data visualization course. These notebooks combine visual artifacts with design exposition, in which students make explicit the rationale behind their design decisions. We also use large language models (LLMs) to generate and categorize question-answer-rationale triples from the narratives and articulations in the notebooks. We then carefully validate the triples and curate a dataset that captures and distills the visualization design choices and corresponding rationales of the students.

[709] arXiv:2506.17336 (replaced) [pdf, html, other]
Title: Privacy-Preserving LLM Interaction with Socratic Chain-of-Thought Reasoning and Homomorphically Encrypted Vector Databases
Yubeen Bae, Minchan Kim, Jaejin Lee, Sangbum Kim, Jaehyung Kim, Yejin Choi, Niloofar Mireshghallah
Comments: 29 pages
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)

Large language models (LLMs) are increasingly used as personal agents, accessing sensitive user data such as calendars, emails, and medical records. Users currently face a trade-off: They can send private records, many of which are stored in remote databases, to powerful but untrusted LLM providers, increasing their exposure risk. Alternatively, they can run less powerful models locally on trusted devices. We bridge this gap. Our Socratic Chain-of-Thought Reasoning first sends a generic, non-private user query to a powerful, untrusted LLM, which generates a Chain-of-Thought (CoT) prompt and detailed sub-queries without accessing user data. Next, we embed these sub-queries and perform encrypted sub-second semantic search using our Homomorphically Encrypted Vector Database across one million entries of a single user's private data. This represents a realistic scale of personal documents, emails, and records accumulated over years of digital activity. Finally, we feed the CoT prompt and the decrypted records to a local language model and generate the final response. On the LoCoMo long-context QA benchmark, our hybrid framework, combining GPT-4o with a local Llama-3.2-1B model, outperforms using GPT-4o alone by up to 7.1 percentage points. This demonstrates a first step toward systems where tasks are decomposed and split between untrusted strong LLMs and weak local ones, preserving user privacy.

[710] arXiv:2506.17765 (replaced) [pdf, html, other]
Title: CARTS: Collaborative Agents for Recommendation Textual Summarization
Jiao Chen, Kehui Yao, Reza Yousefi Maragheh, Kai Zhao, Jianpeng Xu, Jason Cho, Evren Korpeoglu, Sushant Kumar, Kannan Achan
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Current recommendation systems often require some form of textual data summarization, such as generating concise and coherent titles for product carousels or other grouped item displays. While large language models have shown promise in NLP domains for textual summarization, these approaches do not directly apply to recommendation systems, where explanations must be highly relevant to the core features of item sets, adhere to strict word limit constraints. In this paper, we propose CARTS (Collaborative Agents for Recommendation Textual Summarization), a multi-agent LLM framework designed for structured summarization in recommendation systems. CARTS decomposes the task into three stages-Generation Augmented Generation (GAG), refinement circle, and arbitration, where successive agent roles are responsible for extracting salient item features, iteratively refining candidate titles based on relevance and length feedback, and selecting the final title through a collaborative arbitration process. Experiments on large-scale e-commerce data and live A/B testing show that CARTS significantly outperforms single-pass and chain-of-thought LLM baselines, delivering higher title relevance and improved user engagement metrics.

[711] arXiv:2506.18494 (replaced) [pdf, html, other]
Title: Distribution of codewords on the faces of a hypercube and new combinatorial identities
Jamolidin K. Abdurakhmanov
Subjects: Discrete Mathematics (cs.DM)

We present a novel framework for studying combinatorial identities through the geometric lens of subset distributions in q-valued cubes. By analyzing how elements of arbitrary subsets are distributed among the faces of the cube E_q^n, we discover new combinatorial identities with geometric significance. We prove that for any subset A contained in E_2^n, the rank function satisfies refined bounds that lead to exact computations for small cardinalities. Specifically, we show that for odd cardinalities, the lower bound is 4D_A/(|A|^2-1) where D_A is the sum of all pairwise Hamming distances in A. Our main theorem establishes identities connecting the number of k-dimensional faces containing exactly e elements of a subset to binomial sums over all subsets of specified cardinality. This yields a parametric family of identities where classical results emerge as special cases. As applications, we derive a geometric interpretation of Vandermonde's identity by examining faces of q-valued cubes, revealing that this classical result naturally arises from counting element distributions. We also obtain a completely new identity for even-weight vectors: (2^(k-1) - 1) times 2^(n-1) times binomial(n,k) equals the sum over i from 1 to floor(n/2) of binomial(n,2i) times binomial(n-2i,k-2i). This identity, valid for all 1 <= k <= n, demonstrates how geometric perspectives can uncover hidden combinatorial relationships. Our framework provides a unified approach for generating new identities and understanding existing ones through subset rank analysis.

[712] arXiv:2506.18541 (replaced) [pdf, other]
Title: Deciding Termination of Simple Randomized Loops
Éléanore Meyer, Jürgen Giesl
Subjects: Logic in Computer Science (cs.LO)

We show that universal positive almost sure termination (UPAST) is decidable for a class of simple randomized programs, i.e., it is decidable whether the expected runtime of such a program is finite for all inputs. Our class contains all programs that consist of a single loop, with a linear loop guard and a loop body composed of two linear commuting and diagonalizable updates. In each iteration of the loop, the update to be carried out is picked at random, according to a fixed probability. We show the decidability of UPAST for this class of programs, where the program's variables and inputs may range over various sub-semirings of the real numbers. In this way, we extend a line of research initiated by Tiwari in 2004 into the realm of randomized programs.

[713] arXiv:2506.18548 (replaced) [pdf, html, other]
Title: Rethinking Click Models in Light of Carousel Interfaces: Theory-Based Categorization and Design of Click Models
Jingwei Kang, Maarten de Rijke, Santiago de Leon-Martinez, Harrie Oosterhuis
Comments: Accepted by ICTIR 2025
Subjects: Information Retrieval (cs.IR)

Click models are a well-established for modeling user interactions with web interfaces. Previous work has mainly focused on traditional single-list web search settings; this includes existing surveys that introduced categorizations based on the first generation of probabilistic graphical model (PGM) click models that have become standard. However, these categorizations have become outdated, as their conceptualizations are unable to meaningfully compare PGM with neural network (NN) click models nor generalize to newer interfaces, such as carousel interfaces. We argue that this outdated view fails to adequately explain the fundamentals of click model designs, thus hindering the development of novel click models.
This work reconsiders what should be the fundamental concepts in click model design, grounding them - unlike previous approaches - in their mathematical properties. We propose three fundamental key-design choices that explain what statistical patterns a click model can capture, and thus indirectly, what user behaviors they can capture. Based on these choices, we create a novel click model taxonomy that allows a meaningful comparison of all existing click models; this is the first taxonomy of single-list, grid and carousel click models that includes PGMs and NNs. Finally, we show how our conceptualization provides a foundation for future click model design by an example derivation of a novel design for carousel interfaces.

[714] arXiv:2506.18710 (replaced) [pdf, html, other]
Title: Benchmarking the Pedagogical Knowledge of Large Language Models
Maxime Lelièvre, Amy Waldock, Meng Liu, Natalia Valdés Aspillaga, Alasdair Mackintosh, María José Ogando Portela, Jared Lee, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI's knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models' understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions. We consider the relationship between cost and accuracy and chart the progression of the Pareto value frontier over time. We provide online leaderboards at this https URL which are updated with new models and allow interactive exploration and filtering based on various model properties, such as cost per token and open-vs-closed weights, as well as looking at performance in different subjects. LLMs and generative AI have tremendous potential to influence education and help to address the global learning crisis. Education-focused benchmarks are crucial to measure models' capacities to understand pedagogical concepts, respond appropriately to learners' needs, and support effective teaching practices across diverse contexts. They are needed for informing the responsible and evidence-based deployment of LLMs and LLM-based tools in educational settings, and for guiding both development and policy decisions.

[715] arXiv:2506.19089 (replaced) [pdf, html, other]
Title: Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting
Nathaniel Getachew, Abulhair Saparov
Comments: 14 pages, 11 figures
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

[716] arXiv:2506.19283 (replaced) [pdf, html, other]
Title: AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration
Xiangbo Gao, Yuheng Wu, Xuewen Luo, Keshu Wu, Xinghao Chen, Yuping Wang, Chenxi Liu, Yang Zhou, Zhengzhong Tu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative or complement to fixed Road-Side Units (RSUs). Drones offer unique advantages over ground-based perception: complementary bird's-eye-views that reduce occlusions, dynamic positioning capabilities that enable hovering, patrolling, and escorting navigation rules, and significantly lower deployment costs compared to fixed infrastructure. Our dataset comprises 6.73 hours of drone-assisted driving scenarios across urban, suburban, and rural environments with varied weather and lighting conditions. The AirV2X-Perception dataset facilitates the development and standardized evaluation of Vehicle-to-Drone (V2D) algorithms, addressing a critical gap in the rapidly expanding field of aerial-assisted autonomous driving systems. The dataset and development kits are open-sourced at this https URL.

[717] arXiv:2506.19288 (replaced) [pdf, html, other]
Title: Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding
Runwei Guan, Ningwei Ouyang, Tianhao Xu, Shaofeng Liang, Wei Dai, Yafeng Sun, Shang Gao, Songning Lai, Shanliang Yao, Xuming Hu, Ryan Wen Liu, Yutao Yue, Hui Xiong
Comments: 14 pages, 13 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.

[718] arXiv:2506.19676 (replaced) [pdf, html, other]
Title: A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures
Dezhang Kong, Shi Lin, Zhenhua Xu, Zhebo Wang, Minghao Li, Yufeng Li, Yilun Zhang, Hujin Peng, Zeyang Sha, Yuyuan Li, Changting Lin, Xun Wang, Xuan Liu, Ningyu Zhang, Chaochao Chen, Muhammad Khurram Khan, Meng Han
Subjects: Cryptography and Security (cs.CR)

In recent years, Large-Language-Model-driven AI agents have exhibited unprecedented intelligence and adaptability, and are rapidly changing human production and life. Nowadays, agents are undergoing a new round of evolution. They no longer act as an isolated island like LLMs. Instead, they start to communicate with diverse external entities, such as other agents and tools, to perform more complex tasks collectively. Under this trend, agent communication is regarded as a foundational pillar of the future AI ecosystem, and many organizations have intensively begun to design related communication protocols (e.g., Anthropic's MCP and Google's A2A) within the recent few months. However, this new field exposes significant security hazards, which can cause severe damage to real-world scenarios. To help researchers quickly figure out this promising topic and benefit the future agent communication development, this paper presents a comprehensive survey of agent communication security. More precisely, we first present a clear definition of agent communication and categorize the entire lifecycle of agent communication into three stages: user-agent interaction, agent-agent communication, and agent-environment communication. Next, for each communication phase, we dissect related protocols and analyze the security risks according to the communication characteristics. Then, we summarize and outlook on the possible defense countermeasures for each risk. In addition, we conduct experiments using MCP and A2A to help readers better understand the novel vulnerabilities brought by agent communication. Finally, we discuss open issues and future directions in this promising research field.

[719] arXiv:2506.20259 (replaced) [pdf, html, other]
Title: Generating and Customizing Robotic Arm Trajectories using Neural Networks
Andrej Lúčny, Matilde Antonj, Carlo Mazzola, Hana Hornáčková, Igor Farkaš
Comments: The code is released at this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

We introduce a neural network approach for generating and customizing the trajectory of a robotic arm, that guarantees precision and repeatability. To highlight the potential of this novel method, we describe the design and implementation of the technique and show its application in an experimental setting of cognitive robotics. In this scenario, the NICO robot was characterized by the ability to point to specific points in space with precise linear movements, increasing the predictability of the robotic action during its interaction with humans. To achieve this goal, the neural network computes the forward kinematics of the robot arm. By integrating it with a generator of joint angles, another neural network was developed and trained on an artificial dataset created from suitable start and end poses of the robotic arm. Through the computation of angular velocities, the robot was characterized by its ability to perform the movement, and the quality of its action was evaluated in terms of shape and accuracy. Thanks to its broad applicability, our approach successfully generates precise trajectories that could be customized in their shape and adapted to different settings.

[720] arXiv:2506.20687 (replaced) [pdf, html, other]
Title: Review of Three Variants of the k-d Tree
Russell A. Brown
Comments: 29 pages, 11 figures, one listing, one table
Subjects: Data Structures and Algorithms (cs.DS)

The original description of the k-d tree recognized that rebalancing techniques, such as used to build an AVL tree or a red-black tree, are not applicable to a k-d tree. Hence, in order to build a balanced k-d tree, it is necessary to find the median of a set of data for each recursive subdivision of that set. The sort or selection used to find the median, and the technique used to partition the set about that median, strongly influence the computational complexity of building a k-d tree. This article describes and contrasts three variants of the k-d tree that differ in their technique used to partition the set, and compares the performance of those variants. In addition, dual-threaded execution is proposed and analyzed for one of the three variants.

[721] arXiv:2506.20702 (replaced) [pdf, other]
Title: The Singapore Consensus on Global AI Safety Research Priorities
Yoshua Bengio, Tegan Maharaj, Luke Ong, Stuart Russell, Dawn Song, Max Tegmark, Lan Xue, Ya-Qin Zhang, Stephen Casper, Wan Sie Lee, Sören Mindermann, Vanessa Wilfred, Vidhisha Balachandran, Fazl Barez, Michael Belinsky, Imane Bello, Malo Bourgon, Mark Brakel, Siméon Campos, Duncan Cass-Beggs, Jiahao Chen, Rumman Chowdhury, Kuan Chua Seah, Jeff Clune, Juntao Dai, Agnes Delaborde, Nouha Dziri, Francisco Eiras, Joshua Engels, Jinyu Fan, Adam Gleave, Noah Goodman, Fynn Heide, Johannes Heidecke, Dan Hendrycks, Cyrus Hodes, Bryan Low Kian Hsiang, Minlie Huang, Sami Jawhar, Wang Jingyu, Adam Tauman Kalai, Meindert Kamphuis, Mohan Kankanhalli, Subhash Kantamneni, Mathias Bonde Kirk, Thomas Kwa, Jeffrey Ladish, Kwok-Yan Lam, Wan Lee Sie, Taewhi Lee, Xiaojian Li, Jiajun Liu, Chaochao Lu, Yifan Mai, Richard Mallah, Julian Michael, Nick Moës, Simon Möller, Kihyuk Nam, Kwan Yee Ng, Mark Nitzberg, Besmira Nushi, Seán O hÉigeartaigh, Alejandro Ortega, Pierre Peigné, James Petrie, Benjamin Prud'Homme, Reihaneh Rabbany, Nayat Sanchez-Pi, Sarah Schwettmann, Buck Shlegeris, Saad Siddiqui, Aradhana Sinha, Martín Soto, Cheston Tan, Dong Ting, William Tjhi, Robert Trager, Brian Tse, Anthony Tung K. H., Vanessa Wilfred, John Willes, Denise Wong, Wei Xu, Rongwu Xu, Yi Zeng, HongJiang Zhang, Djordje Žikelić
Comments: Final report from the "2025 Singapore Conference on AI (SCAI)" held April 26: this https URL
Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Rapidly improving AI capabilities and autonomy hold significant promise of transformation, but are also driving vigorous debate on how to ensure that AI is safe, i.e., trustworthy, reliable, and secure. Building a trusted ecosystem is therefore essential -- it helps people embrace AI with confidence and gives maximal space for innovation while avoiding backlash.
The "2025 Singapore Conference on AI (SCAI): International Scientific Exchange on AI Safety" aimed to support research in this space by bringing together AI scientists across geographies to identify and synthesise research priorities in AI safety. This resulting report builds on the International AI Safety Report chaired by Yoshua Bengio and backed by 33 governments. By adopting a defence-in-depth model, this report organises AI safety research domains into three types: challenges with creating trustworthy AI systems (Development), challenges with evaluating their risks (Assessment), and challenges with monitoring and intervening after deployment (Control).

[722] arXiv:2506.21096 (replaced) [pdf, html, other]
Title: DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning
Kang He, Yuzhe Ding, Haining Wang, Fei Li, Chong Teng, Donghong Ji
Comments: Accepted by ACL 2025 Findings
Subjects: Computation and Language (cs.CL)

Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.

[723] arXiv:2506.21098 (replaced) [pdf, html, other]
Title: ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry
Qinwen Chen, Wenbiao Tao, Zhiwei Zhu, Mingfan Xi, Liangzhong Guo, Yuan Wang, Wei Wang, Yunshi Lan
Comments: 7 pages, 4 figures. Accepted at ACL 2025 Industry Track
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Community Question Answering (CQA) platforms can be deemed as important knowledge bases in community, but effectively leveraging historical interactions and domain knowledge in real-time remains a challenge. Existing methods often underutilize external knowledge, fail to incorporate dynamic historical QA context, or lack memory mechanisms suited for industrial deployment. We propose ComRAG, a retrieval-augmented generation framework for real-time industrial CQA that integrates static knowledge with dynamic historical QA pairs via a centroid-based memory mechanism designed for retrieval, generation, and efficient storage. Evaluated on three industrial CQA datasets, ComRAG consistently outperforms all baselines--achieving up to 25.9% improvement in vector similarity, reducing latency by 8.7% to 23.3%, and lowering chunk growth from 20.23% to 2.06% over iterations.

[724] arXiv:2506.21319 (replaced) [pdf, html, other]
Title: A Dataset for Enhancing MLLMs in Visualization Understanding and Reconstruction
Can Liu, Chunlin Da, Xiaoxiao Long, Yuxiao Yang, Yu Zhang, Yong Wang
Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

Current multimodal large language models (MLLMs), while effective in natural image understanding, struggle with visualization understanding due to their inability to decode the data-to-visual mapping and extract structured information. To address these challenges, we propose SimVec, a compact and structured vector format that encodes chart elements, including mark types, positions, and sizes. Then, we present a new visualization dataset, which consists of bitmap images of charts, their corresponding SimVec representations, and data-centric question-answering pairs, each accompanied by explanatory chain-of-thought sentences. We fine-tune state-of-the-art MLLMs using our dataset. The experimental results show that fine-tuning leads to substantial improvements in data-centric reasoning tasks compared to their zero-shot versions. SimVec also enables MLLMs to accurately and compactly reconstruct chart structures from images. Our dataset and code are available at: this https URL.

[725] arXiv:2506.21329 (replaced) [pdf, html, other]
Title: Active Inference AI Systems for Scientific Discovery
Karthik Duraisamy
Subjects: Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)

The rapid evolution of artificial intelligence has led to expectations of transformative scientific discovery, yet current systems remain fundamentally limited by their operational architectures, brittle reasoning mechanisms, and their separation from experimental reality. Building on earlier work, we contend that progress in AI-driven science now depends on closing three fundamental gaps -- the abstraction gap, the reasoning gap, and the reality gap -- rather than on model size/data/test time compute. Scientific reasoning demands internal representations that support simulation of actions and response, causal structures that distinguish correlation from mechanism, and continuous calibration. We define active inference AI systems for scientific discovery as those that (i) maintain long-lived research memories grounded in causal self-supervised foundation models, (ii) symbolic or neuro-symbolic planners equipped with Bayesian guardrails, (iii) grow persistent knowledge graphs where thinking generates novel conceptual nodes, reasoning establishes causal edges, and real-world interaction prunes false connections while strengthening verified pathways, and (iv) refine their internal representations through closed-loop interaction with both high-fidelity simulators and automated laboratories - an operational loop where mental simulation guides action and empirical surprise reshapes understanding. In essence, we outline an architecture where discovery arises from the interplay between internal models that enable counterfactual reasoning and external validation that grounds hypotheses in reality. It is also argued that the inherent ambiguity in feedback from simulations and experiments, and underlying uncertainties makes human judgment indispensable, not as a temporary scaffold but as a permanent architectural component.

[726] arXiv:2506.21541 (replaced) [pdf, html, other]
Title: StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning
Chuxin Wang, Yixin Zha, Wenfei Yang, Tianzhu Zhang
Comments: Accepted by ICCV 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.

[727] arXiv:2506.21980 (replaced) [pdf, html, other]
Title: R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning
Biao Wang, Wenwen Li
Comments: 7 pages, 2 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Visual single object tracking aims to continuously localize and estimate the scale of a target in subsequent video frames, given only its initial state in the first frame. This task has traditionally been framed as a template matching problem, evolving through major phases including correlation filters, two-stream networks, and one-stream networks with significant progress achieved. However, these methods typically require explicit classification and regression modeling, depend on supervised training with large-scale datasets, and are limited to the single task of tracking, lacking flexibility. In recent years, multi-modal large language models (MLLMs) have advanced rapidly. Open-source models like Qwen2.5-VL, a flagship MLLMs with strong foundational capabilities, demonstrate excellent performance in grounding tasks. This has spurred interest in applying such models directly to visual tracking. However, experiments reveal that Qwen2.5-VL struggles with template matching between image pairs (i.e., tracking tasks). Inspired by deepseek-R1, we fine-tuned Qwen2.5-VL using the group relative policy optimization (GRPO) reinforcement learning method on a small-scale dataset with a rule-based reward function. The resulting model, R1-Track, achieved notable performance on the GOT-10k benchmark. R1-Track supports flexible initialization via bounding boxes or text descriptions while retaining most of the original model's general capabilities. And we further discuss potential improvements for R1-Track. This rough technical report summarizes our findings as of May 2025.

[728] arXiv:2506.21997 (replaced) [pdf, html, other]
Title: Binned semiparametric Bayesian networks
Rafael Sojo, Javier Díaz-Rozo, Concha Bielza, Pedro Larrañaga
Comments: Submitted to Information Sciences
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

This paper introduces a new type of probabilistic semiparametric model that takes advantage of data binning to reduce the computational cost of kernel density estimation in nonparametric distributions. Two new conditional probability distributions are developed for the new binned semiparametric Bayesian networks, the sparse binned kernel density estimation and the Fourier kernel density estimation. These two probability distributions address the curse of dimensionality, which typically impacts binned models, by using sparse tensors and restricting the number of parent nodes in conditional probability calculations. To evaluate the proposal, we perform a complexity analysis and conduct several comparative experiments using synthetic data and datasets from the UCI Machine Learning repository. The experiments include different binning rules, parent restrictions, grid sizes, and number of instances to get a holistic view of the model's behavior. As a result, our binned semiparametric Bayesian networks achieve structural learning and log-likelihood estimations with no statistically significant differences compared to the semiparametric Bayesian networks, but at a much higher speed. Thus, the new binned semiparametric Bayesian networks prove to be a reliable and more efficient alternative to their non-binned counterparts.

[729] arXiv:2506.22061 (replaced) [pdf, html, other]
Title: Negated String Containment is Decidable (Technical Report)
Vojtěch Havlena, Michal Hečko, Lukáš Holík, Ondřej Lengál
Subjects: Logic in Computer Science (cs.LO); Formal Languages and Automata Theory (cs.FL)

We provide a positive answer to a long-standing open question of the decidability of the not-contains string predicate. Not-contains is practically relevant, for instance in symbolic execution of string manipulating programs. Particularly, we show that the predicate $\neg\mathit{Contains}(x_1 \ldots x_n, y_1 \ldots y_m)$, where $x_1 \ldots x_n$ and $y_1 \ldots y_m$ are sequences of string variables constrained by regular languages, is decidable. Decidability of a not-contains predicate combined with chain-free word equations and regular membership constraints follows.

[730] arXiv:2506.22099 (replaced) [pdf, html, other]
Title: BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting
Zipei Ma, Junzhe Jiang, Yurui Chen, Li Zhang
Comments: Accepted at ICCV 2025, Project Page: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

The realistic reconstruction of street scenes is critical for developing real-world simulators in autonomous driving. Most existing methods rely on object pose annotations, using these poses to reconstruct dynamic objects and move them during the rendering process. This dependence on high-precision object annotations limits large-scale and extensive scene reconstruction. To address this challenge, we propose Bézier curve Gaussian splatting (BézierGS), which represents the motion trajectories of dynamic objects using learnable Bézier curves. This approach fully leverages the temporal information of dynamic objects and, through learnable curve modeling, automatically corrects pose errors. By introducing additional supervision on dynamic object rendering and inter-curve consistency constraints, we achieve reasonable and accurate separation and reconstruction of scene elements. Extensive experiments on the Waymo Open Dataset and the nuPlan benchmark demonstrate that BézierGS outperforms state-of-the-art alternatives in both dynamic and static scene components reconstruction and novel view synthesis.

[731] arXiv:2506.22403 (replaced) [pdf, other]
Title: HyperCLOVA X THINK Technical Report
NAVER Cloud HyperCLOVA X Team
Comments: 50 pages, 13 figures; fixed figures in the appendix
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

We introduce HyperCLOVA X THINK, the first reasoning-focused large language model in the HyperCLOVA X family, pre-trained on roughly $6$ trillion high-quality Korean, and English tokens, augmented with targeted synthetic Korean data. It was implemented as a compute-memory-balanced Peri-LN Transformer scaled with $\mu$P, pre-trained through a three-stage curriculum that expands the context window to $128$K tokens, and post-trained via supervised fine-tuning with Reinforcement Learning from Verifiable Rewards supports both detailed rationale and concise-answer modes. It delivers competitive performance against similarly sized models on Korea-focused benchmarks such as KMMLU, CSAT, KoBALT-700, HAERAE-1.0, and KoBigBench, while preserving robust bilingual consistency and translation quality. In addition, a vision-augmented variant matches or exceeds GPT-4.1 on the KCSAT STEM benchmark, all of which are achieved with substantially lower training compute than existing models of similar sizes. We also present a pruning and distillation technique that will soon be applied to HyperCLOVA X THINK for an open-source and business-friendly foundation model. Altogether, these capabilities position HyperCLOVA X THINK as a robust foundation for Korean AI innovation and a valuable resource for the global research community.

[732] arXiv:2506.22419 (replaced) [pdf, other]
Title: The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
Bingchen Zhao, Despoina Magka, Minqi Jiang, Xian Li, Roberta Raileanu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Kelvin Niu, Shagun Sodhani, Michael Shvartsman, Andrei Lupu, Alisia Lupidi, Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Thomas Foster, Lucia Cipolina-Kun, Abhishek Charnalia, Derek Dunfield, Alexander H. Miller, Oisin Mac Aodha, Jakob Foerster, Yoram Bachrach
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.

[733] arXiv:2506.22523 (replaced) [pdf, other]
Title: Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center
James Wen, Sahil Nalawade, Zhiwei Liang, Catherine Bielick, Marisa Ferrara Boston, Alexander Chowdhury, Adele Collin, Luigi De Angelis, Jacob Ellen, Heather Frase, Rodrigo R. Gameiro, Juan Manuel Gutierrez, Pooja Kadam, Murat Keceli, Srikanth Krishnamurthy, Anne Kwok, Yanan Lance Lu, Heather Mattie, Liam G. McCoy, Katherine Miller, Allison C. Morgan, Marlene Louisa Moerig, Trang Nguyen, Alexander Owen-Post, Alex D. Ruiz, Sreekar Reddy Puchala, Soujanya Samineni, Takeshi Tohyama, Varun Ullanat, Carmine Valenza, Camilo Velez, Pengcheng Wang, Anna Wuest, Yuxiang Zhou, Yingde Zhu, Jason M. Johnson, Naomi Lenane, Jennifer Willcox, Francis J. Vitiello, Leo Anthony G. Celi, Renato Umeton
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)

Background: Generative artificial intelligence (AI) deployment in healthcare settings raises copyright compliance concerns. Dana-Farber Cancer Institute implemented GPT4DFCI, an internal generative AI tool utilizing OpenAI models, that is approved for enterprise use in research and operations. Given (i) the exceptionally broad adoption of the tool in our organization, (ii) our research mission, and (iii) the shared responsibility model required by Microsoft OpenAI products, we deemed rigorous copyright compliance testing necessary.
Case Description: We conducted a structured red teaming exercise in Nov. 2024, with 42 participants from academic, industry, and government institutions. Four teams attempted to extract copyrighted content from GPT4DFCI across four domains: literary works, news articles, scientific publications, and access-restricted clinical notes. Teams successfully extracted verbatim book dedications and near-exact passages through indirect prompting strategies. News article extraction failed despite jailbreak attempts. Scientific article reproduction yielded only high-level summaries. Clinical note testing revealed appropriate privacy safeguards with data reformatting rather than reproduction.
Discussion: The successful extraction of literary content indicates potential copyright material presence in training data, necessitating enhanced inference-time filtering. Differential success rates across content types suggest varying protective mechanisms. The event led to implementation of a copyright-specific meta-prompt in GPT4DFCI; this mitigation is in production since Jan. 2025.
Conclusion: Systematic red teaming revealed specific vulnerabilities in generative AI copyright compliance, leading to concrete mitigation strategies. Academic medical institutions deploying generative AI must implement continuous testing protocols to ensure legal and ethical compliance.

[734] arXiv:2506.22554 (replaced) [pdf, html, other]
Title: Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset
Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D'Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard, Martin Gleize, Sahir Gomez, Hongyu Gong, Srivathsan Govindarajan, Brandon Han, Sen He, Denise Hernandez, Yordan Hristov, Rongjie Huang, Hirofumi Inaguma, Somya Jain, Raj Janardhan, Qingyao Jia, Christopher Klaiber, Dejan Kovachev, Moneish Kumar, Hang Li, Yilei Li, Pavel Litvin, Wei Liu, Guangyao Ma, Jing Ma, Martin Ma, Xutai Ma, Lucas Mantovani, Sagar Miglani, Sreyas Mohan, Louis-Philippe Morency, Evonne Ng, Kam-Woh Ng, Tu Anh Nguyen, Amia Oberai, Benjamin Peloquin, Juan Pino, Jovan Popovic, Omid Poursaeed, Fabian Prada, Alice Rakotoarison, Rakesh Ranjan, Alexander Richard, Christophe Ropers, Safiyyah Saleem, Vasu Sharma, Alex Shcherbyna, Jia Shen, Jie Shen, Anastasis Stathopoulos, Anna Sun, Paden Tomasello, Tuan Tran, Arina Turkatenko, Bo Wan, Chao Wang, Jeff Wang, Mary Williamson, Carleigh Wood, Tao Xiang, Yilin Yang, Julien Yao, Chen Zhang, Jiemin Zhang, Xinyue Zhang, Jason Zheng, Pavlo Zhyzheria, Jan Zikes, Michael Zollhoefer
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual behavior of their interlocutors. We present a variant with speech from an LLM model and integrations with 2D and 3D rendering methods, bringing us closer to interactive virtual agents. Additionally, we describe controllable variants of our motion models that can adapt emotional responses and expressivity levels, as well as generating more semantically-relevant gestures. Finally, we discuss methods for assessing the quality of these dyadic motion models, which are demonstrating the potential for more intuitive and responsive human-AI interactions.

[735] arXiv:2506.22615 (replaced) [pdf, html, other]
Title: Error Estimates for the Arnoldi Approximation of a Matrix Square Root
James H. Adler, Xiaozhe Hu, Wenxiao Pan, Zhongqin Xue
Subjects: Numerical Analysis (math.NA)

The Arnoldi process provides an efficient framework for approximating functions of a matrix applied to a vector, i.e., of the form $f(M)\mathbf{b}$, by repeated matrix-vector multiplications. In this paper, we derive an \textit{a priori} error estimate for approximating the action of a matrix square root using the Arnoldi process, where the integral representation of the error is reformulated in terms of the error for solving the linear system $M\mathbf{x}=\mathbf{b}$. The results extend the error analysis of the Lanczos method for Hermitian matrices in [Chen et al., SIAM J. Matrix Anal. Appl., 2022] to non-Hermitian cases. Furthermore, to make the method applicable to large-scale problems, we assume that the matrices are preprocessed utilizing data-sparse approximations preserving positive definiteness, and then establish a refined error bound in this setting. The numerical results on matrices with different structures demonstrate that our theoretical analysis yields a reliable upper bound. Finally, simulations on large-scale matrices arising in particulate suspensions validate the effectiveness and practicality of the approach.

[736] arXiv:2506.22664 (replaced) [pdf, other]
Title: Hybrid Explicit-Implicit Predictor-Corrector Exponential Time-Differencing Multistep Padé Schemes for Semilinear Parabolic Equations with Time-Delay
Haishen Dai, Huan Lei
Comments: The paper requires substantial revisions for clarity and correctness
Subjects: Numerical Analysis (math.NA)

In this paper, we propose and analyze ETD-Multistep-Padé (ETD-MS-Padé) and ETD Implicit Multistep-Padé (ETD-IMS-Padé) for semilinear parabolic delay differential equations with smooth solutions. In our previous work [15], we proposed ETD-RK-Padé scheme to compute high-order numerical solutions for nonlinear parabolic reaction-diffusion equation with constant time delay. However, the based ETD-RK numerical scheme in [15] is very complex and the corresponding calculation program is also very complicated. We propose in this paper ETD-MS-Padé and ETD-IMS-Padé schemes for the solution of semilinear parabolic equations with delay. We synergize the ETD-MS-Padé with ETD-IMS-Padé to construct efficient predictor-corrector scheme. This new predictor-corrector scheme will become an important tool for solving the numerical solutions of parabolic differential equations. Remarkably, we also conducted experiments in Table$10$ to compare the numerical results of the predictor-corrector scheme with the EERK scheme proposed in paper [42]. The predictor-corrector scheme demonstrated better convergence.
The main idea is to employ an ETD-based Adams multistep extrapolation for the time integration of the corresponding equation. To overcome the well-known numerical instability associated with computing the exponential operator, we utilize the Padé approach to approximate this exponential operator. This methodology leads to the development of the ETD-MS-Padé and ETD-IMS-Padé schemes, applicable even for arbitrary time orders. We validate the ETD-MS1,2,3,4-Padé schemes and ETD-IMS2,3,4 schemes through numerical experiments.

[737] arXiv:2506.22698 (replaced) [pdf, other]
Title: Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report
Emily Dux Speltz
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

This report synthesizes the outcomes of a recent interdisciplinary workshop that brought together leading experts in cognitive psychology, language learning, and artificial intelligence (AI)-based natural language processing (NLP). The workshop, funded by the National Science Foundation, aimed to address a critical knowledge gap in our understanding of the relationship between AI language models and human cognitive processes in text comprehension and composition. Through collaborative dialogue across cognitive, linguistic, and technological perspectives, workshop participants examined the underlying processes involved when humans produce and comprehend text, and how AI can both inform our understanding of these processes and augment human capabilities. The workshop revealed emerging patterns in the relationship between large language models (LLMs) and human cognition, with highlights on both the capabilities of LLMs and their limitations in fully replicating human-like language understanding and generation. Key findings include the potential of LLMs to offer insights into human language processing, the increasing alignment between LLM behavior and human language processing when models are fine-tuned with human feedback, and the opportunities and challenges presented by human-AI collaboration in language tasks. By synthesizing these findings, this report aims to guide future research, development, and implementation of LLMs in cognitive psychology, linguistics, and education. It emphasizes the importance of ethical considerations and responsible use of AI technologies while striving to enhance human capabilities in text comprehension and production through effective human-AI collaboration.

[738] arXiv:2506.22729 (replaced) [pdf, html, other]
Title: Persistence Paradox in Dynamic Science
Honglin Bao, Kai Li
Subjects: Digital Libraries (cs.DL); Computers and Society (cs.CY); Machine Learning (cs.LG)

Persistence is often regarded as a virtue in science. In this paper, however, we challenge this conventional view by highlighting its contextual nature, particularly how persistence can become a liability during periods of paradigm shift. We focus on the deep learning revolution catalyzed by AlexNet in 2012. Analyzing the 20-year career trajectories of over 5,000 scientists who were active in top machine learning venues during the preceding decade, we examine how their research focus and output evolved. We first uncover a dynamic period in which leading venues increasingly prioritized cutting-edge deep learning developments that displaced relatively traditional statistical learning methods. Scientists responded to these changes in markedly different ways. Those who were previously successful or affiliated with old teams adapted more slowly, experiencing what we term a rigidity penalty - a reluctance to embrace new directions leading to a decline in scientific impact, as measured by citation percentile rank. In contrast, scientists who pursued strategic adaptation - selectively pivoting toward emerging trends while preserving weak connections to prior expertise - reaped the greatest benefits. Taken together, our macro- and micro-level findings show that scientific breakthroughs act as mechanisms that reconfigure power structures within a field.

[739] arXiv:2506.22773 (replaced) [pdf, html, other]
Title: Not All Water Consumption Is Equal: A Water Stress Weighted Metric for Sustainable Computing
Yanran Wu, Inez Hua, Yi Ding
Comments: 7 pages, 9 figures, The 4th Workshop on Sustainable Computer Systems (HotCarbon'25), Cambridge, MA, July 10-11th, 2025
Journal-ref: ACM SIGEnergy Energy Informatics Review (EIR), Volume 5 Issue 2, July 2025
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Computers and Society (cs.CY); Machine Learning (cs.LG)

Water consumption is an increasingly critical dimension of computing sustainability, especially as AI workloads rapidly scale. However, current water impact assessment often overlooks where and when water stress is more severe. To fill in this gap, we present SCARF, the first general framework that evaluates water impact of computing by factoring in both spatial and temporal variations in water stress. SCARF calculates an Adjusted Water Impact (AWI) metric that considers both consumption volume and local water stress over time. Through three case studies on LLM serving, datacenters, and semiconductor fabrication plants, we show the hidden opportunities for reducing water impact by optimizing location and time choices, paving the way for water-sustainable computing. The code is available at this https URL.

[740] arXiv:2506.22774 (replaced) [pdf, other]
Title: Bridging Ethical Principles and Algorithmic Methods: An Alternative Approach for Assessing Trustworthiness in AI Systems
Michael Papademas, Xenia Ziouvelou, Antonis Troumpoukis, Vangelis Karkaletsis
Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Artificial Intelligence (AI) technology epitomizes the complex challenges posed by human-made artifacts, particularly those widely integrated into society and exert significant influence, highlighting potential benefits and their negative consequences. While other technologies may also pose substantial risks, AI's pervasive reach makes its societal effects especially profound. The complexity of AI systems, coupled with their remarkable capabilities, can lead to a reliance on technologies that operate beyond direct human oversight or understanding. To mitigate the risks that arise, several theoretical tools and guidelines have been developed, alongside efforts to create technological tools aimed at safeguarding Trustworthy AI. The guidelines take a more holistic view of the issue but fail to provide techniques for quantifying trustworthiness. Conversely, while technological tools are better at achieving such quantification, they lack a holistic perspective, focusing instead on specific aspects of Trustworthy AI. This paper aims to introduce an assessment method that combines the ethical components of Trustworthy AI with the algorithmic processes of PageRank and TrustRank. The goal is to establish an assessment framework that minimizes the subjectivity inherent in the self-assessment techniques prevalent in the field by introducing algorithmic criteria. The application of our approach indicates that a holistic assessment of an AI system's trustworthiness can be achieved by providing quantitative insights while considering the theoretical content of relevant guidelines.

[741] arXiv:2506.22807 (replaced) [pdf, html, other]
Title: FreqDGT: Frequency-Adaptive Dynamic Graph Networks with Transformer for Cross-subject EEG Emotion Recognition
Yueyang Li, Shengyu Gong, Weiming Zeng, Nizhuan Wang, Wai Ting Siok
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Electroencephalography (EEG) serves as a reliable and objective signal for emotion recognition in affective brain-computer interfaces, offering unique advantages through its high temporal resolution and ability to capture authentic emotional states that cannot be consciously controlled. However, cross-subject generalization remains a fundamental challenge due to individual variability, cognitive traits, and emotional responses. We propose FreqDGT, a frequency-adaptive dynamic graph transformer that systematically addresses these limitations through an integrated framework. FreqDGT introduces frequency-adaptive processing (FAP) to dynamically weight emotion-relevant frequency bands based on neuroscientific evidence, employs adaptive dynamic graph learning (ADGL) to learn input-specific brain connectivity patterns, and implements multi-scale temporal disentanglement network (MTDN) that combines hierarchical temporal transformers with adversarial feature disentanglement to capture both temporal dynamics and ensure cross-subject robustness. Comprehensive experiments demonstrate that FreqDGT significantly improves cross-subject emotion recognition accuracy, confirming the effectiveness of integrating frequency-adaptive, spatial-dynamic, and temporal-hierarchical modeling while ensuring robustness to individual differences. The code is available at this https URL.

[742] arXiv:2506.22828 (replaced) [pdf, html, other]
Title: Model-theoretic Forcing in Transition Algebra
Go Hashimoto, Daniel Găină
Subjects: Logic in Computer Science (cs.LO)

We study Löwenheim-Skolem and Omitting Types theorems in Transition Algebra, a logical system obtained by enhancing many sorted first-order logic with features from dynamic logic. The sentences we consider include compositions, unions, and transitive closures of transition relations, which are treated similarly to actions in dynamic logics to define necessity and possibility operators. We show that Upward Löwenheim-Skolem theorem, any form of compactness, and joint Robinson consistency property fail due to the expressivity of transitive closures of transitions. In this non-compact many-sorted logical system, we develop a forcing technique method by generalizing the classical method of forcing used by Keisler to prove Omitting Types theorem. Instead of working within a single signature, we work with a directed diagram of signatures, which allows us to establish Downward Löwenheim-Skolem and Omitting Types theorems despite the fact that models interpret sorts as sets, possibly empty. Building on a complete system of proof rules for Transition Algebra, we extend it with additional proof rules to reason about constructor-based and/or finite transition algebras. We then establish the completeness of this extended system for a fragment of Transition Algebra obtained by restricting models to constructor-based and/or finite transition algebras.

[743] arXiv:2506.22832 (replaced) [pdf, html, other]
Title: Listener-Rewarded Thinking in VLMs for Image Preferences
Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: this https URL.

[744] arXiv:2506.22919 (replaced) [pdf, html, other]
Title: Hecto: Modular Sparse Experts for Adaptive and Interpretable Reasoning
Sanskar Pandey, Ruhaan Chopra, Saad Murtaza Bhat, Ark Abhyudaya
Subjects: Artificial Intelligence (cs.AI)

Mixture-of-Experts (MoE) models enable conditional computation by routing inputs to specialized experts, but these experts rely on identical inductive biases, thus limiting representational diversity. This static computation pathway is inefficient for inputs that require different types of reasoning and limits specialization and interpretability. We propose Hecto, a lightweight MoE architecture that leverages architectural heterogeneity by combining a GRU expert for temporal reasoning and an FFNN expert for static abstraction under a sparse Top-1 gating mechanism. Evaluated on three reasoning benchmarks (AG News, SST-2, HotpotQA) and a regression task (STS-B), Hecto matches or closely trails homogeneous baselines in performance despite receiving isolated input representations, while achieving clear expert specialization, with each expert aligning to distinct reasoning types (temporal vs static). At larger batch sizes, Hecto exhibits improved performance, benefiting from relaxed computational constraints that allow its heterogeneous architecture to optimize more effectively. Ablation results isolate architectural diversity as the source of Hecto's stability and interpretability across diverse reasoning tasks. Overall, Hecto establishes itself as a new benchmark for conditional computation, offering a principled framework for specialized reasoning in low-resource regimes with its model strength derived from principled specialization.

[745] arXiv:2506.22968 (replaced) [pdf, html, other]
Title: Against 'softmaxing' culture
Daniel Mwesigwa
Comments: 7 pages
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)

AI is flattening culture. Evaluations of "culture" are showing the myriad ways in which large AI models are homogenizing language and culture, averaging out rich linguistic differences into generic expressions. I call this phenomenon "softmaxing culture,'' and it is one of the fundamental challenges facing AI evaluations today. Efforts to improve and strengthen evaluations of culture are central to the project of cultural alignment in large AI systems. This position paper argues that machine learning (ML) and human-computer interaction (HCI) approaches to evaluation are limited. I propose two key conceptual shifts. First, instead of asking "what is culture?" at the start of system evaluations, I propose beginning with the question: "when is culture?" Second, while I acknowledge the philosophical claim that cultural universals exist, the challenge is not simply to describe them, but to situate them in relation to their particulars. Taken together, these conceptual shifts invite evaluation approaches that move beyond technical requirements toward perspectives that are more responsive to the complexities of culture.

[746] arXiv:2506.22971 (replaced) [pdf, html, other]
Title: Hierarchical Decentralized Stochastic Control for Cyber-Physical Systems
Kesav Kaza, Ramachandran Anantharaman, Rahul Meshram
Comments: 6 pages, 2 figures
Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC)

This paper presents a two-timescale hierarchical decentralized architecture for control of Cyber-Physical Systems. The architecture consists of $N$ independent sub-processes, a global controller, and $N$ local controllers, each formulated as a Markov Decision Process (MDP). The global controller, operating at a slower timescale optimizes the infinite-horizon discounted cumulative reward under budget constraints. For the local controllers, operating at a faster timescale, we propose two different optimization frameworks, namely the COpt and FOpt. In the COpt framework, the local controller also optimizes an infinite-horizon MDP, while in the FOpt framework, the local controller optimizes a finite-horizon MDP. The FOpt framework mimics a federal structure, where the local controllers have more autonomy in their decision making. First, the existence of stationary deterministic optimal policies for both these frameworks is established. Then, various relationships between the two frameworks are studied, including a bound on the difference between the two optimal value functions. Additionally, sufficiency conditions are provided such that the two frameworks lead to the same optimal values.

[747] arXiv:2506.23044 (replaced) [pdf, other]
Title: Ovis-U1 Technical Report
Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen
Comments: An unified model for multimodal understanding, text-to-image generation, and image editing. GitHub: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.

[748] arXiv:2506.23116 (replaced) [pdf, other]
Title: A User Experience 3.0 (UX 3.0) Paradigm Framework: Designing for Human-Centered AI Experiences
Wei Xu
Subjects: Human-Computer Interaction (cs.HC)

User experience (UX) practices have evolved in stages and are entering a transformative phase (UX 3.0), driven by AI technologies and shifting user needs. Human-centered AI (HCAI) experiences are emerging, necessitating new UX approaches to support UX practices in the AI era. We propose a UX 3.0 paradigm framework to respond and guide UX practices in developing HCAI systems.

[749] arXiv:2506.23126 (replaced) [pdf, html, other]
Title: ParticleFormer: A 3D Point Cloud World Model for Multi-Object, Multi-Material Robotic Manipulation
Suning Huang, Qianzhong Chen, Xiaohan Zhang, Jiankai Sun, Mac Schwager
Subjects: Robotics (cs.RO)

3D world models (i.e., learning-based 3D dynamics models) offer a promising approach to generalizable robotic manipulation by capturing the underlying physics of environment evolution conditioned on robot actions. However, existing 3D world models are primarily limited to single-material dynamics using a particle-based Graph Neural Network model, and often require time-consuming 3D scene reconstruction to obtain 3D particle tracks for training. In this work, we present ParticleFormer, a Transformer-based point cloud world model trained with a hybrid point cloud reconstruction loss, supervising both global and local dynamics features in multi-material, multi-object robot interactions. ParticleFormer captures fine-grained multi-object interactions between rigid, deformable, and flexible materials, trained directly from real-world robot perception data without an elaborate scene reconstruction. We demonstrate the model's effectiveness both in 3D scene forecasting tasks, and in downstream manipulation tasks using a Model Predictive Control (MPC) policy. In addition, we extend existing dynamics learning benchmarks to include diverse multi-material, multi-object interaction scenarios. We validate our method on six simulation and three real-world experiments, where it consistently outperforms leading baselines by achieving superior dynamics prediction accuracy and less rollout error in downstream visuomotor tasks. Experimental videos are available at this https URL.

[750] arXiv:2506.23129 (replaced) [pdf, html, other]
Title: Flatness-based Finite-Horizon Multi-UAV Formation Trajectory Planning and Directionally Aware Collision Avoidance Tracking
Hossein B. Jond, Logan Beaver, Martin Jiroušek, Naiemeh Ahmadlou, Veli Bakırcıoğlu, Martin Saska
Comments: Accepted for Journal of the Franklin Institute
Subjects: Robotics (cs.RO)

Optimal collision-free formation control of the unmanned aerial vehicle (UAV) is a challenge. The state-of-the-art optimal control approaches often rely on numerical methods sensitive to initial guesses. This paper presents an innovative collision-free finite-time formation control scheme for multiple UAVs leveraging the differential flatness of the UAV dynamics, eliminating the need for numerical methods. We formulate a finite-time optimal control problem to plan a formation trajectory for feasible initial states. This optimal control problem in formation trajectory planning involves a collective performance index to meet the formation requirements to achieve relative positions and velocity consensus. It is solved by applying Pontryagin's principle. Subsequently, a collision-constrained regulating problem is addressed to ensure collision-free tracking of the planned formation trajectory. The tracking problem incorporates a directionally aware collision avoidance strategy that prioritizes avoiding UAVs in the forward path and relative approach. It assigns lower priority to those on the sides with an oblique relative approach, disregarding UAVs behind and not in the relative approach. The high-fidelity simulation results validate the effectiveness of the proposed control scheme.

[751] arXiv:2506.23137 (replaced) [pdf, html, other]
Title: Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion
Siyuan Li, Ruitong Liu, Yan Wen, Te Sun
Comments: 10 pages
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Effective modeling of multifaceted relations is pivotal for Knowledge Graph Completion (KGC). However, a majority of existing approaches are predicated on static, embedding-based scoring, exhibiting inherent limitations in capturing contextual dependencies and relational dynamics. Addressing this gap, we propose the Flow-Modulated Scoring (FMS) framework. FMS comprises two principal components: (1) a semantic context learning module that encodes context-sensitive entity representations, and (2) a conditional flow-matching module designed to learn the dynamic transformation from a head to a tail embedding, governed by the aforementioned context. The resultant predictive vector field, representing the context-informed relational path, serves to dynamically refine the initial static score of an entity pair. Through this synergy of context-aware static representations and conditioned dynamic information, FMS facilitates a more profound modeling of relational semantics. Comprehensive evaluations on several standard benchmarks demonstrate that our proposed method surpasses prior state-of-the-art results.

[752] arXiv:2506.23146 (replaced) [pdf, other]
Title: Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions
Dingzriui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Subjects: Computation and Language (cs.CL)

In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.

[753] arXiv:2506.23152 (replaced) [pdf, html, other]
Title: DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover
Youzhuo Wang, Jiayi Ye, Chuyang Xiao, Yiming Zhong, Heng Tao, Hang Yu, Yumeng Liu, Jingyi Yu, Yuexin Ma
Subjects: Robotics (cs.RO)

Handover between a human and a dexterous robotic hand is a fundamental yet challenging task in human-robot collaboration. It requires handling dynamic environments and a wide variety of objects and demands robust and adaptive grasping strategies. However, progress in developing effective dynamic dexterous grasping methods is limited by the absence of high-quality, real-world human-to-robot handover datasets. Existing datasets primarily focus on grasping static objects or rely on synthesized handover motions, which differ significantly from real-world robot motion patterns, creating a substantial gap in applicability. In this paper, we introduce DexH2R, a comprehensive real-world dataset for human-to-robot handovers, built on a dexterous robotic hand. Our dataset captures a diverse range of interactive objects, dynamic motion patterns, rich visual sensor data, and detailed annotations. Additionally, to ensure natural and human-like dexterous motions, we utilize teleoperation for data collection, enabling the robot's movements to align with human behaviors and habits, which is a crucial characteristic for intelligent humanoid robots. Furthermore, we propose an effective solution, DynamicGrasp, for human-to-robot handover and evaluate various state-of-the-art approaches, including auto-regressive models and diffusion policy methods, providing a thorough comparison and analysis. We believe our benchmark will drive advancements in human-to-robot handover research by offering a high-quality dataset, effective solutions, and comprehensive evaluation metrics.

[754] arXiv:2506.23431 (replaced) [pdf, html, other]
Title: Pipelined Decoder for Efficient Context-Aware Text Generation
Zixian Huang, Chenxu Niu, Yu Gu, Gengyang Xiao, Xinwei Huang, Gong Cheng
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.

[755] arXiv:2506.23458 (replaced) [pdf, html, other]
Title: Neuro-Informed Joint Learning Enhances Cognitive Workload Decoding in Portable BCIs
Xiaoxiao Yang, Chao Feng, Jiancheng Chen
Comments: 2 pages short paper
Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

Portable and wearable consumer-grade electroencephalography (EEG) devices, like Muse headbands, offer unprecedented mobility for daily brain-computer interface (BCI) applications, including cognitive load detection. However, the exacerbated non-stationarity in portable EEG signals constrains data fidelity and decoding accuracy, creating a fundamental trade-off between portability and performance. To mitigate such limitation, we propose MuseCogNet (Muse-based Cognitive Network), a unified joint learning framework integrating self-supervised and supervised training paradigms. In particular, we introduce an EEG-grounded self-supervised reconstruction loss based on average pooling to capture robust neurophysiological patterns, while cross-entropy loss refines task-specific cognitive discriminants. This joint learning framework resembles the bottom-up and top-down attention in humans, enabling MuseCogNet to significantly outperform state-of-the-art methods on a publicly available Muse dataset and establish an implementable pathway for neurocognitive monitoring in ecological settings.

[756] arXiv:2506.23491 (replaced) [pdf, html, other]
Title: ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding
ZongHan Hsieh, Tzer-Jen Wei, ShengJing Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

This paper introduces ZonUI-3B, a lightweight Vision-Language Model (VLM) specifically designed for Graphical User Interface grounding tasks, achieving performance competitive with significantly larger models. Unlike large-scale VLMs (>7B parameters) that are computationally intensive and impractical for consumer-grade hardware, ZonUI-3B delivers strong grounding accuracy while being fully trainable on a single GPU (RTX 4090). The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks-including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights ZonUI-3B's exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The ZonUI-3B is available at: this https URL

[757] arXiv:2506.23520 (replaced) [pdf, html, other]
Title: ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data
Yu Zhang, Ruijie Yu, Jidong Tian, Feng Zhu, Jiapeng Liu, Xiaokang Yang, Yaohui Jin, Yanyan Xu
Subjects: Artificial Intelligence (cs.AI)

With the increasing interest in robotic synthesis in the context of organic chemistry, the automated extraction of chemical procedures from literature is critical. However, this task remains challenging due to the inherent ambiguity of chemical language and the high cost of human annotation required for developing reliable computer-aided extraction protocols. Here, we present ChemActor, a fully fine-tuned large language model (LLM), as a chemical executor to convert between unstructured experimental procedures and structured action sequences. We propose a sequential LLM-generated data framework to address the challenges of insufficient and low-quality annotated data. This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input. Additionally, we introduce a novel multi-round LLMs circle review metric, which reflects the model's advanced understanding of chemical experimental procedures. Extensive experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor, augmented by LLM-generated data, achieves state-of-the-art performance, outperforming the baseline model by 10%. The code is available at: this https URL.

[758] arXiv:2506.23544 (replaced) [pdf, html, other]
Title: Both Asymptotic and Non-Asymptotic Convergence of Quasi-Hyperbolic Momentum using Increasing Batch Size
Kento Imaizumi, Hideaki Iiduka
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)

Momentum methods were originally introduced for their superiority to stochastic gradient descent (SGD) in deterministic settings with convex objective functions. However, despite their widespread application to deep neural networks -- a representative case of stochastic nonconvex optimization -- the theoretical justification for their effectiveness in such settings remains limited. Quasi-hyperbolic momentum (QHM) is an algorithm that generalizes various momentum methods and has been studied to better understand the class of momentum-based algorithms as a whole. In this paper, we provide both asymptotic and non-asymptotic convergence results for mini-batch QHM with an increasing batch size. We show that achieving asymptotic convergence requires either a decaying learning rate or an increasing batch size. Since a decaying learning rate adversely affects non-asymptotic convergence, we demonstrate that using mini-batch QHM with an increasing batch size -- without decaying the learning rate -- can be a more effective strategy. Our experiments show that even a finite increase in batch size can provide benefits for training neural networks.

[759] arXiv:2506.23638 (replaced) [pdf, html, other]
Title: Simple Approximations for General Spanner Problems
Fritz Bökler, Markus Chimani, Henning Jasper
Subjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Combinatorics (math.CO)

Consider a graph with n nodes and m edges, independent edge weights and lengths, and arbitrary distance demands for node pairs. The spanner problem asks for a minimum-weight subgraph that satisfies these demands via sufficiently short paths w.r.t. the edge lengths. For multiplicative alpha-spanners (where demands equal alpha times the original distances) and assuming that each edge's weight equals its length, the simple Greedy heuristic by Althöfer et al. (1993) is known to yield strong solutions, both in theory and practice. To obtain guarantees in more general settings, recent approximations typically abandon this simplicity and practicality. Still, so far, there is no known non-trivial approximation algorithm for the spanner problem in its most general form. We provide two surprisingly simple approximations algorithms. In general, our Augmented Greedy achieves the first unconditional approximation ratio of m, which is non-trivial due to the independence of weights and lengths. Crucially, it maintains all size and weight guarantees Greedy is known for, i.e., in the aforementioned multiplicative alpha-spanner scenario and even for additive +beta-spanners. Further, it generalizes some of these size guarantees to derive new weight guarantees. Our second approach, Randomized Rounding, establishes a graph transformation that allows a simple rounding scheme over a standard multicommodity flow LP. It yields an O(n log n)-approximation, assuming integer lengths and polynomially bounded distance demands. The only other known approximation guarantee in this general setting requires several complex subalgorithms and analyses, yet we match it up to a factor of O(n^{1/5-eps}) using standard tools. Further, on bounded-degree graphs, we yield the first O(log n) approximation ratio for constant-bounded distance demands (beyond multiplicative 2-spanners in unit-length graphs).

[760] arXiv:2506.23657 (replaced) [pdf, html, other]
Title: Towards Markerless Intraoperative Tracking of Deformable Spine Tissue
Connor Daly, Elettra Marconi, Marco Riva, Jinendra Ekanayake, Daniel S. Elson, Ferdinando Rodriguez y Baena
Comments: An improved version of this manuscript was accepted to MICCAI
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Consumer-grade RGB-D imaging for intraoperative orthopedic tissue tracking is a promising method with high translational potential. Unlike bone-mounted tracking devices, markerless tracking can reduce operating time and complexity. However, its use has been limited to cadaveric studies. This paper introduces the first real-world clinical RGB-D dataset for spine surgery and develops SpineAlign, a system for capturing deformation between preoperative and intraoperative spine states. We also present an intraoperative segmentation network trained on this data and introduce CorrespondNet, a multi-task framework for predicting key regions for registration in both intraoperative and preoperative scenes.

[761] arXiv:2506.23743 (replaced) [pdf, html, other]
Title: Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences
Tiziano Labruna, Simone Gallo, Giovanni Da San Martino
Subjects: Computation and Language (cs.CL)

Positional bias in binary question answering occurs when a model systematically favors one choice over another based solely on the ordering of presented options. In this study, we quantify and analyze positional bias across five large language models under varying degrees of answer uncertainty. We re-adapted the SQuAD-it dataset by adding an extra incorrect answer option and then created multiple versions with progressively less context and more out-of-context answers, yielding datasets that range from low to high uncertainty. Additionally, we evaluate two naturally higher-uncertainty benchmarks: (1) WebGPT - question pairs with unequal human-assigned quality scores, and (2) Winning Arguments - where models predict the more persuasive argument in Reddit's r/ChangeMyView exchanges. Across each dataset, the order of the "correct" (or higher-quality/persuasive) option is systematically flipped (first placed in position 1, then in position 2) to compute both Preference Fairness and Position Consistency. We observe that positional bias is nearly absent under low-uncertainty conditions, but grows exponentially when it becomes doubtful to decide which option is correct.

[762] arXiv:2506.23777 (replaced) [pdf, html, other]
Title: Synthetically Expressive: Evaluating gesture and voice for emotion and empathy in VR and 2D scenarios
Haoyang Du, Kiran Chhatre, Christopher Peters, Brian Keegan, Rachel McDonnell, Cathy Ennis
Subjects: Graphics (cs.GR)

The creation of virtual humans increasingly leverages automated synthesis of speech and gestures, enabling expressive, adaptable agents that effectively engage users. However, the independent development of voice and gesture generation technologies, alongside the growing popularity of virtual reality (VR), presents significant questions about the integration of these signals and their ability to convey emotional detail in immersive environments. In this paper, we evaluate the influence of real and synthetic gestures and speech, alongside varying levels of immersion (VR vs. 2D displays) and emotional contexts (positive, neutral, negative) on user perceptions. We investigate how immersion affects the perceived match between gestures and speech and the impact on key aspects of user experience, including emotional and empathetic responses and the sense of co-presence. Our findings indicate that while VR enhances the perception of natural gesture-voice pairings, it does not similarly improve synthetic ones - amplifying the perceptual gap between them. These results highlight the need to reassess gesture appropriateness and refine AI-driven synthesis for immersive environments. Supplementary video: this https URL

[763] arXiv:2506.23800 (replaced) [pdf, html, other]
Title: Towards the Training of Deeper Predictive Coding Neural Networks
Chang Qi, Matteo Forasassi, Thomas Lukasiewicz, Tommaso Salvatori
Comments: 18 Pages, 7 figures
Subjects: Machine Learning (cs.LG)

Predictive coding networks trained with equilibrium propagation are neural models that perform inference through an iterative energy minimization process. Previous studies have demonstrated their effectiveness in shallow architectures, but show significant performance degradation when depth exceeds five to seven layers. In this work, we show that the reason behind this degradation is due to exponentially imbalanced errors between layers during weight updates, and predictions from the previous layer not being effective in guiding updates in deeper layers. We address the first issue by introducing two novel methods to optimize the latent variables that use precision-weighting to re-balance the distribution of energy among layers during the `relaxation phase', and the second issue by proposing a novel weight update mechanism that reduces error accumulation in deeper layers. Empirically, we test our methods on a large number of image classification tasks, resulting in large improvements in test accuracy across networks with more than seven layers, with performances comparable to those of backprop on similar models. These findings suggest that a better understanding of the relaxation phase is important to train models using equilibrium propagation at scale, and open new possibilities for their application in complex tasks.

[764] arXiv:2506.23815 (replaced) [pdf, other]
Title: The Impact of AI on Educational Assessment: A Framework for Constructive Alignment
Patrick Stokkink
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)

The influence of Artificial Intelligence (AI), and specifically Large Language Models (LLM), on education is continuously increasing. These models are frequently used by students, giving rise to the question whether current forms of assessment are still a valid way to evaluate student performance and comprehension. The theoretical framework developed in this paper is grounded in Constructive Alignment (CA) theory and Bloom's taxonomy for defining learning objectives. We argue that AI influences learning objectives of different Bloom levels in a different way, and assessment has to be adopted accordingly. Furthermore, in line with Bloom's vision, formative and summative assessment should be aligned on whether the use of AI is permitted or not.
Although lecturers tend to agree that education and assessment need to be adapted to the presence of AI, a strong bias exists on the extent to which lecturers want to allow for AI in assessment. This bias is caused by a lecturer's familiarity with AI and specifically whether they use it themselves. To avoid this bias, we propose structured guidelines on a university or faculty level, to foster alignment among the staff. Besides that, we argue that teaching staff should be trained on the capabilities and limitations of AI tools. In this way, they are better able to adapt their assessment methods.

[765] arXiv:2506.23897 (replaced) [pdf, html, other]
Title: PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View
Longliang Liu, Miaojie Feng, Junda Cheng, Jijun Xiang, Xuan Zhu, Xin Yang
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Panoramic optical flow enables a comprehensive understanding of temporal dynamics across wide fields of view. However, severe distortions caused by sphere-to-plane projections, such as the equirectangular projection (ERP), significantly degrade the performance of conventional perspective-based optical flow methods, especially in polar regions. To address this challenge, we propose PriOr-Flow, a novel dual-branch framework that leverages the low-distortion nature of the orthogonal view to enhance optical flow estimation in these regions. Specifically, we introduce the Dual-Cost Collaborative Lookup (DCCL) operator, which jointly retrieves correlation information from both the primitive and orthogonal cost volumes, effectively mitigating distortion noise during cost volume construction. Furthermore, our Ortho-Driven Distortion Compensation (ODDC) module iteratively refines motion features from both branches, further suppressing polar distortions. Extensive experiments demonstrate that PriOr-Flow is compatible with various perspective-based iterative optical flow methods and consistently achieves state-of-the-art performance on publicly available panoramic optical flow datasets, setting a new benchmark for wide-field motion estimation. The code is publicly available at: this https URL.

[766] arXiv:2506.23918 (replaced) [pdf, html, other]
Title: Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung
Comments: We maintain a real-time GitHub repository tracking progress at: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV)

Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.

[767] arXiv:2506.23940 (replaced) [pdf, html, other]
Title: Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs
Yang Dai, Jianxiang An, Tianwei Lin, Hongyang He, Hongzhe Huang, Wenqiao Zhang, Zheqi Lv, Siliang Tang, Yueting Zhuang
Subjects: Computation and Language (cs.CL)

Multimodal Large Language Models (MLLMs) have achieved success across various domains. However, their applicability tends to degrade when confronted with different types of data inputs, especially for MLLMs that have been fine-tuned for specific tasks. Despite its importance, the study of knowledge sharing among domain-specific MLLMs--such as those trained for mathematics or code--remains largely underexplored. To address the fragmentation of knowledge across domain-specialized MLLMs, we propose a unified parameter integration framework that enables modular composition of expert capabilities. Our method is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy, which leverages both local functional attribution and global information-theoretic signals to guide selective parameter fusion. By extending this mechanism to the low-rank adaptation layer granularity, we ensure efficient integration with minimal inference overhead. Furthermore, we introduce a domain compatibility scoring mechanism that quantifies inter-expert alignment at the activation level and correlates with downstream task utility. This principled fusion protocol allows the final model to synergize heterogeneous expertise while preserving structural modularity. Extensive evaluations across diverse multimodal benchmarks validate the effectiveness of our framework, offering a scalable path toward compositional, domain-adaptive MLLMs.

[768] arXiv:2506.23944 (replaced) [pdf, other]
Title: Adapt Your Body: Mitigating Proprioception Shifts in Imitation Learning
Fuhang Kuang, Jiacheng You, Yingdong Hu, Tong Zhang, Chuan Wen, Yang Gao
Comments: Need further modification
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)

Imitation learning models for robotic tasks typically rely on multi-modal inputs, such as RGB images, language, and proprioceptive states. While proprioception is intuitively important for decision-making and obstacle avoidance, simply incorporating all proprioceptive states leads to a surprising degradation in imitation learning performance. In this work, we identify the underlying issue as the proprioception shift problem, where the distributions of proprioceptive states diverge significantly between training and deployment. To address this challenge, we propose a domain adaptation framework that bridges the gap by utilizing rollout data collected during deployment. Using Wasserstein distance, we quantify the discrepancy between expert and rollout proprioceptive states and minimize this gap by adding noise to both sets of states, proportional to the Wasserstein distance. This strategy enhances robustness against proprioception shifts by aligning the training and deployment distributions. Experiments on robotic manipulation tasks demonstrate the efficacy of our method, enabling the imitation policy to leverage proprioception while mitigating its adverse effects. Our approach outperforms the naive solution which discards proprioception, and other baselines designed to address distributional shifts.

[769] arXiv:2506.23952 (replaced) [pdf, other]
Title: Autonomy by Design: Preserving Human Autonomy in AI Decision-Support
Stefan Buijsman, Sarah Carter, Juan Pablo Bermúdez
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)

AI systems increasingly support human decision-making across domains of professional, skill-based, and personal activity. While previous work has examined how AI might affect human autonomy globally, the effects of AI on domain-specific autonomy -- the capacity for self-governed action within defined realms of skill or expertise -- remain understudied. We analyze how AI decision-support systems affect two key components of domain-specific autonomy: skilled competence (the ability to make informed judgments within one's domain) and authentic value-formation (the capacity to form genuine domain-relevant values and preferences). By engaging with prior investigations and analyzing empirical cases across medical, financial, and educational domains, we demonstrate how the absence of reliable failure indicators and the potential for unconscious value shifts can erode domain-specific autonomy both immediately and over time. We then develop a constructive framework for autonomy-preserving AI support systems. We propose specific socio-technical design patterns -- including careful role specification, implementation of defeater mechanisms, and support for reflective practice -- that can help maintain domain-specific autonomy while leveraging AI capabilities. This framework provides concrete guidance for developing AI systems that enhance rather than diminish human agency within specialized domains of action.

[770] arXiv:2506.23986 (replaced) [pdf, html, other]
Title: StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding
Dake Guo, Jixun Yao, Linhan Ma, He Wang, Lei Xie
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms.\footnote{Speech samples: this https URL}

[771] arXiv:2506.24117 (replaced) [pdf, html, other]
Title: Intertextual Parallel Detection in Biblical Hebrew: A Transformer-Based Benchmark
David M. Smiley
Subjects: Computation and Language (cs.CL)

Identifying parallel passages in biblical Hebrew (BH) is central to biblical scholarship for understanding intertextual relationships. Traditional methods rely on manual comparison, a labor-intensive process prone to human error. This study evaluates the potential of pre-trained transformer-based language models, including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in the Hebrew Bible. Focusing on known parallels between Samuel/Kings and Chronicles, I assessed each model's capability to generate word embeddings distinguishing parallel from non-parallel passages. Using cosine similarity and Wasserstein Distance measures, I found that E5 and AlephBERT show promise; E5 excels in parallel detection, while AlephBERT demonstrates stronger non-parallel differentiation. These findings indicate that pre-trained models can enhance the efficiency and accuracy of detecting intertextual parallels in ancient texts, suggesting broader applications for ancient language studies.

[772] arXiv:2506.24119 (replaced) [pdf, html, other]
Title: SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques
Comments: Work in Progress
Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

[773] arXiv:2506.24124 (replaced) [pdf, html, other]
Title: Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives
Sixun Dong, Wei Fan, Teresa Wu, Yanjie Fu
Comments: Code: this https URL
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Time series forecasting traditionally relies on unimodal numerical inputs, which often struggle to capture high-level semantic patterns due to their dense and unstructured nature. While recent approaches have explored representing time series as text using large language models (LLMs), these methods remain limited by the discrete nature of token sequences and lack the perceptual intuition humans typically apply, such as interpreting visual patterns. In this paper, we propose a multimodal contrastive learning framework that transforms raw time series into structured visual and textual perspectives. Rather than using natural language or real-world images, we construct both modalities directly from numerical sequences. We then align these views in a shared semantic space via contrastive learning, enabling the model to capture richer and more complementary representations. Furthermore, we introduce a variate selection module that leverages the aligned representations to identify the most informative variables for multivariate forecasting. Extensive experiments on fifteen short-term and six long-term forecasting benchmarks demonstrate that our approach consistently outperforms strong unimodal and cross-modal baselines, highlighting the effectiveness of multimodal alignment in enhancing time series forecasting. Code is available at: this https URL.

[774] arXiv:2310.16463 (replaced) [pdf, html, other]
Title: Constructing disjoint Steiner trees in Sierpiński graphs
Chenxu Yang, Ping Li, Yaping Mao, Eddie Cheng, Ralf Klasing
Comments: Steiner Tree; Generalized Connectivity; Sierpiński Graph
Subjects: Combinatorics (math.CO); Data Structures and Algorithms (cs.DS)

Let $G$ be a graph and $S\subseteq V(G)$ with $|S|\geq 2$. Then the trees $T_1, T_2, \cdots, T_\ell$ in $G$ are \emph{internally disjoint Steiner trees} connecting $S$ (or $S$-Steiner trees) if $E(T_i) \cap E(T_j )=\emptyset$ and $V(T_i)\cap V(T_j)=S$ for every pair of distinct integers $i,j$, $1 \leq i, j \leq \ell$. Similarly, if we only have the condition $E(T_i) \cap E(T_j )=\emptyset$ but without the condition $V(T_i)\cap V(T_j)=S$, then they are \emph{edge-disjoint Steiner trees}. The \emph{generalized $k$-connectivity}, denoted by $\kappa_k(G)$, of a graph $G$, is defined as $\kappa_k(G)=\min\{\kappa_G(S)|S \subseteq V(G) \ \textrm{and} \ |S|=k \}$, where $\kappa_G(S)$ is the maximum number of internally disjoint $S$-Steiner trees. The \emph{generalized local edge-connectivity} $\lambda_{G}(S)$ is the maximum number of edge-disjoint Steiner trees connecting $S$ in $G$. The {\it generalized $k$-edge-connectivity} $\lambda_k(G)$ of $G$ is defined as $\lambda_k(G)=\min\{\lambda_{G}(S)\,|\,S\subseteq V(G) \ and \ |S|=k\}$. These measures are generalizations of the concepts of connectivity and edge-connectivity, and they and can be used as measures of vulnerability of networks. It is, in general, difficult to compute these generalized connectivities. However, there are precise results for some special classes of graphs. In this paper, we obtain the exact value of $\lambda_{k}(S(n,\ell))$ for $3\leq k\leq \ell^n$, and the exact value of $\kappa_{k}(S(n,\ell))$ for $3\leq k\leq \ell$, where $S(n, \ell)$ is the Sierpiński graphs with order $\ell^n$. As a direct consequence, these graphs provide additional interesting examples when $\lambda_{k}(S(n,\ell))=\kappa_{k}(S(n,\ell))$. We also study the some network properties of Sierpiński graphs.

[775] arXiv:2401.03302 (replaced) [pdf, html, other]
Title: Realism in Action: Anomaly-Aware Diagnosis of Brain Tumors from Medical Images Using YOLOv8 and DeiT
Seyed Mohammad Hossein Hashemi, Leila Safari, Mohsen Hooshmand, Amirhossein Dadashzadeh Taromi
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

Reliable diagnosis of brain tumors remains challenging due to low clinical incidence rates of such cases. However, this low rate is neglected in most of proposed methods. We propose a clinically inspired framework for anomaly-resilient tumor detection and classification. Detection leverages YOLOv8n fine-tuned on a realistically imbalanced dataset (1:9 tumor-to-normal ratio; 30,000 MRI slices from 81 patients). In addition, we propose a novel Patient-to-Patient (PTP) metric that evaluates diagnostic reliability at the patient level. Classification employs knowledge distillation: a Data Efficient Image Transformer (DeiT) student model is distilled from a ResNet152 teacher. The distilled ViT achieves an F1-score of 0.92 within 20 epochs, matching near teacher performance (F1=0.97) with significantly reduced computational resources. This end-to-end framework demonstrates high robustness in clinically representative anomaly-distributed data, offering a viable tool that adheres to realistic situations in clinics.

[776] arXiv:2401.16776 (replaced) [pdf, html, other]
Title: Leveraging Nested MLMC for Sequential Neural Posterior Estimation with Intractable Likelihoods
Xiliang Yang, Yifei Xiong, Zhijian He
Comments: 30 pages, 6 figures
Subjects: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)

There has been a growing interest in studying sequential neural posterior estimation (SNPE) techniques for their advantages in dealing with simulation-based models with intractable likelihoods. They are devoted to learning the posterior from adaptively proposed simulations using neural network-based conditional density estimators. As a SNPE technique, the automatic posterior transformation (APT) method proposed by Greenberg et al. (2019) performs notably and scales to high dimensional data. However, the APT method bears the computation of an expectation of the logarithm of an intractable normalizing constant, i.e., a nested expectation. Although atomic APT was proposed to solve this by discretizing the normalizing constant, it remains challenging to analyze the convergence of learning. In this paper, we propose a nested APT method to estimate the involved nested expectation instead. This facilitates establishing the convergence analysis. Since the nested estimators for the loss function and its gradient are biased, we make use of unbiased multi-level Monte Carlo (MLMC) estimators for debiasing. To further reduce the excessive variance of the unbiased estimators, this paper also develops some truncated MLMC estimators by taking account of the trade-off between the bias and the average cost. Numerical experiments for approximating complex posteriors with multimodal in moderate dimensions are provided.

[777] arXiv:2405.15643 (replaced) [pdf, html, other]
Title: An Unconditional Representation of the Conditional Score in Infinite-Dimensional Linear Inverse Problems
Fabian Schneider, Duc-Lam Duong, Matti Lassas, Maarten V. de Hoop, Tapio Helin
Comments: Title changed, main text substantially revised, including new experiments, method acronym changed, references added. 34 pages, 11 figures, 2tables
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Probability (math.PR)

Score-based diffusion models (SDMs) have emerged as a powerful tool for sampling from the posterior distribution in Bayesian inverse problems. However, existing methods often require multiple evaluations of the forward mapping to generate a single sample, resulting in significant computational costs for large-scale inverse problems. To address this, we propose an unconditional representation of the conditional score-function (UCoS) tailored to linear inverse problems, which avoids forward model evaluations during sampling by shifting computational effort to an offline training phase. In this phase, a task-dependent score function is learned based on the linear forward operator. Crucially, we show that the conditional score can be derived exactly from a trained (unconditional) score using affine transformations, eliminating the need for conditional score approximations. Our approach is formulated in infinite-dimensional function spaces, making it inherently discretization-invariant. We support this formulation with a rigorous convergence analysis that justifies UCoS beyond any specific discretization. Finally we validate UCoS through high-dimensional computed tomography (CT) and image deblurring experiments, demonstrating both scalability and accuracy.

[778] arXiv:2405.16594 (replaced) [pdf, html, other]
Title: Training-Conditional Coverage Bounds under Covariate Shift
Mehrdad Pournaderi, Yu Xiang
Comments: arXiv admin note: text overlap with arXiv:2404.13731
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Conformal prediction methodology has recently been extended to the covariate shift setting, where the distribution of covariates differs between training and test data. While existing results ensure that the prediction sets from these methods achieve marginal coverage above a nominal level, their coverage rate conditional on the training dataset (referred to as training-conditional coverage) remains unexplored. In this paper, we address this gap by deriving upper bounds on the tail of the training-conditional coverage distribution, offering probably approximately correct (PAC) guarantees for these methods. Our results quantify the relationship between the quality of the prediction sets and the severity of distributional changes, and can potentially be used to compute more efficient prediction sets.

[779] arXiv:2407.14153 (replaced) [pdf, html, other]
Title: De-LightSAM: Modality-Decoupled Lightweight SAM for Generalizable Medical Segmentation
Qing Xu, Jiaxuan Li, Xiangjian He, Chenxin Li, Fiseha B. Tesem, Wenting Duan, Zhen Chen, Rong Qu, Jonathan M. Garibaldi, Chang Wen Chen
Comments: Under Review
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The universality of deep neural networks across different modalities and their generalization capabilities to unseen domains play an essential role in medical image segmentation. The recent segment anything model (SAM) has demonstrated strong adaptability across diverse natural scenarios. However, the huge computational costs, demand for manual annotations as prompts and conflict-prone decoding process of SAM degrade its generalization capabilities in medical scenarios. To address these limitations, we propose a modality-decoupled lightweight SAM for domain-generalized medical image segmentation, named De-LightSAM. Specifically, we first devise a lightweight domain-controllable image encoder (DC-Encoder) that produces discriminative visual features for diverse modalities. Further, we introduce the self-patch prompt generator (SP-Generator) to automatically generate high-quality dense prompt embeddings for guiding segmentation decoding. Finally, we design the query-decoupled modality decoder (QM-Decoder) that leverages a one-to-one strategy to provide an independent decoding channel for every modality, preventing mutual knowledge interference of different modalities. Moreover, we design a multi-modal decoupled knowledge distillation (MDKD) strategy to leverage robust common knowledge to complement domain-specific medical feature representations. Extensive experiments indicate that De-LightSAM outperforms state-of-the-arts in diverse medical imaging segmentation tasks, displaying superior modality universality and generalization capabilities. Especially, De-LightSAM uses only 2.0% parameters compared to SAM-H. The source code is available at this https URL.

[780] arXiv:2407.21021 (replaced) [pdf, html, other]
Title: Asymptotic Formulation of the Role of Shear Loads on Multi-Layered Thin Shells and Classification of Their Deformation Modes
Xiwei Pan (1), Yichao Zhu (1,2) ((1) Department of Engineering Mechanics, Dalian University of Technology, (2) State Key Laboratory of Structural Analysis, Optimization and CAE Software for Industrial Equipment)
Comments: 41 pages, 12 figures
Subjects: Materials Science (cond-mat.mtrl-sci); Numerical Analysis (math.NA)

Shell structures are generally modeled based on kinematic hypotheses, where some of the parameters are preferentially evaluated in a phenomenological manner. In this article, asymptotic analysis against the underlying three-dimensional equation system is considered so as to provide a rational framework for modeling and interpreting the deformation behavior of multi-layered thin shells (MTSs). Capable of accurately predicting both overall stiffness and detailed stress distribution, the proposed shell theory shows its distinguishing features at least in the following aspects. Firstly, it naturally introduces a rule for classifying the deformation modes of MTSs based on the magnitude of the maximum dimensionless principal curvature. Secondly, for each class, the hierarchy in the order of the involved field quantities is examined, and it is shown that when the product of the maximum principal curvature and the characteristic shell size reaches the magnitude of unity or larger, the resulting shell theory cannot be treated by natural extension of plate theories. Lastly, it is demonstrated that, for moderate shear forces and comparable material properties, a leading-order multi-layered shell theory derived from asymptotic analysis should suffice to output satisfactory predictions over the shell stiffness, as well as its internal stress distribution. Numerical examples of the deformation and strength analysis for MTSs are also presented to show the reliability of the leading-order model.

[781] arXiv:2408.01868 (replaced) [pdf, html, other]
Title: Meta-Posterior Consistency for the Bayesian Inference of Metastable System
Zachary P Adams, Sayan Mukherjee
Comments: 32 pages, 3 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)

The vast majority of the literature on learning dynamical systems or stochastic processes from time series has focused on stable or ergodic systems, for both Bayesian and frequentist inference procedures. However, most real-world systems are only metastable, that is, the dynamics appear to be stable on some time scale, but are in fact unstable over longer time scales. Consistency of inference for metastable systems may not be possible, but one can ask about metaconsistency: Do inference procedures converge when observations are taken over a large but finite time interval, but diverge on longer time scales? In this paper we introduce, discuss, and quantify metaconsistency in a Bayesian framework. We discuss how metaconsistency can be exploited to efficiently infer a model for a sub-system of a larger system, where inference on the global behavior may require much more data, or there is no theoretical guarantee as to the asymptotic success of inference procedures. We also discuss the relation between metaconsistency and the spectral properties of the model dynamical system in the case of uniformly ergodic and non-ergodic diffusions.

[782] arXiv:2408.16553 (replaced) [pdf, html, other]
Title: Downscaling Neural Network for Coastal Simulations
Zhi-Song Liu, Markus Buttner, Vadym Aizinger, Andreas Rupp
Comments: 13 pages, 12 figures
Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)

Learning the fine-scale details of a coastal ocean simulation from a coarse representation is a challenging task. For real-world applications, high-resolution simulations are necessary to advance understanding of many coastal processes, specifically, to predict flooding resulting from tsunamis and storm surges. We propose a Downscaling Neural Network for Coastal Simulation (DNNCS) for spatiotemporal enhancement to efficiently learn the high-resolution numerical solution. Given images of coastal simulations produced on low-resolution computational meshes using low polynomial order discontinuous Galerkin discretizations and a coarse temporal resolution, the proposed DNNCS learns to produce high-resolution free surface elevation and velocity visualizations in both time and space. To efficiently model the dynamic changes over time and space, we propose grid-aware spatiotemporal attention to project the temporal features to the spatial domain for non-local feature matching. The coordinate information is also utilized via positional encoding. For the final reconstruction, we use the spatiotemporal bilinear operation to interpolate the missing frames and then expand the feature maps to the frequency domain for residual mapping. Besides data-driven losses, the proposed physics-informed loss guarantees gradient consistency and momentum changes. Their combination contributes to the overall 24% improvements in Root Mean Square Error (RMSE). To train the proposed model, we propose a novel coastal simulation dataset and use it for model optimization and evaluation. Our method shows superior downscaling quality and fast computation compared to the state-of-the-art methods.

[783] arXiv:2411.04775 (replaced) [pdf, html, other]
Title: Learning dynamical systems from data: Gradient-based dictionary optimization
Mohammad Tabish, Neil K. Chada, Stefan Klus
Subjects: Dynamical Systems (math.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)

The Koopman operator plays a crucial role in analyzing the global behavior of dynamical systems. Existing data-driven methods for approximating the Koopman operator or discovering the governing equations of the underlying system typically require a fixed set of basis functions, also called dictionary. The optimal choice of basis functions is highly problem-dependent and often requires domain knowledge. We present a novel gradient descent-based optimization framework for learning suitable and interpretable basis functions from data and show how it can be used in combination with EDMD, SINDy, and PDE-FIND. We illustrate the efficacy of the proposed approach with the aid of various benchmark problems such as the Ornstein-Uhlenbeck process, Chua's circuit, a nonlinear heat equation, as well as protein-folding data.

[784] arXiv:2411.04946 (replaced) [pdf, html, other]
Title: SPGD: Steepest Perturbed Gradient Descent Optimization
Amir M. Vahedi, Horea T. Ilies
Comments: 28 pages, 26 figures, submitted to Journal of Mechanical Design
Subjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Mathematical Physics (math-ph)

Optimization algorithms are pivotal in advancing various scientific and industrial fields but often encounter obstacles such as trapping in local minima, saddle points, and plateaus (flat regions), which makes the convergence to reasonable or near-optimal solutions particularly challenging. This paper presents the Steepest Perturbed Gradient Descent (SPGD), a novel algorithm that innovatively combines the principles of the gradient descent method with periodic uniform perturbation sampling to effectively circumvent these impediments and lead to better solutions whenever possible. SPGD is distinctively designed to generate a set of candidate solutions and select the one exhibiting the steepest loss difference relative to the current solution. It enhances the traditional gradient descent approach by integrating a strategic exploration mechanism that significantly increases the likelihood of escaping sub-optimal local minima and navigating complex optimization landscapes effectively. Our approach not only retains the directed efficiency of gradient descent but also leverages the exploratory benefits of stochastic perturbations, thus enabling a more comprehensive search for global optima across diverse problem spaces. We demonstrate the efficacy of SPGD in solving the 3D component packing problem, an NP-hard challenge. Preliminary results show a substantial improvement over four established methods, particularly on response surfaces with complex topographies and in multidimensional non-convex continuous optimization problems. Comparative analyses with established 2D benchmark functions highlight SPGD's superior performance, showcasing its ability to navigate complex optimization landscapes. These results emphasize SPGD's potential as a versatile tool for a wide range of optimization problems.

[785] arXiv:2411.17485 (replaced) [pdf, html, other]
Title: Storing overlapping associative memories on latent manifolds in low-rank spiking networks
William F. Podlaski, Christian K. Machens
Comments: 17 pages, 5 figures; accepted to NeurIPS 2024 Workshop on Symmetry and Geometry in Neural Representations (NeurReps 2024)
Subjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Associative memory architectures such as the Hopfield network have long been important conceptual and theoretical models for neuroscience and artificial intelligence. However, translating these abstract models into spiking neural networks has been surprisingly difficult. Indeed, much previous work has been restricted to storing a small number of primarily non-overlapping memories in large networks, thereby limiting their scalability. Here, we revisit the associative memory problem in light of recent advances in understanding spike-based computation. Using a recently-established geometric framework, we show that the spiking activity for a large class of all-inhibitory networks is situated on a low-dimensional, convex, and piecewise-linear manifold, with dynamics that move along the manifold. We then map the associative memory problem onto these dynamics, and demonstrate how the vertices of a hypercubic manifold can be used to store stable, overlapping activity patterns with a direct correspondence to the original Hopfield model. We propose several learning rules, and demonstrate a linear scaling of the storage capacity with the number of neurons, as well as robust pattern completion abilities. Overall, this work serves as a case study to demonstrate the effectiveness of using a geometrical perspective to design dynamics on neural manifolds, with implications for neuroscience and machine learning.

[786] arXiv:2411.19906 (replaced) [pdf, html, other]
Title: A Graph-Based Classical and Quantum Approach to Deterministic L-System Inference
Ali Lotfi, Ian McQuillan, Steven Rayan
Comments: 17 pages, 1 figure
Subjects: Quantum Physics (quant-ph); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)

L-systems can be made to model and create simulations of many biological processes, such as plant development. Finding an L-system for a given process is typically solved by hand, by experts, in a massively time-consuming process. It would be significant if this could be done automatically from data, such as from sequences of images. In this paper, we are interested in inferring a particular type of L-system, deterministic context-free L-system (D0L-system) from a sequence of strings. We introduce the characteristic graph of a sequence of strings, which we then utilize to translate our problem (inferring D0L-systems) in polynomial time into the maximum independent set problem (MIS) and the SAT problem. After that, we offer a classical exact algorithm and an approximate quantum algorithm for the problem.

[787] arXiv:2412.06959 (replaced) [pdf, html, other]
Title: Geological and Well prior assisted full waveform inversion using conditional diffusion models
Fu Wang, Xinquan Huang, Tariq Alkhalifah
Subjects: Geophysics (physics.geo-ph); Machine Learning (cs.LG)

Full waveform inversion (FWI) often faces challenges due to inadequate seismic observations, resulting in band-limited and geologically inaccurate inversion results. Incorporating prior information from potential velocity distributions, well-log information, and our geological knowledge and expectations can significantly improve FWI convergence to a realistic model. While diffusion-regularized FWI has shown improved performance compared to conventional FWI by incorporating the velocity distribution prior, it can benefit even more by incorporating well-log information and other geological knowledge priors. To leverage this fact, we propose a geological class and well-information prior-assisted FWI using conditional diffusion models. This method seamlessly integrates multi-modal information into FWI, simultaneously achieving data fitting and universal geologic and geophysics prior matching, which is often not achieved with traditional regularization methods. Specifically, we propose to combine conditional diffusion models with FWI, where we integrate well-log data and geological class conditions into these conditional diffusion models using classifier-free guidance for multi-modal prior matching beyond the original velocity distribution prior. Numerical experiments on the OpenFWI datasets and field marine data demonstrate the effectiveness of our method compared to conventional FWI and the unconditional diffusion-regularized FWI.

[788] arXiv:2412.08453 (replaced) [pdf, html, other]
Title: On best approximation by multivariate ridge functions with applications to generalized translation networks
Paul Geuchen, Palina Salanevich, Olov Schavemaker, Felix Voigtlaender
Subjects: Functional Analysis (math.FA); Machine Learning (cs.LG); Machine Learning (stat.ML)

In this paper, we prove sharp upper and lower bounds for the approximation of Sobolev functions by sums of multivariate ridge functions, i.e., for approximation by functions of the form $\mathbb{R}^d \ni x \mapsto \sum_{k=1}^n \varrho_k(A_k x) \in \mathbb{R}$ with $\varrho_k : \mathbb{R}^\ell \to \mathbb{R}$ and $A_k \in \mathbb{R}^{\ell \times d}$. We show that the order of approximation asymptotically behaves as $n^{-r/(d-\ell)}$, where $r$ is the regularity (order of differentiability) of the Sobolev functions to be approximated. Our lower bound even holds when approximating $L^\infty$-Sobolev functions of regularity $r$ with error measured in $L^1$, while our upper bound applies to the approximation of $L^p$-Sobolev functions in $L^p$ for any $1 \leq p \leq \infty$. These bounds generalize well-known results regarding the approximation properties of univariate ridge functions to the multivariate case. We use our results to obtain sharp asymptotic bounds for the approximation of Sobolev functions using generalized translation networks and complex-valued neural networks.

[789] arXiv:2501.02144 (replaced) [pdf, other]
Title: Establishing baselines for generative discovery of inorganic crystals
Nathan J. Szymanski, Christopher J. Bartel
Subjects: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)

Generative artificial intelligence offers a promising avenue for materials discovery, yet its advantages over traditional methods remain unclear. In this work, we introduce and benchmark two baseline approaches - random enumeration of charge-balanced prototypes and data-driven ion exchange of known compounds - against four generative techniques based on diffusion models, variational autoencoders, and large language models. Our results show that established methods such as ion exchange are better at generating novel materials that are stable, although many of these closely resemble known compounds. In contrast, generative models excel at proposing novel structural frameworks and, when sufficient training data is available, can more effectively target properties such as electronic band gap and bulk modulus. To enhance the performance of both the baseline and generative approaches, we implement a post-generation screening step in which all proposed structures are passed through stability and property filters from pre-trained machine learning models including universal interatomic potentials. This low-cost filtering step leads to substantial improvement in the success rates of all methods, remains computationally efficient, and ultimately provides a practical pathway toward more effective generative strategies for materials discovery. By establishing baselines for comparison, this work highlights opportunities for continued advancement of generative models, especially for the targeted generation of novel materials that are thermodynamically stable.

[790] arXiv:2501.16243 (replaced) [pdf, html, other]
Title: Accelerating Quantum Reinforcement Learning with a Quantum Natural Policy Gradient Based Approach
Yang Xu, Vaneet Aggarwal
Comments: Proceedings of the 42nd International Conference on Machine Learning
Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

We address the problem of quantum reinforcement learning (QRL) under model-free settings with quantum oracle access to the Markov Decision Process (MDP). This paper introduces a Quantum Natural Policy Gradient (QNPG) algorithm, which replaces the random sampling used in classical Natural Policy Gradient (NPG) estimators with a deterministic gradient estimation approach, enabling seamless integration into quantum systems. While this modification introduces a bounded bias in the estimator, the bias decays exponentially with increasing truncation levels. This paper demonstrates that the proposed QNPG algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-1.5})$ for queries to the quantum oracle, significantly improving the classical lower bound of $\tilde{\mathcal{O}}(\epsilon^{-2})$ for queries to the MDP.

[791] arXiv:2502.03551 (replaced) [pdf, html, other]
Title: Gradient Descent Algorithm in Hilbert Spaces under Stationary Markov Chains with $ϕ$- and $β$-Mixing
Priyanka Roy, Susanne Saminger-Platz
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA)

In this paper, we study a strictly stationary Markov chain gradient descent algorithm operating in general Hilbert spaces. Our analysis focuses on the mixing coefficients of the underlying process, specifically the $\phi$- and $\beta$-mixing coefficients. Under these assumptions, we derive probabilistic upper bounds on the convergence behavior of the algorithm based on the exponential as well as the polynomial decay of the mixing coefficients.

[792] arXiv:2502.11900 (replaced) [pdf, html, other]
Title: Ansatz-free Hamiltonian learning with Heisenberg-limited scaling
Hong-Ye Hu, Muzhou Ma, Weiyuan Gong, Qi Ye, Yu Tong, Steven T. Flammia, Susanne F. Yelin
Comments: Updated version with expanded explanations, added pseudocode, and new numerical demonstrations. 10 pages, 4 figures. HYH and MM contributed equally
Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)

Learning the unknown interactions that govern a quantum system is crucial for quantum information processing, device benchmarking, and quantum sensing. The problem, known as Hamiltonian learning, is well understood under the assumption that interactions are local, but this assumption may not hold for arbitrary Hamiltonians. Previous methods all require high-order inverse polynomial dependency with precision, unable to surpass the standard quantum limit and reach the gold standard Heisenberg-limited scaling. Whether Heisenberg-limited Hamiltonian learning is possible without prior assumptions about the interaction structures, a challenge we term \emph{ansatz-free Hamiltonian learning}, remains an open question. In this work, we present a quantum algorithm to learn arbitrary sparse Hamiltonians without any structure constraints using only black-box queries of the system's real-time evolution and minimal digital controls to attain Heisenberg-limited scaling in estimation error. Our method is also resilient to state-preparation-and-measurement errors, enhancing its practical feasibility. We numerically demonstrate our ansatz-free protocol for learning physical Hamiltonians and validating analog quantum simulations, benchmarking our performance against the state-of-the-art Heisenberg-limited learning approach. Moreover, we establish a fundamental trade-off between total evolution time and quantum control on learning arbitrary interactions, revealing the intrinsic interplay between controllability and total evolution time complexity for any learning algorithm. These results pave the way for further exploration into Heisenberg-limited Hamiltonian learning in complex quantum systems under minimal assumptions, potentially enabling new benchmarking and verification protocols.

[793] arXiv:2502.13030 (replaced) [pdf, html, other]
Title: Conformal Inference under High-Dimensional Covariate Shifts via Likelihood-Ratio Regularization
Sunay Joshi, Shayan Kiyani, George Pappas, Edgar Dobriban, Hamed Hassani
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

We consider the problem of conformal prediction under covariate shift. Given labeled data from a source domain and unlabeled data from a covariate shifted target domain, we seek to construct prediction sets with valid marginal coverage in the target domain. Most existing methods require estimating the unknown likelihood ratio function, which can be prohibitive for high-dimensional data such as images. To address this challenge, we introduce the likelihood ratio regularized quantile regression (LR-QR) algorithm, which combines the pinball loss with a novel choice of regularization in order to construct a threshold function without directly estimating the unknown likelihood ratio. We show that the LR-QR method has coverage at the desired level in the target domain, up to a small error term that we can control. Our proofs draw on a novel analysis of coverage via stability bounds from learning theory. Our experiments demonstrate that the LR-QR algorithm outperforms existing methods on high-dimensional prediction tasks, including a regression task for the Communities and Crime dataset, an image classification task from the WILDS repository, and an LLM question-answering task on the MMLU benchmark.

[794] arXiv:2503.02859 (replaced) [pdf, html, other]
Title: Unsupervised Attributed Dynamic Network Embedding with Stability Guarantees
Emma Ceccherini, Ian Gallagher, Andrew Jones, Daniel Lawson
Comments: 28 pages, 5 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)

Stability for dynamic network embeddings ensures that nodes behaving the same at different times receive the same embedding, allowing comparison of nodes in the network across time. We present attributed unfolded adjacency spectral embedding (AUASE), a stable unsupervised representation learning framework for dynamic networks in which nodes are attributed with time-varying covariate information. To establish stability, we prove uniform convergence to an associated latent position model. We quantify the benefits of our dynamic embedding by comparing with state-of-the-art network representation learning methods on four real attributed networks. To the best of our knowledge, AUASE is the only attributed dynamic embedding that satisfies stability guarantees without the need for ground truth labels, which we demonstrate provides significant improvements for link prediction and node classification.

[795] arXiv:2504.09748 (replaced) [pdf, html, other]
Title: Level-set topology optimisation with unfitted finite elements and automatic shape differentiation
Zachary J. Wegert, Jordi Manyer, Connor Mallon, Santiago Badia, Vivien J. Challis
Subjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)

In this paper we develop automatic shape differentiation techniques for unfitted discretisations and link these to recent advances in shape calculus for unfitted methods. We extend existing analytic shape calculus results to the case where the domain boundary intersects with the boundary of the background domain. We further show that we can recover these analytic derivatives to machine precision regardless of the mesh size using the developed automatic shape differentiation techniques, drastically reducing the burden associated with the analytic derivation of these quantities. In addition, we show that we can also recover the symmetric shape Hessian. We implement these techniques for both serial and distributed computing frameworks in the Julia package GridapTopOpt and the wider Gridap ecosystem. As part of this implementation we propose a novel graph-based approach for isolated volume detection. We demonstrate the applicability of the unfitted automatic shape differentiation framework and our implementation by considering the three-dimensional minimum compliance topology optimisation of a linear elastic wheel and of a linear elastic structure in a fluid-structure interaction problem with Stokes flow. The implementation is general and allows GridapTopOpt to solve a wider range of problems on unstructured meshes without analytic calculation of shape derivatives and avoiding issues that arise when material properties are smoothed at the domain boundary. The software is open source and available at this https URL.

[796] arXiv:2505.09438 (replaced) [pdf, other]
Title: Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment
Paul Tschisgale, Holger Maus, Fabian Kieser, Ben Kroehs, Stefan Petersen, Peter Wulff
Subjects: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI)

Large language models (LLMs) are now widely accessible, reaching learners at all educational levels. This development has raised concerns that their use may circumvent essential learning processes and compromise the integrity of established assessment formats. In physics education, where problem solving plays a central role in instruction and assessment, it is therefore essential to understand the physics-specific problem-solving capabilities of LLMs. Such understanding is key to informing responsible and pedagogically sound approaches to integrating LLMs into instruction and assessment. This study therefore compares the problem-solving performance of a general-purpose LLM (GPT-4o, using varying prompting techniques) and a reasoning-optimized model (o1-preview) with that of participants of the German Physics Olympiad, based on a set of well-defined Olympiad problems. In addition to evaluating the correctness of the generated solutions, the study analyzes characteristic strengths and limitations of LLM-generated solutions. The findings of this study indicate that both tested LLMs (GPT-4o and o1-preview) demonstrate advanced problem-solving capabilities on Olympiad-type physics problems, on average outperforming the human participants. Prompting techniques had little effect on GPT-4o's performance, while o1-preview almost consistently outperformed both GPT-4o and the human benchmark. Based on these findings, the study discusses implications for the design of summative and formative assessment in physics education, including how to uphold assessment integrity and support students in critically engaging with LLMs.

[797] arXiv:2505.12578 (replaced) [pdf, html, other]
Title: Stacked conformal prediction
Paulo C. Marques F
Comments: 12 pages, 2 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

We consider a method for conformalizing a stacked ensemble of predictive models, showing that the potentially simple form of the meta-learner at the top of the stack enables a procedure with manageable computational cost that achieves approximate marginal validity without requiring the use of a separate calibration sample. Empirical results indicate that the method compares favorably to a standard inductive alternative.

[798] arXiv:2505.14518 (replaced) [pdf, html, other]
Title: Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples
Chun-Yi Kuan, Hung-yi Lee
Comments: Accepted to Interspeech 2025. Project Website: this https URL
Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs. However, these models often hallucinate non-existent sound events, reducing their reliability in real-world applications. To address this, we propose LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds using synthesized data from the backbone LLM. Unlike prior approaches, our method requires no modification to LLM parameters and efficiently integrates audio representations via a lightweight adapter. Experiments show that LISTEN effectively mitigates hallucinations while maintaining impressive performance on existing audio question and reasoning benchmarks. At the same time, it is more efficient in both data and computation.

[799] arXiv:2505.18182 (replaced) [pdf, other]
Title: Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025)
Damilare Emmanuel Olatunji, Julius Dona Zannu, Carine Pierrette Mukamakuza, Godbright Nixon Uiso, Chol Buol, Mona Mamoun Mubarak Aman, John Bosco Thuo, Nchofon Tagha Ghogomu, Evelyne Umubyeyi
Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG)

AI-powered stethoscopes offer a promising alternative for screening rheumatic heart disease (RHD), particularly in regions with limited diagnostic infrastructure. Early detection is vital, yet echocardiography, the gold standard tool, remains largely inaccessible in low-resource settings due to cost and workforce constraints. This review systematically examines machine learning (ML) applications from 2015 to 2025 that analyze electrocardiogram (ECG) and phonocardiogram (PCG) data to support accessible, scalable screening of all RHD variants in relation to the World Heart Federation's "25 by 25" goal to reduce RHD mortality. Using PRISMA-ScR guidelines, 37 peer-reviewed studies were selected from PubMed, IEEE Xplore, Scopus, and Embase. Convolutional neural networks (CNNs) dominate recent efforts, achieving a median accuracy of 97.75%, F1-score of 0.95, and AUROC of 0.89. However, challenges remain: 73% of studies used single-center datasets, 81.1% relied on private data, only 10.8% were externally validated, and none assessed cost-effectiveness. Although 45.9% originated from endemic regions, few addressed demographic diversity or implementation feasibility. These gaps underscore the disconnect between model performance and clinical readiness. Bridging this divide requires standardized benchmark datasets, prospective trials in endemic areas, and broader validation. If these issues are addressed, AI-augmented auscultation could transform cardiovascular diagnostics in underserved populations, thereby aiding early detection. This review also offers practical recommendations for building accessible ML-based RHD screening tools, aiming to close the diagnostic gap in low-resource settings where conventional auscultation may miss up to 90% of cases and echocardiography remains out of reach.

[800] arXiv:2506.07844 (replaced) [pdf, html, other]
Title: Conditional Local Independence Testing for Dynamic Causal Discovery
Mingzhou Liu, Xinwei Sun, Yizhou Wang
Comments: Working paper
Subjects: Methodology (stat.ME); Machine Learning (cs.LG)

Inferring causal relationships from dynamical systems is the central interest of many scientific inquiries. Conditional Local Independence (CLI), which describes whether the evolution of one process is influenced by another process given additional processes, is important for causal learning in such systems. However, existing CLI tests were limited to counting processes. In this paper, we propose a nonparametric CLT test for Itô processes. Specifically, we first introduce a testing statistic based on the Local Covariance Measure (LCM) by constructing a martingale from the conditional expectation of the process of interest. For estimation, we propose an efficient estimator based on the optimal filtering equation, which can achieve root-N consistency. To establish the asymptotic level and power of the test, we relax the restrictive boundedness condition to a moment bound condition, which is practical for Itô processes. We verify the proposed test in synthetic and real-world experiments.

[801] arXiv:2506.09711 (replaced) [pdf, html, other]
Title: Non-Euclidean dual gradient ascent for entropically regularized linear and semidefinite programming
Yuhang Cai, Michael Lindsey
Subjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)

We present an optimization framework that exhibits dimension-independent convergence on a broad class of semidefinite programs (SDPs). Our approach first regularizes the primal problem with the von Neumann entropy, then solve the regularized problem using dual gradient ascent with respect to a problem-adapted norm. In particular, we show that the dual gradient norm converges to zero at a rate independent of the ambient dimension and, via rounding arguments, construct primal-feasible solutions in certain special cases. We also derive explicit convergence rates for the objective. In order to achieve optimal computational scaling, we must accommodate the use of stochastic gradients constructed via randomized trace estimators. Throughout we illustrate the generality of our framework via three important special cases -- the Goemans-Williamson SDP relaxation of the Max-Cut problem, the optimal transport linear program, and several SDP relaxations of the permutation synchronization problem. Numerical experiments confirm that our methods achieve dimension-independent convergence in practice.

[802] arXiv:2506.09730 (replaced) [pdf, html, other]
Title: Empirical and computer-aided robustness analysis of long-step and accelerated methods in smooth convex optimization
Pierre Vernimmen, François Glineur
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG)

This work assesses both empirically and theoretically, using the performance estimation methodology, how robust different first-order optimization methods are when subject to relative inexactness in their gradient computations. Relative inexactness occurs, for example, when compressing the gradient using fewer bits of information, which happens when dealing with large-scale problems on GPUs. Three major families of methods are analyzed: constant step gradient descent, long-step methods, and accelerated methods. The latter two are first shown to be theoretically not robust to inexactness. Then, a semi-heuristic shortening factor is introduced to improve their theoretical guarantees. All methods are subsequently tested on a concrete inexact problem, with two different types of relative inexactness, and it is observed that both accelerated methods are much more robust than expected, and that the shortening factor significantly helps the long-step methods. In the end, all shortened methods appear to be promising, even in this inexact setting.

[803] arXiv:2506.10230 (replaced) [pdf, html, other]
Title: Prompt-Guided Latent Diffusion with Predictive Class Conditioning for 3D Prostate MRI Generation
Emerson P. Grabke, Masoom A. Haider, Babak Taati
Comments: MAH and BT are co-senior authors on the work. This work has been submitted to the IEEE for possible publication
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Objective: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM strategies typically rely on short-prompt text encoders, non-medical LDMs, or large data volumes. These strategies can limit performance and scientific accessibility. We propose a novel LDM conditioning approach to address these limitations. Methods: We propose Class-Conditioned Efficient Large Language model Adapter (CCELLA), a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. We also propose a data-efficient LDM framework centered around CCELLA and a proposed joint loss function. We first evaluate our method on 3D prostate MRI against state-of-the-art. We then augment a downstream classifier model training dataset with synthetic images from our method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.071. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method during training improves classifier accuracy from 69% to 74%. Training a classifier solely on our method's synthetic images achieved comparable performance to training on real images alone. Conclusion: We show that our method improved both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Significance: The proposed CCELLA-centric framework enables radiology report and class-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility. Code from this study will be available at this https URL

[804] arXiv:2506.10354 (replaced) [pdf, other]
Title: Revisiting mean estimation over $\ell_p$ balls: Is the MLE optimal?
Liviu Aolaritei, Michael I. Jordan, Reese Pathak, Annie Ulichney
Comments: 43 pages, 3 figures
Subjects: Statistics Theory (math.ST); Information Theory (cs.IT)

We revisit the problem of mean estimation in the Gaussian sequence model with $\ell_p$ constraints for $p \in [0, \infty]$. We demonstrate two phenomena for the behavior of the maximum likelihood estimator (MLE), which depend on the noise level, the radius of the (quasi)norm constraint, the dimension, and the norm index $p$. First, if $p$ lies between $0$ and $1 + \Theta(\tfrac{1}{\log d})$, inclusive, or if it is greater than or equal to $2$, the MLE is minimax rate-optimal for all noise levels and all constraint radii. On the other hand, for the remaining norm indices -- namely, if $p$ lies between $1 + \Theta(\tfrac{1}{\log d})$ and $2$ -- here is a more striking behavior: the MLE is minimax rate-suboptimal, despite its nonlinearity in the observations, for essentially all noise levels and constraint radii for which nonlinear estimates are necessary for minimax-optimal estimation. Our results imply that when given $n$ independent and identically distributed Gaussian samples, the MLE can be suboptimal by a polynomial factor in the sample size. Our lower bounds are constructive: whenever the MLE is rate-suboptimal, we provide explicit instances on which the MLE provably incurs suboptimal risk. Finally, in the non-convex case -- namely when $p < 1$ -- we develop sharp local Gaussian width bounds, which may be of independent interest.

[805] arXiv:2506.12269 (replaced) [pdf, html, other]
Title: ICME 2025 Grand Challenge on Video Super-Resolution for Video Conferencing
Babak Naderi, Ross Cutler, Juhee Cho, Nabakumar Khongbantabam, Dejan Ivkovic
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Super-Resolution (SR) is a critical task in computer vision, focusing on reconstructing high-resolution (HR) images from low-resolution (LR) inputs. The field has seen significant progress through various challenges, particularly in single-image SR. Video Super-Resolution (VSR) extends this to the temporal domain, aiming to enhance video quality using methods like local, uni-, bi-directional propagation, or traditional upscaling followed by restoration. This challenge addresses VSR for conferencing, where LR videos are encoded with H.265 at fixed QPs. The goal is to upscale videos by a specific factor, providing HR outputs with enhanced perceptual quality under a low-delay scenario using causal models. The challenge included three tracks: general-purpose videos, talking head videos, and screen content videos, with separate datasets provided by the organizers for training, validation, and testing. We open-sourced a new screen content dataset for the SR task in this challenge. Submissions were evaluated through subjective tests using a crowdsourced implementation of the ITU-T Rec P.910.

[806] arXiv:2506.14461 (replaced) [pdf, other]
Title: Collaborative Charging Scheduling via Balanced Bounding Box Methods
Fangting Zhou, Balazs Kulcsar, Jiaming Wu
Subjects: Optimization and Control (math.OC); Computational Engineering, Finance, and Science (cs.CE)

Electric mobility faces several challenges, most notably the high cost of infrastructure development and the underutilization of charging stations. The concept of shared charging offers a promising solution. The paper explores sustainable urban logistics through horizontal collaboration between two fleet operators and addresses a scheduling problem for the shared use of charging stations. To tackle this, the study formulates a collaborative scheduling problem as a bi-objective nonlinear integer programming model, in which each company aims to minimize its own costs, creating inherent conflicts that require trade-offs. The Balanced Bounding Box Methods (B3Ms) are introduced in order to efficiently derive the efficient frontier, identifying a reduced set of representative solutions. These methods enhance computational efficiency by selectively disregarding closely positioned and competing solutions, preserving the diversity and representativeness of the solutions over the efficient frontier. To determine the final solution and ensure balanced collaboration, cooperative bargaining methods are applied. Numerical case studies demonstrate the viability and scalability of the developed methods, showing that the B3Ms can significantly reduce computational time while maintaining the integrity of the frontier. These methods, along with cooperative bargaining, provide an effective framework for solving various bi-objective optimization problems, extending beyond the collaborative scheduling problem presented here.

[807] arXiv:2506.17064 (replaced) [pdf, html, other]
Title: Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings
Aditya Sengar, Ali Hariri, Daniel Probst, Patrick Barth, Pierre Vandergheynst
Comments: 10 pages (main text), 4 figures, 2 tables. Submitted to NeurIPS 2025. Code and data are publicly available
Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)

Generating diverse, all-atom conformational ensembles of dynamic proteins such as G-protein-coupled receptors (GPCRs) is critical for understanding their function, yet most generative models simplify atomic detail or ignore conformational diversity altogether. We present latent diffusion for full protein generation (LD-FPG), a framework that constructs complete all-atom protein structures, including every side-chain heavy atom, directly from molecular dynamics (MD) trajectories. LD-FPG employs a Chebyshev graph neural network (ChebNet) to obtain low-dimensional latent embeddings of protein conformations, which are processed using three pooling strategies: blind, sequential and residue-based. A diffusion model trained on these latent representations generates new samples that a decoder, optionally regularized by dihedral-angle losses, maps back to Cartesian coordinates. Using D2R-MD, a 2-microsecond MD trajectory (12 000 frames) of the human dopamine D2 receptor in a membrane environment, the sequential and residue-based pooling strategy reproduces the reference ensemble with high structural fidelity (all-atom lDDT of approximately 0.7; C-alpha-lDDT of approximately 0.8) and recovers backbone and side-chain dihedral-angle distributions with a Jensen-Shannon divergence of less than 0.03 compared to the MD data. LD-FPG thereby offers a practical route to system-specific, all-atom ensemble generation for large proteins, providing a promising tool for structure-based therapeutic design on complex, dynamic targets. The D2R-MD dataset and our implementation are freely available to facilitate further research.

[808] arXiv:2506.20470 (replaced) [pdf, html, other]
Title: Pivot probabilities and norm effects in Gaussian elimination for $β$-ensembles
Kenji Gunawan, John Peca-Medlin
Subjects: Probability (math.PR); Numerical Analysis (math.NA)

We analyze pivot probabilities in Gaussian elimination with partial pivoting (GEPP) for $2 \times 2$ random matrix ensembles. For GUE matrices, we resolve a previously reported discrepancy between theoretical predictions and empirical observations by deriving the exact pivot probability under standard LAPACK-style implementations. We further show that Dumitriu-Edelman tridiagonal $\beta$-ensembles agree with the earlier theoretical expectations.

[809] arXiv:2506.22397 (replaced) [pdf, other]
Title: Dehazing Light Microscopy Images with Guided Conditional Flow Matching: finding a sweet spot between fidelity and realism
Anirban Ray, Ashesh, Florian Jug
Comments: 4 figures, 10 pages + refs, 40 pages total (including supplement), 24 supplementary figures
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Fluorescence microscopy is a major driver of scientific progress in the life sciences. Although high-end confocal microscopes are capable of filtering out-of-focus light, cheaper and more accessible microscopy modalities, such as widefield microscopy, can not, which consequently leads to hazy image data. Computational dehazing is trying to combine the best of both worlds, leading to cheap microscopy but crisp-looking images. The perception-distortion trade-off tells us that we can optimize either for data fidelity, e.g. low MSE or high PSNR, or for data realism, measured by perceptual metrics such as LPIPS or FID. Existing methods either prioritize fidelity at the expense of realism, or produce perceptually convincing results that lack quantitative accuracy. In this work, we propose HazeMatching, a novel iterative method for dehazing light microscopy images, which effectively balances these objectives. Our goal was to find a balanced trade-off between the fidelity of the dehazing results and the realism of individual predictions (samples). We achieve this by adapting the conditional flow matching framework by guiding the generative process with a hazy observation in the conditional velocity field. We evaluate HazeMatching on 5 datasets, covering both synthetic and real data, assessing both distortion and perceptual quality. Our method is compared against 7 baselines, achieving a consistent balance between fidelity and realism on average. Additionally, with calibration analysis, we show that HazeMatching produces well-calibrated predictions. Note that our method does not need an explicit degradation operator to exist, making it easily applicable on real microscopy data. All data used for training and evaluation and our code will be publicly available under a permissive license.

[810] arXiv:2506.22568 (replaced) [pdf, html, other]
Title: Maximum Dispersion, Maximum Concentration: Enhancing the Quality of MOP Solutions
Gladston Moreira, Ivan Meneghini, Elizabeth Wanner
Comments: 11 pages
Subjects: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)

Multi-objective optimization problems (MOPs) often require a trade-off between conflicting objectives, maximizing diversity and convergence in the objective space. This study presents an approach to improve the quality of MOP solutions by optimizing the dispersion in the decision space and the convergence in a specific region of the objective space. Our approach defines a Region of Interest (ROI) based on a cone representing the decision maker's preferences in the objective space, while enhancing the dispersion of solutions in the decision space using a uniformity measure. Combining solution concentration in the objective space with dispersion in the decision space intensifies the search for Pareto-optimal solutions while increasing solution diversity. When combined, these characteristics improve the quality of solutions and avoid the bias caused by clustering solutions in a specific region of the decision space. Preliminary experiments suggest that this method enhances multi-objective optimization by generating solutions that effectively balance dispersion and concentration, thereby mitigating bias in the decision space.

[811] arXiv:2506.22704 (replaced) [pdf, other]
Title: Beyond Code: The Multidimensional Impacts of Large Language Models in Software Development
Sardar Bonabi, Sarah Bana, Vijay Gurbaxani, Tingting Nian
Subjects: General Economics (econ.GN); Artificial Intelligence (cs.AI)

Large language models (LLMs) are poised to significantly impact software development, especially in the Open-Source Software (OSS) sector. To understand this impact, we first outline the mechanisms through which LLMs may influence OSS through code development, collaborative knowledge transfer, and skill development. We then empirically examine how LLMs affect OSS developers' work in these three key areas. Leveraging a natural experiment from a temporary ChatGPT ban in Italy, we employ a Difference-in-Differences framework with two-way fixed effects to analyze data from all OSS developers on GitHub in three similar countries, Italy, France, and Portugal, totaling 88,022 users. We find that access to ChatGPT increases developer productivity by 6.4%, knowledge sharing by 9.6%, and skill acquisition by 8.4%. These benefits vary significantly by user experience level: novice developers primarily experience productivity gains, whereas more experienced developers benefit more from improved knowledge sharing and accelerated skill acquisition. In addition, we find that LLM-assisted learning is highly context-dependent, with the greatest benefits observed in technically complex, fragmented, or rapidly evolving contexts. We show that the productivity effects of LLMs extend beyond direct code generation to include enhanced collaborative learning and knowledge exchange among developers, dynamics that are essential for gaining a holistic understanding of LLMs' impact in OSS. Our findings offer critical managerial implications: strategically deploying LLMs can accelerate novice developers' onboarding and productivity, empower intermediate developers to foster knowledge sharing and collaboration, and support rapid skill acquisition, together enhancing long-term organizational productivity and agility.

[812] arXiv:2506.23309 (replaced) [pdf, html, other]
Title: SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting
Yiming Huang, Long Bai, Beilei Cui, Kun Yuan, Guankun Wang, Mobarak I. Hoque, Nicolas Padoy, Nassir Navab, Hongliang Ren
Comments: MICCAI 2025. Project Page: this https URL
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In contemporary surgical research and practice, accurately comprehending 3D surgical scenes with text-promptable capabilities is particularly crucial for surgical planning and real-time intra-operative guidance, where precisely identifying and interacting with surgical tools and anatomical structures is paramount. However, existing works focus on surgical vision-language model (VLM), 3D reconstruction, and segmentation separately, lacking support for real-time text-promptable 3D queries. In this paper, we present SurgTPGS, a novel text-promptable Gaussian Splatting method to fill this gap. We introduce a 3D semantics feature learning strategy incorporating the Segment Anything model and state-of-the-art vision-language models. We extract the segmented language features for 3D surgical scene reconstruction, enabling a more in-depth understanding of the complex surgical environment. We also propose semantic-aware deformation tracking to capture the seamless deformation of semantic features, providing a more precise reconstruction for both texture and semantic features. Furthermore, we present semantic region-aware optimization, which utilizes regional-based semantic information to supervise the training, particularly promoting the reconstruction quality and semantic smoothness. We conduct comprehensive experiments on two real-world surgical datasets to demonstrate the superiority of SurgTPGS over state-of-the-art methods, highlighting its potential to revolutionize surgical practices. SurgTPGS paves the way for developing next-generation intelligent surgical systems by enhancing surgical precision and safety. Our code is available at: this https URL.

Total of 812 entries
Showing up to 2000 entries per page: fewer | more | all
  • About
  • Help
  • contact arXivClick here to contact arXiv Contact
  • subscribe to arXiv mailingsClick here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack