Search | arXiv e-print repository

Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation

Authors: Xiuyu Yang, Shuhan Tan, Philipp Krähenbühl

Abstract: An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token predi… ▽ More An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation. The code and model of InfGen will be released at https://orangesodahub.github.io/InfGen △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Preprint. Project page: https://orangesodahub.github.io/InfGen Code: https://github.com/OrangeSodahub/infgen

arXiv:2506.17204 [pdf, ps, other]

Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning

Authors: Guozheng Ma, Lu Li, Zilin Wang, Li Shen, Pierre-Luc Bacon, Dacheng Tao

Abstract: Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motivating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyo… ▽ More Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motivating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with state-of-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Accepted to ICML 2025

arXiv:2506.17197 [pdf, ps, other]

Schrödinger Bridge Matching for Tree-Structured Costs and Entropic Wasserstein Barycentres

Authors: Samuel Howard, Peter Potaptchik, George Deligiannidis

Abstract: Recent advances in flow-based generative modelling have provided scalable methods for computing the Schrödinger Bridge (SB) between distributions, a dynamic form of entropy-regularised Optimal Transport (OT) for the quadratic cost. The successful Iterative Markovian Fitting (IMF) procedure solves the SB problem via sequential bridge-matching steps, presenting an elegant and practical approach with… ▽ More Recent advances in flow-based generative modelling have provided scalable methods for computing the Schrödinger Bridge (SB) between distributions, a dynamic form of entropy-regularised Optimal Transport (OT) for the quadratic cost. The successful Iterative Markovian Fitting (IMF) procedure solves the SB problem via sequential bridge-matching steps, presenting an elegant and practical approach with many favourable properties over the more traditional Iterative Proportional Fitting (IPF) procedure. Beyond the standard setting, optimal transport can be generalised to the multi-marginal case in which the objective is to minimise a cost defined over several marginal distributions. Of particular importance are costs defined over a tree structure, from which Wasserstein barycentres can be recovered as a special case. In this work, we extend the IMF procedure to solve for the tree-structured SB problem. Our resulting algorithm inherits the many advantages of IMF over IPF approaches in the tree-based setting. In the specific case of Wasserstein barycentres, our approach can be viewed as extending fixed-point approaches for barycentre computation to the case of flow-based entropic OT solvers. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Preprint

arXiv:2506.17184 [pdf, ps, other]

Judo: A User-Friendly Open-Source Package for Sampling-Based Model Predictive Control

Authors: Albert H. Li, Brandon Hung, Aaron D. Ames, Jiuguang Wang, Simon Le Cleac'h, Preston Culbertson

Abstract: Recent advancements in parallel simulation and successful robotic applications are spurring a resurgence in sampling-based model predictive control. To build on this progress, however, the robotics community needs common tooling for prototyping, evaluating, and deploying sampling-based controllers. We introduce Judo, a software package designed to address this need. To facilitate rapid prototyping… ▽ More Recent advancements in parallel simulation and successful robotic applications are spurring a resurgence in sampling-based model predictive control. To build on this progress, however, the robotics community needs common tooling for prototyping, evaluating, and deploying sampling-based controllers. We introduce Judo, a software package designed to address this need. To facilitate rapid prototyping and evaluation, Judo provides robust implementations of common sampling-based MPC algorithms and standardized benchmark tasks. It further emphasizes usability with simple but extensible interfaces for controller and task definitions, asynchronous execution for straightforward simulation-to-hardware transfer, and a highly customizable interactive GUI for tuning controllers interactively. While written in Python, the software leverages MuJoCo as its physics backend to achieve real-time performance, which we validate across both consumer and server-grade hardware. Code at https://github.com/bdaiinstitute/judo. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Accepted at the 2025 RSS Workshop on Fast Motion Planning and Control in the Era of Parallelism. 5 Pages

arXiv:2506.17154 [pdf, ps, other]

Global Microprocessor Correctness in the Presence of Transient Execution

Authors: Andrew T. Walter, Konstantinos Athanasiou, Panagiotis Manolios

Abstract: Correctness for microprocessors is generally understood to be conformance with the associated instruction set architecture (ISA). This is the basis for one of the most important abstractions in computer science, allowing hardware designers to develop highly-optimized processors that are functionally "equivalent" to an ideal processor that executes instructions atomically. This specification is alm… ▽ More Correctness for microprocessors is generally understood to be conformance with the associated instruction set architecture (ISA). This is the basis for one of the most important abstractions in computer science, allowing hardware designers to develop highly-optimized processors that are functionally "equivalent" to an ideal processor that executes instructions atomically. This specification is almost always informal, e.g., commercial microprocessors generally do not come with conformance specifications. In this paper, we advocate for the use of formal specifications, using the theory of refinement. We introduce notions of correctness that can be used to deal with transient execution attacks, including Meltdown and Spectre. Such attacks have shown that ubiquitous microprocessor optimizations, appearing in numerous processors for decades, are inherently buggy. Unlike alternative approaches that use non-interference properties, our notion of correctness is global, meaning it is single specification that: formalizes conformance, includes functional correctness and is parameterized by an microarchitecture. We introduce action skipping refinement, a new type of refinement and we describe how our notions of refinement can be decomposed into properties that are more amenable to automated verification using the the concept of shared-resource commitment refinement maps. We do this in the context of formal, fully executable bit- and cycle-accurate models of an ISA and a microprocessor. Finally, we show how light-weight formal methods based on property-based testing can be used to identify transient execution bugs. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17142 [pdf, ps, other]

A Note on Proper Relational Structures

Authors: Adam Bjorndahl, Philip Sink

Abstract: In this note we provide an algorithm for translating relational structures into "proper" relational structures, i.e., those such that there is no pair of worlds w and u such that w is accessible from u for every agent. In particular, our method of translation preserves many classical properties of relational structures, such as transitivity and the Euclidean property. As a result, this method of t… ▽ More In this note we provide an algorithm for translating relational structures into "proper" relational structures, i.e., those such that there is no pair of worlds w and u such that w is accessible from u for every agent. In particular, our method of translation preserves many classical properties of relational structures, such as transitivity and the Euclidean property. As a result, this method of translation has many applications in the literature on Simplicial Semantics for modal logic, where the creation of proper canonical relational structures is a common step in proofs of completeness. △ Less

Submitted 20 June, 2025; originally announced June 2025.

MSC Class: 03B42 ACM Class: F.4.1

arXiv:2506.17137 [pdf, ps, other]

On the Theory of Conditional Feature Alignment for Unsupervised Domain-Adaptive Counting

Authors: Zhuonan Liang, Dongnan Liu, Jianan Fan, Yaxuan Song, Qiang Qu, Yu Yao, Peng Fu, Weidong Cai

Abstract: Object counting models suffer when deployed across domains with differing density variety, since density shifts are inherently task-relevant and violate standard domain adaptation assumptions. To address this, we propose a theoretical framework of conditional feature alignment. We first formalize the notion of conditional divergence by partitioning each domain into subsets (e.g., object vs. backgr… ▽ More Object counting models suffer when deployed across domains with differing density variety, since density shifts are inherently task-relevant and violate standard domain adaptation assumptions. To address this, we propose a theoretical framework of conditional feature alignment. We first formalize the notion of conditional divergence by partitioning each domain into subsets (e.g., object vs. background) and measuring divergences per condition. We then derive a joint error bound showing that, under discrete label spaces treated as condition sets, aligning distributions conditionally leads to tighter bounds on the combined source-target decision error than unconditional alignment. These insights motivate a general conditional adaptation principle: by preserving task-relevant variations while filtering out nuisance shifts, one can achieve superior cross-domain generalization for counting. We provide both defining conditional divergence then proving its benefit in lowering joint error and a practical adaptation strategy that preserves task-relevant information in unsupervised domain-adaptive counting. We demonstrate the effectiveness of our approach through extensive experiments on multiple counting datasets with varying density distributions. The results show that our method outperforms existing unsupervised domain adaptation methods, empirically validating the theoretical insights on conditional feature alignment. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 18 pages, 5 figures, 8 tables

arXiv:2506.17135 [pdf, ps, other]

No Scratch Quantum Computing by Reducing Qubit Overhead for Efficient Arithmetics

Authors: Omid Faizy, Norbert Wehn, Paul Lukowicz, Maximilian Kiefer-Emmanouilidis

Abstract: Quantum arithmetic computation requires a substantial number of scratch qubits to stay reversible. These operations necessitate qubit and gate resources equivalent to those needed for the larger of the input or output registers due to state encoding. Quantum Hamiltonian Computing (QHC) introduces a novel approach by encoding input for logic operations within a single rotating quantum gate. This in… ▽ More Quantum arithmetic computation requires a substantial number of scratch qubits to stay reversible. These operations necessitate qubit and gate resources equivalent to those needed for the larger of the input or output registers due to state encoding. Quantum Hamiltonian Computing (QHC) introduces a novel approach by encoding input for logic operations within a single rotating quantum gate. This innovation reduces the required qubit register $ N $ to the size of the output states $ O $, where $ N = \log_2 O $. Leveraging QHC principles, we present reversible half-adder and full-adder circuits that compress the standard Toffoli + CNOT layout [Vedral et al., PRA, 54, 11, (1996)] from three-qubit and four-qubit formats for the Quantum half-adder circuit and five sequential Fredkin gates using five qubits [Moutinho et al., PRX Energy 2, 033002 (2023)] for full-adder circuit; into a two-qubit, 4$\times $4 Hilbert space. This scheme, presented here, is optimized for classical logic evaluated on quantum hardware, which due to unitary evolution can bypass classical CMOS energy limitations to certain degree. Although we avoid superposition of input and output states in this manuscript, this remains feasible in principle. We see the best application for QHC in finding the minimal qubit and gate resources needed to evaluate any truth table, advancing FPGA capabilities using integrated quantum circuits or photonics. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17124 [pdf, ps, other]

When Can Model-Free Reinforcement Learning be Enough for Thinking?

Authors: Josiah P. Hanna, Nicholas E. Corrado

Abstract: Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding… ▽ More Recent work on large language models has demonstrated the use of model-free reinforcement learning (RL) to train reasoning-like capabilities. The emergence of "thinking" through model-free RL is interesting as thinking actions neither produce reward nor change the external world state to one where the agent is more likely to get reward. This paper seeks to build a domain-independent understanding of when model-free RL will lead to "thinking" as a strategy for reward maximization. To build this understanding, we first introduce a theoretical model which we call a \textit{thought Markov decision process} (MDP). Thought MDPs minimally extend the classical MDP model to include an abstract notion of thought state and thought action. Using the thought MDP model, we prove the importance of policy initialization in determining whether or not thinking emerges and show formally that thought actions are equivalent to the agent choosing to perform a step of policy improvement before continuing to act. We then show that open-source LLMs satisfy the conditions that our theory predicts are necessary for model-free RL to produce thinking-like behavior. Finally, we hypothesize sufficient conditions that would enable thinking to be learned outside of language generation and introduce a toy domain where a combination of multi-task pre-training and designated thought actions enable more data-efficient RL compared to non-thinking agents. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 15 pages, 3 figures

arXiv:2506.17080 [pdf, ps, other]

Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

Authors: Ricardo Rei, Nuno M. Guerreiro, José Pombal, João Alves, Pedro Teixeirinha, Amin Farajian, André F. T. Martins

Abstract: Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of sk… ▽ More Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17077 [pdf, ps, other]

Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025

Authors: Dominik Macháček, Peter Polák

Abstract: This paper describes Charles University submission to the Simultaneous Speech Translation Task of the IWSLT 2025. We cover all four language pairs with a direct or cascade approach. The backbone of our systems is the offline Whisper speech model, which we use for both translation and transcription in simultaneous mode with the state-of-the-art simultaneous policy AlignAtt. We further improve the p… ▽ More This paper describes Charles University submission to the Simultaneous Speech Translation Task of the IWSLT 2025. We cover all four language pairs with a direct or cascade approach. The backbone of our systems is the offline Whisper speech model, which we use for both translation and transcription in simultaneous mode with the state-of-the-art simultaneous policy AlignAtt. We further improve the performance by prompting to inject in-domain terminology, and we accommodate context. Our cascaded systems further use EuroLLM for unbounded simultaneous translation. Compared to the Organizers' baseline, our systems improve by 2 BLEU points on Czech to English and 13-22 BLEU points on English to German, Chinese and Japanese on the development sets. Additionally, we also propose a new enhanced measure of speech recognition latency. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: IWSLT 2025

arXiv:2506.17065 [pdf, ps, other]

Flow-Based Non-stationary Temporal Regime Causal Structure Learning

Authors: Abdellah Rahmani, Pascal Frossard

Abstract: Understanding causal relationships in multivariate time series is crucial in many scenarios, such as those dealing with financial or neurological data. Many such time series exhibit multiple regimes, i.e., consecutive temporal segments with a priori unknown boundaries, with each regime having its own causal structure. Inferring causal dependencies and regime shifts is critical for analyzing the un… ▽ More Understanding causal relationships in multivariate time series is crucial in many scenarios, such as those dealing with financial or neurological data. Many such time series exhibit multiple regimes, i.e., consecutive temporal segments with a priori unknown boundaries, with each regime having its own causal structure. Inferring causal dependencies and regime shifts is critical for analyzing the underlying processes. However, causal structure learning in this setting is challenging due to (1) non stationarity, i.e., each regime can have its own causal graph and mixing function, and (2) complex noise distributions, which may be non Gaussian or heteroscedastic. Existing causal discovery approaches cannot address these challenges, since generally assume stationarity or Gaussian noise with constant variance. Hence, we introduce FANTOM, a unified framework for causal discovery that handles non stationary processes along with non Gaussian and heteroscedastic noises. FANTOM simultaneously infers the number of regimes and their corresponding indices and learns each regime's Directed Acyclic Graph. It uses a Bayesian Expectation Maximization algorithm that maximizes the evidence lower bound of the data log likelihood. On the theoretical side, we prove, under mild assumptions, that temporal heteroscedastic causal models, introduced in FANTOM's formulation, are identifiable in both stationary and non stationary settings. In addition, extensive experiments on synthetic and real data show that FANTOM outperforms existing methods. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17064 [pdf, ps, other]

Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings

Authors: Aditya Sengar, Ali Hariri, Daniel Probst, Patrick Barth, Pierre Vandergheynst

Abstract: Generating diverse, all-atom conformational ensembles of dynamic proteins such as G-protein-coupled receptors (GPCRs) is critical for understanding their function, yet most generative models simplify atomic detail or ignore conformational diversity altogether. We present latent diffusion for full protein generation (LD-FPG), a framework that constructs complete all-atom protein structures, includi… ▽ More Generating diverse, all-atom conformational ensembles of dynamic proteins such as G-protein-coupled receptors (GPCRs) is critical for understanding their function, yet most generative models simplify atomic detail or ignore conformational diversity altogether. We present latent diffusion for full protein generation (LD-FPG), a framework that constructs complete all-atom protein structures, including every side-chain heavy atom, directly from molecular dynamics (MD) trajectories. LD-FPG employs a Chebyshev graph neural network (ChebNet) to obtain low-dimensional latent embeddings of protein conformations, which are processed using three pooling strategies: blind, sequential and residue-based. A diffusion model trained on these latent representations generates new samples that a decoder, optionally regularized by dihedral-angle losses, maps back to Cartesian coordinates. Using D2R-MD, a 2-microsecond MD trajectory (12 000 frames) of the human dopamine D2 receptor in a membrane environment, the sequential and residue-based pooling strategy reproduces the reference ensemble with high structural fidelity (all-atom lDDT of approximately 0.7; C-alpha-lDDT of approximately 0.8) and recovers backbone and side-chain dihedral-angle distributions with a Jensen-Shannon divergence of less than 0.03 compared to the MD data. LD-FPG thereby offers a practical route to system-specific, all-atom ensemble generation for large proteins, providing a promising tool for structure-based therapeutic design on complex, dynamic targets. The D2R-MD dataset and our implementation are freely available to facilitate further research. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 10 pages (main text), 4 figures, 2 tables. Submitted to NeurIPS 2025. Code and data are publicly available

arXiv:2506.17057 [pdf, ps, other]

Behavior Driven Development for 3D Games

Authors: Fernando Pastor Ricós, Beatriz Marín, I. S. W. B. Prasetya, Tanja E. J. Vos, Joseph Davidson, Karel Hovorka

Abstract: Computer 3D games are complex software environments that require novel testing processes to ensure high-quality standards. The Intelligent Verification/Validation for Extended Reality Based Systems (iv4XR) framework addresses this need by enabling the implementation of autonomous agents to automate game testing scenarios. This framework facilitates the automation of regression test cases for compl… ▽ More Computer 3D games are complex software environments that require novel testing processes to ensure high-quality standards. The Intelligent Verification/Validation for Extended Reality Based Systems (iv4XR) framework addresses this need by enabling the implementation of autonomous agents to automate game testing scenarios. This framework facilitates the automation of regression test cases for complex 3D games like Space Engineers. Nevertheless, the technical expertise required to define test scripts using iv4XR can constrain seamless collaboration between developers and testers. This paper reports how integrating a Behavior-driven Development (BDD) approach with the iv4XR framework allows the industrial company behind Space Engineers to automate regression testing. The success of this industrial collaboration has inspired the iv4XR team to integrate the BDD approach to improve the automation of play-testing for the experimental 3D game LabRecruits. Furthermore, the iv4XR framework has been extended with tactical programming to enable the automation of long-play test scenarios in Space Engineers. These results underscore the versatility of the iv4XR framework in supporting diverse testing approaches while showcasing how BDD empowers users to create, manage, and execute automated game tests using comprehensive and human-readable statements. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17047 [pdf, ps, other]

Navigating the Deep: Signature Extraction on Deep Neural Networks

Authors: Haolin Liu, Adrien Siproudhis, Samuel Experton, Peter Lorenz, Christina Boura, Thomas Peyrin

Abstract: Neural network model extraction has emerged in recent years as an important security concern, as adversaries attempt to recover a network's parameters via black-box queries. A key step in this process is signature extraction, which aims to recover the absolute values of the network's weights layer by layer. Prior work, notably by Carlini et al. (2020), introduced a technique inspired by differenti… ▽ More Neural network model extraction has emerged in recent years as an important security concern, as adversaries attempt to recover a network's parameters via black-box queries. A key step in this process is signature extraction, which aims to recover the absolute values of the network's weights layer by layer. Prior work, notably by Carlini et al. (2020), introduced a technique inspired by differential cryptanalysis to extract neural network parameters. However, their method suffers from several limitations that restrict its applicability to networks with a few layers only. Later works focused on improving sign extraction, but largely relied on the assumption that signature extraction itself was feasible. In this work, we revisit and refine the signature extraction process by systematically identifying and addressing for the first time critical limitations of Carlini et al.'s signature extraction method. These limitations include rank deficiency and noise propagation from deeper layers. To overcome these challenges, we propose efficient algorithmic solutions for each of the identified issues, greatly improving the efficiency of signature extraction. Our approach permits the extraction of much deeper networks than was previously possible. We validate our method through extensive experiments on ReLU-based neural networks, demonstrating significant improvements in extraction depth and accuracy. For instance, our extracted network matches the target network on at least 95% of the input space for each of the eight layers of a neural network trained on the CIFAR-10 dataset, while previous works could barely extract the first three layers. Our results represent a crucial step toward practical attacks on larger and more complex neural network architectures. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 26 pages

arXiv:2506.17046 [pdf, ps, other]

MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

Authors: Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai, Ziyue Wang, Yawen Wang, Kaiyu Huang, Yile Wang, Peng Li, Yang Liu

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically… ▽ More Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models--encompassing both open-source and proprietary architectures--reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17040 [pdf, ps, other]

Stretching Beyond the Obvious: A Gradient-Free Framework to Unveil the Hidden Landscape of Visual Invariance

Authors: Lorenzo Tausani, Paolo Muratore, Morgan B. Talbot, Giacomo Amerio, Gabriel Kreiman, Davide Zoccolan

Abstract: Uncovering which features' combinations high-level visual units encode is critical to understand how images are transformed into representations that support recognition. While existing feature visualization approaches typically infer a unit's most exciting images, this is insufficient to reveal the manifold of transformations under which responses remain invariant, which is key to generalization… ▽ More Uncovering which features' combinations high-level visual units encode is critical to understand how images are transformed into representations that support recognition. While existing feature visualization approaches typically infer a unit's most exciting images, this is insufficient to reveal the manifold of transformations under which responses remain invariant, which is key to generalization in vision. Here we introduce Stretch-and-Squeeze (SnS), an unbiased, model-agnostic, and gradient-free framework to systematically characterize a unit's invariance landscape and its vulnerability to adversarial perturbations in both biological and artificial visual systems. SnS frames these transformations as bi-objective optimization problems. To probe invariance, SnS seeks image perturbations that maximally alter the representation of a reference stimulus in a given processing stage while preserving unit activation. To probe adversarial sensitivity, SnS seeks perturbations that minimally alter the stimulus while suppressing unit activation. Applied to convolutional neural networks (CNNs), SnS revealed image variations that were further from a reference image in pixel-space than those produced by affine transformations, while more strongly preserving the target unit's response. The discovered invariant images differed dramatically depending on the choice of image representation used for optimization: pixel-level changes primarily affected luminance and contrast, while stretching mid- and late-layer CNN representations altered texture and pose respectively. Notably, the invariant images from robust networks were more recognizable by human subjects than those from standard networks, supporting the higher fidelity of robust CNNs as models of the visual system. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 21 pages, 9 figures

arXiv:2506.17035 [pdf]

Critical Appraisal of Fairness Metrics in Clinical Predictive AI

Authors: João Matos, Ben Van Calster, Leo Anthony Celi, Paula Dhiman, Judy Wawira Gichoya, Richard D. Riley, Chris Russell, Sara Khalid, Gary S. Collins

Abstract: Predictive artificial intelligence (AI) offers an opportunity to improve clinical practice and patient outcomes, but risks perpetuating biases if fairness is inadequately addressed. However, the definition of "fairness" remains unclear. We conducted a scoping review to identify and critically appraise fairness metrics for clinical predictive AI. We defined a "fairness metric" as a measure quantify… ▽ More Predictive artificial intelligence (AI) offers an opportunity to improve clinical practice and patient outcomes, but risks perpetuating biases if fairness is inadequately addressed. However, the definition of "fairness" remains unclear. We conducted a scoping review to identify and critically appraise fairness metrics for clinical predictive AI. We defined a "fairness metric" as a measure quantifying whether a model discriminates (societally) against individuals or groups defined by sensitive attributes. We searched five databases (2014-2024), screening 820 records, to include 41 studies, and extracted 62 fairness metrics. Metrics were classified by performance-dependency, model output level, and base performance metric, revealing a fragmented landscape with limited clinical validation and overreliance on threshold-dependent measures. Eighteen metrics were explicitly developed for healthcare, including only one clinical utility metric. Our findings highlight conceptual challenges in defining and quantifying fairness and identify gaps in uncertainty quantification, intersectionality, and real-world applicability. Future work should prioritise clinically meaningful metrics. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 32 pages, 1 figure, 2 tables, 5 boxes, 4 linked supplementary materials

arXiv:2506.17029 [pdf, ps, other]

Scalable and Reliable Multi-agent Reinforcement Learning for Traffic Assignment

Authors: Leizhen Wang, Peibo Duan, Cheng Lyu, Zewen Wang, Zhiqiang He, Nan Zheng, Zhenliang Ma

Abstract: The evolution of metropolitan cities and the increase in travel demands impose stringent requirements on traffic assignment methods. Multi-agent reinforcement learning (MARL) approaches outperform traditional methods in modeling adaptive routing behavior without requiring explicit system dynamics, which is beneficial for real-world deployment. However, MARL frameworks face challenges in scalabilit… ▽ More The evolution of metropolitan cities and the increase in travel demands impose stringent requirements on traffic assignment methods. Multi-agent reinforcement learning (MARL) approaches outperform traditional methods in modeling adaptive routing behavior without requiring explicit system dynamics, which is beneficial for real-world deployment. However, MARL frameworks face challenges in scalability and reliability when managing extensive networks with substantial travel demand, which limiting their practical applicability in solving large-scale traffic assignment problems. To address these challenges, this study introduces MARL-OD-DA, a new MARL framework for the traffic assignment problem, which redefines agents as origin-destination (OD) pair routers rather than individual travelers, significantly enhancing scalability. Additionally, a Dirichlet-based action space with action pruning and a reward function based on the local relative gap are designed to enhance solution reliability and improve convergence efficiency. Experiments demonstrate that the proposed MARL framework effectively handles medium-sized networks with extensive and varied city-level OD demand, surpassing existing MARL methods. When implemented in the SiouxFalls network, MARL-OD-DA achieves better assignment solutions in 10 steps, with a relative gap that is 94.99% lower than that of conventional methods. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17025 [pdf, ps, other]

Volumetric Parameterization for 3-Dimensional Simply-Connected Manifolds

Authors: Zhiyuan Lyu, Qiguang Chen, Gary P. T. Choi, Lok Ming Lui

Abstract: With advances in technology, there has been growing interest in developing effective mapping methods for 3-dimensional objects in recent years. Volumetric parameterization for 3D solid manifolds plays an important role in processing 3D data. However, the conventional approaches cannot control the bijectivity and local geometric distortions of the result mappings due to the complex structure of the… ▽ More With advances in technology, there has been growing interest in developing effective mapping methods for 3-dimensional objects in recent years. Volumetric parameterization for 3D solid manifolds plays an important role in processing 3D data. However, the conventional approaches cannot control the bijectivity and local geometric distortions of the result mappings due to the complex structure of the solid manifolds. Moreover, prior methods mainly focus on one property instead of balancing different properties during the mapping process. In this paper, we propose several novel methods for computing volumetric parameterizations for 3D simply-connected manifolds. Analogous to surface parameterization, our framework incorporates several models designed to preserve geometric structure, achieve density equalization, and optimally balance geometric and density distortions. With these methods, various 3D manifold parameterizations with different desired properties can be achieved. These methods are tested on different examples and manifold remeshing applications, demonstrating their effectiveness and accuracy. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17008 [pdf, ps, other]

When does FTP become FPT?

Authors: Matthias Bentert, Fedor V. Fomin, Petr A. Golovach, Laure Morelle

Abstract: In the problem Fault-Tolerant Path (FTP), we are given an edge-weighted directed graph G = (V, E), a subset U \subseteq E of vulnerable edges, two vertices s, t \in V, and integers k and \ell. The task is to decide whether there exists a subgraph H of G with total cost at most \ell such that, after the removal of any k vulnerable edges, H still contains an s-t-path. We study whether Fault-Tolerant… ▽ More In the problem Fault-Tolerant Path (FTP), we are given an edge-weighted directed graph G = (V, E), a subset U \subseteq E of vulnerable edges, two vertices s, t \in V, and integers k and \ell. The task is to decide whether there exists a subgraph H of G with total cost at most \ell such that, after the removal of any k vulnerable edges, H still contains an s-t-path. We study whether Fault-Tolerant Path is fixed-parameter tractable (FPT) and whether it admits a polynomial kernel under various parameterizations. Our choices of parameters include: the number of vulnerable edges in the input graph, the number of safe (i.e, invulnerable) edges in the input graph, the budget \ell, the minimum number of safe edges in any optimal solution, the minimum number of vulnerable edges in any optimal solution, the required redundancy k, and natural above- and below-guarantee parameterizations. We provide an almost complete description of the complexity landscape of FTP for these parameters. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Appeared in WG 2025

arXiv:2506.17007 [pdf, ps, other]

Robust Reinforcement Learning for Discrete Compositional Generation via General Soft Operators

Authors: Marco Jiralerspong, Esther Derman, Danilo Vucetic, Nikolay Malkin, Bilun Sun, Tianyu Zhang, Pierre-Luc Bacon, Gauthier Gidel

Abstract: A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a small set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regula… ▽ More A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a small set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: https://github.com/marcojira/tgm. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17004 [pdf, ps, other]

A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X Autonomous Driving

Authors: Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian, Hao Si, Manabu Tsukada

Abstract: 3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange… ▽ More 3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy. In the absence of a dedicated dataset for collaborative 3D semantic occupancy prediction, we augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor to provide dense and comprehensive occupancy annotations. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. Experimental results demonstrate that our baseline model consistently outperforms single-agent models, with increasing gains observed as the prediction range expands. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17001 [pdf, ps, other]

PersonalAI: Towards digital twins in the graph form

Authors: Mikhail Menschikov, Dmitry Evseev, Ruslan Kostoev, Ilya Perepechkin, Ilnaz Salimov, Victoria Dochkina, Petr Anokhin, Evgeny Burnaev, Nikita Semenov

Abstract: The challenge of personalizing language models, specifically the ability to account for a user's history during interactions, is of significant interest. Despite recent advancements in large language models (LLMs) and Retrieval Augmented Generation that have enhanced the factual base of LLMs, the task of retaining extensive personal information and using it to generate personalized responses remai… ▽ More The challenge of personalizing language models, specifically the ability to account for a user's history during interactions, is of significant interest. Despite recent advancements in large language models (LLMs) and Retrieval Augmented Generation that have enhanced the factual base of LLMs, the task of retaining extensive personal information and using it to generate personalized responses remains pertinent. To address this, we propose utilizing external memory in the form of knowledge graphs, which are constructed and updated by the LLM itself. We have expanded upon ideas of AriGraph architecture and for the first time introduced a combined graph featuring both standard edges and two types of hyperedges. Experiments conducted on the TriviaQA, HotpotQA and DiaASQ benchmarks indicates that this approach aids in making the process of graph construction and knowledge extraction unified and robust. Furthermore, we augmented the DiaASQ benchmark by incorporating parameters such as time into dialogues and introducing contradictory statements made by the same speaker at different times. Despite these modifications, the performance of the question-answering system remained robust, demonstrating the proposed architecture's ability to maintain and utilize temporal dependencies. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16975 [pdf, ps, other]

Latent Concept Disentanglement in Transformer-based Language Models

Authors: Guan Zhe Hong, Bhavya Vasudeva, Vatsal Sharan, Cyrus Rashtchian, Prabhakar Raghavan, Rina Panigrahy

Abstract: When large language models (LLMs) use in-context learning (ICL) to solve a new task, they seem to grasp not only the goal of the task but also core, latent concepts in the demonstration examples. This begs the question of whether transformers represent latent structures as part of their computation or whether they take shortcuts to solve the problem. Prior mechanistic work on ICL does not address… ▽ More When large language models (LLMs) use in-context learning (ICL) to solve a new task, they seem to grasp not only the goal of the task but also core, latent concepts in the demonstration examples. This begs the question of whether transformers represent latent structures as part of their computation or whether they take shortcuts to solve the problem. Prior mechanistic work on ICL does not address this question because it does not sufficiently examine the relationship between the learned representation and the latent concept, and the considered problem settings often involve only single-step reasoning. In this work, we examine how transformers disentangle and use latent concepts. We show that in 2-hop reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. In tasks parameterized by a continuous latent concept, we find low-dimensional subspaces in the representation space where the geometry mimics the underlying parameterization. Together, these results refine our understanding of ICL and the representation of transformers, and they provide evidence for highly localized structures in the model that disentangle latent concepts in ICL tasks. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16940 [pdf, ps, other]

LunarLoc: Segment-Based Global Localization on the Moon

Authors: Annika Thomas, Robaire Galliath, Aleksander Garbuz, Luke Anger, Cormac O'Neill, Trevor Johst, Dami Thomas, George Lordos, Jonathan P. How

Abstract: Global localization is necessary for autonomous operations on the lunar surface where traditional Earth-based navigation infrastructure, such as GPS, is unavailable. As NASA advances toward sustained lunar presence under the Artemis program, autonomous operations will be an essential component of tasks such as robotic exploration and infrastructure deployment. Tasks such as excavation and transpor… ▽ More Global localization is necessary for autonomous operations on the lunar surface where traditional Earth-based navigation infrastructure, such as GPS, is unavailable. As NASA advances toward sustained lunar presence under the Artemis program, autonomous operations will be an essential component of tasks such as robotic exploration and infrastructure deployment. Tasks such as excavation and transport of regolith require precise pose estimation, but proposed approaches such as visual-inertial odometry (VIO) accumulate odometry drift over long traverses. Precise pose estimation is particularly important for upcoming missions such as the ISRU Pilot Excavator (IPEx) that rely on autonomous agents to operate over extended timescales and varied terrain. To help overcome odometry drift over long traverses, we propose LunarLoc, an approach to global localization that leverages instance segmentation for zero-shot extraction of boulder landmarks from onboard stereo imagery. Segment detections are used to construct a graph-based representation of the terrain, which is then aligned with a reference map of the environment captured during a previous session using graph-theoretic data association. This method enables accurate and drift-free global localization in visually ambiguous settings. LunarLoc achieves sub-cm level accuracy in multi-session global localization experiments, significantly outperforming the state of the art in lunar global localization. To encourage the development of further methods for global localization on the Moon, we release our datasets publicly with a playback module: https://github.com/mit-acl/lunarloc-data. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16929 [pdf]

doi 10.11591/ijphs.v13i1.22577

A deep learning and machine learning approach to predict neonatal death in the context of São Paulo

Authors: Mohon Raihan, Plabon Kumar Saha, Rajan Das Gupta, A Z M Tahmidul Kabir, Afia Anjum Tamanna, Md. Harun-Ur-Rashid, Adnan Bin Abdus Salam, Md Tanvir Anjum, A Z M Ahteshamul Kabir

Abstract: Neonatal death is still a concerning reality for underdeveloped and even some developed countries. Worldwide data indicate that 26.693 babies out of 1,000 births die, according to Macro Trades. To reduce this number, early prediction of endangered babies is crucial. Such prediction enables the opportunity to take ample care of the child and mother so that early child death can be avoided. In this… ▽ More Neonatal death is still a concerning reality for underdeveloped and even some developed countries. Worldwide data indicate that 26.693 babies out of 1,000 births die, according to Macro Trades. To reduce this number, early prediction of endangered babies is crucial. Such prediction enables the opportunity to take ample care of the child and mother so that early child death can be avoided. In this context, machine learning was used to determine whether a newborn baby is at risk. To train the predictive model, historical data of 1.4 million newborns was used. Machine learning and deep learning techniques such as logical regression, K-nearest neighbor, random forest classifier, extreme gradient boosting (XGBoost), convolutional neural network, and long short-term memory (LSTM) were implemented using the dataset to identify the most accurate model for predicting neonatal mortality. Among the machine learning algorithms, XGBoost and random forest classifier achieved the best accuracy with 94%, while among the deep learning models, LSTM delivered the highest accuracy with 99%. Therefore, using LSTM appears to be the most suitable approach to predict whether precautionary measures for a child are necessary. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Journal ref: journal-ref = {Int J Public Health Sci vol 13 no 1 pp 179--190 2024}

arXiv:2506.16912 [pdf, ps, other]

From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts

Authors: Daniel Christoph, Max Ploner, Patrick Haller, Alan Akbik

Abstract: Sample efficiency is a crucial property of language models with practical implications for training efficiency. In real-world text, information follows a long-tailed distribution. Yet, we expect models to learn and recall frequent and infrequent facts. Sample-efficient models are better equipped to handle this challenge of learning and retaining rare information without requiring excessive exposur… ▽ More Sample efficiency is a crucial property of language models with practical implications for training efficiency. In real-world text, information follows a long-tailed distribution. Yet, we expect models to learn and recall frequent and infrequent facts. Sample-efficient models are better equipped to handle this challenge of learning and retaining rare information without requiring excessive exposure. This study analyzes multiple models of varying architectures and sizes, all trained on the same pre-training data. By annotating relational facts with their frequencies in the training corpus, we examine how model performance varies with fact frequency. Our findings show that most models perform similarly on high-frequency facts but differ notably on low-frequency facts. This analysis provides new insights into the relationship between model architecture, size, and factual learning efficiency. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Accepted to the First Workshop on Large Language Model Memorization (L2M2), co-located with ACL 2025 in Vienna

arXiv:2506.16892 [pdf, ps, other]

Orbital Collision: An Indigenously Developed Web-based Space Situational Awareness Platform

Authors: Partha Chowdhury, Harsha M, Ayush Gupta, Sanat K Biswas

Abstract: This work presents an indigenous web based platform Orbital Collision (OrCo), created by the Space Systems Laboratory at IIIT Delhi, to enhance Space Situational Awareness (SSA) by predicting collision probabilities of space objects using Two Line Elements (TLE) data. The work highlights the growing challenges of congestion in the Earth's orbital environment, mainly due to space debris and defunct… ▽ More This work presents an indigenous web based platform Orbital Collision (OrCo), created by the Space Systems Laboratory at IIIT Delhi, to enhance Space Situational Awareness (SSA) by predicting collision probabilities of space objects using Two Line Elements (TLE) data. The work highlights the growing challenges of congestion in the Earth's orbital environment, mainly due to space debris and defunct satellites, which increase collision risks. It employs several methods for propagating orbital uncertainty and calculating the collision probability. The performance of the platform is evaluated through accuracy assessments and efficiency metrics, in order to improve the tracking of space objects and ensure the safety of the satellite in congested space. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: This work has been already submitted for STEP-IPSC 2025 Conference Proceedings

arXiv:2506.16875 [pdf, ps, other]

Comparison of substructured non-overlapping domain decomposition and overlapping additive Schwarz methods for large-scale Helmholtz problems with multiple sources

Authors: Boris Martin, Pierre Jolivet, Christophe Geuzaine

Abstract: Solving large-scale Helmholtz problems discretized with high-order finite elements is notoriously difficult, especially in 3D where direct factorization of the system matrix is very expensive and memory demanding, and robust convergence of iterative methods is difficult to obtain. Domain decomposition methods (DDM) constitute one of the most promising strategy so far, by combining direct and itera… ▽ More Solving large-scale Helmholtz problems discretized with high-order finite elements is notoriously difficult, especially in 3D where direct factorization of the system matrix is very expensive and memory demanding, and robust convergence of iterative methods is difficult to obtain. Domain decomposition methods (DDM) constitute one of the most promising strategy so far, by combining direct and iterative approaches: using direct solvers on overlapping or non-overlapping subdomains, as a preconditioner for a Krylov subspace method on the original Helmholtz system or as an iterative solver on a substructured problem involving field values or Lagrange multipliers on the interfaces between the subdomains. In this work we compare the computational performance of non-overlapping substructured DDM and Optimized Restricted Additive Schwarz (ORAS) preconditioners for solving large-scale Helmholtz problems with multiple sources, as is encountered, e.g., in frequency-domain Full Waveform Inversion. We show on a realistic geophysical test-case that, when appropriately tuned, the non-overlapping methods can reduce the convergence gap sufficiently to significantly outperform the overlapping methods. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 21 pages, 10 figures, 5 tables. Preprint for a submission to SIAM SISC

MSC Class: 35J05; 65N55; 68W10; 35-04; 86-08 ACM Class: J.2; G.1.3; G.1.8; G.4

arXiv:2506.16852 [pdf, ps, other]

Controllable and Expressive One-Shot Video Head Swapping

Authors: Chaonan Ji, Jinwei Qi, Peng Zhang, Bang Zhang, Liefeng Bo

Abstract: In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. Existing face-swapping methods mainly focus on l… ▽ More In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. Existing face-swapping methods mainly focus on localized facial replacement neglecting holistic head morphology, while head-swapping approaches struggling with hairstyle diversity and complex backgrounds, and none of these methods allow users to modify the transplanted head expressions after swapping. To tackle these challenges, our method incorporates several innovative strategies through a unified latent diffusion paradigm. 1) Identity-preserving context fusion: We propose a shape-agnostic mask strategy to explicitly disentangle foreground head identity features from background/body contexts, combining hair enhancement strategy to achieve robust holistic head identity preservation across diverse hair types and complex backgrounds. 2) Expression-aware landmark retargeting and editing: We propose a disentangled 3DMM-driven retargeting module that decouples identity, expression, and head poses, minimizing the impact of original expressions in input images and supporting expression editing. While a scale-aware retargeting strategy is further employed to minimize cross-identity expression distortion for higher transfer precision. Experimental results demonstrate that our method excels in seamless background integration while preserving the identity of the source portrait, as well as showcasing superior expression transfer capabilities applicable to both real and virtual characters. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Project page: https://humanaigc.github.io/SwapAnyHead/

arXiv:2506.16844 [pdf, ps, other]

Bandwidth Selectors on Semiparametric Bayesian Networks

Authors: Victor Alejandre, Concha Bielza, Pedro Larrañaga

Abstract: Semiparametric Bayesian networks (SPBNs) integrate parametric and non-parametric probabilistic models, offering flexibility in learning complex data distributions from samples. In particular, kernel density estimators (KDEs) are employed for the non-parametric component. Under the assumption of data normality, the normal rule is used to learn the bandwidth matrix for the KDEs in SPBNs. This matrix… ▽ More Semiparametric Bayesian networks (SPBNs) integrate parametric and non-parametric probabilistic models, offering flexibility in learning complex data distributions from samples. In particular, kernel density estimators (KDEs) are employed for the non-parametric component. Under the assumption of data normality, the normal rule is used to learn the bandwidth matrix for the KDEs in SPBNs. This matrix is the key hyperparameter that controls the trade-off between bias and variance. However, real-world data often deviates from normality, potentially leading to suboptimal density estimation and reduced predictive performance. This paper first establishes the theoretical framework for the application of state-of-the-art bandwidth selectors and subsequently evaluates their impact on SPBN performance. We explore the approaches of cross-validation and plug-in selectors, assessing their effectiveness in enhancing the learning capability and applicability of SPBNs. To support this investigation, we have extended the open-source package PyBNesian for SPBNs with the additional bandwidth selection techniques and conducted extensive experimental analyses. Our results demonstrate that the proposed bandwidth selectors leverage increasing information more effectively than the normal rule, which, despite its robustness, stagnates with more data. In particular, unbiased cross-validation generally outperforms the normal rule, highlighting its advantage in high sample size scenarios. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 37 pages, 15 figures. Submitted to Information Sciences

ACM Class: I.2.6; I.5.1; G.3

arXiv:2506.16836 [pdf, ps, other]

Engineering Resilience: An Energy-Based Approach to Sustainable Behavioural Interventions

Authors: Arpitha Srivathsa Malavalli, Karthik Sama, Janvi Chhabra, Pooja Bassin, Srinath Srinivasa

Abstract: Addressing complex societal challenges, such as improving public health, fostering honesty in workplaces, or encouraging eco-friendly behaviour requires effective nudges to influence human behaviour at scale. Intervention science seeks to design such nudges within complex societal systems. While interventions primarily aim to shift the system toward a desired state, less attention is given to the… ▽ More Addressing complex societal challenges, such as improving public health, fostering honesty in workplaces, or encouraging eco-friendly behaviour requires effective nudges to influence human behaviour at scale. Intervention science seeks to design such nudges within complex societal systems. While interventions primarily aim to shift the system toward a desired state, less attention is given to the sustainability of that state, which we define in terms of resilience: the system's ability to retain the desired state even under perturbations. In this work, we offer a more holistic perspective to intervention design by incorporating a nature-inspired postulate i.e., lower energy states tend to exhibit greater resilience, as a regularization mechanism within intervention optimization to ensure that the resulting state is also sustainable. Using a simple agent-based simulation where commuters are nudged to choose eco-friendly options (e.g., cycles) over individually attractive but less eco-friendly ones (e.g., cars), we demonstrate how embedding lower energy postulate into intervention design induces resilience. The system energy is defined in terms of motivators that drive its agent's behaviour. By inherently ensuring that agents are not pushed into actions that contradict their motivators, the energy-based approach helps design effective interventions that contribute to resilient behavioural states. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16827 [pdf, ps, other]

Beyond Blur: A Fluid Perspective on Generative Diffusion Models

Authors: Grzegorz Gruszczynski, Michal Jan Wlodarczyk, Jakub J Meixner, Przemyslaw Musialski

Abstract: We propose a novel PDE-driven corruption process for generative image synthesis based on advection-diffusion processes which generalizes existing PDE-based approaches. Our forward pass formulates image corruption via a physically motivated PDE that couples directional advection with isotropic diffusion and Gaussian noise, controlled by dimensionless numbers (Peclet, Fourier). We implement this PDE… ▽ More We propose a novel PDE-driven corruption process for generative image synthesis based on advection-diffusion processes which generalizes existing PDE-based approaches. Our forward pass formulates image corruption via a physically motivated PDE that couples directional advection with isotropic diffusion and Gaussian noise, controlled by dimensionless numbers (Peclet, Fourier). We implement this PDE numerically through a GPU-accelerated custom Lattice Boltzmann solver for fast evaluation. To induce realistic turbulence, we generate stochastic velocity fields that introduce coherent motion and capture multi-scale mixing. In the generative process, a neural network learns to reverse the advection-diffusion operator thus constituting a novel generative model. We discuss how previous methods emerge as specific cases of our operator, demonstrating that our framework generalizes prior PDE-based corruption techniques. We illustrate how advection improves the diversity and quality of the generated images while keeping the overall color palette unaffected. This work bridges fluid dynamics, dimensionless PDE theory, and deep generative modeling, offering a fresh perspective on physically informed image corruption processes for diffusion-based synthesis. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 11 pages, 8 figures, pre-print, supplementary pseudocode in appendix

ACM Class: I.2.6; I.4.10; I.4.8

arXiv:2506.16826 [pdf, ps, other]

AnyTraverse: An off-road traversability framework with VLM and human operator in the loop

Authors: Sattwik Sahu, Agamdeep Singh, Karthik Nambiar, Srikanth Saripalli, P. B. Sujit

Abstract: Off-road traversability segmentation enables autonomous navigation with applications in search-and-rescue, military operations, wildlife exploration, and agriculture. Current frameworks struggle due to significant variations in unstructured environments and uncertain scene changes, and are not adaptive to be used for different robot types. We present AnyTraverse, a framework combining natural lang… ▽ More Off-road traversability segmentation enables autonomous navigation with applications in search-and-rescue, military operations, wildlife exploration, and agriculture. Current frameworks struggle due to significant variations in unstructured environments and uncertain scene changes, and are not adaptive to be used for different robot types. We present AnyTraverse, a framework combining natural language-based prompts with human-operator assistance to determine navigable regions for diverse robotic vehicles. The system segments scenes for a given set of prompts and calls the operator only when encountering previously unexplored scenery or unknown class not part of the prompt in its region-of-interest, thus reducing active supervision load while adapting to varying outdoor scenes. Our zero-shot learning approach eliminates the need for extensive data collection or retraining. Our experimental validation includes testing on RELLIS-3D, Freiburg Forest, and RUGD datasets and demonstrate real-world deployment on multiple robot platforms. The results show that AnyTraverse performs better than GA-NAV and Off-seg while offering a vehicle-agnostic approach to off-road traversability that balances automation with targeted human supervision. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16824 [pdf, ps, other]

Predicting New Research Directions in Materials Science using Large Language Models and Concept Graphs

Authors: Thomas Marwitz, Alexander Colsmann, Ben Breitung, Christoph Brabec, Christoph Kirchlechner, Eva Blasco, Gabriel Cadilha Marques, Horst Hahn, Michael Hirtz, Pavel A. Levkin, Yolita M. Eggeler, Tobias Schlöder, Pascal Friederich

Abstract: Due to an exponential increase in published research articles, it is impossible for individual scientists to read all publications, even within their own research field. In this work, we investigate the use of large language models (LLMs) for the purpose of extracting the main concepts and semantic information from scientific abstracts in the domain of materials science to find links that were not… ▽ More Due to an exponential increase in published research articles, it is impossible for individual scientists to read all publications, even within their own research field. In this work, we investigate the use of large language models (LLMs) for the purpose of extracting the main concepts and semantic information from scientific abstracts in the domain of materials science to find links that were not noticed by humans and thus to suggest inspiring near/mid-term future research directions. We show that LLMs can extract concepts more efficiently than automated keyword extraction methods to build a concept graph as an abstraction of the scientific literature. A machine learning model is trained to predict emerging combinations of concepts, i.e. new research ideas, based on historical data. We demonstrate that integrating semantic concept information leads to an increased prediction performance. The applicability of our model is demonstrated in qualitative interviews with domain experts based on individualized model suggestions. We show that the model can inspire materials scientists in their creative thinking process by predicting innovative combinations of topics that have not yet been investigated. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16822 [pdf, ps, other]

Learning Dexterous Object Handover

Authors: Daniel Frau-Alfaro, Julio Castaño-Amoros, Santiago Puente, Pablo Gil, Roberto Calandra

Abstract: Object handover is an important skill that we use daily when interacting with other humans. To deploy robots in collaborative setting, like houses, being able to receive and handing over objects safely and efficiently becomes a crucial skill. In this work, we demonstrate the use of Reinforcement Learning (RL) for dexterous object handover between two multi-finger hands. Key to this task is the use… ▽ More Object handover is an important skill that we use daily when interacting with other humans. To deploy robots in collaborative setting, like houses, being able to receive and handing over objects safely and efficiently becomes a crucial skill. In this work, we demonstrate the use of Reinforcement Learning (RL) for dexterous object handover between two multi-finger hands. Key to this task is the use of a novel reward function based on dual quaternions to minimize the rotation distance, which outperforms other rotation representations such as Euler and rotation matrices. The robustness of the trained policy is experimentally evaluated by testing w.r.t. objects that are not included in the training distribution, and perturbations during the handover process. The results demonstrate that the trained policy successfully perform this task, achieving a total success rate of 94% in the best-case scenario after 100 experiments, thereby showing the robustness of our policy with novel objects. In addition, the best-case performance of the policy decreases by only 13.8% when the other robot moves during the handover, proving that our policy is also robust to this type of perturbation, which is common in real-world object handovers. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Paper accepted for presentation in RoMan 2025

arXiv:2506.16821 [pdf, ps, other]

Self-supervised Feature Extraction for Enhanced Ball Detection on Soccer Robots

Authors: Can Lin, Daniele Affinita, Marco E. P. Zimmatore, Daniele Nardi, Domenico D. Bloisi, Vincenzo Suriani

Abstract: Robust and accurate ball detection is a critical component for autonomous humanoid soccer robots, particularly in dynamic and challenging environments such as RoboCup outdoor fields. However, traditional supervised approaches require extensive manual annotation, which is costly and time-intensive. To overcome this problem, we present a self-supervised learning framework for domain-adaptive feature… ▽ More Robust and accurate ball detection is a critical component for autonomous humanoid soccer robots, particularly in dynamic and challenging environments such as RoboCup outdoor fields. However, traditional supervised approaches require extensive manual annotation, which is costly and time-intensive. To overcome this problem, we present a self-supervised learning framework for domain-adaptive feature extraction to enhance ball detection performance. The proposed approach leverages a general-purpose pretrained model to generate pseudo-labels, which are then used in a suite of self-supervised pretext tasks -- including colorization, edge detection, and triplet loss -- to learn robust visual features without relying on manual annotations. Additionally, a model-agnostic meta-learning (MAML) strategy is incorporated to ensure rapid adaptation to new deployment scenarios with minimal supervision. A new dataset comprising 10,000 labeled images from outdoor RoboCup SPL matches is introduced, used to validate the method, and made available to the community. Experimental results demonstrate that the proposed pipeline outperforms baseline models in terms of accuracy, F1 score, and IoU, while also exhibiting faster convergence. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16815 [pdf, ps, other]

doi 10.1109/TNSE.2022.3170364

Robust Group Anomaly Detection for Quasi-Periodic Network Time Series

Authors: Kai Yang, Shaoyu Dou, Pan Luo, Xin Wang, H. Vincent Poor

Abstract: Many real-world multivariate time series are collected from a network of physical objects embedded with software, electronics, and sensors. The quasi-periodic signals generated by these objects often follow a similar repetitive and periodic pattern, but have variations in the period, and come in different lengths caused by timing (synchronization) errors. Given a multitude of such quasi-periodic t… ▽ More Many real-world multivariate time series are collected from a network of physical objects embedded with software, electronics, and sensors. The quasi-periodic signals generated by these objects often follow a similar repetitive and periodic pattern, but have variations in the period, and come in different lengths caused by timing (synchronization) errors. Given a multitude of such quasi-periodic time series, can we build machine learning models to identify those time series that behave differently from the majority of the observations? In addition, can the models help human experts to understand how the decision was made? We propose a sequence to Gaussian Mixture Model (seq2GMM) framework. The overarching goal of this framework is to identify unusual and interesting time series within a network time series database. We further develop a surrogate-based optimization algorithm that can efficiently train the seq2GMM model. Seq2GMM exhibits strong empirical performance on a plurality of public benchmark datasets, outperforming state-of-the-art anomaly detection techniques by a significant margin. We also theoretically analyze the convergence property of the proposed training algorithm and provide numerical results to substantiate our theoretical claims. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Published in IEEE Transactions on Network Science and Engineering

Journal ref: IEEE Transactions on Network Science and Engineering. Volume: 9, Issue: 4, 01 July-Aug. 2022

arXiv:2506.16812 [pdf, ps, other]

Zero-Knowledge Proof-of-Location Protocols for Vehicle Subsidies and Taxation Compliance

Authors: Dan Bogdanov, Eduardo Brito, Annika Jaakson, Peeter Laud, Raul-Martin Rebane

Abstract: This paper introduces a new set of privacy-preserving mechanisms for verifying compliance with location-based policies for vehicle taxation, or for (electric) vehicle (EV) subsidies, using Zero-Knowledge Proofs (ZKPs). We present the design and evaluation of a Zero-Knowledge Proof-of-Location (ZK-PoL) system that ensures a vehicle's adherence to territorial driving requirements without disclosing… ▽ More This paper introduces a new set of privacy-preserving mechanisms for verifying compliance with location-based policies for vehicle taxation, or for (electric) vehicle (EV) subsidies, using Zero-Knowledge Proofs (ZKPs). We present the design and evaluation of a Zero-Knowledge Proof-of-Location (ZK-PoL) system that ensures a vehicle's adherence to territorial driving requirements without disclosing specific location data, hence maintaining user privacy. Our findings suggest a promising approach to apply ZK-PoL protocols in large-scale governmental subsidy or taxation programs. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: This is the extended version of the paper to appear in the Proceedings of the 5th International Workshop on Security and Privacy in Intelligent Infrastructures (SP2I 2025), held in conjunction with the 20th International Conference on Availability, Reliability and Security (ARES 2025)

arXiv:2506.16805 [pdf, ps, other]

Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes

Authors: Chao Chen, Nobel Dang, Juexiao Zhang, Wenkai Sun, Pengfei Zheng, Xuhang He, Yimeng Ye, Taarun Srinivas, Chen Feng

Abstract: Humans exhibit a remarkable ability to recognize co-visibility-the overlapping regions visible in multiple images-even when these images are sparsely distributed across a complex scene. This capability is foundational in 3D vision and robotic perception. Despite significant progress in vision learning, it remains unclear whether current vision models have reached human-level proficiency in co-visi… ▽ More Humans exhibit a remarkable ability to recognize co-visibility-the overlapping regions visible in multiple images-even when these images are sparsely distributed across a complex scene. This capability is foundational in 3D vision and robotic perception. Despite significant progress in vision learning, it remains unclear whether current vision models have reached human-level proficiency in co-visibility analysis. In this work, we introduce the Co-Visibility reasONing (Co-VisiON) benchmark, designed to directly evaluate co-visibility reasoning on sparse image sets across over 1000 indoor scenarios. Our experiments reveal that while co-visibility is typically treated as a low-level feature matching task, it poses a significant challenge for existing vision models under sparse conditions. Notably, a proprietary vision-language model outperforms all purely vision-based approaches, with all models lagging substantially behind human performance. This gap underscores the need for more than basic pairwise vision processing-it calls for a comprehensive spatial understanding through high-level reasoning across multiple views. Inspired by human visual cognition, we propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM. We hope our benchmark and findings will spur further advancements in developing vision models capable of robust, high-level reasoning in challenging, sparse environments. Our dataset and source code can be found at: https://ai4ce.github.io/CoVISION △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16791 [pdf, ps, other]

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Authors: Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, and David Salinas, Frank Hutter

Abstract: With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular b… ▽ More With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning and investigate the contributions of individual models. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 51 pages. Code available at https://tabarena.ai/code; examples at https://tabarena.ai/code-examples; dataset curation at https://tabarena.ai/data-tabular-ml-iid-study and https://tabarena.ai/dataset-curation

arXiv:2506.16777 [pdf, ps, other]

DistillNote: LLM-based clinical note summaries improve heart failure diagnosis

Authors: Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto

Abstract: Large language models (LLMs) offer unprecedented opportunities to generate concise summaries of patient information and alleviate the burden of clinical documentation that overwhelms healthcare providers. We present Distillnote, a framework for LLM-based clinical note summarization, and generate over 64,000 admission note summaries through three techniques: (1) One-step, direct summarization, and… ▽ More Large language models (LLMs) offer unprecedented opportunities to generate concise summaries of patient information and alleviate the burden of clinical documentation that overwhelms healthcare providers. We present Distillnote, a framework for LLM-based clinical note summarization, and generate over 64,000 admission note summaries through three techniques: (1) One-step, direct summarization, and a divide-and-conquer approach involving (2) Structured summarization focused on independent clinical insights, and (3) Distilled summarization that further condenses the Structured summaries. We test how useful are the summaries by using them to predict heart failure compared to a model trained on the original notes. Distilled summaries achieve 79% text compression and up to 18.2% improvement in AUPRC compared to an LLM trained on the full notes. We also evaluate the quality of the generated summaries in an LLM-as-judge evaluation as well as through blinded pairwise comparisons with clinicians. Evaluations indicate that one-step summaries are favoured by clinicians according to relevance and clinical actionability, while distilled summaries offer optimal efficiency (avg. 6.9x compression-to-performance ratio) and significantly reduce hallucinations. We release our summaries on PhysioNet to encourage future research. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16759 [pdf, ps, other]

Adaptive Sketching Based Construction of H2 Matrices on GPUs

Authors: Wajih Halim Boukaram, Yang Liu, Pieter Ghysels, Xiaoye Sherry Li

Abstract: We develop a novel linear-complexity bottom-up sketching-based algorithm for constructing a $H^2$ matrix, and present its high performance GPU implementation. The construction algorithm requires both a black-box sketching operator and an entry evaluation function. The novelty of our GPU approach centers around the design and implementation of the above two operations in batched mode on GPU with ac… ▽ More We develop a novel linear-complexity bottom-up sketching-based algorithm for constructing a $H^2$ matrix, and present its high performance GPU implementation. The construction algorithm requires both a black-box sketching operator and an entry evaluation function. The novelty of our GPU approach centers around the design and implementation of the above two operations in batched mode on GPU with accommodation for variable-size data structures in a batch. The batch algorithms minimize the number of kernel launches and maximize the GPU throughput. When applied to covariance matrices, volume IE matrices and $H^2$ update operations, our proposed GPU implementation achieves up to $13\times$ speedup over our CPU implementation, and up to $1000\times$ speedup over an existing GPU implementation of the top-down sketching-based algorithm from the H2Opus library. It also achieves a $660\times$ speedup over an existing sketching-based $H$ construction algorithm from the ButterflyPACK library. Our work represents the first GPU implementation of the class of bottom-up sketching-based $H^2$ construction algorithms. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16712 [pdf, ps, other]

ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models

Authors: Bin Chen, Xinzge Gao, Chuanrui Hu, Penghang Yu, Hua Zhang, Bing-Kun Bao

Abstract: Generative Reward Models (GRMs) provide greater flexibility than scalar reward models in capturing human preferences, but their effectiveness is limited by poor reasoning capabilities. This often results in incomplete or overly speculative reasoning paths, leading to hallucinations or missing key information in complex tasks. We address this challenge with ReasonGRM, a three-stage generative rewar… ▽ More Generative Reward Models (GRMs) provide greater flexibility than scalar reward models in capturing human preferences, but their effectiveness is limited by poor reasoning capabilities. This often results in incomplete or overly speculative reasoning paths, leading to hallucinations or missing key information in complex tasks. We address this challenge with ReasonGRM, a three-stage generative reward modeling framework. In the first stage, Zero-RL is used to generate concise, outcome-directed reasoning paths that reduce the likelihood of critical omissions. In the second stage, we introduce a novel evaluation metric, $R^\star$, which scores reasoning paths based on their generation likelihood. This favors paths that reach correct answers with minimal exploration, helping to reduce hallucination-prone data during training. In the final stage, the model is further refined through reinforcement learning on challenging examples to enhance its preference discrimination capabilities. Experiments on three public benchmarks show that ReasonGRM achieves competitive or state-of-the-art performance, outperforming previous best GRMs by 1.8\% on average and surpassing proprietary models such as GPT-4o by up to 5.6\%. These results demonstrate the effectiveness of reasoning-aware training and highlight the importance of high-quality rationale selection for reliable preference modeling. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16710 [pdf, ps, other]

Experimental Setup and Software Pipeline to Evaluate Optimization based Autonomous Multi-Robot Search Algorithms

Authors: Aditya Bhatt, Mary Katherine Corra, Franklin Merlo, Prajit KrisshnaKumar, Souma Chowdhury

Abstract: Signal source localization has been a problem of interest in the multi-robot systems domain given its applications in search \& rescue and hazard localization in various industrial and outdoor settings. A variety of multi-robot search algorithms exist that usually formulate and solve the associated autonomous motion planning problem as a heuristic model-free or belief model-based optimization proc… ▽ More Signal source localization has been a problem of interest in the multi-robot systems domain given its applications in search \& rescue and hazard localization in various industrial and outdoor settings. A variety of multi-robot search algorithms exist that usually formulate and solve the associated autonomous motion planning problem as a heuristic model-free or belief model-based optimization process. Most of these algorithms however remains tested only in simulation, thereby losing the opportunity to generate knowledge about how such algorithms would compare/contrast in a real physical setting in terms of search performance and real-time computing performance. To address this gap, this paper presents a new lab-scale physical setup and associated open-source software pipeline to evaluate and benchmark multi-robot search algorithms. The presented physical setup innovatively uses an acoustic source (that is safe and inexpensive) and small ground robots (e-pucks) operating in a standard motion-capture environment. This setup can be easily recreated and used by most robotics researchers. The acoustic source also presents interesting uncertainty in terms of its noise-to-signal ratio, which is useful to assess sim-to-real gaps. The overall software pipeline is designed to readily interface with any multi-robot search algorithm with minimal effort and is executable in parallel asynchronous form. This pipeline includes a framework for distributed implementation of multi-robot or swarm search algorithms, integrated with a ROS (Robotics Operating System)-based software stack for motion capture supported localization. The utility of this novel setup is demonstrated by using it to evaluate two state-of-the-art multi-robot search algorithms, based on swarm optimization and batch-Bayesian Optimization (called Bayes-Swarm), as well as a random walk baseline. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: to be published in IDETC 2025 conference proceedings

arXiv:2506.16683 [pdf, ps, other]

A Simple Contrastive Framework Of Item Tokenization For Generative Recommendation

Authors: Penglong Zhai, Yifang Yuan, Fanyi Di, Jie Li, Yue Liu, Chen Li, Jie Huang, Sicong Wang, Yao Xu, Xin Li

Abstract: Generative retrieval-based recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. However, in large-scale recommendation systems, this approach becomes increasingly cumbersome due to the redundancy and sheer scale of the token space. To overcome these limitations, recent research has explored the use of semantic tokens as an alter… ▽ More Generative retrieval-based recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. However, in large-scale recommendation systems, this approach becomes increasingly cumbersome due to the redundancy and sheer scale of the token space. To overcome these limitations, recent research has explored the use of semantic tokens as an alternative to ID tokens, which typically leveraged reconstruction-based strategies, like RQ-VAE, to quantize content embeddings and significantly reduce the embedding size. However, reconstructive quantization aims for the precise reconstruction of each item embedding independently, which conflicts with the goal of generative retrieval tasks focusing more on differentiating among items. Moreover, multi-modal side information of items, such as descriptive text and images, geographical knowledge in location-based recommendation services, has been shown to be effective in improving recommendations by providing richer contexts for interactions. Nevertheless, effectively integrating such complementary knowledge into existing generative recommendation frameworks remains challenging. To overcome these challenges, we propose a novel unsupervised deep quantization exclusively based on contrastive learning, named SimCIT (a Simple Contrastive Item Tokenization framework). Specifically, different from existing reconstruction-based strategies, SimCIT propose to use a learnable residual quantization module to align with the signals from different modalities of the items, which combines multi-modal knowledge alignment and semantic tokenization in a mutually beneficial contrastive learning framework. Extensive experiments across public datasets and a large-scale industrial dataset from various domains demonstrate SimCIT's effectiveness in LLM-based generative recommendation. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 12 pages,7 figures

arXiv:2506.16679 [pdf, ps, other]

How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions

Authors: Manuel Brack, Sudeep Katakol, Felix Friedrich, Patrick Schramowski, Hareesh Ravi, Kristian Kersting, Ajinkya Kale

Abstract: Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model's performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is generally believed to produce more capable models, current literature does not provide any insights into i… ▽ More Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model's performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is generally believed to produce more capable models, current literature does not provide any insights into its design choices. This study closes this gap by systematically investigating how different synthetic captioning strategies impact the downstream performance of text-to-image models. Our experiments demonstrate that dense, high-quality captions enhance text alignment but may introduce trade-offs in output aesthetics and diversity. Conversely, captions of randomized lengths yield balanced improvements across aesthetics and alignment without compromising sample diversity. We also demonstrate that varying caption distributions introduce significant shifts in the output bias of a trained model. Our findings underscore the importance of caption design in achieving optimal model performance and provide practical insights for more effective training data strategies in text-to-image generation. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16654 [pdf, ps, other]

Relational Deep Learning: Challenges, Foundations and Next-Generation Architectures

Authors: Vijay Prakash Dwivedi, Charilaos Kanatsoulis, Shenyang Huang, Jure Leskovec

Abstract: Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data and has been applied to molecules, social networks, recommendation systems, and transportation, among other domains. Data in multi-tabular relational databases can also be constructed as 'relational entity graphs' for Relational Deep Learning (RDL) - a new blueprint… ▽ More Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data and has been applied to molecules, social networks, recommendation systems, and transportation, among other domains. Data in multi-tabular relational databases can also be constructed as 'relational entity graphs' for Relational Deep Learning (RDL) - a new blueprint that enables end-to-end representation learning without traditional feature engineering. Compared to arbitrary graph-structured data, relational entity graphs have key properties: (i) their structure is defined by primary-foreign key relationships between entities in different tables, (ii) the structural connectivity is a function of the relational schema defining a database, and (iii) the graph connectivity is temporal and heterogeneous in nature. In this paper, we provide a comprehensive review of RDL by first introducing the representation of relational databases as relational entity graphs, and then reviewing public benchmark datasets that have been used to develop and evaluate recent GNN-based RDL models. We discuss key challenges including large-scale multi-table integration and the complexities of modeling temporal dynamics and heterogeneous data, while also surveying foundational neural network methods and recent architectural advances specialized for relational entity graphs. Finally, we explore opportunities to unify these distinct modeling challenges, highlighting how RDL converges multiple sub-fields in graph machine learning towards the design of foundation models that can transform the processing of relational data. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16653 [pdf, ps, other]

LLMs in Coding and their Impact on the Commercial Software Engineering Landscape

Authors: Vladislav Belozerov, Peter J Barclay, Askhan Sami

Abstract: Large-language-model coding tools are now mainstream in software engineering. But as these same tools move human effort up the development stack, they present fresh dangers: 10% of real prompts leak private data, 42% of generated snippets hide security flaws, and the models can even ``agree'' with wrong ideas, a trait called sycophancy. We argue that firms must tag and review every AI-generated li… ▽ More Large-language-model coding tools are now mainstream in software engineering. But as these same tools move human effort up the development stack, they present fresh dangers: 10% of real prompts leak private data, 42% of generated snippets hide security flaws, and the models can even ``agree'' with wrong ideas, a trait called sycophancy. We argue that firms must tag and review every AI-generated line of code, keep prompts and outputs inside private or on-premises deployments, obey emerging safety regulations, and add tests that catch sycophantic answers -- so they can gain speed without losing security and accuracy. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Showing 1–50 of 109,571 results for author: P