Search | arXiv e-print repository

Over-squashing in Spatiotemporal Graph Neural Networks

Authors: Ivan Marisca, Jacob Bamberger, Cesare Alippi, Michael M. Bronstein

Abstract: Graph Neural Networks (GNNs) have achieved remarkable success across various domains. However, recent theoretical advances have identified fundamental limitations in their information propagation capabilities, such as over-squashing, where distant nodes fail to effectively exchange information. While extensively studied in static contexts, this issue remains unexplored in Spatiotemporal GNNs (STGN… ▽ More Graph Neural Networks (GNNs) have achieved remarkable success across various domains. However, recent theoretical advances have identified fundamental limitations in their information propagation capabilities, such as over-squashing, where distant nodes fail to effectively exchange information. While extensively studied in static contexts, this issue remains unexplored in Spatiotemporal GNNs (STGNNs), which process sequences associated with graph nodes. Nonetheless, the temporal dimension amplifies this challenge by increasing the information that must be propagated. In this work, we formalize the spatiotemporal over-squashing problem and demonstrate its distinct characteristics compared to the static case. Our analysis reveals that counterintuitively, convolutional STGNNs favor information propagation from points temporally distant rather than close in time. Moreover, we prove that architectures that follow either time-and-space or time-then-space processing paradigms are equally affected by this phenomenon, providing theoretical justification for computationally efficient implementations. We validate our findings on synthetic and real-world datasets, providing deeper insights into their operational dynamics and principled guidance for more effective designs. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.15424 [pdf, ps, other]

doi 10.1007/978-3-319-89884-1_4

PSM: Policy Synchronised Deterministic Memory

Authors: Michael Mendler, Marc Pouzet

Abstract: Concurrency and determinacy do not go well with each other when resources must be shared. Haskell provides parallel programming abstractions such as IVar and LVar in the Par monad and concurrent abstractions such as MVar and TVar in the in IO and STM monads, respectively. The former are determinate but have no destructive updates and the latter have destructive updates but do not guarantee determi… ▽ More Concurrency and determinacy do not go well with each other when resources must be shared. Haskell provides parallel programming abstractions such as IVar and LVar in the Par monad and concurrent abstractions such as MVar and TVar in the in IO and STM monads, respectively. The former are determinate but have no destructive updates and the latter have destructive updates but do not guarantee determinacy. Programming patterns that are both concurrent and determinate, such as those provided by Kahn or Berry require memory abstractions at a higher level than is currently available. In this paper we describe a new type context PSM for policy synchronised memory in Haskell. Like STM and IO, the computations in PSM can access persistent state and, as a side-effect, update the memory in imperative style. Like the Par and IO monads, PSM supports concurrent threads and shared state. However, in contrast to IO, our PSM contexts are race-free since concurrent accesses are policy coordinated which guarantees determinacy.Well-typed transactions in the PSM context can accommodate abstract data structures that are imperative, concurrently shareable and still behave deterministically, by construction. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: This report summarises work on coding the theory of policy-synchronised memory (see https://rdcu.be/erBwl) in Haskell. This was developed for a graduate level course on Functional Reactive Programming taught at Bamberg University by the first author during 2020-2023. An early version of the PSM library had been presented at the SYNCHRON Workshop (Aussois, France), November 2019

arXiv:2506.15405 [pdf, ps, other]

Simulation of parametrized cardiac electrophysiology in three dimensions using physics-informed neural networks

Authors: Roshan Antony Gomez, Julien Stöcker, Barış Cansız, Michael Kaliske

Abstract: Physics-informed neural networks (PINNs) are extensively used to represent various physical systems across multiple scientific domains. The same can be said for cardiac electrophysiology, wherein fully-connected neural networks (FCNNs) have been employed to predict the evolution of an action potential in a 2D space following the two-parameter phenomenological Aliev-Panfilov (AP) model. In this pap… ▽ More Physics-informed neural networks (PINNs) are extensively used to represent various physical systems across multiple scientific domains. The same can be said for cardiac electrophysiology, wherein fully-connected neural networks (FCNNs) have been employed to predict the evolution of an action potential in a 2D space following the two-parameter phenomenological Aliev-Panfilov (AP) model. In this paper, the training behaviour of PINNs is investigated to determine optimal hyperparameters to predict the electrophysiological activity of the myocardium in 3D according to the AP model, with the inclusion of boundary and material parameters. An FCNN architecture is employed with the governing partial differential equations in their strong form, which are scaled consistently with normalization of network inputs. The finite element (FE) method is used to generate training data for the network. Numerical examples with varying spatial dimensions and parameterizations are generated using the trained models. The network predicted fields for both the action potential and the recovery variable are compared with the respective FE simulations. Network losses are weighed with individual scalar values. Their effect on training and prediction is studied to arrive at a method of controlling losses during training. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.15383 [pdf, ps, other]

Global Ground Metric Learning with Applications to scRNA data

Authors: Damin Kühn, Michael T. Schaub

Abstract: Optimal transport provides a robust framework for comparing probability distributions. Its effectiveness is significantly influenced by the choice of the underlying ground metric. Traditionally, the ground metric has either been (i) predefined, e.g., as the Euclidean distance, or (ii) learned in a supervised way, by utilizing labeled data to learn a suitable ground metric for enhanced task-specifi… ▽ More Optimal transport provides a robust framework for comparing probability distributions. Its effectiveness is significantly influenced by the choice of the underlying ground metric. Traditionally, the ground metric has either been (i) predefined, e.g., as the Euclidean distance, or (ii) learned in a supervised way, by utilizing labeled data to learn a suitable ground metric for enhanced task-specific performance. Yet, predefined metrics typically cannot account for the inherent structure and varying importance of different features in the data, and existing supervised approaches to ground metric learning often do not generalize across multiple classes or are restricted to distributions with shared supports. To address these limitations, we propose a novel approach for learning metrics for arbitrary distributions over a shared metric space. Our method provides a distance between individual points like a global metric, but requires only class labels on a distribution-level for training. The learned global ground metric enables more accurate optimal transport distances, leading to improved performance in embedding, clustering and classification tasks. We demonstrate the effectiveness and interpretability of our approach using patient-level scRNA-seq data spanning multiple diseases. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: This method is provided as a Python package on PyPI, see https://github.com/DaminK/ggml-ot

Journal ref: Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 2025, PMLR 258:3295-3303

arXiv:2506.15316 [pdf]

J3DAI: A tiny DNN-Based Edge AI Accelerator for 3D-Stacked CMOS Image Sensor

Authors: Benoit Tain, Raphael Millet, Romain Lemaire, Michal Szczepanski, Laurent Alacoque, Emmanuel Pluchart, Sylvain Choisnet, Rohit Prasad, Jerome Chossat, Pascal Pierunek, Pascal Vivet, Sebastien Thuries

Abstract: This paper presents J3DAI, a tiny deep neural network-based hardware accelerator for a 3-layer 3D-stacked CMOS image sensor featuring an artificial intelligence (AI) chip integrating a Deep Neural Network (DNN)-based accelerator. The DNN accelerator is designed to efficiently perform neural network tasks such as image classification and segmentation. This paper focuses on the digital system of J3D… ▽ More This paper presents J3DAI, a tiny deep neural network-based hardware accelerator for a 3-layer 3D-stacked CMOS image sensor featuring an artificial intelligence (AI) chip integrating a Deep Neural Network (DNN)-based accelerator. The DNN accelerator is designed to efficiently perform neural network tasks such as image classification and segmentation. This paper focuses on the digital system of J3DAI, highlighting its Performance-Power-Area (PPA) characteristics and showcasing advanced edge AI capabilities on a CMOS image sensor. To support hardware, we utilized the Aidge comprehensive software framework, which enables the programming of both the host processor and the DNN accelerator. Aidge supports post-training quantization, significantly reducing memory footprint and computational complexity, making it crucial for deploying models on resource-constrained hardware like J3DAI. Our experimental results demonstrate the versatility and efficiency of this innovative design in the field of edge AI, showcasing its potential to handle both simple and computationally intensive tasks. Future work will focus on further optimizing the architecture and exploring new applications to fully leverage the capabilities of J3DAI. As edge AI continues to grow in importance, innovations like J3DAI will play a crucial role in enabling real-time, low-latency, and energy-efficient AI processing at the edge. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: Preprint from ISLPED 2025. 979-8-3315-2710-5/25/$31.00 \c{opyright}2025 IEEE

arXiv:2506.15041 [pdf, ps, other]

Identifying economic narratives in large text corpora -- An integrated approach using Large Language Models

Authors: Tobias Schmidt, Kai-Robin Lange, Matthias Reccius, Henrik Müller, Michael Roos, Carsten Jentsch

Abstract: As interest in economic narratives has grown in recent years, so has the number of pipelines dedicated to extracting such narratives from texts. Pipelines often employ a mix of state-of-the-art natural language processing techniques, such as BERT, to tackle this task. While effective on foundational linguistic operations essential for narrative extraction, such models lack the deeper semantic unde… ▽ More As interest in economic narratives has grown in recent years, so has the number of pipelines dedicated to extracting such narratives from texts. Pipelines often employ a mix of state-of-the-art natural language processing techniques, such as BERT, to tackle this task. While effective on foundational linguistic operations essential for narrative extraction, such models lack the deeper semantic understanding required to distinguish extracting economic narratives from merely conducting classic tasks like Semantic Role Labeling. Instead of relying on complex model pipelines, we evaluate the benefits of Large Language Models (LLMs) by analyzing a corpus of Wall Street Journal and New York Times newspaper articles about inflation. We apply a rigorous narrative definition and compare GPT-4o outputs to gold-standard narratives produced by expert annotators. Our results suggests that GPT-4o is capable of extracting valid economic narratives in a structured format, but still falls short of expert-level performance when handling complex documents and narratives. Given the novelty of LLMs in economic research, we also provide guidance for future work in economics and the social sciences that employs LLMs to pursue similar objectives. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 53 pages, 5 figures

arXiv:2506.14970 [pdf, ps, other]

NeuroMoE: A Transformer-Based Mixture-of-Experts Framework for Multi-Modal Neurological Disorder Classification

Authors: Wajih Hassan Raza, Aamir Bader Shah, Yu Wen, Yidan Shen, Juan Diego Martinez Lemus, Mya Caryn Schiess, Timothy Michael Ellmore, Renjie Hu, Xin Fu

Abstract: The integration of multi-modal Magnetic Resonance Imaging (MRI) and clinical data holds great promise for enhancing the diagnosis of neurological disorders (NDs) in real-world clinical settings. Deep Learning (DL) has recently emerged as a powerful tool for extracting meaningful patterns from medical data to aid in diagnosis. However, existing DL approaches struggle to effectively leverage multi-m… ▽ More The integration of multi-modal Magnetic Resonance Imaging (MRI) and clinical data holds great promise for enhancing the diagnosis of neurological disorders (NDs) in real-world clinical settings. Deep Learning (DL) has recently emerged as a powerful tool for extracting meaningful patterns from medical data to aid in diagnosis. However, existing DL approaches struggle to effectively leverage multi-modal MRI and clinical data, leading to suboptimal performance. To address this challenge, we utilize a unique, proprietary multi-modal clinical dataset curated for ND research. Based on this dataset, we propose a novel transformer-based Mixture-of-Experts (MoE) framework for ND classification, leveraging multiple MRI modalities-anatomical (aMRI), Diffusion Tensor Imaging (DTI), and functional (fMRI)-alongside clinical assessments. Our framework employs transformer encoders to capture spatial relationships within volumetric MRI data while utilizing modality-specific experts for targeted feature extraction. A gating mechanism with adaptive fusion dynamically integrates expert outputs, ensuring optimal predictive performance. Comprehensive experiments and comparisons with multiple baselines demonstrate that our multi-modal approach significantly enhances diagnostic accuracy, particularly in distinguishing overlapping disease states. Our framework achieves a validation accuracy of 82.47\%, outperforming baseline methods by over 10\%, highlighting its potential to improve ND diagnosis by applying multi-modal learning to real-world clinical data. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: Accepted at the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society

arXiv:2506.14923 [pdf, ps, other]

Forecasting the spatiotemporal evolution of fluid-induced microearthquakes with deep learning

Authors: Jaehong Chung, Michael Manga, Timothy Kneafsey, Tapan Mukerji, Mengsu Hu

Abstract: Microearthquakes (MEQs) generated by subsurface fluid injection record the evolving stress state and permeability of reservoirs. Forecasting their full spatiotemporal evolution is therefore critical for applications such as enhanced geothermal systems (EGS), CO$_2$ sequestration and other geo-engineering applications. We present a transformer-based deep learning model that ingests hydraulic stimul… ▽ More Microearthquakes (MEQs) generated by subsurface fluid injection record the evolving stress state and permeability of reservoirs. Forecasting their full spatiotemporal evolution is therefore critical for applications such as enhanced geothermal systems (EGS), CO$_2$ sequestration and other geo-engineering applications. We present a transformer-based deep learning model that ingests hydraulic stimulation history and prior MEQ observations to forecast four key quantities: cumulative MEQ count, cumulative logarithmic seismic moment, and the 50th- and 95th-percentile extents ($P_{50}, P_{95}$) of the MEQ cloud. Applied to the EGS Collab Experiment 1 dataset, the model achieves $R^2 >0.98$ for the 1-second forecast horizon and $R^2 >0.88$ for the 15-second forecast horizon across all targets, and supplies uncertainty estimates through a learned standard deviation term. These accurate, uncertainty-quantified forecasts enable real-time inference of fracture propagation and permeability evolution, demonstrating the strong potential of deep-learning approaches to improve seismic-risk assessment and guide mitigation strategies in future fluid-injection operations. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.14922 [pdf, ps, other]

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety

Authors: Christina Q. Knight, Kaustubh Deshpande, Ved Sirdeshmukh, Meher Mankikar, Scale Red Team, SEAL Research Team, Julian Michael

Abstract: The rapid advancement of large language models (LLMs) introduces dual-use capabilities that could both threaten and bolster national security and public safety (NSPS). Models implement safeguards to protect against potential misuse relevant to NSPS and allow for benign users to receive helpful information. However, current benchmarks often fail to test safeguard robustness to potential NSPS risks… ▽ More The rapid advancement of large language models (LLMs) introduces dual-use capabilities that could both threaten and bolster national security and public safety (NSPS). Models implement safeguards to protect against potential misuse relevant to NSPS and allow for benign users to receive helpful information. However, current benchmarks often fail to test safeguard robustness to potential NSPS risks in an objective, robust way. We introduce FORTRESS: 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains (unclassified information only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE), Political Violence & Terrorism, and Criminal & Financial Illicit Activities, with 10 total subcategories across these domains. Each prompt-rubric pair has a corresponding benign version to test for model over-refusals. This evaluation of frontier LLMs' safeguard robustness reveals varying trade-offs between potential risks and model usefulness: Claude-3.5-Sonnet demonstrates a low average risk score (ARS) (14.09 out of 100) but the highest over-refusal score (ORS) (21.8 out of 100), while Gemini 2.5 Pro shows low over-refusal (1.4) but a high average potential risk (66.29). Deepseek-R1 has the highest ARS at 78.05, but the lowest ORS at only 0.06. Models such as o1 display a more even trade-off between potential risks and over-refusals (with an ARS of 21.69 and ORS of 5.2). To provide policymakers and researchers with a clear understanding of models' potential risks, we publicly release FORTRESS at https://huggingface.co/datasets/ScaleAI/fortress_public. We also maintain a private set for evaluation. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 12 pages, 7 figures, submitted to NeurIPS

arXiv:2506.14861 [pdf, ps, other]

BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models

Authors: Bharath Dandala, Michael M. Danziger, Ella Barkan, Tanwi Biswas, Viatcheslav Gurev, Jianying Hu, Matthew Madgwick, Akira Koseki, Tal Kozlovski, Michal Rosen-Zvi, Yishai Shimoni, Ching-Huei Tsou

Abstract: Transcriptomic foundation models (TFMs) have recently emerged as powerful tools for analyzing gene expression in cells and tissues, supporting key tasks such as cell-type annotation, batch correction, and perturbation prediction. However, the diversity of model implementations and training strategies across recent TFMs, though promising, makes it challenging to isolate the contribution of individu… ▽ More Transcriptomic foundation models (TFMs) have recently emerged as powerful tools for analyzing gene expression in cells and tissues, supporting key tasks such as cell-type annotation, batch correction, and perturbation prediction. However, the diversity of model implementations and training strategies across recent TFMs, though promising, makes it challenging to isolate the contribution of individual design choices or evaluate their potential synergies. This hinders the field's ability to converge on best practices and limits the reproducibility of insights across studies. We present BMFM-RNA, an open-source, modular software package that unifies diverse TFM pretraining and fine-tuning objectives within a single framework. Leveraging this capability, we introduce a novel training objective, whole cell expression decoder (WCED), which captures global expression patterns using an autoencoder-like CLS bottleneck representation. In this paper, we describe the framework, supported input representations, and training objectives. We evaluated four model checkpoints pretrained on CELLxGENE using combinations of masked language modeling (MLM), WCED and multitask learning. Using the benchmarking capabilities of BMFM-RNA, we show that WCED-based models achieve performance that matches or exceeds state-of-the-art approaches like scGPT across more than a dozen datasets in both zero-shot and fine-tuning tasks. BMFM-RNA, available as part of the biomed-multi-omics project ( https://github.com/BiomedSciAI/biomed-multi-omic ), offers a reproducible foundation for systematic benchmarking and community-driven exploration of optimal TFM training strategies, enabling the development of more effective tools to leverage the latest advances in AI for understanding cell biology. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.14855 [pdf, other]

Feedback-MPPI: Fast Sampling-Based MPC via Rollout Differentiation -- Adios low-level controllers

Authors: Tommaso Belvedere, Michael Ziegltrum, Giulio Turrisi, Valerio Modugno

Abstract: Model Predictive Path Integral control is a powerful sampling-based approach suitable for complex robotic tasks due to its flexibility in handling nonlinear dynamics and non-convex costs. However, its applicability in real-time, highfrequency robotic control scenarios is limited by computational demands. This paper introduces Feedback-MPPI (F-MPPI), a novel framework that augments standard MPPI by… ▽ More Model Predictive Path Integral control is a powerful sampling-based approach suitable for complex robotic tasks due to its flexibility in handling nonlinear dynamics and non-convex costs. However, its applicability in real-time, highfrequency robotic control scenarios is limited by computational demands. This paper introduces Feedback-MPPI (F-MPPI), a novel framework that augments standard MPPI by computing local linear feedback gains derived from sensitivity analysis inspired by Riccati-based feedback used in gradient-based MPC. These gains allow for rapid closed-loop corrections around the current state without requiring full re-optimization at each timestep. We demonstrate the effectiveness of F-MPPI through simulations and real-world experiments on two robotic platforms: a quadrupedal robot performing dynamic locomotion on uneven terrain and a quadrotor executing aggressive maneuvers with onboard computation. Results illustrate that incorporating local feedback significantly improves control performance and stability, enabling robust, high-frequency operation suitable for complex robotic systems. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.14852 [pdf, ps, other]

Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching

Authors: Qizheng Zhang, Michael Wornow, Kunle Olukotun

Abstract: LLM-based agentic applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agentic applications where outputs depend on external data or environmental con… ▽ More LLM-based agentic applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agentic applications where outputs depend on external data or environmental contexts. We propose agentic plan caching, a novel approach that extracts, stores, adapts, and reuses structured plan templates from planning stages of agentic applications across semantically similar tasks to reduce the cost of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agentic applications shows that our system can reduce costs by 46.62% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 23 pages

arXiv:2506.14754 [pdf, ps, other]

Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation

Authors: Carolina Higuera, Akash Sharma, Taosha Fan, Chaithanya Krishna Bodduluri, Byron Boots, Michael Kaess, Mike Lambeta, Tingfan Wu, Zixi Liu, Francois Robert Hogan, Mustafa Mukadam

Abstract: We present Sparsh-X, the first multisensory touch representations across four tactile modalities: image, audio, motion, and pressure. Trained on ~1M contact-rich interactions collected with the Digit 360 sensor, Sparsh-X captures complementary touch signals at diverse temporal and spatial scales. By leveraging self-supervised learning, Sparsh-X fuses these modalities into a unified representation… ▽ More We present Sparsh-X, the first multisensory touch representations across four tactile modalities: image, audio, motion, and pressure. Trained on ~1M contact-rich interactions collected with the Digit 360 sensor, Sparsh-X captures complementary touch signals at diverse temporal and spatial scales. By leveraging self-supervised learning, Sparsh-X fuses these modalities into a unified representation that captures physical properties useful for robot manipulation tasks. We study how to effectively integrate real-world touch representations for both imitation learning and tactile adaptation of sim-trained policies, showing that Sparsh-X boosts policy success rates by 63% over an end-to-end model using tactile images and improves robustness by 90% in recovering object states from touch. Finally, we benchmark Sparsh-X ability to make inferences about physical properties, such as object-action identification, material-quantity estimation, and force estimation. Sparsh-X improves accuracy in characterizing physical properties by 48% compared to end-to-end approaches, demonstrating the advantages of multisensory pretraining for capturing features essential for dexterous manipulation. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.14652 [pdf, ps, other]

Rigor in AI: Doing Rigorous AI Work Requires a Broader, Responsible AI-Informed Conception of Rigor

Authors: Alexandra Olteanu, Su Lin Blodgett, Agathe Balayn, Angelina Wang, Fernando Diaz, Flavio du Pin Calmon, Margaret Mitchell, Michael Ekstrand, Reuben Binns, Solon Barocas

Abstract: In AI research and practice, rigor remains largely understood in terms of methodological rigor -- such as whether mathematical, statistical, or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception o… ▽ More In AI research and practice, rigor remains largely understood in terms of methodological rigor -- such as whether mathematical, statistical, or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception -- in addition to a more expansive understanding of (1) methodological rigor -- should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also aim to provide useful language and a framework for much-needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 20 pages, 1 figure, 1 table

arXiv:2506.14605 [pdf, ps, other]

Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching

Authors: Giacomo Meanti, Thomas Ryckeboer, Michael Arbel, Julien Mairal

Abstract: This work addresses image restoration tasks through the lens of inverse problems using unpaired datasets. In contrast to traditional approaches -- which typically assume full knowledge of the forward model or access to paired degraded and ground-truth images -- the proposed method operates under minimal assumptions and relies only on small, unpaired datasets. This makes it particularly well-suited… ▽ More This work addresses image restoration tasks through the lens of inverse problems using unpaired datasets. In contrast to traditional approaches -- which typically assume full knowledge of the forward model or access to paired degraded and ground-truth images -- the proposed method operates under minimal assumptions and relies only on small, unpaired datasets. This makes it particularly well-suited for real-world scenarios, where the forward model is often unknown or misspecified, and collecting paired data is costly or infeasible. The method leverages conditional flow matching to model the distribution of degraded observations, while simultaneously learning the forward model via a distribution-matching loss that arises naturally from the framework. Empirically, it outperforms both single-image blind and unsupervised approaches on deblurring and non-uniform point spread function (PSF) calibration tasks. It also matches state-of-the-art performance on blind super-resolution. We also showcase the effectiveness of our method with a proof of concept for lens calibration: a real-world application traditionally requiring time-consuming experiments and specialized equipment. In contrast, our approach achieves this with minimal data acquisition effort. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: Code available at https://github.com/inria-thoth/ddm4ip

arXiv:2506.14544 [pdf, ps, other]

Infinite lexicographic products of positional objectives

Authors: Antonio Casares, Pierre Ohlmann, Michał Skrzypczak, Igor Walukiewicz

Abstract: This paper contributes to the study of positional determinacy of infinite duration games played on potentially infinite graphs. Recently, [Ohlmann, TheoretiCS 2023] established that positionality of prefix-independent objectives is preserved by finite lexicographic products. We propose two different notions of infinite lexicographic products indexed by arbitrary ordinals, and extend Ohlmann's resu… ▽ More This paper contributes to the study of positional determinacy of infinite duration games played on potentially infinite graphs. Recently, [Ohlmann, TheoretiCS 2023] established that positionality of prefix-independent objectives is preserved by finite lexicographic products. We propose two different notions of infinite lexicographic products indexed by arbitrary ordinals, and extend Ohlmann's result by proving that they also preserve positionality. In the context of one-player positionality, this extends positional determinacy results of [Grädel and Walukiewicz, Logical Methods in Computer Science 2006] to edge-labelled games and arbitrarily many priorities for both Max-Parity and Min-Parity. Moreover, we show that the Max-Parity objectives over countable ordinals are complete for the infinite levels of the difference hierarchy over $Σ^0_2$ and that Min-Parity is complete for the class $Σ^0_3$. We obtain therefore positional languages that are complete for all those levels, as well as new insights about closure under unions and neutral letters. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.14467 [pdf, ps, other]

Automatic Cannulation of Femoral Vessels in a Porcine Shock Model

Authors: Nico Zevallos, Cecilia G. Morales, Andrew Orekhov, Tejas Rane, Hernando Gomez, Francis X. Guyette, Michael R. Pinsky, John Galeotti, Artur Dubrawski, Howie Choset

Abstract: Rapid and reliable vascular access is critical in trauma and critical care. Central vascular catheterization enables high-volume resuscitation, hemodynamic monitoring, and advanced interventions like ECMO and REBOA. While peripheral access is common, central access is often necessary but requires specialized ultrasound-guided skills, posing challenges in prehospital settings. The complexity arises… ▽ More Rapid and reliable vascular access is critical in trauma and critical care. Central vascular catheterization enables high-volume resuscitation, hemodynamic monitoring, and advanced interventions like ECMO and REBOA. While peripheral access is common, central access is often necessary but requires specialized ultrasound-guided skills, posing challenges in prehospital settings. The complexity arises from deep target vessels and the precision needed for needle placement. Traditional techniques, like the Seldinger method, demand expertise to avoid complications. Despite its importance, ultrasound-guided central access is underutilized due to limited field expertise. While autonomous needle insertion has been explored for peripheral vessels, only semi-autonomous methods exist for femoral access. This work advances toward full automation, integrating robotic ultrasound for minimally invasive emergency procedures. Our key contribution is the successful femoral vein and artery cannulation in a porcine hemorrhagic shock model. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 2 pages, 2 figures, conference

Journal ref: Hamlyn Symposium on Medical Robotics 2025

arXiv:2506.14432 [pdf, ps, other]

A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning

Authors: Asbjørn Munk, Stefano Cerri, Jakob Ambsdorf, Julia Machnio, Sebastian Nørgaard Llambias, Vardan Nersesjan, Christian Hedeager Krag, Peirong Liu, Pablo Rocamora García, Mostafa Mehdipour Ghazi, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen

Abstract: We present FOMO60K, a large-scale, heterogeneous dataset of 60,529 brain Magnetic Resonance Imaging (MRI) scans from 13,900 sessions and 11,187 subjects, aggregated from 16 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal… ▽ More We present FOMO60K, a large-scale, heterogeneous dataset of 60,529 brain Magnetic Resonance Imaging (MRI) scans from 13,900 sessions and 11,187 subjects, aggregated from 16 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing barriers to entry for new users. Accompanying code for self-supervised pretraining and finetuning is provided. FOMO60K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.14400 [pdf, ps, other]

One Size Fits None: Rethinking Fairness in Medical AI

Authors: Roland Roller, Michael Hahn, Ajay Madhavan Ravichandran, Bilgin Osmanodja, Florian Oetke, Zeineb Sassi, Aljoscha Burchardt, Klaus Netter, Klemens Budde, Anne Herrmann, Tobias Strapatsas, Peter Dabrock, Sebastian Möller

Abstract: Machine learning (ML) models are increasingly used to support clinical decision-making. However, real-world medical datasets are often noisy, incomplete, and imbalanced, leading to performance disparities across patient subgroups. These differences raise fairness concerns, particularly when they reinforce existing disadvantages for marginalized groups. In this work, we analyze several medical pred… ▽ More Machine learning (ML) models are increasingly used to support clinical decision-making. However, real-world medical datasets are often noisy, incomplete, and imbalanced, leading to performance disparities across patient subgroups. These differences raise fairness concerns, particularly when they reinforce existing disadvantages for marginalized groups. In this work, we analyze several medical prediction tasks and demonstrate how model performance varies with patient characteristics. While ML models may demonstrate good overall performance, we argue that subgroup-level evaluation is essential before integrating them into clinical workflows. By conducting a performance analysis at the subgroup level, differences can be clearly identified-allowing, on the one hand, for performance disparities to be considered in clinical practice, and on the other hand, for these insights to inform the responsible development of more effective models. Thereby, our work contributes to a practical discussion around the subgroup-sensitive development and deployment of medical ML models and the interconnectedness of fairness and transparency. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: Accepted at the 6th Workshop on Gender Bias in Natural Language Processing at ACL 2025

arXiv:2506.14386 [pdf, ps, other]

ResNets Are Deeper Than You Think

Authors: Christian H. X. Ali Mehmeti-Göpel, Michael Wand

Abstract: Residual connections remain ubiquitous in modern neural network architectures nearly a decade after their introduction. Their widespread adoption is often credited to their dramatically improved trainability: residual networks train faster, more stably, and achieve higher accuracy than their feedforward counterparts. While numerous techniques, ranging from improved initialization to advanced learn… ▽ More Residual connections remain ubiquitous in modern neural network architectures nearly a decade after their introduction. Their widespread adoption is often credited to their dramatically improved trainability: residual networks train faster, more stably, and achieve higher accuracy than their feedforward counterparts. While numerous techniques, ranging from improved initialization to advanced learning rate schedules, have been proposed to close the performance gap between residual and feedforward networks, this gap has persisted. In this work, we propose an alternative explanation: residual networks do not merely reparameterize feedforward networks, but instead inhabit a different function space. We design a controlled post-training comparison to isolate generalization performance from trainability; we find that variable-depth architectures, similar to ResNets, consistently outperform fixed-depth networks, even when optimization is unlikely to make a difference. These results suggest that residual connections confer performance advantages beyond optimization, pointing instead to a deeper inductive bias aligned with the structure of natural data. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: NeurIPS 2025 Submission

arXiv:2506.14374 [pdf, ps, other]

Excessive Reasoning Attack on Reasoning LLMs

Authors: Wai Man Si, Mingjie Li, Michael Backes, Yang Zhang

Abstract: Recent reasoning large language models (LLMs), such as OpenAI o1 and DeepSeek-R1, exhibit strong performance on complex tasks through test-time inference scaling. However, prior studies have shown that these models often incur significant computational costs due to excessive reasoning, such as frequent switching between reasoning trajectories (e.g., underthinking) or redundant reasoning on simple… ▽ More Recent reasoning large language models (LLMs), such as OpenAI o1 and DeepSeek-R1, exhibit strong performance on complex tasks through test-time inference scaling. However, prior studies have shown that these models often incur significant computational costs due to excessive reasoning, such as frequent switching between reasoning trajectories (e.g., underthinking) or redundant reasoning on simple questions (e.g., overthinking). In this work, we expose a novel threat: adversarial inputs can be crafted to exploit excessive reasoning behaviors and substantially increase computational overhead without compromising model utility. Therefore, we propose a novel loss framework consisting of three components: (1) Priority Cross-Entropy Loss, a modification of the standard cross-entropy objective that emphasizes key tokens by leveraging the autoregressive nature of LMs; (2) Excessive Reasoning Loss, which encourages the model to initiate additional reasoning paths during inference; and (3) Delayed Termination Loss, which is designed to extend the reasoning process and defer the generation of final outputs. We optimize and evaluate our attack for the GSM8K and ORCA datasets on DeepSeek-R1-Distill-LLaMA and DeepSeek-R1-Distill-Qwen. Empirical results demonstrate a 3x to 9x increase in reasoning length with comparable utility performance. Furthermore, our crafted adversarial inputs exhibit transferability, inducing computational overhead in o3-mini, o1-mini, DeepSeek-R1, and QWQ models. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.14291 [pdf, ps, other]

Equivariance Everywhere All At Once: A Recipe for Graph Foundation Models

Authors: Ben Finkelshtein, İsmail İlkan Ceylan, Michael Bronstein, Ron Levie

Abstract: Graph machine learning architectures are typically tailored to specific tasks on specific datasets, which hinders their broader applicability. This has led to a new quest in graph machine learning: how to build graph foundation models capable of generalizing across arbitrary graphs and features? In this work, we present a recipe for designing graph foundation models for node-level tasks from first… ▽ More Graph machine learning architectures are typically tailored to specific tasks on specific datasets, which hinders their broader applicability. This has led to a new quest in graph machine learning: how to build graph foundation models capable of generalizing across arbitrary graphs and features? In this work, we present a recipe for designing graph foundation models for node-level tasks from first principles. The key ingredient underpinning our study is a systematic investigation of the symmetries that a graph foundation model must respect. In a nutshell, we argue that label permutation-equivariance alongside feature permutation-invariance are necessary in addition to the common node permutation-equivariance on each local neighborhood of the graph. To this end, we first characterize the space of linear transformations that are equivariant to permutations of nodes and labels, and invariant to permutations of features. We then prove that the resulting network is a universal approximator on multisets that respect the aforementioned symmetries. Our recipe uses such layers on the multiset of features induced by the local neighborhood of the graph to obtain a class of graph foundation models for node property prediction. We validate our approach through extensive experiments on 29 real-world node classification datasets, demonstrating both strong zero-shot empirical performance and consistent improvement as the number of training graphs increases. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.14223 [pdf, ps, other]

Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription

Authors: Anna Hamberger, Sebastian Murgul, Jochen Schmidt, Michael Heizmann

Abstract: Music transcription plays a pivotal role in Music Information Retrieval (MIR), particularly for stringed instruments like the guitar, where symbolic music notations such as MIDI lack crucial playability information. This contribution introduces the Fretting-Transformer, an encoderdecoder model that utilizes a T5 transformer architecture to automate the transcription of MIDI sequences into guitar t… ▽ More Music transcription plays a pivotal role in Music Information Retrieval (MIR), particularly for stringed instruments like the guitar, where symbolic music notations such as MIDI lack crucial playability information. This contribution introduces the Fretting-Transformer, an encoderdecoder model that utilizes a T5 transformer architecture to automate the transcription of MIDI sequences into guitar tablature. By framing the task as a symbolic translation problem, the model addresses key challenges, including string-fret ambiguity and physical playability. The proposed system leverages diverse datasets, including DadaGP, GuitarToday, and Leduc, with novel data pre-processing and tokenization strategies. We have developed metrics for tablature accuracy and playability to quantitatively evaluate the performance. The experimental results demonstrate that the Fretting-Transformer surpasses baseline methods like A* and commercial applications like Guitar Pro. The integration of context-sensitive processing and tuning/capo conditioning further enhances the model's performance, laying a robust foundation for future developments in automated guitar transcription. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: Accepted to the 50th International Computer Music Conference (ICMC), 2025

arXiv:2506.14111 [pdf, ps, other]

Essential-Web v1.0: 24T tokens of organized web data

Authors: Essential AI, :, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, Ashish Vaswani

Abstract: Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels ar… ▽ More Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0 △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.13998 [pdf, ps, other]

DAGs for the Masses

Authors: Michael Anoprenko, Andrei Tonkikh, Alexander Spiegelman, Petr Kuznetsov

Abstract: A recent approach to building consensus protocols on top of Directed Acyclic Graphs (DAGs) shows much promise due to its simplicity and stable throughput. However, as each node in the DAG typically includes a linear number of references to the nodes in the previous round, prior DAG protocols only scale up to a certain point when the overhead of maintaining the graph becomes the bottleneck. To en… ▽ More A recent approach to building consensus protocols on top of Directed Acyclic Graphs (DAGs) shows much promise due to its simplicity and stable throughput. However, as each node in the DAG typically includes a linear number of references to the nodes in the previous round, prior DAG protocols only scale up to a certain point when the overhead of maintaining the graph becomes the bottleneck. To enable large-scale deployments of DAG-based protocols, we propose a sparse DAG architecture, where each node includes only a constant number of references to random nodes in the previous round. We present a sparse version of Bullshark -- one of the most prominent DAG-based consensus protocols -- and demonstrate its improved scalability. Remarkably, unlike other protocols that use random sampling to reduce communication complexity, we manage to avoid sacrificing resilience: the protocol can tolerate up to $f<n/3$ Byzantine faults (where $n$ is the number of participants), same as its less scalable deterministic counterpart. The proposed ``sparse'' methodology can be applied to any protocol that maintains disseminated system updates and causal relations between them in a graph-like structure. Our simulations show that the considerable reduction of transmitted metadata in sparse DAGs results in more efficient network utilization and better scalability. △ Less

Submitted 18 June, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.13996 [pdf, ps, other]

Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences

Authors: Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He

Abstract: Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space. Out-of-box, even on a modern NVIDIA H100 80GB GPU… ▽ More Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space. Out-of-box, even on a modern NVIDIA H100 80GB GPU cluster, training Llama 8B model with sequence over 32K runs out of memory on a basic Hugging Face (HF) model due to two reasons: i) LLM training workloads are not optimized to fully leverage a single GPU memory, ii) existing solutions for leveraging multiple GPU memory are not easily available to HF models, making long sequence training inaccessible. We address this with Arctic Long Sequence Training (ALST). It offers a combination of attention-agnostic single GPU and multi-GPU memory optimizations, that enables it to support out-of-box training of multi-million sequence length for a wide variety of HF models. ALST supports training Meta's Llama 8B model with 500K sequence length on a single H100 GPU, 3.7M on a single 8xH100 GPU node, and over 15M on a 4 node cluster, an increase of over 400x compared to the 32K baseline for the latter. ALST is fully compatible with HF models and open-sourced via Deepspeed https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-pallellism/ and Arctic Training https://github.com/snowflakedb/ArcticTraining/blob/main/projects/sequence-parallelism/README.md. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: 19 pages, 13 figures

arXiv:2506.13974 [pdf, ps, other]

Constant Stepsize Local GD for Logistic Regression: Acceleration by Instability

Authors: Michael Crawshaw, Blake Woodworth, Mingrui Liu

Abstract: Existing analysis of Local (Stochastic) Gradient Descent for heterogeneous objectives requires stepsizes $η\leq 1/K$ where $K$ is the communication interval, which ensures monotonic decrease of the objective. In contrast, we analyze Local Gradient Descent for logistic regression with separable, heterogeneous data using any stepsize $η> 0$. With $R$ communication rounds and $M$ clients, we show con… ▽ More Existing analysis of Local (Stochastic) Gradient Descent for heterogeneous objectives requires stepsizes $η\leq 1/K$ where $K$ is the communication interval, which ensures monotonic decrease of the objective. In contrast, we analyze Local Gradient Descent for logistic regression with separable, heterogeneous data using any stepsize $η> 0$. With $R$ communication rounds and $M$ clients, we show convergence at a rate $\mathcal{O}(1/ηK R)$ after an initial unstable phase lasting for $\widetilde{\mathcal{O}}(ηK M)$ rounds. This improves upon the existing $\mathcal{O}(1/R)$ rate for general smooth, convex objectives. Our analysis parallels the single machine analysis of~\cite{wu2024large} in which instability is caused by extremely large stepsizes, but in our setting another source of instability is large local updates with heterogeneous objectives. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: ICML 2025

arXiv:2506.13905 [pdf, ps, other]

Spec2RTL-Agent: Automated Hardware Code Generation from Complex Specifications Using LLM Agent Systems

Authors: Zhongzhi Yu, Mingjie Liu, Michael Zimmer, Yingyan Celine Lin, Yong Liu, Haoxing Ren

Abstract: Despite recent progress in generating hardware RTL code with LLMs, existing solutions still suffer from a substantial gap between practical application scenarios and the requirements of real-world RTL code development. Prior approaches either focus on overly simplified hardware descriptions or depend on extensive human guidance to process complex specifications, limiting their scalability and auto… ▽ More Despite recent progress in generating hardware RTL code with LLMs, existing solutions still suffer from a substantial gap between practical application scenarios and the requirements of real-world RTL code development. Prior approaches either focus on overly simplified hardware descriptions or depend on extensive human guidance to process complex specifications, limiting their scalability and automation potential. In this paper, we address this gap by proposing an LLM agent system, termed Spec2RTL-Agent, designed to directly process complex specification documentation and generate corresponding RTL code implementations, advancing LLM-based RTL code generation toward more realistic application settings. To achieve this goal, Spec2RTL-Agent introduces a novel multi-agent collaboration framework that integrates three key enablers: (1) a reasoning and understanding module that translates specifications into structured, step-by-step implementation plans; (2) a progressive coding and prompt optimization module that iteratively refines the code across multiple representations to enhance correctness and synthesisability for RTL conversion; and (3) an adaptive reflection module that identifies and traces the source of errors during generation, ensuring a more robust code generation flow. Instead of directly generating RTL from natural language, our system strategically generates synthesizable C++ code, which is then optimized for HLS. This agent-driven refinement ensures greater correctness and compatibility compared to naive direct RTL generation approaches. We evaluate Spec2RTL-Agent on three specification documents, showing it generates accurate RTL code with up to 75% fewer human interventions than existing methods. This highlights its role as the first fully automated multi-agent system for RTL generation from unstructured specs, reducing reliance on human effort in hardware design. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.13820 [pdf, ps, other]

Structured Program Synthesis using LLMs: Results and Insights from the IPARC Challenge

Authors: Shraddha Surana, Ashwin Srinivasan, Michael Bain

Abstract: The IPARC Challenge, inspired by ARC, provides controlled program synthesis tasks over synthetic images to evaluate automatic program construction, focusing on sequence, selection, and iteration. This set of 600 tasks has resisted automated solutions. This paper presents a structured inductive programming approach with LLMs that successfully solves tasks across all IPARC categories. The controlled… ▽ More The IPARC Challenge, inspired by ARC, provides controlled program synthesis tasks over synthetic images to evaluate automatic program construction, focusing on sequence, selection, and iteration. This set of 600 tasks has resisted automated solutions. This paper presents a structured inductive programming approach with LLMs that successfully solves tasks across all IPARC categories. The controlled nature of IPARC reveals insights into LLM-based code generation, including the importance of prior structuring, LLMs' ability to aid structuring (requiring human refinement), the need to freeze correct code, the efficiency of code reuse, and how LLM-generated code can spark human creativity. These findings suggest valuable mechanisms for human-LLM collaboration in tackling complex program synthesis. △ Less

Submitted 15 June, 2025; originally announced June 2025.

arXiv:2506.13776 [pdf, ps, other]

Recommendations and Reporting Checklist for Rigorous & Transparent Human Baselines in Model Evaluations

Authors: Kevin L. Wei, Patricia Paskov, Sunishchal Dev, Michael J. Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Calcott, Evie Coxon, Chinmay Deshpande

Abstract: In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluatio… ▽ More In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: https://github.com/kevinlwei/human-baselines △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: A version of this paper has been accepted to ICML 2025 as a position paper (spotlight), with the title: "Position: Human Baselines in Model Evaluations Need Rigor and Transparency (With Recommendations & Reporting Checklist)."

arXiv:2506.13650 [pdf, ps, other]

Deceptive Path Planning: A Bayesian Game Approach

Authors: Violetta Rostobaya, James Berneburg, Yue Guan, Michael Dorothy, Daigo Shishika

Abstract: This paper investigates how an autonomous agent can transmit information through its motion in an adversarial setting. We consider scenarios where an agent must reach its goal while deceiving an intelligent observer about its destination. We model this interaction as a dynamic Bayesian game between a mobile Attacker with a privately known goal and a Defender who infers the Attacker's intent to all… ▽ More This paper investigates how an autonomous agent can transmit information through its motion in an adversarial setting. We consider scenarios where an agent must reach its goal while deceiving an intelligent observer about its destination. We model this interaction as a dynamic Bayesian game between a mobile Attacker with a privately known goal and a Defender who infers the Attacker's intent to allocate defensive resources effectively. We use Perfect Bayesian Nash Equilibrium (PBNE) as our solution concept and propose a computationally efficient approach to find it. In the resulting equilibrium, the Defender employs a simple Markovian strategy, while the Attacker strategically balances deception and goal efficiency by stochastically mixing shortest and non-shortest paths to manipulate the Defender's beliefs. Numerical experiments demonstrate the advantages of our PBNE-based strategies over existing methods based on one-sided optimization. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: 8 pages, 9 figures. This work has been submitted to the IEEE for possible publication

arXiv:2506.13494 [pdf, ps, other]

Watermarking LLM-Generated Datasets in Downstream Tasks

Authors: Yugeng Liu, Tianshuo Cong, Michael Backes, Zheng Li, Yang Zhang

Abstract: Large Language Models (LLMs) have experienced rapid advancements, with applications spanning a wide range of fields, including sentiment classification, review generation, and question answering. Due to their efficiency and versatility, researchers and companies increasingly employ LLM-generated data to train their models. However, the inability to track content produced by LLMs poses a significan… ▽ More Large Language Models (LLMs) have experienced rapid advancements, with applications spanning a wide range of fields, including sentiment classification, review generation, and question answering. Due to their efficiency and versatility, researchers and companies increasingly employ LLM-generated data to train their models. However, the inability to track content produced by LLMs poses a significant challenge, potentially leading to copyright infringement for the LLM owners. In this paper, we propose a method for injecting watermarks into LLM-generated datasets, enabling the tracking of downstream tasks to detect whether these datasets were produced using the original LLM. These downstream tasks can be divided into two categories. The first involves using the generated datasets at the input level, commonly for training classification tasks. The other is the output level, where model trainers use LLM-generated content as output for downstream tasks, such as question-answering tasks. We design a comprehensive set of experiments to evaluate both watermark methods. Our results indicate the high effectiveness of our watermark approach. Additionally, regarding model utility, we find that classifiers trained on the generated datasets achieve a test accuracy exceeding 0.900 in many cases, suggesting that the utility of such models remains robust. For the output-level watermark, we observe that the quality of the generated text is comparable to that produced using real-world datasets. Through our research, we aim to advance the protection of LLM copyrights, taking a significant step forward in safeguarding intellectual property in this domain. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.13484 [pdf, ps, other]

Deep Diffusion Models and Unsupervised Hyperspectral Unmixing for Realistic Abundance Map Synthesis

Authors: Martina Pastorino, Michael Alibani, Nicola Acito, Gabriele Moser

Abstract: This paper presents a novel methodology for generating realistic abundance maps from hyperspectral imagery using an unsupervised, deep-learning-driven approach. Our framework integrates blind linear hyperspectral unmixing with state-of-the-art diffusion models to enhance the realism and diversity of synthetic abundance maps. First, we apply blind unmixing to extract endmembers and abundance maps d… ▽ More This paper presents a novel methodology for generating realistic abundance maps from hyperspectral imagery using an unsupervised, deep-learning-driven approach. Our framework integrates blind linear hyperspectral unmixing with state-of-the-art diffusion models to enhance the realism and diversity of synthetic abundance maps. First, we apply blind unmixing to extract endmembers and abundance maps directly from raw hyperspectral data. These abundance maps then serve as inputs to a diffusion model, which acts as a generative engine to synthesize highly realistic spatial distributions. Diffusion models have recently revolutionized image synthesis by offering superior performance, flexibility, and stability, making them well-suited for high-dimensional spectral data. By leveraging this combination of physically interpretable unmixing and deep generative modeling, our approach enables the simulation of hyperspectral sensor outputs under diverse imaging conditions--critical for data augmentation, algorithm benchmarking, and model evaluation in hyperspectral analysis. Notably, our method is entirely unsupervised, ensuring adaptability to different datasets without the need for labeled training data. We validate our approach using real hyperspectral imagery from the PRISMA space mission for Earth observation, demonstrating its effectiveness in producing realistic synthetic abundance maps that capture the spatial and spectral characteristics of natural scenes. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: CVPRw2025

arXiv:2506.13479 [pdf, ps, other]

Position: Pause Recycling LoRAs and Prioritize Mechanisms to Uncover Limits and Effectiveness

Authors: Mei-Yen Chen, Thi Thu Uyen Hoang, Michael Hahn, M. Saquib Sarfraz

Abstract: Merging or routing low-rank adapters (LoRAs) has emerged as a popular solution for enhancing large language models, particularly when data access is restricted by regulatory or domain-specific constraints. This position paper argues that the research community should shift its focus from developing new merging or routing algorithms to understanding the conditions under which reusing LoRAs is truly… ▽ More Merging or routing low-rank adapters (LoRAs) has emerged as a popular solution for enhancing large language models, particularly when data access is restricted by regulatory or domain-specific constraints. This position paper argues that the research community should shift its focus from developing new merging or routing algorithms to understanding the conditions under which reusing LoRAs is truly effective. Through theoretical analysis and synthetic two-hop reasoning and math word-problem tasks, we examine whether reusing LoRAs enables genuine compositional generalization or merely reflects shallow pattern matching. Evaluating two data-agnostic methods--parameter averaging and dynamic adapter selection--we found that reusing LoRAs often fails to logically integrate knowledge across disjoint fine-tuning datasets, especially when such knowledge is underrepresented during pretraining. Our empirical results, supported by theoretical insights into LoRA's limited expressiveness, highlight the preconditions and constraints of reusing them for unseen tasks and cast doubt on its feasibility as a truly data-free approach. We advocate for pausing the pursuit of novel methods for recycling LoRAs and emphasize the need for rigorous mechanisms to guide future academic research in adapter-based model merging and practical system designs for practitioners. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.13189 [pdf, ps, other]

Multimodal "Puppeteer": An Exploration of Robot Teleoperation Via Virtual Counterpart with LLM-Driven Voice and Gesture Interaction in Augmented Reality

Authors: Yuchong Zhang, Bastian Orthmann, Shichen Ji, Michael Welle, Jonne Van Haastregt, Danica Kragic

Abstract: The integration of robotics and augmented reality (AR) holds transformative potential for advancing human-robot interaction (HRI), offering enhancements in usability, intuitiveness, accessibility, and collaborative task performance. This paper introduces and evaluates a novel multimodal AR-based robot puppeteer framework that enables intuitive teleoperation via virtual counterpart through large la… ▽ More The integration of robotics and augmented reality (AR) holds transformative potential for advancing human-robot interaction (HRI), offering enhancements in usability, intuitiveness, accessibility, and collaborative task performance. This paper introduces and evaluates a novel multimodal AR-based robot puppeteer framework that enables intuitive teleoperation via virtual counterpart through large language model (LLM)-driven voice commands and hand gesture interactions. Utilizing the Meta Quest 3, users interact with a virtual counterpart robot in real-time, effectively "puppeteering" its physical counterpart within an AR environment. We conducted a within-subject user study with 42 participants performing robotic cube pick-and-place with pattern matching tasks under two conditions: gesture-only interaction and combined voice-and-gesture interaction. Both objective performance metrics and subjective user experience (UX) measures were assessed, including an extended comparative analysis between roboticists and non-roboticists. The results provide key insights into how multimodal input influences contextual task efficiency, usability, and user satisfaction in AR-based HRI. Our findings offer practical design implications for designing effective AR-enhanced HRI systems. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: This work has been submitted to the IEEE TVCG for possible publication

arXiv:2506.13139 [pdf, ps, other]

Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models

Authors: Zhenyu Liao, Michael W. Mahoney

Abstract: Modern Machine Learning (ML) and Deep Neural Networks (DNNs) often operate on high-dimensional data and rely on overparameterized models, where classical low-dimensional intuitions break down. In particular, the proportional regime where the data dimension, sample size, and number of model parameters are all large and comparable, gives rise to novel and sometimes counterintuitive behaviors. This p… ▽ More Modern Machine Learning (ML) and Deep Neural Networks (DNNs) often operate on high-dimensional data and rely on overparameterized models, where classical low-dimensional intuitions break down. In particular, the proportional regime where the data dimension, sample size, and number of model parameters are all large and comparable, gives rise to novel and sometimes counterintuitive behaviors. This paper extends traditional Random Matrix Theory (RMT) beyond eigenvalue-based analysis of linear models to address the challenges posed by nonlinear ML models such as DNNs in this regime. We introduce the concept of High-dimensional Equivalent, which unifies and generalizes both Deterministic Equivalent and Linear Equivalent, to systematically address three technical challenges: high dimensionality, nonlinearity, and the need to analyze generic eigenspectral functionals. Leveraging this framework, we provide precise characterizations of the training and generalization performance of linear models, nonlinear shallow networks, and deep networks. Our results capture rich phenomena, including scaling laws, double descent, and nonlinear learning dynamics, offering a unified perspective on the theoretical understanding of deep learning in high dimensions. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: 30 pages, 6 figures

arXiv:2506.13134 [pdf, ps, other]

Quantum AGI: Ontological Foundations

Authors: Elija Perrier, Michael Timothy Bennett

Abstract: We examine the implications of quantum foundations for AGI, focusing on how seminal results such as Bell's theorems (non-locality), the Kochen-Specker theorem (contextuality) and no-cloning theorem problematise practical implementation of AGI in quantum settings. We introduce a novel information-theoretic taxonomy distinguishing between classical AGI and quantum AGI and show how quantum mechanics… ▽ More We examine the implications of quantum foundations for AGI, focusing on how seminal results such as Bell's theorems (non-locality), the Kochen-Specker theorem (contextuality) and no-cloning theorem problematise practical implementation of AGI in quantum settings. We introduce a novel information-theoretic taxonomy distinguishing between classical AGI and quantum AGI and show how quantum mechanics affects fundamental features of agency. We show how quantum ontology may change AGI capabilities, both via affording computational advantages and via imposing novel constraints. △ Less

Submitted 16 June, 2025; originally announced June 2025.

Comments: Accepted into AGI-25. Technical appendices available via link

arXiv:2506.13059 [pdf, ps, other]

Multipole Attention for Efficient Long Context Reasoning

Authors: Coleman Hooper, Sebastian Zhao, Luca Manolache, Sehoon Kim, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

Abstract: Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cach… ▽ More Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Additionally, prior methods often pre-process the input to make it easier to identify the important prompt tokens when computing attention during generation, and this pre-processing is challenging to perform online for newly generated reasoning tokens. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens, while maintaining approximate representations for the remaining tokens. Our method first performs clustering to group together semantically similar key vectors, and then uses the cluster centroids both to identify important key vectors and to approximate the remaining key vectors in order to retain high accuracy. We design a fast cluster update process to quickly re-cluster the input and previously generated tokens, thereby allowing for accelerating attention to the previous output tokens. We evaluate our method using emerging LRMs such as Qwen-8B, demonstrating that our approach can maintain accuracy on complex reasoning tasks even with aggressive attention sparsity settings. We also provide kernel implementations to demonstrate the practical efficiency gains from our method, achieving up to 4.5$\times$ speedup for attention in long-context reasoning applications. Our code is available at https://github.com/SqueezeAILab/MultipoleAttention. △ Less

Submitted 15 June, 2025; originally announced June 2025.

Comments: 15 pages

arXiv:2506.13040 [pdf, ps, other]

MAMMA: Markerless & Automatic Multi-Person Motion Action Capture

Authors: Hanz Cuevas-Velasquez, Anastasios Yiannakidis, Soyong Shin, Giorgio Becherini, Markus Höschle, Joachim Tesch, Taylor Obersat, Tsvetelina Alexiadis, Michael Black

Abstract: We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent lear… ▽ More We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, benchmark, method, training code, and pre-trained model weights for research purposes. △ Less

Submitted 15 June, 2025; originally announced June 2025.

arXiv:2506.12932 [pdf, ps, other]

Complexity Scaling Laws for Neural Models using Combinatorial Optimization

Authors: Lowell Weissman, Michael Krumdick, A. Lynn Abbott

Abstract: Recent work on neural scaling laws demonstrates that model performance scales predictably with compute budget, model size, and dataset size. In this work, we develop scaling laws based on problem complexity. We analyze two fundamental complexity measures: solution space size and representation space size. Using the Traveling Salesman Problem (TSP) as a case study, we show that combinatorial optimi… ▽ More Recent work on neural scaling laws demonstrates that model performance scales predictably with compute budget, model size, and dataset size. In this work, we develop scaling laws based on problem complexity. We analyze two fundamental complexity measures: solution space size and representation space size. Using the Traveling Salesman Problem (TSP) as a case study, we show that combinatorial optimization promotes smooth cost trends, and therefore meaningful scaling laws can be obtained even in the absence of an interpretable loss. We then show that suboptimality grows predictably for fixed-size models when scaling the number of TSP nodes or spatial dimensions, independent of whether the model was trained with reinforcement learning or supervised fine-tuning on a static dataset. We conclude with an analogy to problem complexity scaling in local search, showing that a much simpler gradient descent of the cost landscape produces similar trends. △ Less

Submitted 15 June, 2025; originally announced June 2025.

Comments: 45 pages, 20 figures

arXiv:2506.12900 [pdf, ps, other]

Self-Stabilizing Replicated State Machine Coping with Byzantine and Recurring Transient Faults

Authors: Shlomi Dolev, Amit Hendin, Maurice Herlihy, Maria Potop Butucaru, Elad Michael Schiller

Abstract: The ability to perform repeated Byzantine agreement lies at the heart of important applications such as blockchain price oracles or replicated state machines. Any such protocol requires the following properties: (1) \textit{Byzantine fault-tolerance}, because not all participants can be assumed to be honest, (2) r\textit{ecurrent transient fault-tolerance}, because even honest participants may be… ▽ More The ability to perform repeated Byzantine agreement lies at the heart of important applications such as blockchain price oracles or replicated state machines. Any such protocol requires the following properties: (1) \textit{Byzantine fault-tolerance}, because not all participants can be assumed to be honest, (2) r\textit{ecurrent transient fault-tolerance}, because even honest participants may be subject to transient ``glitches'', (3) \textit{accuracy}, because the results of quantitative queries (such as price quotes) must lie within the interval of honest participants' inputs, and (4) \textit{self-stabilization}, because it is infeasible to reboot a distributed system following a fault. This paper presents the first protocol for repeated Byzantine agreement that satisfies the properties listed above. Specifically, starting in an arbitrary system configuration, our protocol establishes consistency. It preserves consistency in the face of up to $\lceil n/3 \rceil -1$ Byzantine participants {\em and} constant recurring (``noise'') transient faults, of up to $\lceil n/6 \rceil-1$ additional malicious transient faults, or even more than $\lceil n/6 \rceil-1$ (uniformly distributed) random transient faults, in each repeated Byzantine agreement. △ Less

Submitted 15 June, 2025; originally announced June 2025.

arXiv:2506.12724 [pdf, ps, other]

Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency

Authors: Hiroshi Tanaka, Anika Rao, Hana Satou, Michael Johnson, Sofia García

Abstract: Multimodal Large Models (MLLMs) have achieved remarkable progress in vision-language understanding and generation tasks. However, existing MLLMs typically rely on static modality fusion strategies, which treat all modalities equally regardless of their instance-level reliability or semantic contribution. This often leads to suboptimal performance, especially in scenarios with noisy, missing, or mi… ▽ More Multimodal Large Models (MLLMs) have achieved remarkable progress in vision-language understanding and generation tasks. However, existing MLLMs typically rely on static modality fusion strategies, which treat all modalities equally regardless of their instance-level reliability or semantic contribution. This often leads to suboptimal performance, especially in scenarios with noisy, missing, or misaligned modalities. In this paper, we propose Dynamic Modality Scheduling (DMS), a novel framework that adaptively adjusts the contribution of each modality at a per-sample level. DMS evaluates each modality based on three key factors: (1) \textit{confidence}, estimated from predictive entropy; (2) \textit{uncertainty}, obtained via Monte Carlo dropout; and (3) \textit{semantic consistency}, computed through inter-modal similarity. These signals are combined through a learnable or rule-based scheduler to generate soft modality weights used in downstream fusion.To ensure stable training, we further introduce a \textit{Modality Weight Consistency Loss}, which regularizes the fused representation to stay close to unimodal embeddings proportionally to their assigned weights. Our method is model-agnostic and can be integrated into existing MLLMs such as BLIP-2 and LLaVA. Experimental results on VQA, image-text retrieval, and captioning tasks show that DMS significantly improves both clean and robust performance, especially under modality corruption or dropout conditions. This work provides a general and effective mechanism to enable instance-aware and robustness-enhanced multimodal modeling. △ Less

Submitted 15 June, 2025; originally announced June 2025.

arXiv:2506.12611 [pdf, ps, other]

Accelerating Cloud-Based Transcriptomics: Performance Analysis and Optimization of the STAR Aligner Workflow

Authors: Piotr Kica, Sabina Lichołai, Michał Orzechowski, Maciej Malawski

Abstract: In this work, we explore the Transcriptomics Atlas pipeline adapted for cost-efficient and high-throughput computing in the cloud. We propose a scalable, cloud-native architecture designed for running a resource-intensive aligner -- STAR -- and processing tens or hundreds of terabytes of RNA-sequencing data. We implement multiple optimization techniques that give significant execution time and cos… ▽ More In this work, we explore the Transcriptomics Atlas pipeline adapted for cost-efficient and high-throughput computing in the cloud. We propose a scalable, cloud-native architecture designed for running a resource-intensive aligner -- STAR -- and processing tens or hundreds of terabytes of RNA-sequencing data. We implement multiple optimization techniques that give significant execution time and cost reduction. The impact of particular optimizations is measured in medium-scale experiments followed by a large-scale experiment that leverages all of them and validates the current design. Early stopping optimization allows a reduction in total alignment time by 23%. We analyze the scalability and efficiency of one of the most widely used sequence aligners. For the cloud environment, we identify one of the most suitable EC2 instance types and verify the applicability of spot instances usage. △ Less

Submitted 14 June, 2025; originally announced June 2025.

Comments: Accepted at ICCS2025

arXiv:2506.12563 [pdf, ps, other]

Benchmarking Image Similarity Metrics for Novel View Synthesis Applications

Authors: Charith Wickrema, Sara Leary, Shivangi Sarkar, Mark Giglio, Eric Bianchi, Eliza Mace, Michael Twardowski

Abstract: Traditional image similarity metrics are ineffective at evaluating the similarity between a real image of a scene and an artificially generated version of that viewpoint [6, 9, 13, 14]. Our research evaluates the effectiveness of a new, perceptual-based similarity metric, DreamSim [2], and three popular image similarity metrics: Structural Similarity (SSIM), Peak Signal-to-Noise Ratio (PSNR), and… ▽ More Traditional image similarity metrics are ineffective at evaluating the similarity between a real image of a scene and an artificially generated version of that viewpoint [6, 9, 13, 14]. Our research evaluates the effectiveness of a new, perceptual-based similarity metric, DreamSim [2], and three popular image similarity metrics: Structural Similarity (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Learned Perceptual Image Patch Similarity (LPIPS) [18, 19] in novel view synthesis (NVS) applications. We create a corpus of artificially corrupted images to quantify the sensitivity and discriminative power of each of the image similarity metrics. These tests reveal that traditional metrics are unable to effectively differentiate between images with minor pixel-level changes and those with substantial corruption, whereas DreamSim is more robust to minor defects and can effectively evaluate the high-level similarity of the image. Additionally, our results demonstrate that DreamSim provides a more effective and useful evaluation of render quality, especially for evaluating NVS renders in real-world use cases where slight rendering corruptions are common, but do not affect image utility for human tasks. △ Less

Submitted 14 June, 2025; originally announced June 2025.

arXiv:2506.12362 [pdf, ps, other]

HYPER: A Foundation Model for Inductive Link Prediction with Knowledge Hypergraphs

Authors: Xingyue Huang, Mikhail Galkin, Michael M. Bronstein, İsmail İlkan Ceylan

Abstract: Inductive link prediction with knowledge hypergraphs is the task of predicting missing hyperedges involving completely novel entities (i.e., nodes unseen during training). Existing methods for inductive link prediction with knowledge hypergraphs assume a fixed relational vocabulary and, as a result, cannot generalize to knowledge hypergraphs with novel relation types (i.e., relations unseen during… ▽ More Inductive link prediction with knowledge hypergraphs is the task of predicting missing hyperedges involving completely novel entities (i.e., nodes unseen during training). Existing methods for inductive link prediction with knowledge hypergraphs assume a fixed relational vocabulary and, as a result, cannot generalize to knowledge hypergraphs with novel relation types (i.e., relations unseen during training). Inspired by knowledge graph foundation models, we propose HYPER as a foundation model for link prediction, which can generalize to any knowledge hypergraph, including novel entities and novel relations. Importantly, HYPER can learn and transfer across different relation types of varying arities, by encoding the entities of each hyperedge along with their respective positions in the hyperedge. To evaluate HYPER, we construct 16 new inductive datasets from existing knowledge hypergraphs, covering a diverse range of relation types of varying arities. Empirically, HYPER consistently outperforms all existing methods in both node-only and node-and-relation inductive settings, showing strong generalization to unseen, higher-arity relational structures. △ Less

Submitted 14 June, 2025; originally announced June 2025.

arXiv:2506.12346 [pdf, ps, other]

Refract ICL: Rethinking Example Selection in the Era of Million-Token Models

Authors: Arjun R. Akula, Kazuma Hashimoto, Krishna Srinivasan, Aditi Chaudhary, Karthik Raman, Michael Bendersky

Abstract: The emergence of long-context large language models (LLMs) has enabled the use of hundreds, or even thousands, of demonstrations for in-context learning (ICL) - a previously impractical regime. This paper investigates whether traditional ICL selection strategies, which balance the similarity of ICL examples to the test input (using a text retriever) with diversity within the ICL set, remain effect… ▽ More The emergence of long-context large language models (LLMs) has enabled the use of hundreds, or even thousands, of demonstrations for in-context learning (ICL) - a previously impractical regime. This paper investigates whether traditional ICL selection strategies, which balance the similarity of ICL examples to the test input (using a text retriever) with diversity within the ICL set, remain effective when utilizing a large number of demonstrations. Our experiments demonstrate that, while longer contexts can accommodate more examples, simply increasing the number of demonstrations does not guarantee improved performance. Smart ICL selection remains crucial, even with thousands of demonstrations. To further enhance ICL in this setting, we introduce Refract ICL, a novel ICL selection algorithm specifically designed to focus LLM attention on challenging examples by strategically repeating them within the context and incorporating zero-shot predictions as error signals. Our results show that Refract ICL significantly improves the performance of extremely long-context models such as Gemini 1.5 Pro, particularly on tasks with a smaller number of output classes. △ Less

Submitted 14 June, 2025; originally announced June 2025.

arXiv:2506.12242 [pdf]

Large Language Models for History, Philosophy, and Sociology of Science: Interpretive Uses, Methodological Challenges, and Critical Perspectives

Authors: Arno Simons, Michael Zichert, Adrian Wüthrich

Abstract: This paper explores the use of large language models (LLMs) as research tools in the history, philosophy, and sociology of science (HPSS). LLMs are remarkably effective at processing unstructured text and inferring meaning from context, offering new affordances that challenge long-standing divides between computational and interpretive methods. This raises both opportunities and challenges for HPS… ▽ More This paper explores the use of large language models (LLMs) as research tools in the history, philosophy, and sociology of science (HPSS). LLMs are remarkably effective at processing unstructured text and inferring meaning from context, offering new affordances that challenge long-standing divides between computational and interpretive methods. This raises both opportunities and challenges for HPSS, which emphasizes interpretive methodologies and understands meaning as context-dependent, ambiguous, and historically situated. We argue that HPSS is uniquely positioned not only to benefit from LLMs' capabilities but also to interrogate their epistemic assumptions and infrastructural implications. To this end, we first offer a concise primer on LLM architectures and training paradigms tailored to non-technical readers. We frame LLMs not as neutral tools but as epistemic infrastructures that encode assumptions about meaning, context, and similarity, conditioned by their training data, architecture, and patterns of use. We then examine how computational techniques enhanced by LLMs, such as structuring data, detecting patterns, and modeling dynamic processes, can be applied to support interpretive research in HPSS. Our analysis compares full-context and generative models, outlines strategies for domain and task adaptation (e.g., continued pretraining, fine-tuning, and retrieval-augmented generation), and evaluates their respective strengths and limitations for interpretive inquiry in HPSS. We conclude with four lessons for integrating LLMs into HPSS: (1) model selection involves interpretive trade-offs; (2) LLM literacy is foundational; (3) HPSS must define its own benchmarks and corpora; and (4) LLMs should enhance, not replace, interpretive methods. △ Less

Submitted 13 June, 2025; originally announced June 2025.

Comments: 27 pages, 2 tables

ACM Class: A.1; I.2.1; I.2.7; J.4; J.5

arXiv:2506.12128 [pdf, ps, other]

Improved Ground State Estimation in Quantum Field Theories via Normalising Flow-Assisted Neural Quantum States

Authors: Vishal S. Ngairangbam, Michael Spannowsky, Timur Sypchenko

Abstract: We propose a hybrid variational framework that enhances Neural Quantum States (NQS) with a Normalising Flow-based sampler to improve the expressivity and trainability of quantum many-body wavefunctions. Our approach decouples the sampling task from the variational ansatz by learning a continuous flow model that targets a discretised, amplitude-supported subspace of the Hilbert space. This overcome… ▽ More We propose a hybrid variational framework that enhances Neural Quantum States (NQS) with a Normalising Flow-based sampler to improve the expressivity and trainability of quantum many-body wavefunctions. Our approach decouples the sampling task from the variational ansatz by learning a continuous flow model that targets a discretised, amplitude-supported subspace of the Hilbert space. This overcomes limitations of Markov Chain Monte Carlo (MCMC) and autoregressive methods, especially in regimes with long-range correlations and volume-law entanglement. Applied to the transverse-field Ising model with both short- and long-range interactions, our method achieves comparable ground state energy errors with state-of-the-art matrix product states and lower energies than autoregressive NQS. For systems up to 50 spins, we demonstrate high accuracy and robust convergence across a wide range of coupling strengths, including regimes where competing methods fail. Our results showcase the utility of flow-assisted sampling as a scalable tool for quantum simulation and offer a new approach toward learning expressive quantum states in high-dimensional Hilbert spaces. △ Less

Submitted 13 June, 2025; originally announced June 2025.

Report number: IPPP/25/33

arXiv:2506.12103 [pdf, other]

The Amazon Nova Family of Models: Technical Report and Model Card

Authors: Amazon AGI, Aaron Langford, Aayush Shah, Abhanshu Gupta, Abhimanyu Bhatter, Abhinav Goyal, Abhinav Mathur, Abhinav Mohanty, Abhishek Kumar, Abhishek Sethi, Abi Komma, Abner Pena, Achin Jain, Adam Kunysz, Adam Opyrchal, Adarsh Singh, Aditya Rawal, Adok Achar Budihal Prasad, Adrià de Gispert, Agnika Kumar, Aishwarya Aryamane, Ajay Nair, Akilan M, Akshaya Iyengar, Akshaya Vishnu Kudlu Shanbhogue , et al. (761 additional authors not shown)

Abstract: We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents… ▽ More We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation. △ Less

Submitted 17 March, 2025; originally announced June 2025.

Comments: 48 pages, 10 figures

Report number: 20250317

arXiv:2506.12100 [pdf, ps, other]

LLM Embedding-based Attribution (LEA): Quantifying Source Contributions to Generative Model's Response for Vulnerability Analysis

Authors: Reza Fayyazi, Michael Zuzak, Shanchieh Jay Yang

Abstract: Security vulnerabilities are rapidly increasing in frequency and complexity, creating a shifting threat landscape that challenges cybersecurity defenses. Large Language Models (LLMs) have been widely adopted for cybersecurity threat analysis. When querying LLMs, dealing with new, unseen vulnerabilities is particularly challenging as it lies outside LLMs' pre-trained distribution. Retrieval-Augment… ▽ More Security vulnerabilities are rapidly increasing in frequency and complexity, creating a shifting threat landscape that challenges cybersecurity defenses. Large Language Models (LLMs) have been widely adopted for cybersecurity threat analysis. When querying LLMs, dealing with new, unseen vulnerabilities is particularly challenging as it lies outside LLMs' pre-trained distribution. Retrieval-Augmented Generation (RAG) pipelines mitigate the problem by injecting up-to-date authoritative sources into the model context, thus reducing hallucinations and increasing the accuracy in responses. Meanwhile, the deployment of LLMs in security-sensitive environments introduces challenges around trust and safety. This raises a critical open question: How to quantify or attribute the generated response to the retrieved context versus the model's pre-trained knowledge? This work proposes LLM Embedding-based Attribution (LEA) -- a novel, explainable metric to paint a clear picture on the 'percentage of influence' the pre-trained knowledge vs. retrieved content has for each generated response. We apply LEA to assess responses to 100 critical CVEs from the past decade, verifying its effectiveness to quantify the insightfulness for vulnerability analysis. Our development of LEA reveals a progression of independency in hidden states of LLMs: heavy reliance on context in early layers, which enables the derivation of LEA; increased independency in later layers, which sheds light on why scale is essential for LLM's effectiveness. This work provides security analysts a means to audit LLM-assisted workflows, laying the groundwork for transparent, high-assurance deployments of RAG-enhanced LLMs in cybersecurity operations. △ Less

Submitted 12 June, 2025; originally announced June 2025.

Showing 1–50 of 21,312 results for author: Michael