Search | arXiv e-print repository

arXiv:2509.19296 [pdf, ps, other]

Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

Authors: Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B. Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, Xuanchi Ren

Abstract: The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagin… ▽ More The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation. △ Less

Submitted 23 September, 2025; originally announced September 2025.

Comments: Project Page: https://research.nvidia.com/labs/toronto-ai/lyra/

arXiv:2509.19231 [pdf, ps, other]

Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation

Authors: Karen Rosero, Eunjung Yeo, David R. Mortensen, Cortney Van't Slot, Rami R. Hallac, Carlos Busso

Abstract: We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in le… ▽ More We present ChiReSSD, a speech reconstruction framework that preserves children speaker's identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in lexical accuracy and speaker identity preservation. Furthermore, we automatically predict the phonetic content in the original and reconstructed pairs, where the proportion of corrected consonants is comparable to the percentage of correct consonants (PCC), a clinical speech assessment metric. Our experiments show Pearson correlation of 0.63 between automatic and human expert annotations, highlighting the potential to reduce the manual transcription burden. In addition, experiments on the TORGO dataset demonstrate effective generalization for reconstructing adult dysarthric speech. Our results indicate that disentangled, style-based TTS reconstruction can provide identity-preserving speech across diverse clinical populations. △ Less

Submitted 23 September, 2025; originally announced September 2025.

arXiv:2509.18768 [pdf, ps, other]

Purer than pure: how purity reshapes the upstream materiality of the semiconductor industry

Authors: Gauthier Roussilhe, Thibault Pirson, David Bol, Srinjoy Mitra

Abstract: Growing attention is given to the environmental impacts of the digital sector, exacerbated by the increase of digital products and services in our globalized societies. The materiality of the digital sector is often presented through the environmental impacts of mining activities to point out that digitization does not mean dematerialization. Despite its importance, such a narrative is often restr… ▽ More Growing attention is given to the environmental impacts of the digital sector, exacerbated by the increase of digital products and services in our globalized societies. The materiality of the digital sector is often presented through the environmental impacts of mining activities to point out that digitization does not mean dematerialization. Despite its importance, such a narrative is often restricted to a few minerals (e.g., cobalt, lithium) that have become the symbols of extractive industries. In this paper, we further explore the materiality of the digital sector with an approach based on the diversity of elements and their purity requirements in the semiconductor industry. Semiconductors are responsible for manufacturing the key building blocks of the digital sector, i.e., microchips. Given that the need for ultra-high purity materials is very specific to the semiconductor industry, a few companies around the world have been studied, revealing new critical actors in complex supply chains. This highlights strong dependencies towards other industrial sectors with mass production and the need for a deeper investigation of interactions with the chemical industry, complementary to the mining industry. △ Less

Submitted 23 September, 2025; originally announced September 2025.

Comments: 11 pages, 7 figures

arXiv:2509.18465 [pdf, ps, other]

Using Age of Information for Throughput Optimal Spectrum Sharing

Authors: Hongjae Nam, Vishrant Tripathi, David J. Love

Abstract: We consider a spectrum sharing problem where two users attempt to communicate over N channels. The Primary User (PU) has prioritized transmissions and its occupancy on each channel over time can be modeled as a Markov chain. The Secondary User (SU) needs to determine which channels are free at each time-slot and attempt opportunistic transmissions. The goal of the SU is to maximize its own through… ▽ More We consider a spectrum sharing problem where two users attempt to communicate over N channels. The Primary User (PU) has prioritized transmissions and its occupancy on each channel over time can be modeled as a Markov chain. The Secondary User (SU) needs to determine which channels are free at each time-slot and attempt opportunistic transmissions. The goal of the SU is to maximize its own throughput, while simultaneously minimizing collisions with the PU, and satisfying spectrum access constraints. To solve this problem, we first decouple the multiple-channel problem into N single-channel problems. For each decoupled problem, we prove that there exists an optimal threshold policy that depends on the last observed PU occupancy and the freshness of this occupancy information. Second, we establish the indexability of the decoupled problems by analyzing the structure of the optimal threshold policy. Using this structure, we derive a Whittle index-based scheduling policy that allocates SU transmissions using the Age of Information (AoI) of accessed channels. We also extend our insights to PU occupancy models that are correlated across channels and incorporate learning of unknown Markov transition matrices into our policies. Finally, we provide detailed numerical simulations that demonstrate the performance gains of our approach. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: 16 pages, 10 figures

arXiv:2509.18439 [pdf]

Developing an AI framework to automatically detect shared decision-making in patient-doctor conversations

Authors: Oscar J. Ponce-Ponte, David Toro-Tobon, Luis F. Figueroa, Michael Gionfriddo, Megan Branda, Victor M. Montori, Saturnino Luz, Juan P. Brito

Abstract: Shared decision-making (SDM) is necessary to achieve patient-centred care. Currently no methodology exists to automatically measure SDM at scale. This study aimed to develop an automated approach to measure SDM by using language modelling and the conversational alignment (CA) score. A total of 157 video-recorded patient-doctor conversations from a randomized multi-centre trial evaluating SDM decis… ▽ More Shared decision-making (SDM) is necessary to achieve patient-centred care. Currently no methodology exists to automatically measure SDM at scale. This study aimed to develop an automated approach to measure SDM by using language modelling and the conversational alignment (CA) score. A total of 157 video-recorded patient-doctor conversations from a randomized multi-centre trial evaluating SDM decision aids for anticoagulation in atrial fibrillations were transcribed and segmented into 42,559 sentences. Context-response pairs and negative sampling were employed to train deep learning (DL) models and fine-tuned BERT models via the next sentence prediction (NSP) task. Each top-performing model was used to calculate four types of CA scores. A random-effects analysis by clinician, adjusting for age, sex, race, and trial arm, assessed the association between CA scores and SDM outcomes: the Decisional Conflict Scale (DCS) and the Observing Patient Involvement in Decision-Making 12 (OPTION12) scores. p-values were corrected for multiple comparisons with the Benjamini-Hochberg method. Among 157 patients (34% female, mean age 70 SD 10.8), clinicians on average spoke more words than patients (1911 vs 773). The DL model without the stylebook strategy achieved a recall@1 of 0.227, while the fine-tuned BERTbase (110M) achieved the highest recall@1 with 0.640. The AbsMax (18.36 SE7.74 p=0.025) and Max CA (21.02 SE7.63 p=0.012) scores generated with the DL without stylebook were associated with OPTION12. The Max CA score generated with the fine-tuned BERTbase (110M) was associated with the DCS score (-27.61 SE12.63 p=0.037). BERT model sizes did not have an impact the association between CA scores and SDM. This study introduces an automated, scalable methodology to measure SDM in patient-doctor conversations through explainable CA scores, with potential to evaluate SDM strategies at scale. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: 53 pages, 1 figure, 4 tables, 5 supplementary figures, 13 supplementary tables

arXiv:2509.18412 [pdf, ps, other]

Identifying birdsong syllables without labelled data

Authors: Mélisande Teng, Julien Boussard, David Rolnick, Hugo Larochelle

Abstract: Identifying sequences of syllables within birdsongs is key to tackling a wide array of challenges, including bird individual identification and better understanding of animal communication and sensory-motor learning. Recently, machine learning approaches have demonstrated great potential to alleviate the need for experts to label long audio recordings by hand. However, they still typically rely on… ▽ More Identifying sequences of syllables within birdsongs is key to tackling a wide array of challenges, including bird individual identification and better understanding of animal communication and sensory-motor learning. Recently, machine learning approaches have demonstrated great potential to alleviate the need for experts to label long audio recordings by hand. However, they still typically rely on the availability of labelled data for model training, restricting applicability to a few species and datasets. In this work, we build the first fully unsupervised algorithm to decompose birdsong recordings into sequences of syllables. We first detect syllable events, then cluster them to extract templates --syllable representations-- before performing matching pursuit to decompose the recording as a sequence of syllables. We evaluate our automatic annotations against human labels on a dataset of Bengalese finch songs and find that our unsupervised method achieves high performance. We also demonstrate that our approach can distinguish individual birds within a species through their unique vocal signatures, for both Bengalese finches and another species, the great tit. △ Less

Submitted 22 September, 2025; originally announced September 2025.

arXiv:2509.18357

doi 10.4204/EPTCS.429

Proceedings Seventh International Conference on Applied Category Theory 2024

Authors: Michael Johnson, David Jaz Myers

Abstract: Proceedings of the Seventh International Conference on Applied Category Theory, held at the University of Oxford on 17 - 21 June 2024. The contributions to ACT 2024 ranged from pure to applied and included contributions in a wide range of disciplines in science and engineering. ACT 2024 included talks in classical mechanics, quantum physics, probability theory, linguistics, decision theory, machin… ▽ More Proceedings of the Seventh International Conference on Applied Category Theory, held at the University of Oxford on 17 - 21 June 2024. The contributions to ACT 2024 ranged from pure to applied and included contributions in a wide range of disciplines in science and engineering. ACT 2024 included talks in classical mechanics, quantum physics, probability theory, linguistics, decision theory, machine learning, epidemiology, thermodynamics, engineering, and logic. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Journal ref: EPTCS 429, 2025

arXiv:2509.18030 [pdf, ps, other]

RadEval: A framework for radiology text evaluation

Authors: Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, David Eyre, Jean-Benoit Delbrouck

Abstract: We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, ext… ▽ More We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: Accepted to EMNLP 2025 Demo track - Oral

arXiv:2509.17957 [pdf, ps, other]

On the Variational Costs of Changing Our Minds

Authors: David Hyland, Mahault Albarracin

Abstract: The human mind is capable of extraordinary achievements, yet it often appears to work against itself. It actively defends its cherished beliefs even in the face of contradictory evidence, conveniently interprets information to conform to desired narratives, and selectively searches for or avoids information to suit its various purposes. Despite these behaviours deviating from common normative stan… ▽ More The human mind is capable of extraordinary achievements, yet it often appears to work against itself. It actively defends its cherished beliefs even in the face of contradictory evidence, conveniently interprets information to conform to desired narratives, and selectively searches for or avoids information to suit its various purposes. Despite these behaviours deviating from common normative standards for belief updating, we argue that such 'biases' are not inherently cognitive flaws, but rather an adaptive response to the significant pragmatic and cognitive costs associated with revising one's beliefs. This paper introduces a formal framework that aims to model the influence of these costs on our belief updating mechanisms. We treat belief updating as a motivated variational decision, where agents weigh the perceived 'utility' of a belief against the informational cost required to adopt a new belief state, quantified by the Kullback-Leibler divergence from the prior to the variational posterior. We perform computational experiments to demonstrate that simple instantiations of this resource-rational model can be used to qualitatively emulate commonplace human behaviours, including confirmation bias and attitude polarisation. In doing so, we suggest that this framework makes steps toward a more holistic account of the motivated Bayesian mechanics of belief change and provides practical insights for predicting, compensating for, and correcting deviations from desired belief updating processes. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: Accepted as a full paper at the 6th International Workshop on Active Inference

arXiv:2509.17789 [pdf, ps, other]

From Restoration to Reconstruction: Rethinking 3D Gaussian Splatting for Underwater Scenes

Authors: Guoxi Huang, Haoran Wang, Zipeng Qi, Wenjun Lu, David Bull, Nantheera Anantrasirichai

Abstract: Underwater image degradation poses significant challenges for 3D reconstruction, where simplified physical models often fail in complex scenes. We propose \textbf{R-Splatting}, a unified framework that bridges underwater image restoration (UIR) with 3D Gaussian Splatting (3DGS) to improve both rendering quality and geometric fidelity. Our method integrates multiple enhanced views produced by diver… ▽ More Underwater image degradation poses significant challenges for 3D reconstruction, where simplified physical models often fail in complex scenes. We propose \textbf{R-Splatting}, a unified framework that bridges underwater image restoration (UIR) with 3D Gaussian Splatting (3DGS) to improve both rendering quality and geometric fidelity. Our method integrates multiple enhanced views produced by diverse UIR models into a single reconstruction pipeline. During inference, a lightweight illumination generator samples latent codes to support diverse yet coherent renderings, while a contrastive loss ensures disentangled and stable illumination representations. Furthermore, we propose \textit{Uncertainty-Aware Opacity Optimization (UAOO)}, which models opacity as a stochastic function to regularize training. This suppresses abrupt gradient responses triggered by illumination variation and mitigates overfitting to noisy or view-specific artifacts. Experiments on Seathru-NeRF and our new BlueCoral3D dataset demonstrate that R-Splatting outperforms strong baselines in both rendering quality and geometric accuracy. △ Less

Submitted 22 September, 2025; originally announced September 2025.

arXiv:2509.17768 [pdf, ps, other]

DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching

Authors: Jessica Ojo, Zina Kamel, David Ifeoluwa Adelani

Abstract: Language Identification (LID) is a core task in multilingual NLP, yet current systems often overfit to clean, monolingual data. This work introduces DIVERS-BENCH, a comprehensive evaluation of state-of-the-art LID models across diverse domains, including speech transcripts, web text, social media texts, children's stories, and code-switched text. Our findings reveal that while models achieve high… ▽ More Language Identification (LID) is a core task in multilingual NLP, yet current systems often overfit to clean, monolingual data. This work introduces DIVERS-BENCH, a comprehensive evaluation of state-of-the-art LID models across diverse domains, including speech transcripts, web text, social media texts, children's stories, and code-switched text. Our findings reveal that while models achieve high accuracy on curated datasets, performance degrades sharply on noisy and informal inputs. We also introduce DIVERS-CS, a diverse code-switching benchmark dataset spanning 10 language pairs, and show that existing models struggle to detect multiple languages within the same sentence. These results highlight the need for more robust and inclusive LID systems in real-world settings. △ Less

Submitted 22 September, 2025; originally announced September 2025.

arXiv:2509.17726 [pdf, ps, other]

Automated Labeling of Intracranial Arteries with Uncertainty Quantification Using Deep Learning

Authors: Javier Bisbal, Patrick Winter, Sebastian Jofre, Aaron Ponce, Sameer A. Ansari, Ramez Abdalla, Michael Markl, Oliver Welin Odeback, Sergio Uribe, Cristian Tejos, Julio Sotelo, Susanne Schnell, David Marlevi

Abstract: Accurate anatomical labeling of intracranial arteries is essential for cerebrovascular diagnosis and hemodynamic analysis but remains time-consuming and subject to interoperator variability. We present a deep learning-based framework for automated artery labeling from 3D Time-of-Flight Magnetic Resonance Angiography (3D ToF-MRA) segmentations (n=35), incorporating uncertainty quantification to enh… ▽ More Accurate anatomical labeling of intracranial arteries is essential for cerebrovascular diagnosis and hemodynamic analysis but remains time-consuming and subject to interoperator variability. We present a deep learning-based framework for automated artery labeling from 3D Time-of-Flight Magnetic Resonance Angiography (3D ToF-MRA) segmentations (n=35), incorporating uncertainty quantification to enhance interpretability and reliability. We evaluated three convolutional neural network architectures: (1) a UNet with residual encoder blocks, reflecting commonly used baselines in vascular labeling; (2) CS-Net, an attention-augmented UNet incorporating channel and spatial attention mechanisms for enhanced curvilinear structure recognition; and (3) nnUNet, a self-configuring framework that automates preprocessing, training, and architectural adaptation based on dataset characteristics. Among these, nnUNet achieved the highest labeling performance (average Dice score: 0.922; average surface distance: 0.387 mm), with improved robustness in anatomically complex vessels. To assess predictive confidence, we implemented test-time augmentation (TTA) and introduced a novel coordinate-guided strategy to reduce interpolation errors during augmented inference. The resulting uncertainty maps reliably indicated regions of anatomical ambiguity, pathological variation, or manual labeling inconsistency. We further validated clinical utility by comparing flow velocities derived from automated and manual labels in co-registered 4D Flow MRI datasets, observing close agreement with no statistically significant differences. Our framework offers a scalable, accurate, and uncertainty-aware solution for automated cerebrovascular labeling, supporting downstream hemodynamic analysis and facilitating clinical integration. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: 16 pages, 6 figures

MSC Class: I.4.0

arXiv:2509.17661 [pdf, ps, other]

Comparator Loss: An Ordinal Contrastive Loss to Derive a Severity Score for Speech-based Health Monitoring

Authors: Jacob J Webber, Oliver Watts, Lovisa Wihlborg, David Wheatley, Johnny Tam, Christine Weaver, Suvankar Pal, Siddharthan Chandran, Cassia Valentini-Botinhao

Abstract: Monitoring the progression of neurodegenerative disease has important applications in the planning of treatment and the evaluation of future medications. Whereas much of the state-of-the-art in health monitoring from speech has been focused on classifying patients versus healthy controls, or predicting real-world health metrics, we propose here a novel measure of disease progression: the severity… ▽ More Monitoring the progression of neurodegenerative disease has important applications in the planning of treatment and the evaluation of future medications. Whereas much of the state-of-the-art in health monitoring from speech has been focused on classifying patients versus healthy controls, or predicting real-world health metrics, we propose here a novel measure of disease progression: the severity score. This score is derived from a model trained to minimize what we call the comparator loss. The comparator loss ensures scores follow an ordering relation, which can be based on diagnosis, clinically annotated scores, or simply the chronological order of the recordings. In addition to giving a more detailed picture than a simple discrete classification, the proposed comparator loss-based system has the potential to incorporate information from disparate health metrics, which is critical for making full use of small health-related datasets. We evaluated our proposed models based on their ability to affirmatively track the progression of patients with motor neuron disease (MND), the correlation of their output with clinical annotations such as ALSFRS-R, as well as their ability to distinguish between subjects with MND and healthy controls. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: Submitted to ICASSP 2026. This work is supported by NEURii, a collaborative partnership involving the University of Edinburgh, Gates Ventures, Eisai, LifeArc and Health Data Research UK (HDR UK)

arXiv:2509.17645 [pdf, ps, other]

RAVEN: RAnking and Validation of ExoplaNets

Authors: Andreas Hadjigeorghiou, David J. Armstrong, Kaiming Cui, Marina Lafarga Magro, Luis Agustín Nieto, Rodrigo F. Díaz, Lauren Doyle, Vedad Kunovac

Abstract: We present RAVEN, a newly developed vetting and validation pipeline for TESS exoplanet candidates. The pipeline employs a Bayesian framework to derive the posterior probability of a candidate being a planet against a set of False Positive (FP) scenarios, through the use of a Gradient Boosted Decision Tree and a Gaussian Process classifier, trained on comprehensive synthetic training sets of simula… ▽ More We present RAVEN, a newly developed vetting and validation pipeline for TESS exoplanet candidates. The pipeline employs a Bayesian framework to derive the posterior probability of a candidate being a planet against a set of False Positive (FP) scenarios, through the use of a Gradient Boosted Decision Tree and a Gaussian Process classifier, trained on comprehensive synthetic training sets of simulated planets and 8 astrophysical FP scenarios injected into TESS lightcurves. These training sets allow large scale candidate vetting and performance verification against individual FP scenarios. A Non-Simulated FP training set consisting of real TESS candidates caused primarily by stellar variability and systematic noise is also included. The machine learning derived probabilities are combined with scenario specific prior probabilities, including the candidates' positional probabilities, to compute the final posterior probabilities. Candidates with a planetary posterior probability greater than 99% against each FP scenario and whose implied planetary radius is less than 8$R_{\oplus}$ are considered to be statistically validated by the pipeline. In this first version, the pipeline has been developed for candidates with a lightcurve released from the TESS Science Processing Operations Centre, an orbital period between 0.5 and 16 days and a transit depth greater than 300ppm. The pipeline obtained area-under-curve (AUC) scores > 97% on all FP scenarios and > 99% on all but one. Testing on an independent external sample of 1361 pre-classified TOIs, the pipeline achieved an overall accuracy of 91%, demonstrating its effectiveness for automated ranking of TESS candidates. For a probability threshold of 0.9 the pipeline reached a precision of 97% with a recall score of 66% on these TOIs. The RAVEN pipeline is publicly released as a cloud-hosted app, making it easily accessible to the community. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: Submitted to MNRAS. Comments from the community are welcome

arXiv:2509.17601 [pdf, ps, other]

FastNet: Improving the physical consistency of machine-learning weather prediction models through loss function design

Authors: Tom Dunstan, Oliver Strickson, Thusal Bennett, Jack Bowyer, Matthew Burnand, James Chappell, Alejandro Coca-Castro, Kirstine Ida Dale, Eric G. Daub, Noushin Eftekhari, Manvendra Janmaijaya, Jon Lillis, David Salvador-Jasin, Nathan Simpson, Ryan Sze-Yin Chan, Mohamad Elmasri, Lydia Allegranza France, Sam Madge, Levan Bokeria, Hannah Brown, Tom Dodds, Anna-Louise Ellis, David Llewellyn-Jones, Theo McCaie, Sophia Moreton , et al. (9 additional authors not shown)

Abstract: Machine learning weather prediction (MLWP) models have demonstrated remarkable potential in delivering accurate forecasts at significantly reduced computational cost compared to traditional numerical weather prediction (NWP) systems. However, challenges remain in ensuring the physical consistency of MLWP outputs, particularly in deterministic settings. This study presents FastNet, a graph neural n… ▽ More Machine learning weather prediction (MLWP) models have demonstrated remarkable potential in delivering accurate forecasts at significantly reduced computational cost compared to traditional numerical weather prediction (NWP) systems. However, challenges remain in ensuring the physical consistency of MLWP outputs, particularly in deterministic settings. This study presents FastNet, a graph neural network (GNN)-based global prediction model, and investigates the impact of alternative loss function designs on improving the physical realism of its forecasts. We explore three key modifications to the standard mean squared error (MSE) loss: (1) a modified spherical harmonic (MSH) loss that penalises spectral amplitude errors to reduce blurring and enhance small-scale structure retention; (2) inclusion of horizontal gradient terms in the loss to suppress non-physical artefacts; and (3) an alternative wind representation that decouples speed and direction to better capture extreme wind events. Results show that while the MSH and gradient-based losses \textit{alone} may slightly degrade RMSE scores, when trained in combination the model exhibits very similar MSE performance to an MSE-trained model while at the same time significantly improving spectral fidelity and physical consistency. The alternative wind representation further improves wind speed accuracy and reduces directional bias. Collectively, these findings highlight the importance of loss function design as a mechanism for embedding domain knowledge into MLWP models and advancing their operational readiness. △ Less

Submitted 22 September, 2025; originally announced September 2025.

arXiv:2509.17469 [pdf, ps, other]

doi 10.1007/978-3-032-04354-2_20

LongEval at CLEF 2025: Longitudinal Evaluation of IR Systems on Web and Scientific Data

Authors: Matteo Cancellieri, Alaa El-Ebshihy, Tobias Fink, Maik Fröbe, Petra Galuščáková, Gabriela Gonzalez-Saez, Lorraine Goeuriot, David Iommi, Jüri Keller, Petr Knoth, Philippe Mulhem, Florina Piroi, David Pride, Philipp Schaer

Abstract: The LongEval lab focuses on the evaluation of information retrieval systems over time. Two datasets are provided that capture evolving search scenarios with changing documents, queries, and relevance assessments. Systems are assessed from a temporal perspective-that is, evaluating retrieval effectiveness as the data they operate on changes. In its third edition, LongEval featured two retrieval tas… ▽ More The LongEval lab focuses on the evaluation of information retrieval systems over time. Two datasets are provided that capture evolving search scenarios with changing documents, queries, and relevance assessments. Systems are assessed from a temporal perspective-that is, evaluating retrieval effectiveness as the data they operate on changes. In its third edition, LongEval featured two retrieval tasks: one in the area of ad-hoc web retrieval, and another focusing on scientific article retrieval. We present an overview of this year's tasks and datasets, as well as the participating systems. A total of 19 teams submitted their approaches, which we evaluated using nDCG and a variety of measures that quantify changes in retrieval effectiveness over time. △ Less

Submitted 22 September, 2025; originally announced September 2025.

arXiv:2509.17405 [pdf, ps, other]

Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization

Authors: Manish Acharya, David Hyde

Abstract: The sliced Wasserstein distance (SW) reduces optimal transport on $\mathbb{R}^d$ to a sum of one-dimensional projections, and thanks to this efficiency, it is widely used in geometry, generative modeling, and registration tasks. Recent work shows that quasi-Monte Carlo constructions for computing SW (QSW) yield direction sets with excellent approximation error. This paper presents an alternate, no… ▽ More The sliced Wasserstein distance (SW) reduces optimal transport on $\mathbb{R}^d$ to a sum of one-dimensional projections, and thanks to this efficiency, it is widely used in geometry, generative modeling, and registration tasks. Recent work shows that quasi-Monte Carlo constructions for computing SW (QSW) yield direction sets with excellent approximation error. This paper presents an alternate, novel approach: learning directions with Bayesian optimization (BO), particularly in settings where SW appears inside an optimization loop (e.g., gradient flows). We introduce a family of drop-in selectors for projection directions: BOSW, a one-shot BO scheme on the unit sphere; RBOSW, a periodic-refresh variant; ABOSW, an adaptive hybrid that seeds from competitive QSW sets and performs a few lightweight BO refinements; and ARBOSW, a restarted hybrid that periodically relearns directions during optimization. Our BO approaches can be composed with QSW and its variants (demonstrated by ABOSW/ARBOSW) and require no changes to downstream losses or gradients. We provide numerical experiments where our methods achieve state-of-the-art performance, and on the experimental suite of the original QSW paper, we find that ABOSW and ARBOSW can achieve convergence comparable to the best QSW variants with modest runtime overhead. △ Less

Submitted 23 September, 2025; v1 submitted 22 September, 2025; originally announced September 2025.

Comments: 19 pages, 11 figures

MSC Class: 49Q22 (Primary) 90C57; 68Txx (Secondary) ACM Class: G.3; I.2

arXiv:2509.17389 [pdf, ps, other]

3D Printable Soft Liquid Metal Sensors for Delicate Manipulation Tasks

Authors: Lois Liow, Jonty Milford, Emre Uygun, Andre Farinha, Vinoth Viswanathan, Josh Pinskier, David Howard

Abstract: Robotics and automation are key enablers to increase throughput in ongoing conservation efforts across various threatened ecosystems. Cataloguing, digitisation, husbandry, and similar activities require the ability to interact with delicate, fragile samples without damaging them. Additionally, learning-based solutions to these tasks require the ability to safely acquire data to train manipulation… ▽ More Robotics and automation are key enablers to increase throughput in ongoing conservation efforts across various threatened ecosystems. Cataloguing, digitisation, husbandry, and similar activities require the ability to interact with delicate, fragile samples without damaging them. Additionally, learning-based solutions to these tasks require the ability to safely acquire data to train manipulation policies through, e.g., reinforcement learning. To address these twin needs, we introduce a novel method to print free-form, highly sensorised soft 'physical twins'. We present an automated design workflow to create complex and customisable 3D soft sensing structures on demand from 3D scans or models. Compared to the state of the art, our soft liquid metal sensors faithfully recreate complex natural geometries and display excellent sensing properties suitable for validating performance in delicate manipulation tasks. We demonstrate the application of our physical twins as 'sensing corals': high-fidelity, 3D printed replicas of scanned corals that eliminate the need for live coral experimentation, whilst increasing data quality, offering an ethical and scalable pathway for advancing autonomous coral handling and soft manipulation broadly. Through extensive bench-top manipulation and underwater grasping experiments, we show that our sensing coral is able to detect grasps under 0.5 N, effectively capturing the delicate interactions and light contact forces required for coral handling. Finally, we showcase the value of our physical twins across two demonstrations: (i) automated coral labelling for lab identification and (ii) robotic coral aquaculture. Sensing physical twins such as ours can provide richer grasping feedback than conventional sensors providing experimental validation of prior to deployment in handling fragile and delicate items. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: 8 pages, 4 figures

arXiv:2509.17351 [pdf, ps, other]

Institutional Research Computing Capabilities in Australia: 2024

Authors: Slava Kitaeff, Luc Betbeder-Matibet, Jake Carroll, Stephen Giugni, David Abramson, John Zaitseff, Sarah Walters, David Powell, Chris Bording, Trung Nguyen, Angus Macoustra, Fabien Voisin, Bowen Chen, Jarrod Hurley

Abstract: Institutional research computing infrastructure plays a vital role in Australia's research ecosystem, complementing and extending national facilities. This paper analyses research computing capabilities across Australian universities and organisations, showing how institutional systems support research excellence through local compute resources, specialised hardware, and cluster solutions. Our stu… ▽ More Institutional research computing infrastructure plays a vital role in Australia's research ecosystem, complementing and extending national facilities. This paper analyses research computing capabilities across Australian universities and organisations, showing how institutional systems support research excellence through local compute resources, specialised hardware, and cluster solutions. Our study finds that nearly 112,258 CPU cores and 2,241 GPUs serve over 6,000 researchers as essential bridges between desktops and national facilities, enabling workflows from development to large-scale computations. The estimated replacement value of this infrastructure is $144M AUD. Drawing on detailed data from multiple institutions, we identify key patterns in deployment, utilisation, and strategic alignment with research priorities. Institutional resources provide critical support for data-intensive projects, facilitate training and higher-degree student research, enable prototyping and development, and ensure data sovereignty compliance when required. The analysis shows how these facilities leverage national investments while addressing institution-specific needs that national systems cannot meet. We present evidence that strategic investment in institutional capabilities yields significant returns through greater research productivity, enhanced graduate training, and improved outcomes. The study offers insights for organisations planning computing strategies and highlights the importance of maintaining robust institutional resources alongside national facilities. △ Less

Submitted 22 September, 2025; originally announced September 2025.

Comments: 9 pages in IEEE Proceedings format, International Conference on eScience 2025, Accepted

arXiv:2509.17286 [pdf, ps, other]

RADE for Land Mobile Radio: A Neural Codec for Transmission of Speech over Baseband FM Radio Channels

Authors: David Rowe, Tibor Bece

Abstract: In the 1990s Land Mobile Radio (LMR) systems evolved from analog frequency modulation (FM) to standardised digital systems. Both digital and analog FM systems now co-exist in various services and exhibit similar speech quality. The architecture of many digital radios retains the analog FM modulator and demodulator from legacy analog radios, but driven by a multi-level digital pulse train rather th… ▽ More In the 1990s Land Mobile Radio (LMR) systems evolved from analog frequency modulation (FM) to standardised digital systems. Both digital and analog FM systems now co-exist in various services and exhibit similar speech quality. The architecture of many digital radios retains the analog FM modulator and demodulator from legacy analog radios, but driven by a multi-level digital pulse train rather than an analog voice signal. We denote this architecture baseband FM (BBFM). In this paper we describe a modern machine learning approach that uses an autoencoder to send high quality, 8 kHz bandwidth speech over the BBFM channel. The speech quality is shown to be superior to analog FM over simulated LMR channels in the presence of fading, and a demonstration of the system running over commodity UHF radios is presented. △ Less

Submitted 21 September, 2025; originally announced September 2025.

Comments: 6 pages, 9 figures

arXiv:2509.17282 [pdf, ps, other]

Task-Oriented Communications for 3D Scene Representation: Balancing Timeliness and Fidelity

Authors: Xiangmin Xu, Zhen Meng, Kan Chen, Jiaming Yang, Emma Li, Philip G. Zhao, David Flynn

Abstract: Real-time Three-dimensional (3D) scene representation is a foundational element that supports a broad spectrum of cutting-edge applications, including digital manufacturing, Virtual, Augmented, and Mixed Reality (VR/AR/MR), and the emerging metaverse. Despite advancements in real-time communication and computing, achieving a balance between timeliness and fidelity in 3D scene representation remain… ▽ More Real-time Three-dimensional (3D) scene representation is a foundational element that supports a broad spectrum of cutting-edge applications, including digital manufacturing, Virtual, Augmented, and Mixed Reality (VR/AR/MR), and the emerging metaverse. Despite advancements in real-time communication and computing, achieving a balance between timeliness and fidelity in 3D scene representation remains a challenge. This work investigates a wireless network where multiple homogeneous mobile robots, equipped with cameras, capture an environment and transmit images to an edge server over channels for 3D representation. We propose a contextual-bandit Proximal Policy Optimization (PPO) framework incorporating both Age of Information (AoI) and semantic information to optimize image selection for representation, balancing data freshness and representation quality. Two policies -- the $ω$-threshold and $ω$-wait policies -- together with two benchmark methods are evaluated, timeliness embedding and weighted sum, on standard datasets and baseline 3D scene representation models. Experimental results demonstrate improved representation fidelity while maintaining low latency, offering insight into the model's decision-making process. This work advances real-time 3D scene representation by optimizing the trade-off between timeliness and fidelity in dynamic environments. △ Less

Submitted 21 September, 2025; originally announced September 2025.

Comments: Submitted to IEEE Transactions on Mobile Computing

arXiv:2509.17265 [pdf, ps, other]

Identifying and Upweighting Power-Niche Users to Mitigate Popularity Bias in Recommendations

Authors: David Liu, Erik Weis, Moritz Laber, Tina Eliassi-Rad, Brennan Klein

Abstract: Recommender systems have been shown to exhibit popularity bias by over-recommending popular items and under-recommending relevant niche items. We seek to understand interactions with niche items in benchmark recommendation datasets as a step toward mitigating popularity bias. We find that, compared to mainstream users, niche-preferring users exhibit a longer-tailed activity-level distribution, ind… ▽ More Recommender systems have been shown to exhibit popularity bias by over-recommending popular items and under-recommending relevant niche items. We seek to understand interactions with niche items in benchmark recommendation datasets as a step toward mitigating popularity bias. We find that, compared to mainstream users, niche-preferring users exhibit a longer-tailed activity-level distribution, indicating the existence of users who both prefer niche items and exhibit high activity levels. We partition users along two axes: (1) activity level ("power" vs. "light") and (2) item-popularity preference ("mainstream" vs. "niche"), and show that in several benchmark datasets, the number of power-niche users (high activity and niche preference) is statistically significantly larger than expected under a null configuration model. Motivated by this observation, we propose a framework for reweighting the Bayesian Personalized Ranking (BPR) loss that simultaneously reweights based on user activity level and item popularity. Our method introduces two interpretable parameters: one controlling the significance of user activity level, and the other of item popularity. Experiments on benchmark datasets show that upweighting power-niche users reduces popularity bias and can increase overall performance. In contrast to previous work that only considers user activity level or item popularity in isolation, our results suggest that considering their interaction leads to Pareto-dominant performance. △ Less

Submitted 21 September, 2025; originally announced September 2025.

arXiv:2509.17202 [pdf]

Fundamental Mechanisms of Human Learning

Authors: Scott E. Allen, A. David Redish, René F. Kizilcec

Abstract: Learning underlies nearly all human behavior and is central to education and education reform. Although recent advances in neuroscience have revealed the fundamental structure of learning processes, these insights have yet to be integrated into research and practice. Specifically, neuroscience has found that decision-making is governed by a structured process of perception, action-selection, and e… ▽ More Learning underlies nearly all human behavior and is central to education and education reform. Although recent advances in neuroscience have revealed the fundamental structure of learning processes, these insights have yet to be integrated into research and practice. Specifically, neuroscience has found that decision-making is governed by a structured process of perception, action-selection, and execution, supported by multiple neural systems with distinct memory stores and learning mechanisms. These systems extract different types of information (categorical, predictive, structural, and sequential) challenging canonical models of memory used in learning and behavioral science research by providing a mechanistic account of how humans acquire and use knowledge. Because each system learns differently, effective teaching requires alignment with system-specific processes. We propose a unified model that integrates these neuroscientific insights, bridging basic mechanisms with outcomes in education, identity, belonging, and wellbeing. By translating first principles of neural information processing into a generalizable framework, this work advances theories of skill acquisition and transfer while establishing a foundation for interdisciplinary research to refine how learning is understood and supported across domains of human behavior. △ Less

Submitted 21 September, 2025; originally announced September 2025.

arXiv:2509.17180 [pdf, ps, other]

Regularizing Extrapolation in Causal Inference

Authors: David Arbour, Harsh Parikh, Bijan Niknam, Elizabeth Stuart, Kara Rudolph, Avi Feller

Abstract: Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher vari… ▽ More Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher variance. By contrast, estimators like importance weighting and random forests (sometimes implicitly) restrict weights to be non-negative, reducing dependence on parametric modeling and variance at the cost of worse imbalance. In this paper, we propose a unified framework that directly penalizes the level of extrapolation, replacing the current practice of a hard non-negativity constraint with a soft constraint and corresponding hyperparameter. We derive a worst-case extrapolation error bound and introduce a novel "bias-bias-variance" tradeoff, encompassing biases due to feature imbalance, model misspecification, and estimator variance; this tradeoff is especially pronounced in high dimensions, particularly when positivity is poor. We then develop an optimization procedure that regularizes this bound while minimizing imbalance and outline how to use this approach as a sensitivity analysis for dependence on parametric modeling assumptions. We demonstrate the effectiveness of our approach through synthetic experiments and a real-world application, involving the generalization of randomized controlled trial estimates to a target population of interest. △ Less

Submitted 21 September, 2025; originally announced September 2025.

arXiv:2509.16801 [pdf, ps, other]

Sublinear Time Quantum Sensitivity Sampling

Authors: Zhao Song, David P. Woodruff, Lichen Zhang

Abstract: We present a unified framework for quantum sensitivity sampling, extending the advantages of quantum computing to a broad class of classical approximation problems. Our unified framework provides a streamlined approach for constructing coresets and offers significant runtime improvements in applications such as clustering, regression, and low-rank approximation. Our contributions include: * $k$-… ▽ More We present a unified framework for quantum sensitivity sampling, extending the advantages of quantum computing to a broad class of classical approximation problems. Our unified framework provides a streamlined approach for constructing coresets and offers significant runtime improvements in applications such as clustering, regression, and low-rank approximation. Our contributions include: * $k$-median and $k$-means clustering: For $n$ points in $d$-dimensional Euclidean space, we give an algorithm that constructs an $ε$-coreset in time $\widetilde O(n^{0.5}dk^{2.5}~\mathrm{poly}(ε^{-1}))$ for $k$-median and $k$-means clustering. Our approach achieves a better dependence on $d$ and constructs smaller coresets that only consist of points in the dataset, compared to recent results of [Xue, Chen, Li and Jiang, ICML'23]. * $\ell_p$ regression: For $\ell_p$ regression problems, we construct an $ε$-coreset of size $\widetilde O_p(d^{\max\{1, p/2\}}ε^{-2})$ in time $\widetilde O_p(n^{0.5}d^{\max\{0.5, p/4\}+1}(ε^{-3}+d^{0.5}))$, improving upon the prior best quantum sampling approach of [Apers and Gribling, QIP'24] for all $p\in (0, 2)\cup (2, 22]$, including the widely studied least absolute deviation regression ($\ell_1$ regression). * Low-rank approximation with Frobenius norm error: We introduce the first quantum sublinear-time algorithm for low-rank approximation that does not rely on data-dependent parameters, and runs in $\widetilde O(nd^{0.5}k^{0.5}ε^{-1})$ time. Additionally, we present quantum sublinear algorithms for kernel low-rank approximation and tensor low-rank approximation, broadening the range of achievable sublinear time algorithms in randomized numerical linear algebra. △ Less

Submitted 20 September, 2025; originally announced September 2025.

arXiv:2509.16617 [pdf, ps, other]

Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model

Authors: David Kreismann

Abstract: As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data. However, predictive analytics methods based on conventional machine learning models and limited data infrastructure often provide inaccurate predictions, especially in underserved areas. In this context,… ▽ More As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data. However, predictive analytics methods based on conventional machine learning models and limited data infrastructure often provide inaccurate predictions, especially in underserved areas. In this context, geospatial foundation models trained on unstructured global data demonstrate strong generalization and require minimal fine-tuning, offering an alternative for predictions where traditional approaches are limited. This study fine-tunes a geospatial foundation model to predict urban land surface temperatures under future climate scenarios and explores its response to land cover changes using simulated vegetation strategies. The fine-tuned model achieved pixel-wise downscaling errors below 1.74 °C and aligned with ground truth patterns, demonstrating an extrapolation capacity up to 3.62 °C. △ Less

Submitted 20 September, 2025; originally announced September 2025.

Comments: 12 pages, 4 figures, to appear in GI LNI (SKILL 2025)

ACM Class: I.2.6; I.5.4; I.6.8

arXiv:2509.16531 [pdf, ps, other]

Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains

Authors: Junghwan Kim, Haotian Zhang, David Jurgens

Abstract: Authorship representation (AR) learning, which models an author's unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings-mostly in English-leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key… ▽ More Authorship representation (AR) learning, which models an author's unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings-mostly in English-leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key innovations: probabilistic content masking, which encourages the model to focus on stylistically indicative words rather than content-specific words, and language-aware batching, which improves contrastive learning by reducing cross-lingual interference. Our model is trained on over 4.5 million authors across 36 languages and 13 domains. It consistently outperforms monolingual baselines in 21 out of 22 non-English languages, achieving an average Recall@8 improvement of 4.85%, with a maximum gain of 15.91% in a single language. Furthermore, it exhibits stronger cross-lingual and cross-domain generalization compared to a monolingual model trained solely on English. Our analysis confirms the effectiveness of both proposed techniques, highlighting their critical roles in the model's improved performance. △ Less

Submitted 20 September, 2025; originally announced September 2025.

Comments: Accepted to EMNLP 2025

arXiv:2509.16508 [pdf, ps, other]

Federated Learning with Ad-hoc Adapter Insertions: The Case of Soft-Embeddings for Training Classifier-as-Retriever

Authors: Marijan Fofonjka, Shahryar Zehtabi, Alireza Behtash, Tyler Mauer, David Stout

Abstract: When existing retrieval-augmented generation (RAG) solutions are intended to be used for new knowledge domains, it is necessary to update their encoders, which are taken to be pretrained large language models (LLMs). However, fully finetuning these large models is compute- and memory-intensive, and even infeasible when deployed on resource-constrained edge devices. We propose a novel encoder archi… ▽ More When existing retrieval-augmented generation (RAG) solutions are intended to be used for new knowledge domains, it is necessary to update their encoders, which are taken to be pretrained large language models (LLMs). However, fully finetuning these large models is compute- and memory-intensive, and even infeasible when deployed on resource-constrained edge devices. We propose a novel encoder architecture in this work that addresses this limitation by using a frozen small language model (SLM), which satisfies the memory constraints of edge devices, and inserting a small adapter network before the transformer blocks of the SLM. The trainable adapter takes the token embeddings of the new corpus and learns to produce enhanced soft embeddings for it, while requiring significantly less compute power to update than full fine-tuning. We further propose a novel retrieval mechanism by attaching a classifier head to the SLM encoder, which is trained to learn a similarity mapping of the input embeddings to their corresponding documents. Finally, to enable the online fine-tuning of both (i) the encoder soft embeddings and (ii) the classifier-as-retriever on edge devices, we adopt federated learning (FL) and differential privacy (DP) to achieve an efficient, privacy-preserving, and product-grade training solution. We conduct a theoretical analysis of our methodology, establishing convergence guarantees under mild assumptions on gradient variance when deployed for general smooth nonconvex loss functions. Through extensive numerical experiments, we demonstrate (i) the efficacy of obtaining soft embeddings to enhance the encoder, (ii) training a classifier to improve the retriever, and (iii) the role of FL in achieving speedup. △ Less

Submitted 19 September, 2025; originally announced September 2025.

Comments: 22 pages, 7 figures, 3 tables

arXiv:2509.16413 [pdf, ps, other]

Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research

Authors: Richard Diehl Martinez, David Demitri Africa, Yuval Weiss, Suchir Salhan, Ryan Daniels, Paula Buttery

Abstract: Building language models (LMs), especially small and medium ones, remains more art than science. While large LMs often improve by sheer scale, it is still unclear why many design choices work. For small LMs, this uncertainty is more limiting: tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce Pico… ▽ More Building language models (LMs), especially small and medium ones, remains more art than science. While large LMs often improve by sheer scale, it is still unclear why many design choices work. For small LMs, this uncertainty is more limiting: tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce Pico, a lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Pico consists of two libraries that together provide a practical sandbox where researchers can make targeted changes to a model's architecture or training procedures and directly observe their effects on the model's behavior. To support reproducible experimentation, we also release a suite of baseline models, pico-decoder, trained under standardized conditions and open-sourced for the community. Case studies highlight how Pico can support iterative small LM design and analysis. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.16378 [pdf]

Longitudinal and Multimodal Recording System to Capture Real-World Patient-Clinician Conversations for AI and Encounter Research: Protocol

Authors: Misk Al Zahidy, Kerly Guevara Maldonado, Luis Vilatuna Andrango, Ana Cristina Proano, Ana Gabriela Claros, Maria Lizarazo Jimenez, David Toro-Tobon, Oscar J. Ponce-Ponce, Juan P. Brito

Abstract: The promise of AI in medicine depends on learning from data that reflect what matters to patients and clinicians. Most existing models are trained on electronic health records (EHRs), which capture biological measures but rarely patient-clinician interactions. These relationships, central to care, unfold across voice, text, and video, yet remain absent from datasets. As a result, AI systems traine… ▽ More The promise of AI in medicine depends on learning from data that reflect what matters to patients and clinicians. Most existing models are trained on electronic health records (EHRs), which capture biological measures but rarely patient-clinician interactions. These relationships, central to care, unfold across voice, text, and video, yet remain absent from datasets. As a result, AI systems trained solely on EHRs risk perpetuating a narrow biomedical view of medicine and overlooking the lived exchanges that define clinical encounters. Our objective is to design, implement, and evaluate the feasibility of a longitudinal, multimodal system for capturing patient-clinician encounters, linking 360 degree video/audio recordings with surveys and EHR data to create a dataset for AI research. This single site study is in an academic outpatient endocrinology clinic at Mayo Clinic. Adult patients with in-person visits to participating clinicians are invited to enroll. Encounters are recorded with a 360 degree video camera. After each visit, patients complete a survey on empathy, satisfaction, pace, and treatment burden. Demographic and clinical data are extracted from the EHR. Feasibility is assessed using five endpoints: clinician consent, patient consent, recording success, survey completion, and data linkage across modalities. Recruitment began in January 2025. By August 2025, 35 of 36 eligible clinicians (97%) and 212 of 281 approached patients (75%) had consented. Of consented encounters, 162 (76%) had complete recordings and 204 (96%) completed the survey. This study aims to demonstrate the feasibility of a replicable framework for capturing the multimodal dynamics of patient-clinician encounters. By detailing workflows, endpoints, and ethical safeguards, it provides a template for longitudinal datasets and lays the foundation for AI models that incorporate the complexity of care. △ Less

Submitted 19 September, 2025; originally announced September 2025.

Comments: 23 pages, 2 figures, 2 tables

arXiv:2509.16266 [pdf, ps, other]

Vibrational Fingerprints of Strained Polymers: A Spectroscopic Pathway to Mechanical State Prediction

Authors: Julian Konrad, Janina Mittelhaus, David M. Wilkins, Bodo Fiedler, Robert Meißner

Abstract: The vibrational response of polymer networks under load provides a sensitive probe of molecular deformation and a route to non-destructive diagnostics. Here we show that machine-learned force fields reproduce these spectroscopic fingerprints with quantum-level fidelity in realistic epoxy thermosets. Using MACE-OFF23 molecular dynamics, we capture the experimentally observed redshifts of para-pheny… ▽ More The vibrational response of polymer networks under load provides a sensitive probe of molecular deformation and a route to non-destructive diagnostics. Here we show that machine-learned force fields reproduce these spectroscopic fingerprints with quantum-level fidelity in realistic epoxy thermosets. Using MACE-OFF23 molecular dynamics, we capture the experimentally observed redshifts of para-phenylene stretching modes under tensile load, in contrast to the harmonic OPLS-AA model. These shifts correlate with molecular elongation and alignment, consistent with Badger's rule, directly linking vibrational features to local stress. To capture IR intensities, we trained a symmetry-adapted dipole moment model on representative epoxy fragments, enabling validation of strain responses. Together, these approaches provide chemically accurate and computationally accessible predictions of strain-dependent vibrational spectra. Our results establish vibrational fingerprints as predictive markers of mechanical state in polymer networks, pointing to new strategies for stress mapping and structural-health diagnostics in advanced materials. △ Less

Submitted 18 September, 2025; originally announced September 2025.

arXiv:2509.16262 [pdf]

Socratic Mind: Impact of a Novel GenAI-Powered Assessment Tool on Student Learning and Higher-Order Thinking

Authors: Jeonghyun Lee, Jui-Tse Hung, Meryem Yilmaz Soylu, Diana Popescu, Christopher Zhang Cui, Gayane Grigoryan, David A Joyner, Stephen W Harmon

Abstract: This study examines the impact of Socratic Mind, a Generative Artificial Intelligence (GenAI) powered formative assessment tool that employs Socratic questioning to support student learning in a large, fully online undergraduate-level computing course. Employing a quasi-experimental, mixed-methods design, we investigated participants' engagement patterns, the influence of user experience on engage… ▽ More This study examines the impact of Socratic Mind, a Generative Artificial Intelligence (GenAI) powered formative assessment tool that employs Socratic questioning to support student learning in a large, fully online undergraduate-level computing course. Employing a quasi-experimental, mixed-methods design, we investigated participants' engagement patterns, the influence of user experience on engagement, and impacts on both perceived and actual learning outcomes. Data were collected from the system logs, surveys on user experience and perceived engagement and learning gains, student reflections, and course performance data. Results indicated that participants consistently reported high levels of affective, behavioral, and cognitive engagement, and these were strongly linked to positive user experiences and perceived learning outcomes. Quantitative analysis further revealed that students who engaged with the GenAI tool experienced significant gains in their quiz scores compared to those who did not, particularly benefiting students with lower baseline achievement. Additionally, thematic analysis of qualitative feedback revealed substantial perceived improvements in higher-order thinking skills, including problem solving, critical thinking, and self-reflection. Our findings highlight the promise of AI-mediated dialogue in fostering deeper engagement and higher-order cognitive skills. As higher education institutions expand GenAI integration in curriculum, this dialogic, GenAI powered assessment tool can offer a scalable strategy to promote students' meaningful learning outcomes. △ Less

Submitted 17 September, 2025; originally announced September 2025.

arXiv:2509.16203 [pdf, ps, other]

Inverting Trojans in LLMs

Authors: Zhengxing Li, Guangmingmei Yang, Jayaram Raghuram, David J. Miller, George Kesidis

Abstract: While effective backdoor detection and inversion schemes have been developed for AIs used e.g. for images, there are challenges in "porting" these methods to LLMs. First, the LLM input space is discrete, which precludes gradient-based search over this space, central to many backdoor inversion methods. Second, there are ~30,000^k k-tuples to consider, k the token-length of a putative trigger. Third… ▽ More While effective backdoor detection and inversion schemes have been developed for AIs used e.g. for images, there are challenges in "porting" these methods to LLMs. First, the LLM input space is discrete, which precludes gradient-based search over this space, central to many backdoor inversion methods. Second, there are ~30,000^k k-tuples to consider, k the token-length of a putative trigger. Third, for LLMs there is the need to blacklist tokens that have strong marginal associations with the putative target response (class) of an attack, as such tokens give false detection signals. However, good blacklists may not exist for some domains. We propose a LLM trigger inversion approach with three key components: i) discrete search, with putative triggers greedily accreted, starting from a select list of singletons; ii) implicit blacklisting, achieved by evaluating the average cosine similarity, in activation space, between a candidate trigger and a small clean set of samples from the putative target class; iii) detection when a candidate trigger elicits high misclassifications, and with unusually high decision confidence. Unlike many recent works, we demonstrate that our approach reliably detects and successfully inverts ground-truth backdoor trigger phrases. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.16180 [pdf, ps, other]

Query-Efficient Locally Private Hypothesis Selection via the Scheffe Graph

Authors: Gautam Kamath, Alireza F. Pour, Matthew Regehr, David P. Woodruff

Abstract: We propose an algorithm with improved query-complexity for the problem of hypothesis selection under local differential privacy constraints. Given a set of $k$ probability distributions $Q$, we describe an algorithm that satisfies local differential privacy, performs $\tilde{O}(k^{3/2})$ non-adaptive queries to individuals who each have samples from a probability distribution $p$, and outputs a pr… ▽ More We propose an algorithm with improved query-complexity for the problem of hypothesis selection under local differential privacy constraints. Given a set of $k$ probability distributions $Q$, we describe an algorithm that satisfies local differential privacy, performs $\tilde{O}(k^{3/2})$ non-adaptive queries to individuals who each have samples from a probability distribution $p$, and outputs a probability distribution from the set $Q$ which is nearly the closest to $p$. Previous algorithms required either $Ω(k^2)$ queries or many rounds of interactive queries. Technically, we introduce a new object we dub the Scheffé graph, which captures structure of the differences between distributions in $Q$, and may be of more broad interest for hypothesis selection tasks. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.16040 [pdf, ps, other]

Automated Constitutive Model Discovery by Pairing Sparse Regression Algorithms with Model Selection Criteria

Authors: Jorge-Humberto Urrea-Quintero, David Anton, Laura De Lorenzis, Henning Wessels

Abstract: The automated discovery of constitutive models from data has recently emerged as a promising alternative to the traditional model calibration paradigm. In this work, we present a fully automated framework for constitutive model discovery that systematically pairs three sparse regression algorithms (Least Absolute Shrinkage and Selection Operator (LASSO), Least Angle Regression (LARS), and Orthogon… ▽ More The automated discovery of constitutive models from data has recently emerged as a promising alternative to the traditional model calibration paradigm. In this work, we present a fully automated framework for constitutive model discovery that systematically pairs three sparse regression algorithms (Least Absolute Shrinkage and Selection Operator (LASSO), Least Angle Regression (LARS), and Orthogonal Matching Pursuit (OMP)) with three model selection criteria: $K$-fold cross-validation (CV), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). This pairing yields nine distinct algorithms for model discovery and enables a systematic exploration of the trade-off between sparsity, predictive performance, and computational cost. While LARS serves as an efficient path-based solver for the $\ell_1$-constrained problem, OMP is introduced as a tractable heuristic for $\ell_0$-regularized selection. The framework is applied to both isotropic and anisotropic hyperelasticity, utilizing both synthetic and experimental datasets. Results reveal that all nine algorithm-criterion combinations perform consistently well for the discovery of isotropic and anisotropic materials, yielding highly accurate constitutive models. These findings broaden the range of viable discovery algorithms beyond $\ell_1$-based approaches such as LASSO. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.16032 [pdf, ps, other]

A Matter of Height: The Impact of a Robotic Object on Human Compliance

Authors: Michael Faber, Andrey Grishko, Julian Waksberg, David Pardo, Tomer Leivy, Yuval Hazan, Emanuel Talmansky, Benny Megidish, Hadas Erel

Abstract: Robots come in various forms and have different characteristics that may shape the interaction with them. In human-human interactions, height is a characteristic that shapes human dynamics, with taller people typically perceived as more persuasive. In this work, we aspired to evaluate if the same impact replicates in a human-robot interaction and specifically with a highly non-humanoid robotic obj… ▽ More Robots come in various forms and have different characteristics that may shape the interaction with them. In human-human interactions, height is a characteristic that shapes human dynamics, with taller people typically perceived as more persuasive. In this work, we aspired to evaluate if the same impact replicates in a human-robot interaction and specifically with a highly non-humanoid robotic object. The robot was designed with modules that could be easily added or removed, allowing us to change its height without altering other design features. To test the impact of the robot's height, we evaluated participants' compliance with its request to volunteer to perform a tedious task. In the experiment, participants performed a cognitive task on a computer, which was framed as the main experiment. When done, they were informed that the experiment was completed. While waiting to receive their credits, the robotic object, designed as a mobile robotic service table, entered the room, carrying a tablet that invited participants to complete a 300-question questionnaire voluntarily. We compared participants' compliance in two conditions: A Short robot composed of two modules and 95cm in height and a Tall robot consisting of three modules and 132cm in height. Our findings revealed higher compliance with the Short robot's request, demonstrating an opposite pattern to human dynamics. We conclude that while height has a substantial social impact on human-robot interactions, it follows a unique pattern of influence. Our findings suggest that designers cannot simply adopt and implement elements from human social dynamics to robots without testing them first. △ Less

Submitted 19 September, 2025; originally announced September 2025.

Comments: 8 pages, 6 figures, 1 table, submitted to IEEE RO-MAN 2025

arXiv:2509.16020 [pdf, ps, other]

AI Methods for Permutation Circuit Synthesis Across Generic Topologies

Authors: Victor Villar, Juan Cruz-Benito, Ismael Faro, David Kremer

Abstract: This paper investigates artificial intelligence (AI) methodologies for the synthesis and transpilation of permutation circuits across generic topologies. Our approach uses Reinforcement Learning (RL) techniques to achieve near-optimal synthesis of permutation circuits up to 25 qubits. Rather than developing specialized models for individual topologies, we train a foundational model on a generic re… ▽ More This paper investigates artificial intelligence (AI) methodologies for the synthesis and transpilation of permutation circuits across generic topologies. Our approach uses Reinforcement Learning (RL) techniques to achieve near-optimal synthesis of permutation circuits up to 25 qubits. Rather than developing specialized models for individual topologies, we train a foundational model on a generic rectangular lattice, and employ masking mechanisms to dynamically select subsets of topologies during the synthesis. This enables the synthesis of permutation circuits on any topology that can be embedded within the rectangular lattice, without the need to re-train the model. In this paper we show results for 5x5 lattice and compare them to previous AI topology-oriented models and classical methods, showing that they outperform classical heuristics, and match previous specialized AI models, and performs synthesis even for topologies that were not seen during training. We further show that the model can be fine tuned to strengthen the performance for selected topologies of interest. This methodology allows a single trained model to efficiently synthesize circuits across diverse topologies, allowing its practical integration into transpilation workflows. △ Less

Submitted 19 September, 2025; originally announced September 2025.

Comments: This paper has been accepted by First AAAI Symposium on Quantum Information & Machine Learning (QIML): Bridging Quantum Computing and Artificial Intelligence at AAAI 2025 Fall Symposium

arXiv:2509.15933 [pdf, ps, other]

Bayesian Physics Informed Neural Networks for Reliable Transformer Prognostics

Authors: Ibai Ramirez, Jokin Alcibar, Joel Pino, Mikel Sanz, David Pardo, Jose I. Aizpurua

Abstract: Scientific Machine Learning (SciML) integrates physics and data into the learning process, offering improved generalization compared with purely data-driven models. Despite its potential, applications of SciML in prognostics remain limited, partly due to the complexity of incorporating partial differential equations (PDEs) for ageing physics and the scarcity of robust uncertainty quantification me… ▽ More Scientific Machine Learning (SciML) integrates physics and data into the learning process, offering improved generalization compared with purely data-driven models. Despite its potential, applications of SciML in prognostics remain limited, partly due to the complexity of incorporating partial differential equations (PDEs) for ageing physics and the scarcity of robust uncertainty quantification methods. This work introduces a Bayesian Physics-Informed Neural Network (B-PINN) framework for probabilistic prognostics estimation. By embedding Bayesian Neural Networks into the PINN architecture, the proposed approach produces principled, uncertainty-aware predictions. The method is applied to a transformer ageing case study, where insulation degradation is primarily driven by thermal stress. The heat diffusion PDE is used as the physical residual, and different prior distributions are investigated to examine their impact on predictive posterior distributions and their ability to encode a priori physical knowledge. The framework is validated against a finite element model developed and tested with real measurements from a solar power plant. Results, benchmarked against a dropout-PINN baseline, show that the proposed B-PINN delivers more reliable prognostic predictions by accurately quantifying predictive uncertainty. This capability is crucial for supporting robust and informed maintenance decision-making in critical power assets. △ Less

Submitted 19 September, 2025; originally announced September 2025.

Comments: Submitted to the Annual Prognostics and Health Management (PHM) Society Conference 2025

arXiv:2509.15909 [pdf, ps, other]

A CARLA-based Simulation of Electrically Driven Forklifts

Authors: David Claus, Christiane Thielemann, Hans-Georg Stark

Abstract: This paper presents the simulation of the operation of an electric forklift fleet within an intralogistics scenario. For this purpose, the open source simulation tool CARLA is used; according to our knowledge this is a novel approach in the context of logistics simulation. First, CARLA is used to generate and visualize a realistic 3D outdoor warehouse scenario, incorporating a number of randomly m… ▽ More This paper presents the simulation of the operation of an electric forklift fleet within an intralogistics scenario. For this purpose, the open source simulation tool CARLA is used; according to our knowledge this is a novel approach in the context of logistics simulation. First, CARLA is used to generate and visualize a realistic 3D outdoor warehouse scenario, incorporating a number of randomly moving forklifts. In a next step, intralogistics transport tasks, such as pick-and-place, are simulated for the forklift fleet, including shortest-path finding. Furthermore, the capability to play back localization data, previously recorded from a ''real'' forklift fleet, is demonstrated.This play back is done in the original recreated environment, thereby enabling the visualization of the forklifts movements. Finally, the energy consumption of the forklift trucks is simulated by integrating a physical battery model that generates the state of charge (SOC) of each truck as a function of load and activity. To demonstrate the wide range of possible applications for the CARLA simulation platform, we describe two use cases. The first deals with the problem of detecting regions with critically high traffic densities, the second with optimal placement of charging stations for the forklift trucks. Both use cases are calculated for an exemplary warehouse model. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.15905 [pdf, ps, other]

Deep Feedback Models

Authors: David Calhas, Arlindo L. Oliveira

Abstract: Deep Feedback Models (DFMs) are a new class of stateful neural networks that combine bottom up input with high level representations over time. This feedback mechanism introduces dynamics into otherwise static architectures, enabling DFMs to iteratively refine their internal state and mimic aspects of biological decision making. We model this process as a differential equation solved through a rec… ▽ More Deep Feedback Models (DFMs) are a new class of stateful neural networks that combine bottom up input with high level representations over time. This feedback mechanism introduces dynamics into otherwise static architectures, enabling DFMs to iteratively refine their internal state and mimic aspects of biological decision making. We model this process as a differential equation solved through a recurrent neural network, stabilized via exponential decay to ensure convergence. To evaluate their effectiveness, we measure DFMs under two key conditions: robustness to noise and generalization with limited data. In both object recognition and segmentation tasks, DFMs consistently outperform their feedforward counterparts, particularly in low data or high noise regimes. In addition, DFMs translate to medical imaging settings, while being robust against various types of noise corruption. These findings highlight the importance of feedback in achieving stable, robust, and generalizable learning. Code is available at https://github.com/DCalhas/deep_feedback_models. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.15895 [pdf]

From Data to Diagnosis: A Large, Comprehensive Bone Marrow Dataset and AI Methods for Childhood Leukemia Prediction

Authors: Henning Höfener, Farina Kock, Martina Pontones, Tabita Ghete, David Pfrang, Nicholas Dickel, Meik Kunz, Daniela P. Schacherer, David A. Clunie, Andrey Fedorov, Max Westphal, Markus Metzler

Abstract: Leukemia diagnosis primarily relies on manual microscopic analysis of bone marrow morphology supported by additional laboratory parameters, making it complex and time consuming. While artificial intelligence (AI) solutions have been proposed, most utilize private datasets and only cover parts of the diagnostic pipeline. Therefore, we present a large, high-quality, publicly available leukemia bone… ▽ More Leukemia diagnosis primarily relies on manual microscopic analysis of bone marrow morphology supported by additional laboratory parameters, making it complex and time consuming. While artificial intelligence (AI) solutions have been proposed, most utilize private datasets and only cover parts of the diagnostic pipeline. Therefore, we present a large, high-quality, publicly available leukemia bone marrow dataset spanning the entire diagnostic process, from cell detection to diagnosis. Using this dataset, we further propose methods for cell detection, cell classification, and diagnosis prediction. The dataset comprises 246 pediatric patients with diagnostic, clinical and laboratory information, over 40 000 cells with bounding box annotations and more than 28 000 of these with high-quality class labels, making it the most comprehensive dataset publicly available. Evaluation of the AI models yielded an average precision of 0.96 for the cell detection, an area under the curve of 0.98, and an F1-score of 0.61 for the 33-class cell classification, and a mean F1-score of 0.90 for the diagnosis prediction using predicted cell counts. While the proposed approaches demonstrate their usefulness for AI-assisted diagnostics, the dataset will foster further research and development in the field, ultimately contributing to more precise diagnoses and improved patient outcomes. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.15643 [pdf, ps, other]

Finite-blocklength Fluid Antenna Systems

Authors: Zhentian Zhang, Kai-Kit Wong, David Morales-Jimenez, Hao Jiang, Hao Xu, Christos Masouros, Zaichen Zhang, Chan-Byoung Chae

Abstract: This work introduces and investigates finite blocklength fluid antenna systems (FBL-FASs). To meet the stringent key performance indicators (KPIs) of 6G and beyond networks, including ultra-massive machine-type communications (mMTC), ultra-reliable low-latency communications (URLLC), and enhanced mobile broadband (eMBB), it is necessary to evaluate the performance of FAS under limited channel uses… ▽ More This work introduces and investigates finite blocklength fluid antenna systems (FBL-FASs). To meet the stringent key performance indicators (KPIs) of 6G and beyond networks, including ultra-massive machine-type communications (mMTC), ultra-reliable low-latency communications (URLLC), and enhanced mobile broadband (eMBB), it is necessary to evaluate the performance of FAS under limited channel uses across time, frequency, and other domains. By exploiting random matrix theory and extreme value theory (EVT), we characterize the effect of finite blocklength on key metrics such as the signal-to-noise ratio (SNR) and the signal-to-interference-plus-noise ratio (SINR), via accurate estimation of interference caused by codeword correlation. Closed-form expressions for block error rate (BLER) and outage probability are derived, covering both conditional BLER (with channel state information, CSI) and statistical BLER (without CSI). The proposed analysis leverages Chernoff bounds and introduces a Taylor-expansion-assisted mean value theorem for integrals (MVTI) to reduce computational complexity. Numerical results show that, compared with conventional multi-antenna systems, the proposed FBL-FAS framework achieves higher energy and spectral efficiency under finite blocklength, making it a promising enabler for next-generation wireless networks. △ Less

Submitted 19 September, 2025; originally announced September 2025.

arXiv:2509.15373 [pdf, ps, other]

Frustratingly Easy Data Augmentation for Low-Resource ASR

Authors: Katsumi Ibaraki, David Chiang

Abstract: This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text--using gloss-based replacement, random replacement, or an LLM-based approach--and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with… ▽ More This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text--using gloss-based replacement, random replacement, or an LLM-based approach--and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability. △ Less

Submitted 18 September, 2025; originally announced September 2025.

Comments: 5 pages, 2 figures, 2 tables, submitted to ICASSP 2026

arXiv:2509.15335 [pdf, ps, other]

PolBiX: Detecting LLMs' Political Bias in Fact-Checking through X-phemisms

Authors: Charlott Jakob, David Harbecke, Patrick Parschan, Pia Wenzel Neves, Vera Schmitt

Abstract: Large Language Models are increasingly used in applications requiring objective assessment, which could be compromised by political bias. Many studies found preferences for left-leaning positions in LLMs, but downstream effects on tasks like fact-checking remain underexplored. In this study, we systematically investigate political bias through exchanging words with euphemisms or dysphemisms in Ger… ▽ More Large Language Models are increasingly used in applications requiring objective assessment, which could be compromised by political bias. Many studies found preferences for left-leaning positions in LLMs, but downstream effects on tasks like fact-checking remain underexplored. In this study, we systematically investigate political bias through exchanging words with euphemisms or dysphemisms in German claims. We construct minimal pairs of factually equivalent claims that differ in political connotation, to assess the consistency of LLMs in classifying them as true or false. We evaluate six LLMs and find that, more than political leaning, the presence of judgmental words significantly influences truthfulness assessment. While a few models show tendencies of political bias, this is not mitigated by explicitly calling for objectivism in prompts. Warning: This paper contains content that may be offensive or upsetting. △ Less

Submitted 23 September, 2025; v1 submitted 18 September, 2025; originally announced September 2025.

Comments: Accepted at Findings of EMNLP 2025, camera-ready version

arXiv:2509.15325 [pdf, ps, other]

Measurement and Potential Field-Based Patient Modeling for Model-Mediated Tele-ultrasound

Authors: Ryan S. Yeung, David G. Black, Septimiu E. Salcudean

Abstract: Teleoperated ultrasound can improve diagnostic medical imaging access for remote communities. Having accurate force feedback is important for enabling sonographers to apply the appropriate probe contact force to optimize ultrasound image quality. However, large time delays in communication make direct force feedback impractical. Prior work investigated using point cloud-based model-mediated teleop… ▽ More Teleoperated ultrasound can improve diagnostic medical imaging access for remote communities. Having accurate force feedback is important for enabling sonographers to apply the appropriate probe contact force to optimize ultrasound image quality. However, large time delays in communication make direct force feedback impractical. Prior work investigated using point cloud-based model-mediated teleoperation and internal potential field models to estimate contact forces and torques. We expand on this by introducing a method to update the internal potential field model of the patient with measured positions and forces for more transparent model-mediated tele-ultrasound. We first generate a point cloud model of the patient's surface and transmit this to the sonographer in a compact data structure. This is converted to a static voxelized volume where each voxel contains a potential field value. These values determine the forces and torques, which are rendered based on overlap between the voxelized volume and a point shell model of the ultrasound transducer. We solve for the potential field using a convex quadratic that combines the spatial Laplace operator with measured forces. This was evaluated on volunteer patients ($n=3$) by computing the accuracy of rendered forces. Results showed the addition of measured forces to the model reduced the force magnitude error by an average of 7.23 N and force vector angle error by an average of 9.37$^{\circ}$ compared to using only Laplace's equation. △ Less

Submitted 18 September, 2025; originally announced September 2025.

arXiv:2509.15278 [pdf]

Assessing metadata privacy in neuroimaging

Authors: Emilie Kibsgaard, Anita Sue Jwa, Christopher J Markiewicz, David Rodriguez Gonzalez, Judith Sainz Pardo, Russell A. Poldrack, Cyril R. Pernet

Abstract: The ethical and legal imperative to share research data without causing harm requires careful attention to privacy risks. While mounting evidence demonstrates that data sharing benefits science, legitimate concerns persist regarding the potential leakage of personal information that could lead to reidentification and subsequent harm. We reviewed metadata accompanying neuroimaging datasets from six… ▽ More The ethical and legal imperative to share research data without causing harm requires careful attention to privacy risks. While mounting evidence demonstrates that data sharing benefits science, legitimate concerns persist regarding the potential leakage of personal information that could lead to reidentification and subsequent harm. We reviewed metadata accompanying neuroimaging datasets from six heterogeneous studies openly available on OpenNeuro, involving participants across the lifespan, from children to older adults, with and without clinical diagnoses, and including associated clinical score data. Using metaprivBIDS (https://github.com/CPernet/metaprivBIDS), a novel tool for the systematic assessment of privacy in tabular data, we found that privacy is generally well maintained, with serious vulnerabilities being rare. Nonetheless, minor issues were identified in nearly all datasets and warrant mitigation. Notably, clinical score data (e.g., neuropsychological results) posed minimal reidentification risk, whereas demographic variables (age, sex, race, income, and geolocation) represented the principal privacy vulnerabilities. We outline practical measures to address these risks, enabling safer data sharing practices. △ Less

Submitted 18 September, 2025; originally announced September 2025.

Comments: 19 pages, 7 tables, 2 figures, original analysis of 6 Open Datasets

arXiv:2509.15273 [pdf, ps, other]

Embodied Arena: A Comprehensive, Unified, and Evolving Evaluation Platform for Embodied AI

Authors: Fei Ni, Min Zhang, Pengyi Li, Yifu Yuan, Lingfeng Zhang, Yuecheng Liu, Peilong Han, Longxin Kou, Shaojin Ma, Jinbin Qiao, David Gamaliel Arcos Bravo, Yuening Wang, Xiao Hu, Zhanguang Zhang, Xianze Yao, Yutong Li, Zhao Zhang, Ying Wen, Ying-Cong Chen, Xiaodan Liang, Liang Lin, Bin He, Haitham Bou-Ammar, He Wang, Huazhe Xu , et al. (12 additional authors not shown)

Abstract: Embodied AI development significantly lags behind large foundation models due to three critical challenges: (1) lack of systematic understanding of core capabilities needed for Embodied AI, making research lack clear objectives; (2) absence of unified and standardized evaluation systems, rendering cross-benchmark evaluation infeasible; and (3) underdeveloped automated and scalable acquisition meth… ▽ More Embodied AI development significantly lags behind large foundation models due to three critical challenges: (1) lack of systematic understanding of core capabilities needed for Embodied AI, making research lack clear objectives; (2) absence of unified and standardized evaluation systems, rendering cross-benchmark evaluation infeasible; and (3) underdeveloped automated and scalable acquisition methods for embodied data, creating critical bottlenecks for model scaling. To address these obstacles, we present Embodied Arena, a comprehensive, unified, and evolving evaluation platform for Embodied AI. Our platform establishes a systematic embodied capability taxonomy spanning three levels (perception, reasoning, task execution), seven core capabilities, and 25 fine-grained dimensions, enabling unified evaluation with systematic research objectives. We introduce a standardized evaluation system built upon unified infrastructure supporting flexible integration of 22 diverse benchmarks across three domains (2D/3D Embodied Q&A, Navigation, Task Planning) and 30+ advanced models from 20+ worldwide institutes. Additionally, we develop a novel LLM-driven automated generation pipeline ensuring scalable embodied evaluation data with continuous evolution for diversity and comprehensiveness. Embodied Arena publishes three real-time leaderboards (Embodied Q&A, Navigation, Task Planning) with dual perspectives (benchmark view and capability view), providing comprehensive overviews of advanced model capabilities. Especially, we present nine findings summarized from the evaluation results on the leaderboards of Embodied Arena. This helps to establish clear research veins and pinpoint critical research problems, thereby driving forward progress in the field of Embodied AI. △ Less

Submitted 23 September, 2025; v1 submitted 18 September, 2025; originally announced September 2025.

Comments: 32 pages, 5 figures, Embodied Arena Technical Report

arXiv:2509.15263 [pdf, ps, other]

Subject Matter Expertise vs Professional Management in Collective Sequential Decision Making

Authors: David Shoresh, Yonatan Loewenstein

Abstract: Your company's CEO is retiring. You search for a successor. You can promote an employee from the company familiar with the company's operations, or recruit an external professional manager. Who should you prefer? It has not been clear how to address this question, the "subject matter expertise vs. professional manager debate", quantitatively and objectively. We note that a company's success depend… ▽ More Your company's CEO is retiring. You search for a successor. You can promote an employee from the company familiar with the company's operations, or recruit an external professional manager. Who should you prefer? It has not been clear how to address this question, the "subject matter expertise vs. professional manager debate", quantitatively and objectively. We note that a company's success depends on long sequences of interdependent decisions, with often-opposing recommendations of diverse board members. To model this task in a controlled environment, we utilize chess - a complex, sequential game with interdependent decisions which allows for quantitative analysis of performance and expertise (since the states, actions and game outcomes are well-defined). The availability of chess engines differing in style and expertise, allows scalable experimentation. We considered a team of (computer) chess players. At each turn, team members recommend a move and a manager chooses a recommendation. We compared the performance of two manager types. For manager as "subject matter expert", we used another (computer) chess player that assesses the recommendations of the team members based on its own chess expertise. We examined the performance of such managers at different strength levels. To model a "professional manager", we used Reinforcement Learning (RL) to train a network that identifies the board positions in which different team members have relative advantage, without any pretraining in chess. We further examined this network to see if any chess knowledge is acquired implicitly. We found that subject matter expertise beyond a minimal threshold does not significantly contribute to team synergy. Moreover, performance of a RL-trained "professional" manager significantly exceeds that of even the best "expert" managers, while acquiring only limited understanding of chess. △ Less

Submitted 18 September, 2025; originally announced September 2025.

Comments: Reinforcement Learning and Decision Making (RLDM) 2025. arXiv admin note: substantial text overlap with arXiv:2412.18593

ACM Class: K.4.3

arXiv:2509.15031 [pdf, ps, other]

AutoEdit: Automatic Hyperparameter Tuning for Image Editing

Authors: Chau Pham, Quan Dao, Mahesh Bhosale, Yunjie Tian, Dimitris Metaxas, David Doermann

Abstract: Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification, \textit{etc.} This process incur… ▽ More Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification, \textit{etc.} This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing's hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world. △ Less

Submitted 18 September, 2025; originally announced September 2025.

Comments: Accepted to NeurIPS 2025

arXiv:2509.14816 [pdf, ps, other]

Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution

Authors: Humphrey Munn, Brendan Tidd, Peter Böhm, Marcus Gallagher, David Howard

Abstract: Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has enabled impressive results such as robust robot locomotion in the real world, many tasks still require careful reward tuning and are brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting… ▽ More Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has enabled impressive results such as robust robot locomotion in the real world, many tasks still require careful reward tuning and are brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting scalability. Modelling reward vectors and their trade-offs can address these issues; however, multi-objective methods remain underused in RL for robotics because of computational cost and optimisation difficulty. In this work, we investigate the conflict between gradient contributions for each objective that emerge from scalarising the task objectives. In particular, we explicitly address the conflict between task-based rewards and terms that regularise the policy towards realistic behaviour. We propose GCR-PPO, a modification to actor-critic optimisation that decomposes the actor update into objective-wise gradients using a multi-headed critic and resolves conflicts based on the objective priority. Our methodology, GCR-PPO, is evaluated on the well-known IsaacLab manipulation and locomotion benchmarks and additional multi-objective modifications on two related tasks. We show superior scalability compared to parallel PPO (p = 0.04), without significant computational overhead. We also show higher performance with more conflicting tasks. GCR-PPO improves on large-scale PPO with an average improvement of 9.5%, with high-conflict tasks observing a greater improvement. The code is available at https://github.com/humphreymunn/GCR-PPO. △ Less

Submitted 18 September, 2025; originally announced September 2025.

Showing 1–50 of 20,359 results for author: David