Search | arXiv e-print repository

ProDiF: Protecting Domain-Invariant Features to Secure Pre-Trained Models Against Extraction

Authors: Tong Zhou, Shijin Duan, Gaowen Liu, Charles Fleming, Ramana Rao Kompella, Shaolei Ren, Xiaolin Xu

Abstract: Pre-trained models are valuable intellectual property, capturing both domain-specific and domain-invariant features within their weight spaces. However, model extraction attacks threaten these assets by enabling unauthorized source-domain inference and facilitating cross-domain transfer via the exploitation of domain-invariant features. In this work, we introduce **ProDiF**, a novel framework that… ▽ More Pre-trained models are valuable intellectual property, capturing both domain-specific and domain-invariant features within their weight spaces. However, model extraction attacks threaten these assets by enabling unauthorized source-domain inference and facilitating cross-domain transfer via the exploitation of domain-invariant features. In this work, we introduce **ProDiF**, a novel framework that leverages targeted weight space manipulation to secure pre-trained models against extraction attacks. **ProDiF** quantifies the transferability of filters and perturbs the weights of critical filters in unsecured memory, while preserving actual critical weights in a Trusted Execution Environment (TEE) for authorized users. A bi-level optimization further ensures resilience against adaptive fine-tuning attacks. Experimental results show that **ProDiF** reduces source-domain accuracy to near-random levels and decreases cross-domain transferability by 74.65\%, providing robust protection for pre-trained models. This work offers comprehensive protection for pre-trained DNN models and highlights the potential of weight space manipulation as a novel approach to model security. △ Less

Submitted 17 March, 2025; originally announced March 2025.

Comments: Accepted at the ICLR Workshop on Neural Network Weights as a New Data Modality 2025

arXiv:2503.13133 [pdf, other]

Regular black holes and their singular families

Authors: Hyat Huang, Xiao-Pin Rao

Abstract: Regular black holes without curvature singularity can arise in Einstein gravity with appropriate matter energy-momentum tensor. We show that these regular solutions represent only a special case of a much broader family of black holes with a free mass parameter. The regularity is achieved only at a specific mass value, and any deviation from the fine-tuned parameter inevitably results in curvature… ▽ More Regular black holes without curvature singularity can arise in Einstein gravity with appropriate matter energy-momentum tensor. We show that these regular solutions represent only a special case of a much broader family of black holes with a free mass parameter. The regularity is achieved only at a specific mass value, and any deviation from the fine-tuned parameter inevitably results in curvature singularity. As a concrete example, we consider nonlinear electrodynamics (NLED) as matter sources. A new NLED theory is proposed that is a generalization of the Bardeen class and the Hayward class. New regular black holes and their singular counterparts are obtained. Significant distinctions between regular black holes and their singular counterparts are analyzed. These findings provide new insights into regular black holes. △ Less

Submitted 17 March, 2025; originally announced March 2025.

Comments: 14 pages, 3 figures. Comments are welcome

arXiv:2503.12964 [pdf, other]

Training Video Foundation Models with NVIDIA NeMo

Authors: Zeeshan Patel, Ethan He, Parth Mannan, Xiaowei Ren, Ryan Wolf, Niket Agarwal, Jacob Huffman, Zhuoyao Wang, Carl Wang, Jack Chang, Yan Bai, Tommy Huang, Linnan Wang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang, Vasanth Rao Naik Sabavat, Sriharsha Niverty, Rong Ou , et al. (4 additional authors not shown)

Abstract: Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, mul… ▽ More Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference. △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.12863 [pdf, ps, other]

Parameter estimation for generalized mixed fractional stochastic heat equation

Authors: B. L. S. Prakasa Rao

Abstract: We study the properties of a stochastic heat equation with a generalized mixed fractional Brownian noise. We obtain the covariance structure, stationarity and obtain bounds for the asymptotic behaviour of the solution. We suggest estimators for the unknown parameters based on discrete time observations and study their asymptotic properties. We study the properties of a stochastic heat equation with a generalized mixed fractional Brownian noise. We obtain the covariance structure, stationarity and obtain bounds for the asymptotic behaviour of the solution. We suggest estimators for the unknown parameters based on discrete time observations and study their asymptotic properties. △ Less

Submitted 17 March, 2025; originally announced March 2025.

MSC Class: 60G22

arXiv:2503.12826 [pdf, other]

Active and Passive Conformal Transformations in Scalar-Tensor Gravitational Theories

Authors: Israel Quiros, Amit Kumar Rao

Abstract: Through considering the conformal transformations as coordinate transformations in some abstract space of fields, where the different fields are assumed as ``generalized coordinates,'' we introduce the notion of active and passive conformal transformations. We then apply both complementary approaches to the conformal frames issue, arising in the context of scalar-tensor gravity theories, in order… ▽ More Through considering the conformal transformations as coordinate transformations in some abstract space of fields, where the different fields are assumed as ``generalized coordinates,'' we introduce the notion of active and passive conformal transformations. We then apply both complementary approaches to the conformal frames issue, arising in the context of scalar-tensor gravity theories, in order to get better understanding of the problem. Special focus is on the coupling of matter fields to gravity. The recent result that the Lagrangian density of fundamental matter fields and perfect fluids, in its standard form, is not only conformal invariant but also conformal form-invariant, is taken into consideration in the discussion about the conformal frames issue. △ Less

Submitted 24 April, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

Comments: 19 pages 1 figure. Improvements in the text for better understanding, bibliographic references added

arXiv:2503.12446 [pdf, other]

BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries

Authors: Tianle Li, Yongming Rao, Winston Hu, Yu Cheng

Abstract: Encoder-free multimodal large language models(MLLMs) eliminate the need for a well-trained vision encoder by directly processing image tokens before the language model. While this approach reduces computational overhead and model complexity, it often requires large amounts of training data to effectively capture the visual knowledge typically encoded by vision models like CLIP. The absence of a vi… ▽ More Encoder-free multimodal large language models(MLLMs) eliminate the need for a well-trained vision encoder by directly processing image tokens before the language model. While this approach reduces computational overhead and model complexity, it often requires large amounts of training data to effectively capture the visual knowledge typically encoded by vision models like CLIP. The absence of a vision encoder implies that the model is likely to rely on substantial data to learn the necessary visual-semantic alignments. In this work, we present BREEN, a data-efficient encoder-free multimodal architecture that mitigates this issue. BREEN leverages a learnable query and image experts to achieve comparable performance with significantly less training data. The learnable query, positioned between image and text tokens, is supervised by the output of a pretrained CLIP model to distill visual knowledge, bridging the gap between visual and textual modalities. Additionally, the image expert processes image tokens and learnable queries independently, improving efficiency and reducing interference with the LLM's textual capabilities. BREEN achieves comparable performance to prior encoder-free state-of-the-art models like Mono-InternVL, using only 13 million text-image pairs in training about one percent of the data required by existing methods. Our work highlights a promising direction for data-efficient encoder-free multimodal learning, offering an alternative to traditional encoder-based approaches. △ Less

Submitted 16 March, 2025; originally announced March 2025.

arXiv:2503.12386 [pdf, ps, other]

doi 10.1109/ICASSP49660.2025.10889620

A Comparative Study of Invariance-Aware Loss Functions for Deep Learning-based Gridless Direction-of-Arrival Estimation

Authors: Kuan-Lin Chen, Bhaskar D. Rao

Abstract: Covariance matrix reconstruction has been the most widely used guiding objective in gridless direction-of-arrival (DoA) estimation for sparse linear arrays. Many semidefinite programming (SDP)-based methods fall under this category. Although deep learning-based approaches enable the construction of more sophisticated objective functions, most methods still rely on covariance matrix reconstruction.… ▽ More Covariance matrix reconstruction has been the most widely used guiding objective in gridless direction-of-arrival (DoA) estimation for sparse linear arrays. Many semidefinite programming (SDP)-based methods fall under this category. Although deep learning-based approaches enable the construction of more sophisticated objective functions, most methods still rely on covariance matrix reconstruction. In this paper, we propose new loss functions that are invariant to the scaling of the matrices and provide a comparative study of losses with varying degrees of invariance. The proposed loss functions are formulated based on the scale-invariant signal-to-distortion ratio between the target matrix and the Gram matrix of the prediction. Numerical results show that a scale-invariant loss outperforms its non-invariant counterpart but is inferior to the recently proposed subspace loss that is invariant to the change of basis. These results provide evidence that designing loss functions with greater degrees of invariance is advantageous in deep learning-based gridless DoA estimation. △ Less

Submitted 16 March, 2025; originally announced March 2025.

Comments: 5 pages. Accepted at ICASSP 2025

arXiv:2503.12317 [pdf]

A Transformer-based survival model for prediction of all-cause mortality in heart failure patients: a multi-cohort study

Authors: Shishir Rao, Nouman Ahmed, Gholamreza Salimi-Khorshidi, Christopher Yau, Huimin Su, Nathalie Conrad, Folkert W Asselbergs, Mark Woodward, Rod Jackson, John GF Cleland, Kazem Rahimi

Abstract: We developed and validated TRisk, a Transformer-based AI model predicting 36-month mortality in heart failure patients by analysing temporal patient journeys from UK electronic health records (EHR). Our study included 403,534 heart failure patients (ages 40-90) from 1,418 English general practices, with 1,063 practices for model derivation and 355 for external validation. TRisk was compared agains… ▽ More We developed and validated TRisk, a Transformer-based AI model predicting 36-month mortality in heart failure patients by analysing temporal patient journeys from UK electronic health records (EHR). Our study included 403,534 heart failure patients (ages 40-90) from 1,418 English general practices, with 1,063 practices for model derivation and 355 for external validation. TRisk was compared against the MAGGIC-EHR model across various patient subgroups. With median follow-up of 9 months, TRisk achieved a concordance index of 0.845 (95% confidence interval: [0.841, 0.849]), significantly outperforming MAGGIC-EHR's 0.728 (0.723, 0.733) for predicting 36-month all-cause mortality. TRisk showed more consistent performance across sex, age, and baseline characteristics, suggesting less bias. We successfully adapted TRisk to US hospital data through transfer learning, achieving a C-index of 0.802 (0.789, 0.816) with 21,767 patients. Explainability analyses revealed TRisk captured established risk factors while identifying underappreciated predictors like cancers and hepatic failure that were important across both cohorts. Notably, cancers maintained strong prognostic value even a decade after diagnosis. TRisk demonstrated well-calibrated mortality prediction across both healthcare systems. Our findings highlight the value of tracking longitudinal health profiles and revealed risk factors not included in previous expert-driven models. △ Less

Submitted 15 March, 2025; originally announced March 2025.

arXiv:2503.12295 [pdf, other]

Towards Learning High-Precision Least Squares Algorithms with Sequence Models

Authors: Jerry Liu, Jessica Grogan, Owen Dugan, Ashish Rao, Simran Arora, Atri Rudra, Christopher Ré

Abstract: This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to… ▽ More This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000x lower MSE than standard Transformers trained end-to-end and they incur a 10,000x smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares. △ Less

Submitted 15 March, 2025; originally announced March 2025.

Comments: 75 pages, 18 figures. ICLR 2025

arXiv:2503.10842 [pdf, other]

Monte Carlo model of distilled remote entanglement between superconducting qubits across optical channels

Authors: Nicolas Dirnegger, Moein Malekakhlagh, Vikesh Siddhu, Ashutosh Rao, Chi Xiong, Muir Kumph, Jason Orcutt, Abram Falk

Abstract: A promising quantum computing architecture comprises modules of superconducting quantum processors linked by optical channels via quantum transducers. To map transducer device performance to system-level channel performance, our model uses Monte Carlo simulations that incorporate 2-to-1 and 3-to-1 entanglement distillation protocols. We show that the Extreme Photon Loss distillation protocol is pa… ▽ More A promising quantum computing architecture comprises modules of superconducting quantum processors linked by optical channels via quantum transducers. To map transducer device performance to system-level channel performance, our model uses Monte Carlo simulations that incorporate 2-to-1 and 3-to-1 entanglement distillation protocols. We show that the Extreme Photon Loss distillation protocol is particularly high performing and that, even without distillation, present-day transducers are at the threshold of enabling Bell pair distribution with fidelities of 50%. If the next generation of transducers can improve by 3 orders of magnitude in both added noise and efficiency, and increase repetition rates by 50x, then they would allow for remote two-qubit gates achieving 99.7% fidelities at 100 kHz rates. These results set targets for transducers to be ready for deployment into scaled superconducting quantum computers. △ Less

Submitted 13 March, 2025; originally announced March 2025.

Comments: 13 pages, 8 figures, 1 table, APS March Meeting Conference

arXiv:2503.10621 [pdf, other]

DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

Authors: Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, Salman Khan

Abstract: While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understandin… ▽ More While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at https://github.com/ayesha-ishaq/DriveLMM-o1. △ Less

Submitted 13 March, 2025; originally announced March 2025.

Comments: 8 pages, 4 figures, 3 tables, github: https://github.com/ayesha-ishaq/DriveLMM-o1

arXiv:2503.10615 [pdf, other]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Authors: Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, Wei Chen

Abstract: Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the abse… ▽ More Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks. △ Less

Submitted 18 March, 2025; v1 submitted 13 March, 2025; originally announced March 2025.

Comments: Code and Model: https://github.com/Fancy-MLLM/R1-onevision

arXiv:2503.10512 [pdf, other]

Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression

Authors: Hooman Shahrokhi, Devjeet Raj Roy, Yan Yan, Venera Arnaoudova, Janaradhan Rao Doppa

Abstract: We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in th… ▽ More We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in the set to pass all test cases in code generation application. To address this problem, we develop a simple and effective conformal inference algorithm referred to as Generative Prediction Sets (GPS). Given a set of calibration examples and black-box access to a deep generative model, GPS can generate prediction sets with provable guarantees. The key insight behind GPS is to exploit the inherent structure within the distribution over the minimum number of samples needed to obtain an admissible output to develop a simple conformal regression approach over the minimum number of samples. Experiments on multiple datasets for code and math word problems using different large language models demonstrate the efficacy of GPS over state-of-the-art methods. △ Less

Submitted 13 March, 2025; originally announced March 2025.

arXiv:2503.10330 [pdf, other]

Dynamical response theory of interacting Majorana fermions and its application to generic Kitaev quantum spin liquids in a field

Authors: Peng Rao, Roderich Moessner, Johannes Knolle

Abstract: Motivated by the appearance of Majorana fermions in a broad range of correlated and topological electronic systems, we develop a general method to compute the dynamical response of interacting Majorana fermions in the random-phase approximation (RPA). This can be applied self-consistently on top of Majorana mean-field theory (MFT) backgrounds, thereby in particular providing a powerful tool to ana… ▽ More Motivated by the appearance of Majorana fermions in a broad range of correlated and topological electronic systems, we develop a general method to compute the dynamical response of interacting Majorana fermions in the random-phase approximation (RPA). This can be applied self-consistently on top of Majorana mean-field theory (MFT) backgrounds, thereby in particular providing a powerful tool to analyse $\textit{generic}$ behaviour in the vicinity of (various heavily studied) exactly soluble models. Prime examples are quantum spin liquids (QSL) with emergent Majorana excitations, with the celebrated exact solution of Kitaev. We employ the RPA to study in considerable detail phase structure and dynamics of the extended Kitaev honeycomb $KJΓ$-model, with and without an applied field. First, we benchmark our method with Kitaev's exactly soluble model, finding a remarkable agreement. The interactions between Majorana fermions even turn out to mimic the effect of local $\mathbb{Z}_2$ flux excitations, which we explain analytically. Second, we show how small non-Kitaev couplings $J$ and $Γ$ induce Majorana bound states, resulting in sharp features in the dynamical structure factor in the presence of fractionalisation: such 'spinon excitons' naturally appear, and can coexist and interact with the broad Majorana continuum. Third, for increasing couplings or field, our theory predicts instabilities of the KQSL triggered by the condensation of the sharp modes. From the high symmetry momenta of the condensation we can deduce which magnetically ordered phases surround the KQSL, in good agreement with previous finite-size numerics. We discuss implications for experiments and the broad range of applicability of our method to other QSL and Majorana systems. △ Less

Submitted 13 March, 2025; originally announced March 2025.

Comments: 19 pages, 14 figures

arXiv:2503.09873 [pdf, other]

doi 10.1109/TAES.2025.3550474

FDCT: Frequency-Aware Decomposition and Cross-Modal Token-Alignment for Multi-Sensor Target Classification

Authors: Shoaib Meraj Sami, Md Mahedi Hasan, Nasser M. Nasrabadi, Raghuveer Rao

Abstract: In automatic target recognition (ATR) systems, sensors may fail to capture discriminative, fine-grained detail features due to environmental conditions, noise created by CMOS chips, occlusion, parallaxes, and sensor misalignment. Therefore, multi-sensor image fusion is an effective choice to overcome these constraints. However, multi-modal image sensors are heterogeneous and have domain and granul… ▽ More In automatic target recognition (ATR) systems, sensors may fail to capture discriminative, fine-grained detail features due to environmental conditions, noise created by CMOS chips, occlusion, parallaxes, and sensor misalignment. Therefore, multi-sensor image fusion is an effective choice to overcome these constraints. However, multi-modal image sensors are heterogeneous and have domain and granularity gaps. In addition, the multi-sensor images can be misaligned due to intricate background clutters, fluctuating illumination conditions, and uncontrolled sensor settings. In this paper, to overcome these issues, we decompose, align, and fuse multiple image sensor data for target classification. We extract the domain-specific and domain-invariant features from each sensor data. We propose to develop a shared unified discrete token (UDT) space between sensors to reduce the domain and granularity gaps. Additionally, we develop an alignment module to overcome the misalignment between multi-sensors and emphasize the discriminative representation of the UDT space. In the alignment module, we introduce sparsity constraints to provide a better cross-modal representation of the UDT space and robustness against various sensor settings. We achieve superior classification performance compared to single-modality classifiers and several state-of-the-art multi-modal fusion algorithms on four multi-sensor ATR datasets. △ Less

Submitted 12 March, 2025; originally announced March 2025.

Comments: 12 pages Accepted in the IEEE Transactions on Aerospace and Electronic Systems

arXiv:2503.09290 [pdf, ps, other]

Adaptive and Self-Tuning SBL with Total Variation Priors for Block-Sparse Signal Recovery

Authors: Hamza Djelouat, Reijo Leinonen, Mikko J. Sillanpää, Bhaskar D. Rao, Markku Juntti

Abstract: This letter addresses the problem of estimating block sparse signal with unknown group partitions in a multiple measurement vector (MMV) setup. We propose a Bayesian framework by applying an adaptive total variation (TV) penalty on the hyper-parameter space of the sparse signal. The main contributions are two-fold. 1) We extend the TV penalty beyond the immediate neighbor, thus enabling better cap… ▽ More This letter addresses the problem of estimating block sparse signal with unknown group partitions in a multiple measurement vector (MMV) setup. We propose a Bayesian framework by applying an adaptive total variation (TV) penalty on the hyper-parameter space of the sparse signal. The main contributions are two-fold. 1) We extend the TV penalty beyond the immediate neighbor, thus enabling better capture of the signal structure. 2) A dynamic framework is provided to learn the penalty parameter for regularization. It is based on the statistical dependencies between the entries of tentative blocks, thus eliminating the need for fine-tuning. The superior performance of the proposed method is empirically demonstrated by extensive computer simulations with the state-of-art benchmarks. The proposed solution exhibits both excellent performance and robustness against sparsity model mismatch. △ Less

Submitted 12 March, 2025; originally announced March 2025.

arXiv:2503.08600 [pdf, other]

NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

Authors: Delip Rao, Weiqiu You, Eric Wong, Chris Callison-Burch

Abstract: We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publica… ▽ More We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publication takes effect. We also introduce a new task to distinguish between existing scientific claims and aspirational research intentions in proposals. Using zero-shot prompting with frontier large language models, we jointly extract 114K scientific claims and 145K investigation proposals from 16K grant abstracts in the materials science domain to create a focused subset called NSF-SciFy-MatSci. We use this dataset to evaluate 3 three key tasks: (1) technical to non-technical abstract generation, where models achieve high BERTScore (0.85+ F1); (2) scientific claim extraction, where fine-tuned models outperform base models by 100% relative improvement; and (3) investigation proposal extraction, showing 90%+ improvement with fine-tuning. We introduce novel LLM-based evaluation metrics for robust assessment of claim/proposal extraction quality. As the largest scientific claim dataset to date -- with an estimated 2.8 million claims across all STEM disciplines funded by the NSF -- NSF-SciFy enables new opportunities for claim verification and meta-scientific research. We publicly release all datasets, trained models, and evaluation code to facilitate further research. △ Less

Submitted 15 March, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

Comments: 11 pages, 3 figures, 6 tables

arXiv:2503.07891 [pdf, other]

Gemini Embedding: Generalizable Embeddings from Gemini

Authors: Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, Xiaoqi Ren, Shanfeng Zhang, Daniel Salz, Michael Boratko, Jay Han, Blair Chen, Shuo Huang, Vikram Rao, Paul Suganthan, Feng Han, Andreas Doumanoglou, Nithi Gupta, Fedor Moiseev, Cathy Yip, Aashi Jain , et al. (22 additional authors not shown)

Abstract: In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini… ▽ More In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini Embedding can be precomputed and applied to a variety of downstream tasks including classification, similarity, clustering, ranking, and retrieval. Evaluated on the Massive Multilingual Text Embedding Benchmark (MMTEB), which includes over one hundred tasks across 250+ languages, Gemini Embedding substantially outperforms prior state-of-the-art models, demonstrating considerable improvements in embedding quality. Achieving state-of-the-art performance across MMTEB's multilingual, English, and code benchmarks, our unified model demonstrates strong capabilities across a broad selection of tasks and surpasses specialized domain-specific models. △ Less

Submitted 10 March, 2025; originally announced March 2025.

Comments: 19 pages

arXiv:2503.07284 [pdf, other]

An asymptotic preserving scheme satisfying entropy stability for the barotropic Euler system

Authors: Megala Anandan, Mária Lukáčová-Medvid'ová, S. V. Raghurama Rao

Abstract: In this paper we study structure-preserving numerical methods for low Mach number barotropic Euler equations. Besides their asymptotic preserving properties that are crucial in order to obtain uniformly consistent and stable approximations of the Euler equations in their singular limit as the Mach number approaches zero, our aim is also to preserve discrete entropy stability. Suitable acoustic/adv… ▽ More In this paper we study structure-preserving numerical methods for low Mach number barotropic Euler equations. Besides their asymptotic preserving properties that are crucial in order to obtain uniformly consistent and stable approximations of the Euler equations in their singular limit as the Mach number approaches zero, our aim is also to preserve discrete entropy stability. Suitable acoustic/advection splitting approach combined with time implicit-explicit approximations are used to achieve the asymptotic preserving property. The entropy stability of different space discretisation strategies is studied for different values of Mach number and is validated by the numerical experiments. △ Less

Submitted 14 May, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

arXiv:2503.06486 [pdf, other]

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Authors: Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, Chunhua Shen

Abstract: This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and… ▽ More This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks. △ Less

Submitted 9 March, 2025; originally announced March 2025.

arXiv:2503.06426 [pdf, ps, other]

doi 10.1109/TCCN.2025.3550359

Federated Learning for Diffusion Models

Authors: Zihao Peng, Xijun Wang, Shengbo Chen, Hong Rao, Cong Shen

Abstract: Diffusion models are powerful generative models that can produce highly realistic samples for various tasks. Typically, these models are constructed using centralized, independently and identically distributed (IID) training data. However, in practical scenarios, data is often distributed across multiple clients and frequently manifests non-IID characteristics. Federated Learning (FL) can leverage… ▽ More Diffusion models are powerful generative models that can produce highly realistic samples for various tasks. Typically, these models are constructed using centralized, independently and identically distributed (IID) training data. However, in practical scenarios, data is often distributed across multiple clients and frequently manifests non-IID characteristics. Federated Learning (FL) can leverage this distributed data to train diffusion models, but the performance of existing FL methods is unsatisfactory in non-IID scenarios. To address this, we propose FedDDPM-Federated Learning with Denoising Diffusion Probabilistic Models, which leverages the data generative capability of diffusion models to facilitate model training. In particular, the server uses well-trained local diffusion models uploaded by each client before FL training to generate auxiliary data that can approximately represent the global data distribution. Following each round of model aggregation, the server further optimizes the global model using the auxiliary dataset to alleviate the impact of heterogeneous data on model performance. We provide a rigorous convergence analysis of FedDDPM and propose an enhanced algorithm, FedDDPM+, to reduce training overheads. FedDDPM+ detects instances of slow model learning and performs a one-shot correction using the auxiliary dataset. Experimental results validate that our proposed algorithms outperform the state-of-the-art FL algorithms on the MNIST, CIFAR10 and CIFAR100 datasets. △ Less

Submitted 8 March, 2025; originally announced March 2025.

arXiv:2503.05931 [pdf, other]

Training and Inference Efficiency of Encoder-Decoder Speech Models

Authors: Piotr Żelasko, Kunal Dhawan, Daniel Galvez, Krishna C. Puvvada, Ankita Pasad, Nithin Rao Koluguri, Ke Hu, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

Abstract: Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models e… ▽ More Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently, and what can we do to improve? We argue that a major, if not the most severe, detrimental factor for training efficiency is related to the sampling strategy of sequential data. We show that negligence in mini-batch sampling leads to more than 50% computation being spent on padding. To that end, we study, profile, and optimize Canary-1B training to show gradual improvement in GPU utilization leading up to 5x increase in average batch sizes versus its original training settings. This in turn allows us to train an equivalent model using 4x less GPUs in the same wall time, or leverage the original resources and train it in 2x shorter wall time. Finally, we observe that the major inference bottleneck lies in the autoregressive decoder steps. We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup as measured by inverse real-time factor (RTFx) while preserving the accuracy and compute requirements for convergence. The training code and models will be available as open-source. △ Less

Submitted 19 March, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

arXiv:2503.05701 [pdf]

OPTIC: Optimizing Patient-Provider Triaging & Improving Communications in Clinical Operations using GPT-4 Data Labeling and Model Distillation

Authors: Alberto Santamaria-Pang, Frank Tuan, Ross Campbell, Cindy Zhang, Ankush Jindal, Roopa Surapur, Brad Holloman, Deanna Hanisch, Rae Buckley, Carisa Cooney, Ivan Tarapov, Kimberly S. Peairs, Brian Hasselfeld, Peter Greene

Abstract: The COVID-19 pandemic has accelerated the adoption of telemedicine and patient messaging through electronic medical portals (patient medical advice requests, or PMARs). While these platforms enhance patient access to healthcare, they have also increased the burden on healthcare providers due to the surge in PMARs. This study seeks to develop an efficient tool for message triaging to reduce physici… ▽ More The COVID-19 pandemic has accelerated the adoption of telemedicine and patient messaging through electronic medical portals (patient medical advice requests, or PMARs). While these platforms enhance patient access to healthcare, they have also increased the burden on healthcare providers due to the surge in PMARs. This study seeks to develop an efficient tool for message triaging to reduce physician workload and improve patient-provider communication. We developed OPTIC (Optimizing Patient-Provider Triaging & Improving Communications in Clinical Operations), a powerful message triaging tool that utilizes GPT-4 for data labeling and BERT for model distillation. The study used a dataset of 405,487 patient messaging encounters from Johns Hopkins Medicine between January and June 2020. High-quality labeled data was generated through GPT-4-based prompt engineering, which was then used to train a BERT model to classify messages as "Admin" or "Clinical." The BERT model achieved 88.85% accuracy on the test set validated by GPT-4 labeling, with a sensitivity of 88.29%, specificity of 89.38%, and an F1 score of 0.8842. BERTopic analysis identified 81 distinct topics within the test data, with over 80% accuracy in classifying 58 topics. The system was successfully deployed through Epic's Nebula Cloud Platform, demonstrating its practical effectiveness in healthcare settings. △ Less

Submitted 5 February, 2025; originally announced March 2025.

Comments: 15 pages, 8 figures. submitted to Journal of the American Medical Informatics Association

arXiv:2503.05473 [pdf, other]

The Society of HiveMind: Multi-Agent Optimization of Foundation Model Swarms to Unlock the Potential of Collective Intelligence

Authors: Noah Mamie, Susie Xi Rao

Abstract: Multi-agent systems address issues of accessibility and scalability of artificial intelligence (AI) foundation models, which are often represented by large language models. We develop a framework - the "Society of HiveMind" (SOHM) - that orchestrates the interaction between multiple AI foundation models, imitating the observed behavior of animal swarms in nature by following modern evolutionary th… ▽ More Multi-agent systems address issues of accessibility and scalability of artificial intelligence (AI) foundation models, which are often represented by large language models. We develop a framework - the "Society of HiveMind" (SOHM) - that orchestrates the interaction between multiple AI foundation models, imitating the observed behavior of animal swarms in nature by following modern evolutionary theories. On the one hand, we find that the SOHM provides a negligible benefit on tasks that mainly require real-world knowledge. On the other hand, we remark a significant improvement on tasks that require intensive logical reasoning, indicating that multi-agent systems are capable of increasing the reasoning capabilities of the collective compared to the individual agents. Our findings demonstrate the potential of combining a multitude of diverse AI foundation models to form an artificial swarm intelligence capable of self-improvement through interactions with a given environment. △ Less

Submitted 13 March, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

Comments: 11 pages (excl. appendix)

arXiv:2503.05122 [pdf, other]

EDM: Efficient Deep Feature Matching

Authors: Xi Li, Tong Rao, Cihui Pan

Abstract: Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we pre… ▽ More Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features, and progressively injects feature correlations from global to local for efficient multi-scale feature aggregation, improving both speed and performance. In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features, avoiding the significant computational cost of explicitly locating keypoints on high-resolution local feature heatmaps. Moreover, effective selection strategies are introduced to enhance matching accuracy. Extensive experiments show that our EDM achieves competitive matching accuracy on various benchmarks and exhibits excellent efficiency, offering valuable best practices for real-world applications. The code is available at https://github.com/chicleee/EDM. △ Less

Submitted 22 May, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

arXiv:2503.04998 [pdf, other]

Multi-Agent Ergodic Exploration under Smoke-Based, Time-Varying Sensor Visibility Constraints

Authors: Elena Wittemyer, Ananya Rao, Ian Abraham, Howie Choset

Abstract: In this work, we consider the problem of multi-agent informative path planning (IPP) for robots whose sensor visibility continuously changes as a consequence of a time-varying natural phenomenon. We leverage ergodic trajectory optimization (ETO), which generates paths such that the amount of time an agent spends in an area is proportional to the expected information in that area. We focus specific… ▽ More In this work, we consider the problem of multi-agent informative path planning (IPP) for robots whose sensor visibility continuously changes as a consequence of a time-varying natural phenomenon. We leverage ergodic trajectory optimization (ETO), which generates paths such that the amount of time an agent spends in an area is proportional to the expected information in that area. We focus specifically on the problem of multi-agent drone search of a wildfire, where we use the time-varying environmental process of smoke diffusion to construct a sensor visibility model. This sensor visibility model is used to repeatedly calculate an expected information distribution (EID) to be used in the ETO algorithm. Our experiments show that our exploration method achieves improved information gathering over both baseline search methods and naive ergodic search formulations. △ Less