-
ProDiF: Protecting Domain-Invariant Features to Secure Pre-Trained Models Against Extraction
Authors:
Tong Zhou,
Shijin Duan,
Gaowen Liu,
Charles Fleming,
Ramana Rao Kompella,
Shaolei Ren,
Xiaolin Xu
Abstract:
Pre-trained models are valuable intellectual property, capturing both domain-specific and domain-invariant features within their weight spaces. However, model extraction attacks threaten these assets by enabling unauthorized source-domain inference and facilitating cross-domain transfer via the exploitation of domain-invariant features. In this work, we introduce **ProDiF**, a novel framework that…
▽ More
Pre-trained models are valuable intellectual property, capturing both domain-specific and domain-invariant features within their weight spaces. However, model extraction attacks threaten these assets by enabling unauthorized source-domain inference and facilitating cross-domain transfer via the exploitation of domain-invariant features. In this work, we introduce **ProDiF**, a novel framework that leverages targeted weight space manipulation to secure pre-trained models against extraction attacks. **ProDiF** quantifies the transferability of filters and perturbs the weights of critical filters in unsecured memory, while preserving actual critical weights in a Trusted Execution Environment (TEE) for authorized users. A bi-level optimization further ensures resilience against adaptive fine-tuning attacks. Experimental results show that **ProDiF** reduces source-domain accuracy to near-random levels and decreases cross-domain transferability by 74.65\%, providing robust protection for pre-trained models. This work offers comprehensive protection for pre-trained DNN models and highlights the potential of weight space manipulation as a novel approach to model security.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Regular black holes and their singular families
Authors:
Hyat Huang,
Xiao-Pin Rao
Abstract:
Regular black holes without curvature singularity can arise in Einstein gravity with appropriate matter energy-momentum tensor. We show that these regular solutions represent only a special case of a much broader family of black holes with a free mass parameter. The regularity is achieved only at a specific mass value, and any deviation from the fine-tuned parameter inevitably results in curvature…
▽ More
Regular black holes without curvature singularity can arise in Einstein gravity with appropriate matter energy-momentum tensor. We show that these regular solutions represent only a special case of a much broader family of black holes with a free mass parameter. The regularity is achieved only at a specific mass value, and any deviation from the fine-tuned parameter inevitably results in curvature singularity. As a concrete example, we consider nonlinear electrodynamics (NLED) as matter sources. A new NLED theory is proposed that is a generalization of the Bardeen class and the Hayward class. New regular black holes and their singular counterparts are obtained. Significant distinctions between regular black holes and their singular counterparts are analyzed. These findings provide new insights into regular black holes.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Training Video Foundation Models with NVIDIA NeMo
Authors:
Zeeshan Patel,
Ethan He,
Parth Mannan,
Xiaowei Ren,
Ryan Wolf,
Niket Agarwal,
Jacob Huffman,
Zhuoyao Wang,
Carl Wang,
Jack Chang,
Yan Bai,
Tommy Huang,
Linnan Wang,
Sahil Jain,
Shanmugam Ramasamy,
Joseph Jennings,
Ekaterina Sirazitdinova,
Oleg Sudakov,
Mingyuan Ma,
Bobby Chen,
Forrest Lin,
Hao Wang,
Vasanth Rao Naik Sabavat,
Sriharsha Niverty,
Rong Ou
, et al. (4 additional authors not shown)
Abstract:
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, mul…
▽ More
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Parameter estimation for generalized mixed fractional stochastic heat equation
Authors:
B. L. S. Prakasa Rao
Abstract:
We study the properties of a stochastic heat equation with a generalized mixed fractional Brownian noise. We obtain the covariance structure, stationarity and obtain bounds for the asymptotic behaviour of the solution. We suggest estimators for the unknown parameters based on discrete time observations and study their asymptotic properties.
We study the properties of a stochastic heat equation with a generalized mixed fractional Brownian noise. We obtain the covariance structure, stationarity and obtain bounds for the asymptotic behaviour of the solution. We suggest estimators for the unknown parameters based on discrete time observations and study their asymptotic properties.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Active and Passive Conformal Transformations in Scalar-Tensor Gravitational Theories
Authors:
Israel Quiros,
Amit Kumar Rao
Abstract:
Through considering the conformal transformations as coordinate transformations in some abstract space of fields, where the different fields are assumed as ``generalized coordinates,'' we introduce the notion of active and passive conformal transformations. We then apply both complementary approaches to the conformal frames issue, arising in the context of scalar-tensor gravity theories, in order…
▽ More
Through considering the conformal transformations as coordinate transformations in some abstract space of fields, where the different fields are assumed as ``generalized coordinates,'' we introduce the notion of active and passive conformal transformations. We then apply both complementary approaches to the conformal frames issue, arising in the context of scalar-tensor gravity theories, in order to get better understanding of the problem. Special focus is on the coupling of matter fields to gravity. The recent result that the Lagrangian density of fundamental matter fields and perfect fluids, in its standard form, is not only conformal invariant but also conformal form-invariant, is taken into consideration in the discussion about the conformal frames issue.
△ Less
Submitted 24 April, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries
Authors:
Tianle Li,
Yongming Rao,
Winston Hu,
Yu Cheng
Abstract:
Encoder-free multimodal large language models(MLLMs) eliminate the need for a well-trained vision encoder by directly processing image tokens before the language model. While this approach reduces computational overhead and model complexity, it often requires large amounts of training data to effectively capture the visual knowledge typically encoded by vision models like CLIP. The absence of a vi…
▽ More
Encoder-free multimodal large language models(MLLMs) eliminate the need for a well-trained vision encoder by directly processing image tokens before the language model. While this approach reduces computational overhead and model complexity, it often requires large amounts of training data to effectively capture the visual knowledge typically encoded by vision models like CLIP. The absence of a vision encoder implies that the model is likely to rely on substantial data to learn the necessary visual-semantic alignments. In this work, we present BREEN, a data-efficient encoder-free multimodal architecture that mitigates this issue. BREEN leverages a learnable query and image experts to achieve comparable performance with significantly less training data. The learnable query, positioned between image and text tokens, is supervised by the output of a pretrained CLIP model to distill visual knowledge, bridging the gap between visual and textual modalities. Additionally, the image expert processes image tokens and learnable queries independently, improving efficiency and reducing interference with the LLM's textual capabilities. BREEN achieves comparable performance to prior encoder-free state-of-the-art models like Mono-InternVL, using only 13 million text-image pairs in training about one percent of the data required by existing methods. Our work highlights a promising direction for data-efficient encoder-free multimodal learning, offering an alternative to traditional encoder-based approaches.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
A Comparative Study of Invariance-Aware Loss Functions for Deep Learning-based Gridless Direction-of-Arrival Estimation
Authors:
Kuan-Lin Chen,
Bhaskar D. Rao
Abstract:
Covariance matrix reconstruction has been the most widely used guiding objective in gridless direction-of-arrival (DoA) estimation for sparse linear arrays. Many semidefinite programming (SDP)-based methods fall under this category. Although deep learning-based approaches enable the construction of more sophisticated objective functions, most methods still rely on covariance matrix reconstruction.…
▽ More
Covariance matrix reconstruction has been the most widely used guiding objective in gridless direction-of-arrival (DoA) estimation for sparse linear arrays. Many semidefinite programming (SDP)-based methods fall under this category. Although deep learning-based approaches enable the construction of more sophisticated objective functions, most methods still rely on covariance matrix reconstruction. In this paper, we propose new loss functions that are invariant to the scaling of the matrices and provide a comparative study of losses with varying degrees of invariance. The proposed loss functions are formulated based on the scale-invariant signal-to-distortion ratio between the target matrix and the Gram matrix of the prediction. Numerical results show that a scale-invariant loss outperforms its non-invariant counterpart but is inferior to the recently proposed subspace loss that is invariant to the change of basis. These results provide evidence that designing loss functions with greater degrees of invariance is advantageous in deep learning-based gridless DoA estimation.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
A Transformer-based survival model for prediction of all-cause mortality in heart failure patients: a multi-cohort study
Authors:
Shishir Rao,
Nouman Ahmed,
Gholamreza Salimi-Khorshidi,
Christopher Yau,
Huimin Su,
Nathalie Conrad,
Folkert W Asselbergs,
Mark Woodward,
Rod Jackson,
John GF Cleland,
Kazem Rahimi
Abstract:
We developed and validated TRisk, a Transformer-based AI model predicting 36-month mortality in heart failure patients by analysing temporal patient journeys from UK electronic health records (EHR). Our study included 403,534 heart failure patients (ages 40-90) from 1,418 English general practices, with 1,063 practices for model derivation and 355 for external validation. TRisk was compared agains…
▽ More
We developed and validated TRisk, a Transformer-based AI model predicting 36-month mortality in heart failure patients by analysing temporal patient journeys from UK electronic health records (EHR). Our study included 403,534 heart failure patients (ages 40-90) from 1,418 English general practices, with 1,063 practices for model derivation and 355 for external validation. TRisk was compared against the MAGGIC-EHR model across various patient subgroups. With median follow-up of 9 months, TRisk achieved a concordance index of 0.845 (95% confidence interval: [0.841, 0.849]), significantly outperforming MAGGIC-EHR's 0.728 (0.723, 0.733) for predicting 36-month all-cause mortality. TRisk showed more consistent performance across sex, age, and baseline characteristics, suggesting less bias. We successfully adapted TRisk to US hospital data through transfer learning, achieving a C-index of 0.802 (0.789, 0.816) with 21,767 patients. Explainability analyses revealed TRisk captured established risk factors while identifying underappreciated predictors like cancers and hepatic failure that were important across both cohorts. Notably, cancers maintained strong prognostic value even a decade after diagnosis. TRisk demonstrated well-calibrated mortality prediction across both healthcare systems. Our findings highlight the value of tracking longitudinal health profiles and revealed risk factors not included in previous expert-driven models.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Towards Learning High-Precision Least Squares Algorithms with Sequence Models
Authors:
Jerry Liu,
Jessica Grogan,
Owen Dugan,
Ashish Rao,
Simran Arora,
Atri Rudra,
Christopher Ré
Abstract:
This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to…
▽ More
This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000x lower MSE than standard Transformers trained end-to-end and they incur a 10,000x smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.
△ Less
Submitted 15 March, 2025;
originally announced March 2025.
-
Monte Carlo model of distilled remote entanglement between superconducting qubits across optical channels
Authors:
Nicolas Dirnegger,
Moein Malekakhlagh,
Vikesh Siddhu,
Ashutosh Rao,
Chi Xiong,
Muir Kumph,
Jason Orcutt,
Abram Falk
Abstract:
A promising quantum computing architecture comprises modules of superconducting quantum processors linked by optical channels via quantum transducers. To map transducer device performance to system-level channel performance, our model uses Monte Carlo simulations that incorporate 2-to-1 and 3-to-1 entanglement distillation protocols. We show that the Extreme Photon Loss distillation protocol is pa…
▽ More
A promising quantum computing architecture comprises modules of superconducting quantum processors linked by optical channels via quantum transducers. To map transducer device performance to system-level channel performance, our model uses Monte Carlo simulations that incorporate 2-to-1 and 3-to-1 entanglement distillation protocols. We show that the Extreme Photon Loss distillation protocol is particularly high performing and that, even without distillation, present-day transducers are at the threshold of enabling Bell pair distribution with fidelities of 50%. If the next generation of transducers can improve by 3 orders of magnitude in both added noise and efficiency, and increase repetition rates by 50x, then they would allow for remote two-qubit gates achieving 99.7% fidelities at 100 kHz rates. These results set targets for transducers to be ready for deployment into scaled superconducting quantum computers.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding
Authors:
Ayesha Ishaq,
Jean Lahoud,
Ketan More,
Omkar Thawakar,
Ritesh Thawkar,
Dinura Dissanayake,
Noor Ahsan,
Yuhao Li,
Fahad Shahbaz Khan,
Hisham Cholakkal,
Ivan Laptev,
Rao Muhammad Anwer,
Salman Khan
Abstract:
While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understandin…
▽ More
While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at https://github.com/ayesha-ishaq/DriveLMM-o1.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
Authors:
Yi Yang,
Xiaoxuan He,
Hongkun Pan,
Xiyan Jiang,
Yan Deng,
Xingtao Yang,
Haoyu Lu,
Dacheng Yin,
Fengyun Rao,
Minfeng Zhu,
Bo Zhang,
Wei Chen
Abstract:
Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the abse…
▽ More
Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.
△ Less
Submitted 18 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression
Authors:
Hooman Shahrokhi,
Devjeet Raj Roy,
Yan Yan,
Venera Arnaoudova,
Janaradhan Rao Doppa
Abstract:
We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in th…
▽ More
We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in the set to pass all test cases in code generation application. To address this problem, we develop a simple and effective conformal inference algorithm referred to as Generative Prediction Sets (GPS). Given a set of calibration examples and black-box access to a deep generative model, GPS can generate prediction sets with provable guarantees. The key insight behind GPS is to exploit the inherent structure within the distribution over the minimum number of samples needed to obtain an admissible output to develop a simple conformal regression approach over the minimum number of samples. Experiments on multiple datasets for code and math word problems using different large language models demonstrate the efficacy of GPS over state-of-the-art methods.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
Dynamical response theory of interacting Majorana fermions and its application to generic Kitaev quantum spin liquids in a field
Authors:
Peng Rao,
Roderich Moessner,
Johannes Knolle
Abstract:
Motivated by the appearance of Majorana fermions in a broad range of correlated and topological electronic systems, we develop a general method to compute the dynamical response of interacting Majorana fermions in the random-phase approximation (RPA). This can be applied self-consistently on top of Majorana mean-field theory (MFT) backgrounds, thereby in particular providing a powerful tool to ana…
▽ More
Motivated by the appearance of Majorana fermions in a broad range of correlated and topological electronic systems, we develop a general method to compute the dynamical response of interacting Majorana fermions in the random-phase approximation (RPA). This can be applied self-consistently on top of Majorana mean-field theory (MFT) backgrounds, thereby in particular providing a powerful tool to analyse $\textit{generic}$ behaviour in the vicinity of (various heavily studied) exactly soluble models. Prime examples are quantum spin liquids (QSL) with emergent Majorana excitations, with the celebrated exact solution of Kitaev. We employ the RPA to study in considerable detail phase structure and dynamics of the extended Kitaev honeycomb $KJΓ$-model, with and without an applied field. First, we benchmark our method with Kitaev's exactly soluble model, finding a remarkable agreement. The interactions between Majorana fermions even turn out to mimic the effect of local $\mathbb{Z}_2$ flux excitations, which we explain analytically. Second, we show how small non-Kitaev couplings $J$ and $Γ$ induce Majorana bound states, resulting in sharp features in the dynamical structure factor in the presence of fractionalisation: such 'spinon excitons' naturally appear, and can coexist and interact with the broad Majorana continuum. Third, for increasing couplings or field, our theory predicts instabilities of the KQSL triggered by the condensation of the sharp modes. From the high symmetry momenta of the condensation we can deduce which magnetically ordered phases surround the KQSL, in good agreement with previous finite-size numerics. We discuss implications for experiments and the broad range of applicability of our method to other QSL and Majorana systems.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
FDCT: Frequency-Aware Decomposition and Cross-Modal Token-Alignment for Multi-Sensor Target Classification
Authors:
Shoaib Meraj Sami,
Md Mahedi Hasan,
Nasser M. Nasrabadi,
Raghuveer Rao
Abstract:
In automatic target recognition (ATR) systems, sensors may fail to capture discriminative, fine-grained detail features due to environmental conditions, noise created by CMOS chips, occlusion, parallaxes, and sensor misalignment. Therefore, multi-sensor image fusion is an effective choice to overcome these constraints. However, multi-modal image sensors are heterogeneous and have domain and granul…
▽ More
In automatic target recognition (ATR) systems, sensors may fail to capture discriminative, fine-grained detail features due to environmental conditions, noise created by CMOS chips, occlusion, parallaxes, and sensor misalignment. Therefore, multi-sensor image fusion is an effective choice to overcome these constraints. However, multi-modal image sensors are heterogeneous and have domain and granularity gaps. In addition, the multi-sensor images can be misaligned due to intricate background clutters, fluctuating illumination conditions, and uncontrolled sensor settings. In this paper, to overcome these issues, we decompose, align, and fuse multiple image sensor data for target classification. We extract the domain-specific and domain-invariant features from each sensor data. We propose to develop a shared unified discrete token (UDT) space between sensors to reduce the domain and granularity gaps. Additionally, we develop an alignment module to overcome the misalignment between multi-sensors and emphasize the discriminative representation of the UDT space. In the alignment module, we introduce sparsity constraints to provide a better cross-modal representation of the UDT space and robustness against various sensor settings. We achieve superior classification performance compared to single-modality classifiers and several state-of-the-art multi-modal fusion algorithms on four multi-sensor ATR datasets.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Adaptive and Self-Tuning SBL with Total Variation Priors for Block-Sparse Signal Recovery
Authors:
Hamza Djelouat,
Reijo Leinonen,
Mikko J. Sillanpää,
Bhaskar D. Rao,
Markku Juntti
Abstract:
This letter addresses the problem of estimating block sparse signal with unknown group partitions in a multiple measurement vector (MMV) setup. We propose a Bayesian framework by applying an adaptive total variation (TV) penalty on the hyper-parameter space of the sparse signal. The main contributions are two-fold. 1) We extend the TV penalty beyond the immediate neighbor, thus enabling better cap…
▽ More
This letter addresses the problem of estimating block sparse signal with unknown group partitions in a multiple measurement vector (MMV) setup. We propose a Bayesian framework by applying an adaptive total variation (TV) penalty on the hyper-parameter space of the sparse signal. The main contributions are two-fold. 1) We extend the TV penalty beyond the immediate neighbor, thus enabling better capture of the signal structure. 2) A dynamic framework is provided to learn the penalty parameter for regularization. It is based on the statistical dependencies between the entries of tentative blocks, thus eliminating the need for fine-tuning. The superior performance of the proposed method is empirically demonstrated by extensive computer simulations with the state-of-art benchmarks. The proposed solution exhibits both excellent performance and robustness against sparsity model mismatch.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
NSF-SciFy: Mining the NSF Awards Database for Scientific Claims
Authors:
Delip Rao,
Weiqiu You,
Eric Wong,
Chris Callison-Burch
Abstract:
We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publica…
▽ More
We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publication takes effect. We also introduce a new task to distinguish between existing scientific claims and aspirational research intentions in proposals. Using zero-shot prompting with frontier large language models, we jointly extract 114K scientific claims and 145K investigation proposals from 16K grant abstracts in the materials science domain to create a focused subset called NSF-SciFy-MatSci. We use this dataset to evaluate 3 three key tasks: (1) technical to non-technical abstract generation, where models achieve high BERTScore (0.85+ F1); (2) scientific claim extraction, where fine-tuned models outperform base models by 100% relative improvement; and (3) investigation proposal extraction, showing 90%+ improvement with fine-tuning. We introduce novel LLM-based evaluation metrics for robust assessment of claim/proposal extraction quality. As the largest scientific claim dataset to date -- with an estimated 2.8 million claims across all STEM disciplines funded by the NSF -- NSF-SciFy enables new opportunities for claim verification and meta-scientific research. We publicly release all datasets, trained models, and evaluation code to facilitate further research.
△ Less
Submitted 15 March, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Gemini Embedding: Generalizable Embeddings from Gemini
Authors:
Jinhyuk Lee,
Feiyang Chen,
Sahil Dua,
Daniel Cer,
Madhuri Shanbhogue,
Iftekhar Naim,
Gustavo Hernández Ábrego,
Zhe Li,
Kaifeng Chen,
Henrique Schechter Vera,
Xiaoqi Ren,
Shanfeng Zhang,
Daniel Salz,
Michael Boratko,
Jay Han,
Blair Chen,
Shuo Huang,
Vikram Rao,
Paul Suganthan,
Feng Han,
Andreas Doumanoglou,
Nithi Gupta,
Fedor Moiseev,
Cathy Yip,
Aashi Jain
, et al. (22 additional authors not shown)
Abstract:
In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini…
▽ More
In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini Embedding can be precomputed and applied to a variety of downstream tasks including classification, similarity, clustering, ranking, and retrieval. Evaluated on the Massive Multilingual Text Embedding Benchmark (MMTEB), which includes over one hundred tasks across 250+ languages, Gemini Embedding substantially outperforms prior state-of-the-art models, demonstrating considerable improvements in embedding quality. Achieving state-of-the-art performance across MMTEB's multilingual, English, and code benchmarks, our unified model demonstrates strong capabilities across a broad selection of tasks and surpasses specialized domain-specific models.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
An asymptotic preserving scheme satisfying entropy stability for the barotropic Euler system
Authors:
Megala Anandan,
Mária Lukáčová-Medvid'ová,
S. V. Raghurama Rao
Abstract:
In this paper we study structure-preserving numerical methods for low Mach number barotropic Euler equations. Besides their asymptotic preserving properties that are crucial in order to obtain uniformly consistent and stable approximations of the Euler equations in their singular limit as the Mach number approaches zero, our aim is also to preserve discrete entropy stability. Suitable acoustic/adv…
▽ More
In this paper we study structure-preserving numerical methods for low Mach number barotropic Euler equations. Besides their asymptotic preserving properties that are crucial in order to obtain uniformly consistent and stable approximations of the Euler equations in their singular limit as the Mach number approaches zero, our aim is also to preserve discrete entropy stability. Suitable acoustic/advection splitting approach combined with time implicit-explicit approximations are used to achieve the asymptotic preserving property. The entropy stability of different space discretisation strategies is studied for different values of Mach number and is validated by the numerical experiments.
△ Less
Submitted 14 May, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
Authors:
Cong Chen,
Mingyu Liu,
Chenchen Jing,
Yizhou Zhou,
Fengyun Rao,
Hao Chen,
Bo Zhang,
Chunhua Shen
Abstract:
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and…
▽ More
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Federated Learning for Diffusion Models
Authors:
Zihao Peng,
Xijun Wang,
Shengbo Chen,
Hong Rao,
Cong Shen
Abstract:
Diffusion models are powerful generative models that can produce highly realistic samples for various tasks. Typically, these models are constructed using centralized, independently and identically distributed (IID) training data. However, in practical scenarios, data is often distributed across multiple clients and frequently manifests non-IID characteristics. Federated Learning (FL) can leverage…
▽ More
Diffusion models are powerful generative models that can produce highly realistic samples for various tasks. Typically, these models are constructed using centralized, independently and identically distributed (IID) training data. However, in practical scenarios, data is often distributed across multiple clients and frequently manifests non-IID characteristics. Federated Learning (FL) can leverage this distributed data to train diffusion models, but the performance of existing FL methods is unsatisfactory in non-IID scenarios. To address this, we propose FedDDPM-Federated Learning with Denoising Diffusion Probabilistic Models, which leverages the data generative capability of diffusion models to facilitate model training. In particular, the server uses well-trained local diffusion models uploaded by each client before FL training to generate auxiliary data that can approximately represent the global data distribution. Following each round of model aggregation, the server further optimizes the global model using the auxiliary dataset to alleviate the impact of heterogeneous data on model performance. We provide a rigorous convergence analysis of FedDDPM and propose an enhanced algorithm, FedDDPM+, to reduce training overheads. FedDDPM+ detects instances of slow model learning and performs a one-shot correction using the auxiliary dataset. Experimental results validate that our proposed algorithms outperform the state-of-the-art FL algorithms on the MNIST, CIFAR10 and CIFAR100 datasets.
△ Less
Submitted 8 March, 2025;
originally announced March 2025.
-
Training and Inference Efficiency of Encoder-Decoder Speech Models
Authors:
Piotr Żelasko,
Kunal Dhawan,
Daniel Galvez,
Krishna C. Puvvada,
Ankita Pasad,
Nithin Rao Koluguri,
Ke Hu,
Vitaly Lavrukhin,
Jagadeesh Balam,
Boris Ginsburg
Abstract:
Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models e…
▽ More
Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently, and what can we do to improve? We argue that a major, if not the most severe, detrimental factor for training efficiency is related to the sampling strategy of sequential data. We show that negligence in mini-batch sampling leads to more than 50% computation being spent on padding. To that end, we study, profile, and optimize Canary-1B training to show gradual improvement in GPU utilization leading up to 5x increase in average batch sizes versus its original training settings. This in turn allows us to train an equivalent model using 4x less GPUs in the same wall time, or leverage the original resources and train it in 2x shorter wall time. Finally, we observe that the major inference bottleneck lies in the autoregressive decoder steps. We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup as measured by inverse real-time factor (RTFx) while preserving the accuracy and compute requirements for convergence. The training code and models will be available as open-source.
△ Less
Submitted 19 March, 2025; v1 submitted 7 March, 2025;
originally announced March 2025.
-
OPTIC: Optimizing Patient-Provider Triaging & Improving Communications in Clinical Operations using GPT-4 Data Labeling and Model Distillation
Authors:
Alberto Santamaria-Pang,
Frank Tuan,
Ross Campbell,
Cindy Zhang,
Ankush Jindal,
Roopa Surapur,
Brad Holloman,
Deanna Hanisch,
Rae Buckley,
Carisa Cooney,
Ivan Tarapov,
Kimberly S. Peairs,
Brian Hasselfeld,
Peter Greene
Abstract:
The COVID-19 pandemic has accelerated the adoption of telemedicine and patient messaging through electronic medical portals (patient medical advice requests, or PMARs). While these platforms enhance patient access to healthcare, they have also increased the burden on healthcare providers due to the surge in PMARs. This study seeks to develop an efficient tool for message triaging to reduce physici…
▽ More
The COVID-19 pandemic has accelerated the adoption of telemedicine and patient messaging through electronic medical portals (patient medical advice requests, or PMARs). While these platforms enhance patient access to healthcare, they have also increased the burden on healthcare providers due to the surge in PMARs. This study seeks to develop an efficient tool for message triaging to reduce physician workload and improve patient-provider communication. We developed OPTIC (Optimizing Patient-Provider Triaging & Improving Communications in Clinical Operations), a powerful message triaging tool that utilizes GPT-4 for data labeling and BERT for model distillation. The study used a dataset of 405,487 patient messaging encounters from Johns Hopkins Medicine between January and June 2020. High-quality labeled data was generated through GPT-4-based prompt engineering, which was then used to train a BERT model to classify messages as "Admin" or "Clinical." The BERT model achieved 88.85% accuracy on the test set validated by GPT-4 labeling, with a sensitivity of 88.29%, specificity of 89.38%, and an F1 score of 0.8842. BERTopic analysis identified 81 distinct topics within the test data, with over 80% accuracy in classifying 58 topics. The system was successfully deployed through Epic's Nebula Cloud Platform, demonstrating its practical effectiveness in healthcare settings.
△ Less
Submitted 5 February, 2025;
originally announced March 2025.
-
The Society of HiveMind: Multi-Agent Optimization of Foundation Model Swarms to Unlock the Potential of Collective Intelligence
Authors:
Noah Mamie,
Susie Xi Rao
Abstract:
Multi-agent systems address issues of accessibility and scalability of artificial intelligence (AI) foundation models, which are often represented by large language models. We develop a framework - the "Society of HiveMind" (SOHM) - that orchestrates the interaction between multiple AI foundation models, imitating the observed behavior of animal swarms in nature by following modern evolutionary th…
▽ More
Multi-agent systems address issues of accessibility and scalability of artificial intelligence (AI) foundation models, which are often represented by large language models. We develop a framework - the "Society of HiveMind" (SOHM) - that orchestrates the interaction between multiple AI foundation models, imitating the observed behavior of animal swarms in nature by following modern evolutionary theories. On the one hand, we find that the SOHM provides a negligible benefit on tasks that mainly require real-world knowledge. On the other hand, we remark a significant improvement on tasks that require intensive logical reasoning, indicating that multi-agent systems are capable of increasing the reasoning capabilities of the collective compared to the individual agents. Our findings demonstrate the potential of combining a multitude of diverse AI foundation models to form an artificial swarm intelligence capable of self-improvement through interactions with a given environment.
△ Less
Submitted 13 March, 2025; v1 submitted 7 March, 2025;
originally announced March 2025.
-
EDM: Efficient Deep Feature Matching
Authors:
Xi Li,
Tong Rao,
Cihui Pan
Abstract:
Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we pre…
▽ More
Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features, and progressively injects feature correlations from global to local for efficient multi-scale feature aggregation, improving both speed and performance. In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features, avoiding the significant computational cost of explicitly locating keypoints on high-resolution local feature heatmaps. Moreover, effective selection strategies are introduced to enhance matching accuracy. Extensive experiments show that our EDM achieves competitive matching accuracy on various benchmarks and exhibits excellent efficiency, offering valuable best practices for real-world applications. The code is available at https://github.com/chicleee/EDM.
△ Less
Submitted 22 May, 2025; v1 submitted 6 March, 2025;
originally announced March 2025.
-
Multi-Agent Ergodic Exploration under Smoke-Based, Time-Varying Sensor Visibility Constraints
Authors:
Elena Wittemyer,
Ananya Rao,
Ian Abraham,
Howie Choset
Abstract:
In this work, we consider the problem of multi-agent informative path planning (IPP) for robots whose sensor visibility continuously changes as a consequence of a time-varying natural phenomenon. We leverage ergodic trajectory optimization (ETO), which generates paths such that the amount of time an agent spends in an area is proportional to the expected information in that area. We focus specific…
▽ More
In this work, we consider the problem of multi-agent informative path planning (IPP) for robots whose sensor visibility continuously changes as a consequence of a time-varying natural phenomenon. We leverage ergodic trajectory optimization (ETO), which generates paths such that the amount of time an agent spends in an area is proportional to the expected information in that area. We focus specifically on the problem of multi-agent drone search of a wildfire, where we use the time-varying environmental process of smoke diffusion to construct a sensor visibility model. This sensor visibility model is used to repeatedly calculate an expected information distribution (EID) to be used in the ETO algorithm. Our experiments show that our exploration method achieves improved information gathering over both baseline search methods and naive ergodic search formulations.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
A kinetic-based regularization method for data science applications
Authors:
Abhisek Ganguly,
Alessandro Gabbana,
Vybhav Rao,
Sauro Succi,
Santosh Ansumali
Abstract:
We propose a physics-based regularization technique for function learning, inspired by statistical mechanics. By drawing an analogy between optimizing the parameters of an interpolator and minimizing the energy of a system, we introduce corrections that impose constraints on the lower-order moments of the data distribution. This minimizes the discrepancy between the discrete and continuum represen…
▽ More
We propose a physics-based regularization technique for function learning, inspired by statistical mechanics. By drawing an analogy between optimizing the parameters of an interpolator and minimizing the energy of a system, we introduce corrections that impose constraints on the lower-order moments of the data distribution. This minimizes the discrepancy between the discrete and continuum representations of the data, in turn allowing to access more favorable energy landscapes, thus improving the accuracy of the interpolator. Our approach improves performance in both interpolation and regression tasks, even in high-dimensional spaces. Unlike traditional methods, it does not require empirical parameter tuning, making it particularly effective for handling noisy data. We also show that thanks to its local nature, the method offers computational and memory efficiency advantages over Radial Basis Function interpolators, especially for large datasets.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Authors:
Sambal Shikhar,
Mohammed Irfan Kurpath,
Sahal Shaji Mullappilly,
Jean Lahoud,
Fahad Khan,
Rao Muhammad Anwer,
Salman Khan,
Hisham Cholakkal
Abstract:
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M…
▽ More
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Simulating the Real World: A Unified Survey of Multimodal Generative Models
Authors:
Yuqi Hu,
Longguang Wang,
Xian Liu,
Ling-Hao Chen,
Yuwei Guo,
Yukai Shi,
Ce Liu,
Anyi Rao,
Zeyu Wang,
Hui Xiong
Abstract:
Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images…
▽ More
Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Exact first passage time distribution for nonlinear chemical reaction networks II: monomolecular reactions and a A + B - C type of second-order reaction with arbitrary initial conditions
Authors:
Changqian Rao,
David Waxman,
Wei Lin,
Zhuoyi Song
Abstract:
In biochemical reaction networks, the first passage time (FPT) of a reaction quantifies the time it takes for the reaction to first occur, from the initial state. While the mean FPT historically served as a summary metric, a far more comprehensive characterization of the dynamics of the network is contained within the complete FPT distribution. The relatively uncommon theoretical treatments of the…
▽ More
In biochemical reaction networks, the first passage time (FPT) of a reaction quantifies the time it takes for the reaction to first occur, from the initial state. While the mean FPT historically served as a summary metric, a far more comprehensive characterization of the dynamics of the network is contained within the complete FPT distribution. The relatively uncommon theoretical treatments of the FPT distribution that have been given in the past have been confined to linear systems, with zero and first-order processes. Recently, we presented theoretically exact solutions for the FPT distribution, within nonlinear systems involving two-particle collisions, such as A+B - C. Although this research yielded invaluable results, it was based upon the assumption of initial conditions in the form of a Poisson distribution. This somewhat restricts its relevance to real-world biochemical systems, which frequently display intricate behaviour and initial conditions that are non-Poisson in nature. Our current study extends prior analyses to accommodate arbitrary initial conditions, thereby expanding the applicability of our theoretical framework and providing a more adaptable tool for capturing the dynamics of biochemical reaction networks.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Mapping strain and structural heterogeneities around bubbles in amorphous ionically conductive Bi$_2$O$_3$
Authors:
Ellis Rae Kennedy,
Stephanie M. Ribet,
Ian S. Winter,
Caitlin A. Kohnert,
Yongqiang Wang,
Karen C. Bustillo,
Colin Ophus,
Benjamin K. Derby
Abstract:
While amorphous materials are often approximated to have a statistically homogeneous atomic structure, they frequently exhibit localized structural heterogeneity that challenges simplified models. This study uses 4D scanning transmission electron microscopy to investigate the strain and structural modifications around gas bubbles in amorphous Bi$_2$O$_3$ induced by argon irradiation. We present a…
▽ More
While amorphous materials are often approximated to have a statistically homogeneous atomic structure, they frequently exhibit localized structural heterogeneity that challenges simplified models. This study uses 4D scanning transmission electron microscopy to investigate the strain and structural modifications around gas bubbles in amorphous Bi$_2$O$_3$ induced by argon irradiation. We present a method for determining strain fields surrounding bubbles that can be used to measure the internal pressure of the gas. Compressive strain is observed around the cavities, with higher-order crystalline symmetries emerging near the cavity interfaces, suggesting paracrystalline ordering as a result of bubble coarsening. This ordering, along with a compressive strain gradient, indicates that gas bubbles induce significant localized changes in atomic packing. By analyzing strain fields with maximum compressive strains of 3\%, we estimate a lower bound on the internal pressure of the bubbles at 2.5 GPa. These findings provide insight into the complex structural behavior of amorphous materials under stress, particularly in systems with gas inclusions, and offer new methods for probing the local atomic structure in disordered materials. Although considering structural heterogeneity in amorphous systems is non-trivial, these features have crucial impacts on material functionalities, such as mechanical strength, ionic conductivity, and electronic mobility.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Effect of Ag nano-additivation on microstructure formation in Nd-Fe-B magnets built by laser powder bed fusion
Authors:
Varatharaja Nallathambi,
Philipp Gabriel,
Xinren Chen,
Ziyuan Rao,
Konstantin Skokov,
Oliver Gutfleisch,
Stephan Barcikowski,
Anna Rosa Ziefuss,
Baptiste Gault
Abstract:
Laser powder bed fusion (PBF-LB/M) enables the near-net shape production of permanent magnets with complex geometry while reducing material waste. However, controlling the microstructure and optimizing magnetic properties remain challenging due to rapid solidification and intrinsic heat treatment effects occurring during both inter-layer and intra-layer processing. Surface additivation of the feed…
▽ More
Laser powder bed fusion (PBF-LB/M) enables the near-net shape production of permanent magnets with complex geometry while reducing material waste. However, controlling the microstructure and optimizing magnetic properties remain challenging due to rapid solidification and intrinsic heat treatment effects occurring during both inter-layer and intra-layer processing. Surface additivation of the feedstock powder with Ag nanoparticles (NPs) is a concept that has been shown to increase the coercivity of PBF-LB/M-produced Nd-Fe-B magnets. Using atom probe tomography (APT) and transmission electron microscopy (TEM), we reveal that Ag nano-additivation promotes heterogeneous nucleation of the Nd2Fe14B phase, leading to refined, equiaxed grains and increased stability of the Ti-Zr-B-rich intergranular phase. The intrinsic heat treatment, influenced by layer-wise processing, further affects the distribution of Ag-rich regions, impacting grain growth and intergranular phase composition across different regions of the melt pool. Compared to the unadditivated sample, the Ag-additivated sample exhibits a significantly finer grain structure and a changed intergranular phase, which contribute to enhanced domain wall pinning and coercivity. These microstructural changes directly modify the magnetic domain structure, as evidenced by Lorentz transmission electron microscopy (TEM). Our results highlight that the interplay between nano-additivation and in-process heat treatment provides a novel pathway for tailoring the microstructure and enhancing the magnetic performance of permanent magnets.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Lattice dynamics of hexagonal ZnMgS
Authors:
Abdelmahjid Elmahjoubi,
Mala Rao,
Alexandre Ivanov,
Andrei Postnikov,
Alain Polian,
Toni Alhaddad,
Samrath Chaplot,
Andrea Piovano,
Sebastien Diliberto,
Stephanie Michel,
Alain Maillard,
Karol Strzalkowski,
O. Pages
Abstract:
Inelastic neutron scattering measurements on the hexagonal Zn67Mg33S semiconductor alloy reveal a bimodal pattern of the optical modes across the Brillouin zone, confirmed by first-principles simulations. Such modes are sensitive to the local fluctuations in the composition inherent to random Zn/Mg alloying, distinguishing homo from hetero environments of a given bond (1-bond/2-mode), as is formal…
▽ More
Inelastic neutron scattering measurements on the hexagonal Zn67Mg33S semiconductor alloy reveal a bimodal pattern of the optical modes across the Brillouin zone, confirmed by first-principles simulations. Such modes are sensitive to the local fluctuations in the composition inherent to random Zn/Mg alloying, distinguishing homo from hetero environments of a given bond (1-bond/2-mode), as is formalized for cubic alloys by the percolation model. The latter model thus emerges as a generic framework for systematizing the optical modes of semiconductor alloys in various crystal structures.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Computational Analysis of Degradation Modeling in Blind Panoramic Image Quality Assessment
Authors:
Jiebin Yan,
Ziwen Tan,
Jiale Rao,
Lei Wu,
Yifan Zuo,
Yuming Fang
Abstract:
Blind panoramic image quality assessment (BPIQA) has recently brought new challenge to the visual quality community, due to the complex interaction between immersive content and human behavior. Although many efforts have been made to advance BPIQA from both conducting psychophysical experiments and designing performance-driven objective algorithms, \textit{limited content} and \textit{few samples}…
▽ More
Blind panoramic image quality assessment (BPIQA) has recently brought new challenge to the visual quality community, due to the complex interaction between immersive content and human behavior. Although many efforts have been made to advance BPIQA from both conducting psychophysical experiments and designing performance-driven objective algorithms, \textit{limited content} and \textit{few samples} in those closed sets inevitably would result in shaky conclusions, thereby hindering the development of BPIQA, we refer to it as the \textit{easy-database} issue. In this paper, we present a sufficient computational analysis of degradation modeling in BPIQA to thoroughly explore the \textit{easy-database issue}, where we carefully design three types of experiments via investigating the gap between BPIQA and blind image quality assessment (BIQA), the necessity of specific design in BPIQA models, and the generalization ability of BPIQA models. From extensive experiments, we find that easy databases narrow the gap between the performance of BPIQA and BIQA models, which is unconducive to the development of BPIQA. And the easy databases make the BPIQA models be closed to saturation, therefore the effectiveness of the associated specific designs can not be well verified. Besides, the BPIQA models trained on our recently proposed databases with complicated degradation show better generalization ability. Thus, we believe that much more efforts are highly desired to put into BPIQA from both subjective viewpoint and objective viewpoint.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
Weakly-Constrained 4D Var for Downscaling with Uncertainty using Data-Driven Surrogate Models
Authors:
Philip Dinenis,
Vishwas Rao,
Mihai Anitescu
Abstract:
Dynamic downscaling typically involves using numerical weather prediction (NWP) solvers to refine coarse data to higher spatial resolutions. Data-driven models such as FourCastNet have emerged as a promising alternative to the traditional NWP models for forecasting. Once these models are trained, they are capable of delivering forecasts in a few seconds, thousands of times faster compared to class…
▽ More
Dynamic downscaling typically involves using numerical weather prediction (NWP) solvers to refine coarse data to higher spatial resolutions. Data-driven models such as FourCastNet have emerged as a promising alternative to the traditional NWP models for forecasting. Once these models are trained, they are capable of delivering forecasts in a few seconds, thousands of times faster compared to classical NWP models. However, as the lead times, and, therefore, their forecast window, increase, these models show instability in that they tend to diverge from reality. In this paper, we propose to use data assimilation approaches to stabilize them when used for downscaling tasks. Data assimilation uses information from three different sources, namely an imperfect computational model based on partial differential equations (PDE), from noisy observations, and from an uncertainty-reflecting prior. In this work, when carrying out dynamic downscaling, we replace the computationally expensive PDE-based NWP models with FourCastNet in a ``weak-constrained 4DVar framework" that accounts for the implied model errors. We demonstrate the efficacy of this approach for a hurricane-tracking problem; moreover, the 4DVar framework naturally allows the expression and quantification of uncertainty. We demonstrate, using ERA5 data, that our approach performs better than the ensemble Kalman filter (EnKF) and the unstabilized FourCastNet model, both in terms of forecast accuracy and forecast uncertainty.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Learning Surrogates for Offline Black-Box Optimization via Gradient Matching
Authors:
Minh Hoang,
Azza Fadhel,
Aryan Deshwal,
Janardhan Rao Doppa,
Trong Nghia Hoang
Abstract:
Offline design optimization problem arises in numerous science and engineering applications including material and chemical design, where expensive online experimentation necessitates the use of in silico surrogate functions to predict and maximize the target objective over candidate designs. Although these surrogates can be learned from offline data, their predictions are often inaccurate outside…
▽ More
Offline design optimization problem arises in numerous science and engineering applications including material and chemical design, where expensive online experimentation necessitates the use of in silico surrogate functions to predict and maximize the target objective over candidate designs. Although these surrogates can be learned from offline data, their predictions are often inaccurate outside the offline data regime. This challenge raises a fundamental question about the impact of imperfect surrogate model on the performance gap between its optima and the true optima, and to what extent the performance loss can be mitigated. Although prior work developed methods to improve the robustness of surrogate models and their associated optimization processes, a provably quantifiable relationship between an imperfect surrogate and the corresponding performance gap, as well as whether prior methods directly address it, remain elusive. To shed light on this important question, we present a theoretical framework to understand offline black-box optimization, by explicitly bounding the optimization quality based on how well the surrogate matches the latent gradient field that underlines the offline data. Inspired by our theoretical analysis, we propose a principled black-box gradient matching algorithm to create effective surrogate models for offline optimization, improving over prior approaches on various real-world benchmarks.
△ Less
Submitted 26 February, 2025;
originally announced March 2025.
-
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Authors:
Zitang Zhou,
Ke Mei,
Yu Lu,
Tianyi Wang,
Fengyun Rao
Abstract:
This paper introduces HarmonySet, a comprehensive dataset designed to advance video-music understanding. HarmonySet consists of 48,328 diverse video-music pairs, annotated with detailed information on rhythmic synchronization, emotional alignment, thematic coherence, and cultural relevance. We propose a multi-step human-machine collaborative framework for efficient annotation, combining human insi…
▽ More
This paper introduces HarmonySet, a comprehensive dataset designed to advance video-music understanding. HarmonySet consists of 48,328 diverse video-music pairs, annotated with detailed information on rhythmic synchronization, emotional alignment, thematic coherence, and cultural relevance. We propose a multi-step human-machine collaborative framework for efficient annotation, combining human insights with machine-generated descriptions to identify key transitions and assess alignment across multiple dimensions. Additionally, we introduce a novel evaluation framework with tasks and metrics to assess the multi-dimensional alignment of video and music, including rhythm, emotion, theme, and cultural context. Our extensive experiments demonstrate that HarmonySet, along with the proposed evaluation framework, significantly improves the ability of multimodal models to capture and analyze the intricate relationships between video and music.
△ Less
Submitted 4 March, 2025; v1 submitted 3 March, 2025;
originally announced March 2025.
-
Towards net-zero manufacturing: carbon-aware scheduling for GHG emissions reduction
Authors:
Andrea Mencaroni,
Pieter Leyman,
Birger Raa,
Stijn De Vuyst,
Dieter Claeys
Abstract:
Detailed scheduling has traditionally been optimized for the reduction of makespan and manufacturing costs. However, growing awareness of environmental concerns and increasingly stringent regulations are pushing manufacturing towards reducing the carbon footprint of its operations. Scope 2 emissions, which are the indirect emissions related to the production and consumption of grid electricity, ar…
▽ More
Detailed scheduling has traditionally been optimized for the reduction of makespan and manufacturing costs. However, growing awareness of environmental concerns and increasingly stringent regulations are pushing manufacturing towards reducing the carbon footprint of its operations. Scope 2 emissions, which are the indirect emissions related to the production and consumption of grid electricity, are in fact estimated to be responsible for more than one-third of the global GHG emissions. In this context, carbon-aware scheduling can serve as a powerful way to reduce manufacturing's carbon footprint by considering the time-dependent carbon intensity of the grid and the availability of on-site renewable electricity.
This study introduces a carbon-aware permutation flow-shop scheduling model designed to reduce scope 2 emissions. The model is formulated as a mixed-integer linear problem, taking into account the forecasted grid generation mix and available on-site renewable electricity, along with the set of jobs to be scheduled and their corresponding power requirements. The objective is to find an optimal day-ahead schedule that minimizes scope 2 emissions. The problem is addressed using a dedicated memetic algorithm, combining evolutionary strategy and local search.
Results from computational experiments confirm that by considering the dynamic carbon intensity of the grid and on-site renewable electricity availability, substantial reductions in carbon emissions can be achieved.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Quantum nonlocal double slit interference with partially coherent qubits
Authors:
Sakshi Rao,
Bhaskar Kanseri
Abstract:
Partially coherent quantum-entangled beams combine quantum entanglement with partial coherence, allowing them to maintain quantum characteristics while being more resistant to distortions caused by random media during propagation. In this study, we investigate the effect of coherence variation of such beams on non-local double-slit quantum interference. The spatial coherence variation is achieved…
▽ More
Partially coherent quantum-entangled beams combine quantum entanglement with partial coherence, allowing them to maintain quantum characteristics while being more resistant to distortions caused by random media during propagation. In this study, we investigate the effect of coherence variation of such beams on non-local double-slit quantum interference. The spatial coherence variation is achieved by controlling the spot size and transverse coherence length of the Gaussian Schell model pump in the spontaneous parametric down conversion process. For a fixed beam size, the momentum correlation width of partially coherent biphotons increases with the decreases in the transverse coherence length. This results in a biphoton beam exhibiting multiple spatial modes, making it more suitable for studying the non-local features of quantum states in imaging, interference, and diffraction experiments. Our findings infer both high-quality and near-unity visibility of nonlocal interference using the partially coherent twin beams, even with the substantial decrease in the coherence of the pump. We believe these results can enhance robustness against the deleterious effects of the medium during propagation and can have potential applications in optical image cryptography, biomedical imaging, quantum lithography, and quantum holography.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
Social Welfare Maximization in Approval-Based Committee Voting under Uncertainty
Authors:
Haris Aziz,
Yuhang Guo,
Venkateswara Rao Kagita,
Baharak Rastegari,
Mashbat Suzuki
Abstract:
Approval voting is widely used for making multi-winner voting decisions. The canonical rule (also called Approval Voting) used in the setting aims to maximize social welfare by selecting candidates with the highest number of approvals. We revisit approval-based multi-winner voting in scenarios where the information regarding the voters' preferences is uncertain. We present several algorithmic resu…
▽ More
Approval voting is widely used for making multi-winner voting decisions. The canonical rule (also called Approval Voting) used in the setting aims to maximize social welfare by selecting candidates with the highest number of approvals. We revisit approval-based multi-winner voting in scenarios where the information regarding the voters' preferences is uncertain. We present several algorithmic results for problems related to social welfare maximization under uncertainty, including computing an outcome that is social welfare maximizing with the highest probability, computing the social welfare probability distribution of a given outcome, computing the probability that a given outcome is social welfare maximizing, and understanding how robust an outcome is with respect to social welfare maximizing.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
Re-Imagining Multimodal Instruction Tuning: A Representation View
Authors:
Yiyang Liu,
James Chenhao Liang,
Ruixiang Tang,
Yugyung Lee,
Majid Rabbani,
Sohail Dianat,
Raghuveer Rao,
Lifu Huang,
Dongfang Liu,
Qifan Wang,
Cheng Han
Abstract:
Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to re…
▽ More
Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.
△ Less
Submitted 20 March, 2025; v1 submitted 1 March, 2025;
originally announced March 2025.
-
The Simons Observatory: Science Goals and Forecasts for the Enhanced Large Aperture Telescope
Authors:
The Simons Observatory Collaboration,
M. Abitbol,
I. Abril-Cabezas,
S. Adachi,
P. Ade,
A. E. Adler,
P. Agrawal,
J. Aguirre,
Z. Ahmed,
S. Aiola,
T. Alford,
A. Ali,
D. Alonso,
M. A. Alvarez,
R. An,
K. Arnold,
P. Ashton,
Z. Atkins,
J. Austermann,
S. Azzoni,
C. Baccigalupi,
A. Baleato Lizancos,
D. Barron,
P. Barry,
J. Bartlett
, et al. (397 additional authors not shown)
Abstract:
We describe updated scientific goals for the wide-field, millimeter-wave survey that will be produced by the Simons Observatory (SO). Significant upgrades to the 6-meter SO Large Aperture Telescope (LAT) are expected to be complete by 2028, and will include a doubled mapping speed with 30,000 new detectors and an automated data reduction pipeline. In addition, a new photovoltaic array will supply…
▽ More
We describe updated scientific goals for the wide-field, millimeter-wave survey that will be produced by the Simons Observatory (SO). Significant upgrades to the 6-meter SO Large Aperture Telescope (LAT) are expected to be complete by 2028, and will include a doubled mapping speed with 30,000 new detectors and an automated data reduction pipeline. In addition, a new photovoltaic array will supply most of the observatory's power. The LAT survey will cover about 60% of the sky at a regular observing cadence, with five times the angular resolution and ten times the map depth of Planck. The science goals are to: (1) determine the physical conditions in the early universe and constrain the existence of new light particles; (2) measure the integrated distribution of mass, electron pressure, and electron momentum in the late-time universe, and, in combination with optical surveys, determine the neutrino mass and the effects of dark energy via tomographic measurements of the growth of structure at $z < 3$; (3) measure the distribution of electron density and pressure around galaxy groups and clusters, and calibrate the effects of energy input from galaxy formation on the surrounding environment; (4) produce a sample of more than 30,000 galaxy clusters, and more than 100,000 extragalactic millimeter sources, including regularly sampled AGN light-curves, to study these sources and their emission physics; (5) measure the polarized emission from magnetically aligned dust grains in our Galaxy, to study the properties of dust and the role of magnetic fields in star formation; (6) constrain asteroid regoliths, search for Trans-Neptunian Objects, and either detect or eliminate large portions of the phase space in the search for Planet 9; and (7) provide a powerful new window into the transient universe on time scales of minutes to years, concurrent with observations from Rubin of overlapping sky.
△ Less
Submitted 15 March, 2025; v1 submitted 1 March, 2025;
originally announced March 2025.
-
PinLanding: Content-First Keyword Landing Page Generation via Multi-Modal AI for Web-Scale Discovery
Authors:
Faye Zhang,
Jasmine Wan,
Qianyu Cheng,
Jinfeng Rao
Abstract:
Online platforms like Pinterest hosting vast content collections traditionally rely on manual curation or user-generated search logs to create keyword landing pages (KLPs) -- topic-centered collection pages that serve as entry points for content discovery. While manual curation ensures quality, it doesn't scale to millions of collections, and search log approaches result in limited topic coverage…
▽ More
Online platforms like Pinterest hosting vast content collections traditionally rely on manual curation or user-generated search logs to create keyword landing pages (KLPs) -- topic-centered collection pages that serve as entry points for content discovery. While manual curation ensures quality, it doesn't scale to millions of collections, and search log approaches result in limited topic coverage and imprecise content matching. In this paper, we present PinLanding, a novel content-first architecture that transforms the way platforms create topical collections. Instead of deriving topics from user behavior, our system employs a multi-stage pipeline combining vision-language model (VLM) for attribute extraction, large language model (LLM) for topic generation, and a CLIP-based dual-encoder architecture for precise content matching. Our model achieves 99.7% Recall@10 on Fashion200K benchmark, demonstrating strong attribute understanding capabilities. In production deployment for search engine optimization with 4.2 million shopping landing pages, the system achieves a 4X increase in topic coverage and 14.29% improvement in collection attribute precision over the traditional search log-based approach via human evaluation. The architecture can be generalized beyond search traffic to power various user experiences, including content discovery and recommendations, providing a scalable solution to transform unstructured content into curated topical collections across any content domain.
△ Less
Submitted 1 March, 2025;
originally announced March 2025.
-
LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Authors:
Komal Kumar,
Tajamul Ashraf,
Omkar Thawakar,
Rao Muhammad Anwer,
Hisham Cholakkal,
Mubarak Shah,
Ming-Hsuan Yang,
Phillip H. S. Torr,
Fahad Shahbaz Khan,
Salman Khan
Abstract:
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-tr…
▽ More
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.
△ Less
Submitted 24 March, 2025; v1 submitted 28 February, 2025;
originally announced February 2025.
-
Enhanced Electromechanical Properties of Solution-Processed K$_{0.5}$Na$_{0.5}$NbO$_{3}$ Thin Films
Authors:
Nagamalleswara Rao Alluri,
Longfei Song,
Stephanie Girod,
Barnik Mandal,
Juliette Cardoletti,
Vid Bobnar,
Torsten Granzow,
Veronika Kovacova,
Adrian-Marie Philippe,
Emmanuel Defay,
Sebastjan Glinsek
Abstract:
K$_{0.5}$Na$_{0.5}$NbO$_{3}$ is among the most promising lead-free piezoelectrics. While its sputtered films match the performance of the champion piezoelectric Pb(Zr,Ti)O$_{3}$, processing of high-quality, reproducible, and time-stable solution-processed K$_{0.5}$Na$_{0.5}$NbO$_{3}$ films remains challenging. Here, we report 1 $μ$m-thick Mn-doped K$_{0.5}$Na$_{0.5}$NbO$_{3}$ films prepared throug…
▽ More
K$_{0.5}$Na$_{0.5}$NbO$_{3}$ is among the most promising lead-free piezoelectrics. While its sputtered films match the performance of the champion piezoelectric Pb(Zr,Ti)O$_{3}$, processing of high-quality, reproducible, and time-stable solution-processed K$_{0.5}$Na$_{0.5}$NbO$_{3}$ films remains challenging. Here, we report 1 $μ$m-thick Mn-doped K$_{0.5}$Na$_{0.5}$NbO$_{3}$ films prepared through a chemical solution deposition process, which have perfectly dense microstructure and uniform composition across their thickness. The films exhibit a high transverse piezoelectric coefficient (e$_{31,f}$ = -14.8 C/m$^{2}$), high dielectric permittivity ($ε_{r}$ = 920), low dielectric losses (tan$δ$ = 0.05) and can withstand electric fields up to at least 1 MV/cm. The functional properties show excellent stability over time, and the synthesis process is reproducible. The results demonstrate the high potential of Mn-doped K$_{0.5}$Na$_{0.5}$NbO$_{3}$ films to become a replacement for lead-based Pb(Zr,Ti)O$_{3}$ films in piezoelectric applications.
△ Less
Submitted 28 February, 2025;
originally announced February 2025.
-
Multiple Linked Tensor Factorization
Authors:
Zhiyu Kang,
Raghavendra B. Rao,
Eric F. Lock
Abstract:
In biomedical research and other fields, it is now common to generate high content data that are both multi-source and multi-way. Multi-source data are collected from different high-throughput technologies while multi-way data are collected over multiple dimensions, yielding multiple tensor arrays. Integrative analysis of these data sets is needed, e.g., to capture and synthesize different facets…
▽ More
In biomedical research and other fields, it is now common to generate high content data that are both multi-source and multi-way. Multi-source data are collected from different high-throughput technologies while multi-way data are collected over multiple dimensions, yielding multiple tensor arrays. Integrative analysis of these data sets is needed, e.g., to capture and synthesize different facets of complex biological systems. However, despite growing interest in multi-source and multi-way factorization techniques, methods that can handle data that are both multi-source and multi-way are limited. In this work, we propose a Multiple Linked Tensors Factorization (MULTIFAC) method extending the CANDECOMP/PARAFAC (CP) decomposition to simultaneously reduce the dimension of multiple multi-way arrays and approximate underlying signal. We first introduce a version of the CP factorization with L2 penalties on the latent factors, leading to rank sparsity. When extended to multiple linked tensors, the method automatically reveals latent components that are shared across data sources or individual to each data source. We also extend the decomposition algorithm to its expectation-maximization (EM) version to handle incomplete data with imputation. Extensive simulation studies are conducted to demonstrate MULTIFAC's ability to (i) approximate underlying signal, (ii) identify shared and unshared structures, and (iii) impute missing data. The approach yields an interpretable decomposition on multi-way multi-omics data for a study on early-life iron deficiency.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Light-Emitting Microfibers from Lotus Root for Eco-friendly Optical Waveguides and Biosensing
Authors:
X. Yang,
L. Xu,
S. Xiong,
H. Rao,
F. Tan,
J. Yan,
Y. Bao,
A. Albanese,
A. Camposeo,
D. Pisignano,
B. Li
Abstract:
Optical biosensors based on micro-/nano-fibers are highly valuable for probing and monitoring liquid environments and bioactivity. Most of current optical biosensors, however, are still based on glass, semiconductors, or metallic materials, which might be not fully suited for biologically-relevant environments. Here, we introduce biocompatible and flexible microfibers from Lotus silk as micro-envi…
▽ More
Optical biosensors based on micro-/nano-fibers are highly valuable for probing and monitoring liquid environments and bioactivity. Most of current optical biosensors, however, are still based on glass, semiconductors, or metallic materials, which might be not fully suited for biologically-relevant environments. Here, we introduce biocompatible and flexible microfibers from Lotus silk as micro-environmental monitors that exhibit waveguiding of intrinsic fluorescence as well as of coupled light. These features make single-filament monitors excellent building blocks for a variety of sensing functions, including pH-probing and detection of bacterial activity. These results pave the way for the development of new and entirely eco-friendly, potentially multiplexed biosensing platforms.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
Max360IQ: Blind Omnidirectional Image Quality Assessment with Multi-axis Attention
Authors:
Jiebin Yan,
Ziwen Tan,
Yuming Fang,
Jiale Rao,
Yifan Zuo
Abstract:
Omnidirectional image, also called 360-degree image, is able to capture the entire 360-degree scene, thereby providing more realistic immersive feelings for users than general 2D image and stereoscopic image. Meanwhile, this feature brings great challenges to measuring the perceptual quality of omnidirectional images, which is closely related to users' quality of experience, especially when the om…
▽ More
Omnidirectional image, also called 360-degree image, is able to capture the entire 360-degree scene, thereby providing more realistic immersive feelings for users than general 2D image and stereoscopic image. Meanwhile, this feature brings great challenges to measuring the perceptual quality of omnidirectional images, which is closely related to users' quality of experience, especially when the omnidirectional images suffer from non-uniform distortion. In this paper, we propose a novel and effective blind omnidirectional image quality assessment (BOIQA) model with multi-axis attention (Max360IQ), which can proficiently measure not only the quality of uniformly distorted omnidirectional images but also the quality of non-uniformly distorted omnidirectional images. Specifically, the proposed Max360IQ is mainly composed of a backbone with stacked multi-axis attention modules for capturing both global and local spatial interactions of extracted viewports, a multi-scale feature integration (MSFI) module to fuse multi-scale features and a quality regression module with deep semantic guidance for predicting the quality of omnidirectional images. Experimental results demonstrate that the proposed Max360IQ outperforms the state-of-the-art Assessor360 by 3.6\% in terms of SRCC on the JUFE database with non-uniform distortion, and gains improvement of 0.4\% and 0.8\% in terms of SRCC on the OIQA and CVIQ databases, respectively. The source code is available at https://github.com/WenJuing/Max360IQ.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
First frequency phase transfer from the 3 mm to the 1 mm band on an Earth-sized baseline
Authors:
Sara Issaoun,
Dominic W. Pesce,
María J. Rioja,
Richard Dodson,
Lindy Blackburn,
Garrett K. Keating,
Sheperd S. Doeleman,
Bong Won Sohn,
Wu Jiang,
Dan Hoak,
Wei Yu,
Pablo Torne,
Ramprasad Rao,
Remo P. J. Tilanus,
Iván Martí-Vidal,
Taehyun Jung,
Garret Fitzpatrick,
Miguel Sánchez-Portal,
Salvador Sánchez,
Jonathan Weintroub,
Mark Gurwell,
Carsten Kramer,
Carlos Durán,
David John,
Juan L. Santaren
, et al. (11 additional authors not shown)
Abstract:
Frequency Phase Transfer (FPT) is a technique designed to increase coherence and sensitivity in radio interferometry by making use of the non-dispersive nature of the troposphere to calibrate high-frequency data using solutions derived at a lower frequency. While the Korean VLBI Network has pioneered the use of simultaneous multi-band systems for routine FPT up to an observing frequency of 130 GHz…
▽ More
Frequency Phase Transfer (FPT) is a technique designed to increase coherence and sensitivity in radio interferometry by making use of the non-dispersive nature of the troposphere to calibrate high-frequency data using solutions derived at a lower frequency. While the Korean VLBI Network has pioneered the use of simultaneous multi-band systems for routine FPT up to an observing frequency of 130 GHz, this technique remains largely untested in the (sub)millimeter regime. A recent effort has been made to outfit dual-band systems at (sub)millimeter observatories participating in the Event Horizon Telescope (EHT) and to test the feasibility and performance of FPT up to the observing frequencies of the EHT. We present the results of simultaneous dual-frequency observations conducted in January 2024 on an Earth-sized baseline between the IRAM 30-m in Spain and the JCMT and SMA in Hawai`i. We performed simultaneous observations at 86 and 215 GHz on the bright sources J0958+6533 and OJ287, with strong detections obtained at both frequencies. We observe a strong correlation between the interferometric phases at the two frequencies, matching the trend expected for atmospheric fluctuations and demonstrating for the first time the viability of FPT for VLBI at a wavelength of $\sim$1 millimeter. We show that the application of FPT systematically increases the 215 GHz coherence on all averaging timescales. In addition, the use of the co-located JCMT and SMA as a single dual-frequency station demonstrates the feasibility of paired-antenna FPT for VLBI for the first time, with implications for future array capabilities (e.g., ALMA sub-arraying and ngVLA calibration strategies).
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation
Authors:
Pengzhi Li,
Pengfei Yu,
Zide Liu,
Wei He,
Xuhao Pan,
Xudong Rao,
Tao Wei,
Wei Chen
Abstract:
In this paper, we introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models while minimizing computational demands. Traditional text encoders, such as CLIP and T5, exhibit limitations in multilingual processing, hindering image generation across diverse languages. We address these challenges by leveraging the advanced capabilities of…
▽ More
In this paper, we introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models while minimizing computational demands. Traditional text encoders, such as CLIP and T5, exhibit limitations in multilingual processing, hindering image generation across diverse languages. We address these challenges by leveraging the advanced capabilities of LLMs. Our approach employs a language representation strategy that applies hierarchical caption optimization and human instruction techniques to derive precise semantic information,. Subsequently, we incorporate a lightweight adapter and a cross-modal refiner to facilitate efficient feature alignment and interaction between LLMs and image features. LDGen reduces training time and enables zero-shot multilingual image generation. Experimental results indicate that our method surpasses baseline models in both prompt adherence and image aesthetic quality, while seamlessly supporting multiple languages. Project page: https://zrealli.github.io/LDGen.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.