-
TabArena: A Living Benchmark for Machine Learning on Tabular Data
Authors:
Nick Erickson,
Lennart Purucker,
Andrej Tschalzev,
David Holzmüller,
Prateek Mutalik Desai,
and David Salinas,
Frank Hutter
Abstract:
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular b…
▽ More
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning and investigate the contributions of individual models. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW
Authors:
Prateek Mehta,
Anasuya Patil
Abstract:
Knowledge extraction through sound is a distinctive property. Visually impaired individuals often rely solely on Braille books and audio recordings provided by NGOs. Due to limitations in these approaches, blind individuals often cannot access books of their choice. Speech is a more effective mode of communication than text for blind and visually impaired persons, as they can easily respond to sou…
▽ More
Knowledge extraction through sound is a distinctive property. Visually impaired individuals often rely solely on Braille books and audio recordings provided by NGOs. Due to limitations in these approaches, blind individuals often cannot access books of their choice. Speech is a more effective mode of communication than text for blind and visually impaired persons, as they can easily respond to sounds. This paper presents the development of an accurate, reliable, cost-effective, and user-friendly optical character recognition (OCR)-based speech synthesis system. The OCR-based system has been implemented using Laboratory Virtual Instrument Engineering Workbench (LabVIEW).
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Fading in the Flow: Suppression of cold gas growth in expanding galactic outflows
Authors:
Alankar Dutta,
Prateek Sharma,
Max Gronke
Abstract:
Multiwavelength observations reveal multiphase outflows that play a crucial role in redistributing gas and metals in and around galaxies. Theoretical modelling of such multiphase outflows often employs wind tunnel simulations of a spherical cold ($\sim 10^4 \ \rm K$) cloud facing a uniform hot ($\sim 10^6\ \rm K$) wind. However, outflows are naturally expanding and wind conditions change downstrea…
▽ More
Multiwavelength observations reveal multiphase outflows that play a crucial role in redistributing gas and metals in and around galaxies. Theoretical modelling of such multiphase outflows often employs wind tunnel simulations of a spherical cold ($\sim 10^4 \ \rm K$) cloud facing a uniform hot ($\sim 10^6\ \rm K$) wind. However, outflows are naturally expanding and wind conditions change downstream -- a crucial aspect overlooked in most idealized simulations. To address this, we examine how an expanding wind influences the survival, morphology, and dynamics of a cloud. We perform idealized hydrodynamic simulations of radiative cloud-crushing in an expanding wind, where the steady background wind is modelled using the adiabatic Chevalier & Clegg 1985 (CC85) analytic solution. Moving downstream, we find that the clouds remain locally isobaric with the wind, leading to a steep decline in their density contrast with respect to the ambient medium, and they eventually dissipate into the wind. This also suppresses the growth of cold gas mass in comparison to a plane-parallel wind since entrained clouds move into a less radiative background. Using analytic scaling arguments, we present a physical picture of cloud evolution in a CC85 wind. Cloud expansion and local pressure equilibrium are the key regulators of cold mass growth. Unlike traditional homogeneous wind tunnel simulations, our simulations account for the differential expansion experienced by the long cometary tails of clouds moving in an outflow. Consequently, a strong head-to-tail emission gradient in the filamentary cold gas tails develop -- features closer to observations. In addition, we demonstrate that the dynamics of individual clouds may substantially alter the radial properties of their host multiphase outflows.
△ Less
Submitted 12 June, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
Quanta Diffusion
Authors:
Prateek Chennuri,
Dongdong Fu,
Stanley H. Chan
Abstract:
We present Quanta Diffusion (QuDi), a powerful generative video reconstruction method for single-photon imaging. QuDi is an algorithm supporting the latest Quanta Image Sensors (QIS) and Single Photon Avalanche Diodes (SPADs) for extremely low-light imaging conditions. Compared to existing methods, QuDi overcomes the difficulties of simultaneously managing the motion and the strong shot noise. The…
▽ More
We present Quanta Diffusion (QuDi), a powerful generative video reconstruction method for single-photon imaging. QuDi is an algorithm supporting the latest Quanta Image Sensors (QIS) and Single Photon Avalanche Diodes (SPADs) for extremely low-light imaging conditions. Compared to existing methods, QuDi overcomes the difficulties of simultaneously managing the motion and the strong shot noise. The core innovation of QuDi is to inject a physics-based forward model into the diffusion algorithm, while keeping the motion estimation in the loop. QuDi demonstrates an average of 2.4 dB PSNR improvement over the best existing methods.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Spark Transformer: Reactivating Sparsity in FFN and Attention
Authors:
Chong You,
Kan Wu,
Zhipeng Jia,
Lin Chen,
Srinadh Bhojanapalli,
Jiaxian Guo,
Utku Evci,
Jan Wassenberg,
Praneeth Netrapalli,
Jeremiah J. Willcock,
Suvinay Subramanian,
Felix Chern,
Alek Andreev,
Shreya Pathak,
Felix Yu,
Prateek Jain,
David E. Culler,
Henry M. Levy,
Sanjiv Kumar
Abstract:
The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the Re…
▽ More
The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity often degrade model quality, increase parameter count, complicate or slow down training. Sparse attention, the application of sparse activation to the attention mechanism, often faces similar challenges.
This paper introduces the Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism while maintaining model quality, parameter count, and standard training procedures. Our method realizes sparsity via top-k masking for explicit control over sparsity level. Crucially, we introduce statistical top-k, a hardware-accelerator-friendly, linear-time approximate algorithm that avoids costly sorting and mitigates significant training slowdown from standard top-$k$ operators. Furthermore, Spark Transformer reallocates existing FFN parameters and attention key embeddings to form a low-cost predictor for identifying activated entries. This design not only mitigates quality loss from enforced sparsity, but also enhances wall-time benefit. Pretrained with the Gemma-2 recipe, Spark Transformer demonstrates competitive performance on standard benchmarks while exhibiting significant sparsity: only 8% of FFN neurons are activated, and each token attends to a maximum of 256 tokens. This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
PixCell: A generative foundation model for digital histopathology images
Authors:
Srikar Yellapragada,
Alexandros Graikos,
Zilinghan Li,
Kostas Triaridis,
Varun Belagali,
Saarthak Kapse,
Tarak Nath Nandi,
Ravi K Madduri,
Prateek Prasanna,
Tahsin Kurc,
Rajarsi R. Gupta,
Joel Saltz,
Dimitris Samaras
Abstract:
The digitization of histology slides has revolutionized pathology, providing massive datasets for cancer diagnosis and research. Contrastive self-supervised and vision-language models have been shown to effectively mine large pathology datasets to learn discriminative representations. On the other hand, generative models, capable of synthesizing realistic and diverse images, present a compelling s…
▽ More
The digitization of histology slides has revolutionized pathology, providing massive datasets for cancer diagnosis and research. Contrastive self-supervised and vision-language models have been shown to effectively mine large pathology datasets to learn discriminative representations. On the other hand, generative models, capable of synthesizing realistic and diverse images, present a compelling solution to address unique problems in pathology that involve synthesizing images; overcoming annotated data scarcity, enabling privacy-preserving data sharing, and performing inherently generative tasks, such as virtual staining. We introduce PixCell, the first diffusion-based generative foundation model for histopathology. We train PixCell on PanCan-30M, a vast, diverse dataset derived from 69,184 H\&E-stained whole slide images covering various cancer types. We employ a progressive training strategy and a self-supervision-based conditioning that allows us to scale up training without any annotated data. PixCell generates diverse and high-quality images across multiple cancer types, which we find can be used in place of real data to train a self-supervised discriminative model. Synthetic images shared between institutions are subject to fewer regulatory barriers than would be the case with real clinical images. Furthermore, we showcase the ability to precisely control image generation using a small set of annotated images, which can be used for both data augmentation and educational purposes. Testing on a cell segmentation task, a mask-guided PixCell enables targeted data augmentation, improving downstream performance. Finally, we demonstrate PixCell's ability to use H\&E structural staining to infer results from molecular marker studies; we use this capability to infer IHC staining from H\&E images. Our trained models are publicly released to accelerate research in computational pathology.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Faster Approx. Top-K: Harnessing the Full Power of Two Stages
Authors:
Yashas Samaga,
Varun Yerram,
Spandana Raj Babbula,
Prateek Jain,
Praneeth Netrapalli
Abstract:
We consider the Top-$K$ selection problem, which aims to identify the largest-$K$ elements from an array. Top-$K$ selection arises in many machine learning algorithms and often becomes a bottleneck on accelerators, which are optimized for dense matrix multiplications. To address this problem, \citet{chern2022tpuknnknearestneighbor} proposed a fast two-stage \textit{approximate} Top-$K$ algorithm:…
▽ More
We consider the Top-$K$ selection problem, which aims to identify the largest-$K$ elements from an array. Top-$K$ selection arises in many machine learning algorithms and often becomes a bottleneck on accelerators, which are optimized for dense matrix multiplications. To address this problem, \citet{chern2022tpuknnknearestneighbor} proposed a fast two-stage \textit{approximate} Top-$K$ algorithm: (i) partition the input array and select the top-$1$ element from each partition, (ii) sort this \textit{smaller subset} and return the top $K$ elements. In this paper, we consider a generalized version of this algorithm, where the first stage selects top-$K'$ elements, for some $1 \leq K' \leq K$, from each partition. Our contributions are as follows: (i) we derive an expression for the expected recall of this generalized algorithm and show that choosing $K' > 1$ with fewer partitions in the first stage reduces the input size to the second stage more effectively while maintaining the same expected recall as the original algorithm, (ii) we derive a bound on the expected recall for the original algorithm in \citet{chern2022tpuknnknearestneighbor} that is provably tighter by a factor of $2$ than the one in that paper, and (iii) we implement our algorithm on Cloud TPUv5e and achieve around an order of magnitude speedups over the original algorithm without sacrificing recall on real-world tasks.
△ Less
Submitted 5 June, 2025; v1 submitted 4 June, 2025;
originally announced June 2025.
-
PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches
Authors:
Dennis Jacob,
Chong Xiang,
Prateek Mittal
Abstract:
Deep learning techniques have enabled vast improvements in computer vision technologies. Nevertheless, these models are vulnerable to adversarial patch attacks which catastrophically impair performance. The physically realizable nature of these attacks calls for certifiable defenses, which feature provable guarantees on robustness. While certifiable defenses have been successfully applied to singl…
▽ More
Deep learning techniques have enabled vast improvements in computer vision technologies. Nevertheless, these models are vulnerable to adversarial patch attacks which catastrophically impair performance. The physically realizable nature of these attacks calls for certifiable defenses, which feature provable guarantees on robustness. While certifiable defenses have been successfully applied to single-label classification, limited work has been done for multi-label classification. In this work, we present PatchDEMUX, a certifiably robust framework for multi-label classifiers against adversarial patches. Our approach is a generalizable method which can extend any existing certifiable defense for single-label classification; this is done by considering the multi-label classification task as a series of isolated binary classification problems to provably guarantee robustness. Furthermore, in the scenario where an attacker is limited to a single patch we propose an additional certification procedure that can provide tighter robustness bounds. Using the current state-of-the-art (SOTA) single-label certifiable defense PatchCleanser as a backbone, we find that PatchDEMUX can achieve non-trivial robustness on the MS-COCO and PASCAL VOC datasets while maintaining high clean performance
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
ImmunoDiff: A Diffusion Model for Immunotherapy Response Prediction in Lung Cancer
Authors:
Moinak Bhattacharya,
Judy Huang,
Amna F. Sher,
Gagandeep Singh,
Chao Chen,
Prateek Prasanna
Abstract:
Accurately predicting immunotherapy response in Non-Small Cell Lung Cancer (NSCLC) remains a critical unmet need. Existing radiomics and deep learning-based predictive models rely primarily on pre-treatment imaging to predict categorical response outcomes, limiting their ability to capture the complex morphological and textural transformations induced by immunotherapy. This study introduces Immuno…
▽ More
Accurately predicting immunotherapy response in Non-Small Cell Lung Cancer (NSCLC) remains a critical unmet need. Existing radiomics and deep learning-based predictive models rely primarily on pre-treatment imaging to predict categorical response outcomes, limiting their ability to capture the complex morphological and textural transformations induced by immunotherapy. This study introduces ImmunoDiff, an anatomy-aware diffusion model designed to synthesize post-treatment CT scans from baseline imaging while incorporating clinically relevant constraints. The proposed framework integrates anatomical priors, specifically lobar and vascular structures, to enhance fidelity in CT synthesis. Additionally, we introduce a novel cbi-Adapter, a conditioning module that ensures pairwise-consistent multimodal integration of imaging and clinical data embeddings, to refine the generative process. Additionally, a clinical variable conditioning mechanism is introduced, leveraging demographic data, blood-based biomarkers, and PD-L1 expression to refine the generative process. Evaluations on an in-house NSCLC cohort treated with immune checkpoint inhibitors demonstrate a 21.24% improvement in balanced accuracy for response prediction and a 0.03 increase in c-index for survival prediction. Code will be released soon.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Matryoshka Model Learning for Improved Elastic Student Models
Authors:
Chetan Verma,
Aditya Srinivas Timmaraju,
Cho-Jui Hsieh,
Suyash Damle,
Ngot Bui,
Yang Zhang,
Wen Chen,
Xin Liu,
Prateek Jain,
Inderjit S Dhillon
Abstract:
Industry-grade ML models are carefully designed to meet rapidly evolving serving constraints, which requires significant resources for model development. In this paper, we propose MatTA, a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe. TA models are larger versions of the Student models with higher capacity, and thus allow Student models to better…
▽ More
Industry-grade ML models are carefully designed to meet rapidly evolving serving constraints, which requires significant resources for model development. In this paper, we propose MatTA, a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe. TA models are larger versions of the Student models with higher capacity, and thus allow Student models to better relate to the Teacher model and also bring in more domain-specific expertise. Furthermore, multiple accurate Student models can be extracted from the TA model. Therefore, despite only one training run, our methodology provides multiple servable options to trade off accuracy for lower serving cost. We demonstrate the proposed method, MatTA, on proprietary datasets and models. Its practical efficacy is underscored by live A/B tests within a production ML system, demonstrating 20% improvement on a key metric. We also demonstrate our method on GPT-2 Medium, a public model, and achieve relative improvements of over 24% on SAT Math and over 10% on the LAMBADA benchmark.
△ Less
Submitted 2 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
Monotone Bounded-Depth Complexity of Homomorphism Polynomials
Authors:
C. S. Bhargav,
Shiteng Chen,
Radu Curticapean,
Prateek Dwivedi
Abstract:
For every fixed graph $H$, it is known that homomorphism counts from $H$ and colorful $H$-subgraph counts can be determined in $O(n^{t+1})$ time on $n$-vertex input graphs $G$, where $t$ is the treewidth of $H$. On the other hand, a running time of $n^{o(t / \log t)}$ would refute the exponential-time hypothesis. Komarath, Pandey and Rahul (Algorithmica, 2023) studied algebraic variants of these c…
▽ More
For every fixed graph $H$, it is known that homomorphism counts from $H$ and colorful $H$-subgraph counts can be determined in $O(n^{t+1})$ time on $n$-vertex input graphs $G$, where $t$ is the treewidth of $H$. On the other hand, a running time of $n^{o(t / \log t)}$ would refute the exponential-time hypothesis. Komarath, Pandey and Rahul (Algorithmica, 2023) studied algebraic variants of these counting problems, i.e., homomorphism and subgraph $\textit{polynomials}$ for fixed graphs $H$. These polynomials are weighted sums over the objects counted above, where each object is weighted by the product of variables corresponding to edges contained in the object. As shown by Komarath et al., the $\textit{monotone}$ circuit complexity of the homomorphism polynomial for $H$ is $Θ(n^{\mathrm{tw}(H)+1})$.
In this paper, we characterize the power of monotone $\textit{bounded-depth}$ circuits for homomorphism and colorful subgraph polynomials. This leads us to discover a natural hierarchy of graph parameters $\mathrm{tw}_Δ(H)$, for fixed $Δ\in \mathbb N$, which capture the width of tree-decompositions for $H$ when the underlying tree is required to have depth at most $Δ$. We prove that monotone circuits of product-depth $Δ$ computing the homomorphism polynomial for $H$ require size $Θ(n^{\mathrm{tw}_Δ(H^{\dagger})+1})$, where $H^{\dagger}$ is the graph obtained from $H$ by removing all degree-$1$ vertices. This allows us to derive an optimal depth hierarchy theorem for monotone bounded-depth circuits through graph-theoretic arguments.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Deconfounded Warm-Start Thompson Sampling with Applications to Precision Medicine
Authors:
Prateek Jaiswal,
Esmaeil Keyvanshokooh,
Junyu Cao
Abstract:
Randomized clinical trials often require large patient cohorts before drawing definitive conclusions, yet abundant observational data from parallel studies remains underutilized due to confounding and hidden biases. To bridge this gap, we propose Deconfounded Warm-Start Thompson Sampling (DWTS), a practical approach that leverages a Doubly Debiased LASSO (DDL) procedure to identify a sparse set of…
▽ More
Randomized clinical trials often require large patient cohorts before drawing definitive conclusions, yet abundant observational data from parallel studies remains underutilized due to confounding and hidden biases. To bridge this gap, we propose Deconfounded Warm-Start Thompson Sampling (DWTS), a practical approach that leverages a Doubly Debiased LASSO (DDL) procedure to identify a sparse set of reliable measured covariates and combines them with key hidden covariates to form a reduced context. By initializing Thompson Sampling (LinTS) priors with DDL-estimated means and variances on these measured features -- while keeping uninformative priors on hidden features -- DWTS effectively harnesses confounded observational data to kick-start adaptive clinical trials. Evaluated on both a purely synthetic environment and a virtual environment created using real cardiovascular risk dataset, DWTS consistently achieves lower cumulative regret than standard LinTS, showing how offline causal insights from observational data can improve trial efficiency and support more personalized treatment decisions.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Large Language Models Implicitly Learn to See and Hear Just By Reading
Authors:
Prateek Verma,
Mert Pilanci
Abstract:
This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our a…
▽ More
This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary connections for various applications rather than training models from scratch every single time.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Joint Depth and Reflectivity Estimation using Single-Photon LiDAR
Authors:
Hashan K. Weerasooriya,
Prateek Chennuri,
Weijian Zhang,
Istvan Gyongy,
Stanley H. Chan
Abstract:
Single-Photon Light Detection and Ranging (SP-LiDAR is emerging as a leading technology for long-range, high-precision 3D vision tasks. In SP-LiDAR, timestamps encode two complementary pieces of information: pulse travel time (depth) and the number of photons reflected by the object (reflectivity). Existing SP-LiDAR reconstruction methods typically recover depth and reflectivity separately or sequ…
▽ More
Single-Photon Light Detection and Ranging (SP-LiDAR is emerging as a leading technology for long-range, high-precision 3D vision tasks. In SP-LiDAR, timestamps encode two complementary pieces of information: pulse travel time (depth) and the number of photons reflected by the object (reflectivity). Existing SP-LiDAR reconstruction methods typically recover depth and reflectivity separately or sequentially use one modality to estimate the other. Moreover, the conventional 3D histogram construction is effective mainly for slow-moving or stationary scenes. In dynamic scenes, however, it is more efficient and effective to directly process the timestamps. In this paper, we introduce an estimation method to simultaneously recover both depth and reflectivity in fast-moving scenes. We offer two contributions: (1) A theoretical analysis demonstrating the mutual correlation between depth and reflectivity and the conditions under which joint estimation becomes beneficial. (2) A novel reconstruction method, "SPLiDER", which exploits the shared information to enhance signal recovery. On both synthetic and real SP-LiDAR data, our method outperforms existing approaches, achieving superior joint reconstruction quality.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Leveraging Offline Data from Similar Systems for Online Linear Quadratic Control
Authors:
Shivam Bajaj,
Prateek Jaiswal,
Vijay Gupta
Abstract:
``Sim2real gap", in which the system learned in simulations is not the exact representation of the real system, can lead to loss of stability and performance when controllers learned using data from the simulated system are used on the real system. In this work, we address this challenge in the linear quadratic regulator (LQR) setting. Specifically, we consider an LQR problem for a system with unk…
▽ More
``Sim2real gap", in which the system learned in simulations is not the exact representation of the real system, can lead to loss of stability and performance when controllers learned using data from the simulated system are used on the real system. In this work, we address this challenge in the linear quadratic regulator (LQR) setting. Specifically, we consider an LQR problem for a system with unknown system matrices. Along with the state-action pairs from the system to be controlled, a trajectory of length $S$ of state-action pairs from a different unknown system is available. Our proposed algorithm is constructed upon Thompson sampling and utilizes the mean as well as the uncertainty of the dynamics of the system from which the trajectory of length $S$ is obtained. We establish that the algorithm achieves $\tilde{\mathcal{O}}({f(S,M_δ)\sqrt{T/S}})$ Bayes regret after $T$ time steps, where $M_δ$ characterizes the \emph{dissimilarity} between the two systems and $f(S,M_δ)$ is a function of $S$ and $M_δ$. When $M_δ$ is sufficiently small, the proposed algorithm achieves $\tilde{\mathcal{O}}({\sqrt{T/S}})$ Bayes regret and outperforms a naive strategy which does not utilize the available trajectory.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Semantic De-boosting in e-commerce Query Autocomplete
Authors:
Adithya Rajan,
Weiqi Tong,
Greg Sharp,
Prateek Verma,
Kevin Li
Abstract:
In ecommerce search, query autocomplete plays a critical role to help users in their shopping journey. Often times, query autocomplete presents users with semantically similar queries, which can impede the user's ability to find diverse and relevant results. This paper proposes a novel strategy to enhance this service by refining the presentation of typeahead suggestions based on their semantic si…
▽ More
In ecommerce search, query autocomplete plays a critical role to help users in their shopping journey. Often times, query autocomplete presents users with semantically similar queries, which can impede the user's ability to find diverse and relevant results. This paper proposes a novel strategy to enhance this service by refining the presentation of typeahead suggestions based on their semantic similarity.
Our solution uniquely demotes semantically equivalent queries using an embedding similarity of query suggestions at runtime. This strategy ensures only distinct and varied queries are prioritized, thereby promoting more diverse suggestions for users. To maintain comprehensive query coverage, we incorporate this deduplication process within the query suggestion reranking step. This approach ensures that the broad spectrum of possible queries remains available to users, while eliminating the redundancy and repetitiveness in the suggestion list.
In extending this work, we propose using the distance between query embeddings to offer even more diverse suggestions to users using an algorithm similar to maximal marginal relevance. This approach will further ensure the delivery of non-redundant, unique, and pertinent suggestions to users, thus enriching their search experience.
We evaluated our method through rigorous AB testing, demonstrating substantial improvements in key metrics. Notably, we observed a statistically significant rise in the search Add-to-Cart rate, signifying an enhanced user engagement and conversion rate. Furthermore, we observed a statistically significant decrease in clicks to ATC, implying that the feature improved the efficiency of the customer's product search journey. Finally, we also noticed a marked reduction in the null page view rate, indicating the increased pertinence and efficiency of user search sessions.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
From Search To Sampling: Generative Models For Robust Algorithmic Recourse
Authors:
Prateek Garg,
Lokesh Nagalapatti,
Sunita Sarawagi
Abstract:
Algorithmic Recourse provides recommendations to individuals who are adversely impacted by automated model decisions, on how to alter their profiles to achieve a favorable outcome. Effective recourse methods must balance three conflicting goals: proximity to the original profile to minimize cost, plausibility for realistic recourse, and validity to ensure the desired outcome. We show that existing…
▽ More
Algorithmic Recourse provides recommendations to individuals who are adversely impacted by automated model decisions, on how to alter their profiles to achieve a favorable outcome. Effective recourse methods must balance three conflicting goals: proximity to the original profile to minimize cost, plausibility for realistic recourse, and validity to ensure the desired outcome. We show that existing methods train for these objectives separately and then search for recourse through a joint optimization over the recourse goals during inference, leading to poor recourse recommendations. We introduce GenRe, a generative recourse model designed to train the three recourse objectives jointly. Training such generative models is non-trivial due to lack of direct recourse supervision. We propose efficient ways to synthesize such supervision and further show that GenRe's training leads to a consistent estimator. Unlike most prior methods, that employ non-robust gradient descent based search during inference, GenRe simply performs a forward sampling over the generative model to produce minimum cost recourse, leading to superior performance across multiple metrics. We also demonstrate GenRe provides the best trade-off between cost, plausibility and validity, compared to state-of-art baselines. Our code is available at: https://github.com/prateekgargx/genre.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
A Large Language Model for Feasible and Diverse Population Synthesis
Authors:
Sung Yoo Lim,
Hyunsoo Yun,
Prateek Bansal,
Dong-Kyu Kim,
Eui-Jin Kim
Abstract:
Generating a synthetic population that is both feasible and diverse is crucial for ensuring the validity of downstream activity schedule simulation in activity-based models (ABMs). While deep generative models (DGMs), such as variational autoencoders and generative adversarial networks, have been applied to this task, they often struggle to balance the inclusion of rare but plausible combinations…
▽ More
Generating a synthetic population that is both feasible and diverse is crucial for ensuring the validity of downstream activity schedule simulation in activity-based models (ABMs). While deep generative models (DGMs), such as variational autoencoders and generative adversarial networks, have been applied to this task, they often struggle to balance the inclusion of rare but plausible combinations (i.e., sampling zeros) with the exclusion of implausible ones (i.e., structural zeros). To improve feasibility while maintaining diversity, we propose a fine-tuning method for large language models (LLMs) that explicitly controls the autoregressive generation process through topological orderings derived from a Bayesian Network (BN). Experimental results show that our hybrid LLM-BN approach outperforms both traditional DGMs and proprietary LLMs (e.g., ChatGPT-4o) with few-shot learning. Specifically, our approach achieves approximately 95% feasibility, significantly higher than the ~80% observed in DGMs, while maintaining comparable diversity, making it well-suited for practical applications. Importantly, the method is based on a lightweight open-source LLM, enabling fine-tuning and inference on standard personal computing environments. This makes the approach cost-effective and scalable for large-scale applications, such as synthesizing populations in megacities, without relying on expensive infrastructure. By initiating the ABM pipeline with high-quality synthetic populations, our method improves overall simulation reliability and reduces downstream error propagation. The source code for these methods is available for research and practical application.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
FairPO: Robust Preference Optimization for Fair Multi-Label Learning
Authors:
Soumen Kumar Mondal,
Akshit Varmora,
Prateek Chanda,
Ganesh Ramakrishnan
Abstract:
We propose FairPO, a novel framework designed to promote fairness in multi-label classification by directly optimizing preference signals with a group robustness perspective. In our framework, the set of labels is partitioned into privileged and non-privileged groups, and a preference-based loss inspired by Direct Preference Optimization (DPO) is employed to more effectively differentiate true pos…
▽ More
We propose FairPO, a novel framework designed to promote fairness in multi-label classification by directly optimizing preference signals with a group robustness perspective. In our framework, the set of labels is partitioned into privileged and non-privileged groups, and a preference-based loss inspired by Direct Preference Optimization (DPO) is employed to more effectively differentiate true positive labels from confusing negatives within the privileged group, while preserving baseline classification performance for non-privileged labels. By framing the learning problem as a robust optimization over groups, our approach dynamically adjusts the training emphasis toward groups with poorer performance, thereby mitigating bias and ensuring a fairer treatment across diverse label categories. In addition, we outline plans to extend this approach by investigating alternative loss formulations such as Simple Preference Optimisation (SimPO) and Contrastive Preference Optimization (CPO) to exploit reference-free reward formulations and contrastive training signals. Furthermore, we plan to extend FairPO with multilabel generation capabilities, enabling the model to dynamically generate diverse and coherent label sets for ambiguous inputs.
△ Less
Submitted 16 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization
Authors:
Anas Anwarul Haq Khan,
Utkarsh Verma,
Prateek Chanda,
Ganesh Ramakrishnan
Abstract:
We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and…
▽ More
We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Authors:
Prateek Chhikara,
Dev Khant,
Saket Aryan,
Taranjeet Singh,
Deshraj Yadav
Abstract:
Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient informatio…
▽ More
Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Platonic Grounding for Efficient Multimodal Language Models
Authors:
Moulik Choraria,
Xinbo Wu,
Akhil Bhimaraju,
Nitesh Sekhar,
Yue Wu,
Xu Zhang,
Prateek Singhal,
Lav R. Varshney
Abstract:
The hyperscaling of data and parameter count in Transformer-based models is yielding diminishing performance improvement, especially when weighed against training costs. Such plateauing indicates the importance of methods for more efficient finetuning and inference, while retaining similar performance. This is especially relevant for multimodal learning paradigms, where inference costs of processi…
▽ More
The hyperscaling of data and parameter count in Transformer-based models is yielding diminishing performance improvement, especially when weighed against training costs. Such plateauing indicates the importance of methods for more efficient finetuning and inference, while retaining similar performance. This is especially relevant for multimodal learning paradigms, where inference costs of processing multimodal tokens can determine the model's practical viability. At the same time, research on representations and mechanistic interpretability has improved our understanding of the inner workings of Transformer-based models; one such line of work reveals an implicit alignment in the deeper layers of pretrained models, across modalities. Taking inspiration from this, we motivate and propose a simple modification to existing multimodal frameworks that rely on aligning pretrained models. We demonstrate that our approach maintains and, in some cases, even improves performance of baseline methods while achieving significant gains in both training and inference-time compute. Our work also has implications for combining pretrained models into larger systems efficiently.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
Efficient Single-Pass Training for Multi-Turn Reasoning
Authors:
Ritesh Goru,
Shanay Mehta,
Prateek Jain
Abstract:
Training Large Language Models ( LLMs) to generate explicit reasoning before they produce an answer has been shown to improve their performance across various tasks such as mathematics and coding. However, fine-tuning LLMs on multi-turn reasoning datasets presents a unique challenge: LLMs must generate reasoning tokens that are excluded from subsequent inputs to the LLM. This discrepancy prevents…
▽ More
Training Large Language Models ( LLMs) to generate explicit reasoning before they produce an answer has been shown to improve their performance across various tasks such as mathematics and coding. However, fine-tuning LLMs on multi-turn reasoning datasets presents a unique challenge: LLMs must generate reasoning tokens that are excluded from subsequent inputs to the LLM. This discrepancy prevents us from processing an entire conversation in a single forward pass-an optimization readily available when we fine-tune on a multi-turn non-reasoning dataset. This paper proposes a novel approach that overcomes this limitation through response token duplication and a custom attention mask that enforces appropriate visibility constraints. Our approach significantly reduces the training time and allows efficient fine-tuning on multi-turn reasoning datasets.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Program Skeletons for Automated Program Translation
Authors:
Bo Wang,
Tianyu Li,
Ruishi Li,
Umang Mathur,
Prateek Saxena
Abstract:
Translating software between programming languages is a challenging task, for which automated techniques have been elusive and hard to scale up to larger programs. A key difficulty in cross-language translation is that one has to re-express the intended behavior of the source program into idiomatic constructs of a different target language. This task needs abstracting away from the source language…
▽ More
Translating software between programming languages is a challenging task, for which automated techniques have been elusive and hard to scale up to larger programs. A key difficulty in cross-language translation is that one has to re-express the intended behavior of the source program into idiomatic constructs of a different target language. This task needs abstracting away from the source language-specific details, while keeping the overall functionality the same. In this work, we propose a novel and systematic approach for making such translation amenable to automation based on a framework we call program skeletons. A program skeleton retains the high-level structure of the source program by abstracting away and effectively summarizing lower-level concrete code fragments, which can be mechanically translated to the target programming language. A skeleton, by design, permits many different ways of filling in the concrete implementation for fragments, which can work in conjunction with existing data-driven code synthesizers. Most importantly, skeletons can conceptually enable sound decomposition, i.e., if each individual fragment is correctly translated, taken together with the mechanically translated skeleton, the final translated program is deemed to be correct as a whole. We present a prototype system called Skel embodying the idea of skeleton-based translation from Python to JavaScript. Our results show promising scalability compared to prior works. For 9 real-world Python programs, some with more than about 1k lines of code, 95% of their code fragments can be automatically translated, while about 5% require manual effort. All the final translations are correct with respect to whole-program test suites.
△ Less
Submitted 22 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning
Authors:
Nikhil Shivakumar Nayak,
Krishnateja Killamsetty,
Ligong Han,
Abhishek Bhandwaldar,
Prateek Chanda,
Kai Xu,
Hao Wang,
Aldo Pareja,
Oleg Silkin,
Mustafa Eyceoz,
Akash Srivastava
Abstract:
Continual learning in large language models (LLMs) is prone to catastrophic forgetting, where adapting to new tasks significantly degrades performance on previously learned ones. Existing methods typically rely on low-rank, parameter-efficient updates that limit the model's expressivity and introduce additional parameters per task, leading to scalability issues. To address these limitations, we pr…
▽ More
Continual learning in large language models (LLMs) is prone to catastrophic forgetting, where adapting to new tasks significantly degrades performance on previously learned ones. Existing methods typically rely on low-rank, parameter-efficient updates that limit the model's expressivity and introduce additional parameters per task, leading to scalability issues. To address these limitations, we propose a novel continual full fine-tuning approach leveraging adaptive singular value decomposition (SVD). Our method dynamically identifies task-specific low-rank parameter subspaces and constrains updates to be orthogonal to critical directions associated with prior tasks, thus effectively minimizing interference without additional parameter overhead or storing previous task gradients. We evaluate our approach extensively on standard continual learning benchmarks using both encoder-decoder (T5-Large) and decoder-only (LLaMA-2 7B) models, spanning diverse tasks including classification, generation, and reasoning. Empirically, our method achieves state-of-the-art results, up to 7% higher average accuracy than recent baselines like O-LoRA, and notably maintains the model's general linguistic capabilities, instruction-following accuracy, and safety throughout the continual learning process by reducing forgetting to near-negligible levels. Our adaptive SVD framework effectively balances model plasticity and knowledge retention, providing a practical, theoretically grounded, and computationally scalable solution for continual learning scenarios in large language models.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
BrainMRDiff: A Diffusion Model for Anatomically Consistent Brain MRI Synthesis
Authors:
Moinak Bhattacharya,
Saumya Gupta,
Annie Singh,
Chao Chen,
Gagandeep Singh,
Prateek Prasanna
Abstract:
Accurate brain tumor diagnosis relies on the assessment of multiple Magnetic Resonance Imaging (MRI) sequences. However, in clinical practice, the acquisition of certain sequences may be affected by factors like motion artifacts or contrast agent contraindications, leading to suboptimal outcome, such as poor image quality. This can then affect image interpretation by radiologists. Synthesizing hig…
▽ More
Accurate brain tumor diagnosis relies on the assessment of multiple Magnetic Resonance Imaging (MRI) sequences. However, in clinical practice, the acquisition of certain sequences may be affected by factors like motion artifacts or contrast agent contraindications, leading to suboptimal outcome, such as poor image quality. This can then affect image interpretation by radiologists. Synthesizing high quality MRI sequences has thus become a critical research focus. Though recent advancements in controllable generative AI have facilitated the synthesis of diagnostic quality MRI, ensuring anatomical accuracy remains a significant challenge. Preserving critical structural relationships between different anatomical regions is essential, as even minor structural or topological inconsistencies can compromise diagnostic validity. In this work, we propose BrainMRDiff, a novel topology-preserving, anatomy-guided diffusion model for synthesizing brain MRI, leveraging brain and tumor anatomies as conditioning inputs. To achieve this, we introduce two key modules: Tumor+Structure Aggregation (TSA) and Topology-Guided Anatomy Preservation (TGAP). TSA integrates diverse anatomical structures with tumor information, forming a comprehensive conditioning mechanism for the diffusion process. TGAP enforces topological consistency during reverse denoising diffusion process; both these modules ensure that the generated image respects anatomical integrity. Experimental results demonstrate that BrainMRDiff surpasses existing baselines, achieving performance improvements of 23.33% on the BraTS-AG dataset and 33.33% on the BraTS-Met dataset. Code will be made publicly available soon.
△ Less
Submitted 29 May, 2025; v1 submitted 6 April, 2025;
originally announced April 2025.
-
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
Authors:
Saarthak Kapse,
Pushpak Pati,
Srikar Yellapragada,
Srijan Das,
Rajarsi R. Gupta,
Joel Saltz,
Dimitris Samaras,
Prateek Prasanna
Abstract:
Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive c…
▽ More
Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior into a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise. Code is made available at https://github.com/bmi-imaginelab/GECKO
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Supercooled Confinement
Authors:
Prateek Agrawal,
Gaurang Ramakant Kane,
Vazha Loladze,
Mario Reig
Abstract:
We study general properties of confinement phase transitions in the early universe. An observable gravitational wave signal from such transitions requires significant supercooling. However, in almost all understood examples of confining gauge theories the degree of supercooling is too small to give interesting gravitational wave signals. We review and highlight the evidence why supercooling is not…
▽ More
We study general properties of confinement phase transitions in the early universe. An observable gravitational wave signal from such transitions requires significant supercooling. However, in almost all understood examples of confining gauge theories the degree of supercooling is too small to give interesting gravitational wave signals. We review and highlight the evidence why supercooling is not generic in confining gauge theories. The exceptions are Randall-Sundrum models which define a strongly coupled gauge theory holographically by a 5D gravitational theory. We construct a simple illustrative model of a 4D gauge theory inspired by features of the Randall-Sundrum model. It is a large-$N$ gauge theory in the conformal window coupled to a weakly coupled scalar field which undergoes a supercooled phase transition that breaks the conformal symmetry and triggers confinement. We show that there are interesting features in the gravitational wave spectra that can carry the imprint of the confining gauge theory.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Effectively Controlling Reasoning Models through Thinking Intervention
Authors:
Tong Wu,
Chong Xiang,
Jiachen T. Wang,
G. Edward Suh,
Prateek Mittal
Abstract:
Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to expl…
▽ More
Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. We find that the Thinking Intervention paradigm enhances the capabilities of reasoning models across a wide range of tasks, including instruction following on IFEval and Overthinking, instruction hierarchy on SEP, and safety alignment on XSTest and SorryBench. Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models. Overall, our work opens a promising new research avenue for controlling reasoning LLMs.
△ Less
Submitted 21 May, 2025; v1 submitted 31 March, 2025;
originally announced March 2025.
-
How to set up a psychedelic study: Unique considerations for research involving human participants
Authors:
Marcus J. Glennon,
Catherine I. V. Bird,
Prateek Yadav,
Patrick Kleine,
Shayam Suseelan,
Christina Boman-Markaki,
Vasileia Kotoula,
Matt Butler,
Robert Leech,
Leor Roseman,
David Erritzoe,
Deepak P. Srivastava,
Celia Morgan,
Christopher Timmermann,
Greg Cooper,
Jeremy I. Skipper,
James Rucker,
Sunjeev K. Kamboj,
Mitul A. Mehta,
Ravi K. Das,
Anjali Bhat
Abstract:
Setting up a psychedelic study can be a long, arduous, and kafkaesque process. This rapidly-developing field poses several unique challenges for researchers, necessitating a range of considerations that have not yet been standardised. Many of the complexities inherent to psychedelic research also challenge existing assumptions around, for example, approaches to psychiatric prescribing, the concept…
▽ More
Setting up a psychedelic study can be a long, arduous, and kafkaesque process. This rapidly-developing field poses several unique challenges for researchers, necessitating a range of considerations that have not yet been standardised. Many of the complexities inherent to psychedelic research also challenge existing assumptions around, for example, approaches to psychiatric prescribing, the conceptual framing of the placebo effect, and definitions of selfhood. This review paper brings together several of the major psychedelic research teams across the United Kingdom to formalise these unique considerations, identify continuing areas of debate, and provide a practical, experience-based guide, with recommendations for policymakers and future researchers intending to set up a psychedelic research study or clinical trial. We approach this such that the paper can either be read end to end, or treated as a manual: readers can dip into relevant sections as needed.
△ Less
Submitted 18 April, 2025; v1 submitted 28 March, 2025;
originally announced March 2025.
-
Anvil: A General-Purpose Timing-Safe Hardware Description Language
Authors:
Jason Zhijingcheng Yu,
Aditya Ranjan Jha,
Umang Mathur,
Trevor E. Carlson,
Prateek Saxena
Abstract:
Hardware designs routinely use stateless signals which change with their underlying registers. Unintended behaviours arise when a register is mutated even when its dependent signals are expected to remain stable (unchanged). Such timing hazards are common because, with a few exceptions, existing HDLs lack the abstraction for stable values and delegate this responsibility to hardware designers, who…
▽ More
Hardware designs routinely use stateless signals which change with their underlying registers. Unintended behaviours arise when a register is mutated even when its dependent signals are expected to remain stable (unchanged). Such timing hazards are common because, with a few exceptions, existing HDLs lack the abstraction for stable values and delegate this responsibility to hardware designers, who then have to carefully decide whether a value remains unchanged, sometimes even across hardware modules. This paper proposes Anvil, an HDL which statically prevents timing hazards with a novel type system. Anvil is the only HDL we know of that guarantees timing safety without sacrificing expressiveness for cycle-level timing control or dynamic timing behaviours. Instead of abstracting away differences between registers and signals, Anvil's type system exposes them fully but captures timing relationships between register mutations and signal usages for enforcing timing safety. This, in turn, enables safe composition of communicating hardware modules by static enforcement of timing contracts that encode timing constraints on shared signals. Such timing contracts can be specified parametric on abstract time points that can vary during run-time, allowing the type system to statically express dynamic timing behaviour. We have implemented Anvil and successfully used it for implementing key timing-sensitive modules in an open-source RISC-V CPU, which demonstrates its expressiveness and practicality.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Variational inference for hierarchical models with conditional scale and skewness corrections
Authors:
Lucas Kock,
Linda S. L. Tan,
Prateek Bansal,
David J. Nott
Abstract:
Gaussian variational approximations are widely used for summarizing posterior distributions in Bayesian models, especially in high-dimensional settings. However, a drawback of such approximations is the inability to capture skewness or more complex features of the posterior. Recent work suggests applying skewness corrections to existing Gaussian or other symmetric approximations to address this li…
▽ More
Gaussian variational approximations are widely used for summarizing posterior distributions in Bayesian models, especially in high-dimensional settings. However, a drawback of such approximations is the inability to capture skewness or more complex features of the posterior. Recent work suggests applying skewness corrections to existing Gaussian or other symmetric approximations to address this limitation. We propose to incorporate the skewness correction into the definition of an approximating variational family. We consider approximating the posterior for hierarchical models, in which there are ``global'' and ``local'' parameters. A baseline variational approximation is defined as the product of a Gaussian marginal posterior for global parameters and a Gaussian conditional posterior for local parameters given the global ones. Skewness corrections are then considered. The adjustment of the conditional posterior term for local variables is adaptive to the global parameter value. Optimization of baseline variational parameters is performed jointly with the skewness correction. Our approach allows the location, scale and skewness to be captured separately, without using additional parameters for skewness adjustments. The proposed method substantially improves accuracy for only a modest increase in computational cost compared to state-of-the-art Gaussian approximations. Good performance is demonstrated in generalized linear mixed models and multinomial logit discrete choice models.
△ Less
Submitted 25 March, 2025; v1 submitted 23 March, 2025;
originally announced March 2025.
-
OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters
Authors:
Sahil Tyagi,
Prateek Sharma
Abstract:
Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent techniques on heterogeneous resources, performance degrades due to stragglers and stale updates. In this work, we develop an adaptive batch-scaling framework called Om…
▽ More
Deep learning systems are optimized for clusters with homogeneous resources. However, heterogeneity is prevalent in computing infrastructure across edge, cloud and HPC. When training neural networks using stochastic gradient descent techniques on heterogeneous resources, performance degrades due to stragglers and stale updates. In this work, we develop an adaptive batch-scaling framework called OmniLearn to mitigate the effects of heterogeneity in distributed training. Our approach is inspired by proportional controllers to balance computation across heterogeneous servers, and works under varying resource availability. By dynamically adjusting worker mini-batches at runtime, OmniLearn reduces training time by 14-85%. We also investigate asynchronous training, where our techniques improve accuracy by up to 6.9%.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Dynamics of Transition Metal Ion Transport in High-Gradient Magnetic Fields
Authors:
Prateek Benhal,
Muhammad Garba,
Jamel Ali,
Theo Siegrist,
Munir Humayun,
Hadi Mohammadigoushki
Abstract:
Magnetic separation has emerged as an eco-friendly and sustainable technique with applications in water purification, chemical separation, biochemical, medical, and mining. In this study, we present, a combined experimental and theoretical investigation of the transport of transition metal ions using high-gradient magnetic fields. Experiments were conducted on aqueous solutions containing either p…
▽ More
Magnetic separation has emerged as an eco-friendly and sustainable technique with applications in water purification, chemical separation, biochemical, medical, and mining. In this study, we present, a combined experimental and theoretical investigation of the transport of transition metal ions using high-gradient magnetic fields. Experiments were conducted on aqueous solutions containing either paramagnetic manganese chloride (MnCl$_$2) or diamagnetic zinc chloride (ZnCl$_$2) ions, with concentrations ranging from 1 mM to 100 mM under a non-uniform magnetic field of an electromagnet. Our results demonstrate that while paramagnetic MnCl$_$2 is captured by the mesh wool in the magnetic field, diamagnetic ZnCl$_$2 remains unaffected by the presence of magnetic field. The capture efficiency of paramagnetic MnCl$_$2 increases with both the initial ion concentration and the applied magnetic field strength. Furthermore, in binary mixtures, the capture rate of MnCl2 is reduced compared to single-ion solutions, highlighting the role of ion interactions in magnetic separation. Our theoretical modeling indicates that magnetic capture is governed by a balance between magnetic forces and viscous forces. Additionally, the magnetic separation process is enhanced by the field-induced cluster formation of paramagnetic metal ions, which are predicted to be two orders of magnitude larger than individual hydrated ion units. These findings provide insights into the mechanisms of magnetic transport of metal ions and offer potential pathways for improving separation efficiency in complex ion mixtures that contain critical materials.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
The Deployment of End-to-End Audio Language Models Should Take into Account the Principle of Least Privilege
Authors:
Luxi He,
Xiangyu Qi,
Michel Liao,
Inyoung Cheong,
Prateek Mittal,
Danqi Chen,
Peter Henderson
Abstract:
We are at a turning point for language models that accept audio input. The latest end-to-end audio language models (Audio LMs) process speech directly instead of relying on a separate transcription step. This shift preserves detailed information, such as intonation or the presence of multiple speakers, that would otherwise be lost in transcription. However, it also introduces new safety risks, inc…
▽ More
We are at a turning point for language models that accept audio input. The latest end-to-end audio language models (Audio LMs) process speech directly instead of relying on a separate transcription step. This shift preserves detailed information, such as intonation or the presence of multiple speakers, that would otherwise be lost in transcription. However, it also introduces new safety risks, including the potential misuse of speaker identity cues and other sensitive vocal attributes, which could have legal implications. In this position paper, we urge a closer examination of how these models are built and deployed. We argue that the principle of least privilege should guide decisions on whether to deploy cascaded or end-to-end models. Specifically, evaluations should assess (1) whether end-to-end modeling is necessary for a given application; and (2), the appropriate scope of information access. Finally, We highlight related gaps in current audio LM benchmarks and identify key open research questions, both technical and policy-related, that must be addressed to enable the responsible deployment of end-to-end Audio LMs.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Real AI Agents with Fake Memories: Fatal Context Manipulation Attacks on Web3 Agents
Authors:
Atharv Singh Patlan,
Peiyao Sheng,
S. Ashwin Hebbar,
Prateek Mittal,
Pramod Viswanath
Abstract:
The integration of AI agents with Web3 ecosystems harnesses their complementary potential for autonomy and openness yet also introduces underexplored security risks, as these agents dynamically interact with financial protocols and immutable smart contracts. This paper investigates the vulnerabilities of AI agents within blockchain-based financial ecosystems when exposed to adversarial threats in…
▽ More
The integration of AI agents with Web3 ecosystems harnesses their complementary potential for autonomy and openness yet also introduces underexplored security risks, as these agents dynamically interact with financial protocols and immutable smart contracts. This paper investigates the vulnerabilities of AI agents within blockchain-based financial ecosystems when exposed to adversarial threats in real-world scenarios. We introduce the concept of context manipulation, a comprehensive attack vector that exploits unprotected context surfaces, including input channels, memory modules, and external data feeds.
Through empirical analysis of ElizaOS, a decentralized AI agent framework for automated Web3 operations, we demonstrate how adversaries can manipulate context by injecting malicious instructions into prompts or historical interaction records, leading to unintended asset transfers and protocol violations which could be financially devastating.
To quantify these vulnerabilities, we design CrAIBench, a Web3 domain-specific benchmark that evaluates the robustness of AI agents against context manipulation attacks across 150+ realistic blockchain tasks, including token transfers, trading, bridges and cross-chain interactions and 500+ attack test cases using context manipulation. We systematically assess attack and defense strategies, analyzing factors like the influence of security prompts, reasoning models, and the effectiveness of alignment techniques.
Our findings show that prompt-based defenses are insufficient when adversaries corrupt stored context, achieving significant attack success rates despite these defenses. Fine-tuning-based defenses offer a more robust alternative, substantially reducing attack success rates while preserving utility on single-step tasks. This research highlights the urgent need to develop AI agents that are both secure and fiduciarily responsible.
△ Less
Submitted 30 April, 2025; v1 submitted 20 March, 2025;
originally announced March 2025.
-
Probabilistic Delay Forecasting in 5G Using Recurrent and Attention-Based Architectures
Authors:
Samie Mostafavi,
Gourav Prateek Sharma,
Ahmad Traboulsi,
James Gross
Abstract:
With the emergence of new application areas such as cyber-physical systems and human-in-the-loop applications ensuring a specific level of end-to-end network latency with high reliability (e.g., 99.9%) is becoming increasingly critical. To align wireless links with these reliability requirements, it is essential to analyze and control network latency in terms of its full probability distribution.…
▽ More
With the emergence of new application areas such as cyber-physical systems and human-in-the-loop applications ensuring a specific level of end-to-end network latency with high reliability (e.g., 99.9%) is becoming increasingly critical. To align wireless links with these reliability requirements, it is essential to analyze and control network latency in terms of its full probability distribution. However, in a wireless link, the distribution may vary over time, making this task particularly challenging. We propose predicting the latency distribution using state-of-the-art data-driven techniques that leverage historical network information. Our approach tokenizes network state information and processes it using temporal deep-learning architectures-namely LSTM and Transformer models-to capture both short- and long-term delay dependencies. These models output parameters for a chosen parametric density via a mixture density network with Gaussian mixtures, yielding multi-step probabilistic forecasts of future delays. To validate our proposed approach, we implemented and tested these methods using a time-synchronized, SDR-based OpenAirInterface 5G testbed to collect and preprocess network-delay data. Our experiments show that the Transformer model achieves lower negative log-likelihood and mean absolute error than both LSTM and feed-forward baselines in challenging scenarios, while also providing insights into model complexity and training/inference overhead. This framework enables more informed decision-making for adaptive scheduling and resource allocation, paving the way toward enhanced QoS in evolving 5G and 6G networks.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Pathology Image Compression with Pre-trained Autoencoders
Authors:
Srikar Yellapragada,
Alexandros Graikos,
Kostas Triaridis,
Zilinghan Li,
Tarak Nath Nandi,
Ravi K Madduri,
Prateek Prasanna,
Joel Saltz,
Dimitris Samaras
Abstract:
The growing volume of high-resolution Whole Slide Images in digital histopathology poses significant storage, transmission, and computational efficiency challenges. Standard compression methods, such as JPEG, reduce file sizes but often fail to preserve fine-grained phenotypic details critical for downstream tasks. In this work, we repurpose autoencoders (AEs) designed for Latent Diffusion Models…
▽ More
The growing volume of high-resolution Whole Slide Images in digital histopathology poses significant storage, transmission, and computational efficiency challenges. Standard compression methods, such as JPEG, reduce file sizes but often fail to preserve fine-grained phenotypic details critical for downstream tasks. In this work, we repurpose autoencoders (AEs) designed for Latent Diffusion Models as an efficient learned compression framework for pathology images. We systematically benchmark three AE models with varying compression levels and evaluate their reconstruction ability using pathology foundation models. We introduce a fine-tuning strategy to further enhance reconstruction fidelity that optimizes a pathology-specific learned perceptual metric. We validate our approach on downstream tasks, including segmentation, patch classification, and multiple instance learning, showing that replacing images with AE-compressed reconstructions leads to minimal performance degradation. Additionally, we propose a K-means clustering-based quantization method for AE latents, improving storage efficiency while maintaining reconstruction quality. We provide the weights of the fine-tuned autoencoders at https://huggingface.co/collections/StonyBrook-CVLab/pathology-fine-tuned-aes-67d45f223a659ff2e3402dd0.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Drums of high width
Authors:
Alex Davies,
Prateek Gupta,
Sebastien Racaniere,
Grzegorz Swirszcz,
Adam Zsolt Wagner,
Theophane Weber,
Geordie Williamson
Abstract:
We provide a family of $5$-dimensional prismatoids whose width grows linearly in the number of vertices. This provides a new infinite family of counter-examples to the Hirsch conjecture whose excess width grows linearly in the number of vertices, and answers a question of Matschke, Santos and Weibel.
We provide a family of $5$-dimensional prismatoids whose width grows linearly in the number of vertices. This provides a new infinite family of counter-examples to the Hirsch conjecture whose excess width grows linearly in the number of vertices, and answers a question of Matschke, Santos and Weibel.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Design and Implementation of FourCropNet: A CNN-Based System for Efficient Multi-Crop Disease Detection and Management
Authors:
H. P. Khandagale,
Sangram Patil,
V. S. Gavali,
S. V. Chavan,
P. P. Halkarnikar,
Prateek A. Meshram
Abstract:
Plant disease detection is a critical task in agriculture, directly impacting crop yield, food security, and sustainable farming practices. This study proposes FourCropNet, a novel deep learning model designed to detect diseases in multiple crops, including CottonLeaf, Grape, Soybean, and Corn. The model leverages an advanced architecture comprising residual blocks for efficient feature extraction…
▽ More
Plant disease detection is a critical task in agriculture, directly impacting crop yield, food security, and sustainable farming practices. This study proposes FourCropNet, a novel deep learning model designed to detect diseases in multiple crops, including CottonLeaf, Grape, Soybean, and Corn. The model leverages an advanced architecture comprising residual blocks for efficient feature extraction, attention mechanisms to enhance focus on disease-relevant regions, and lightweight layers for computational efficiency. These components collectively enable FourCropNet to achieve superior performance across varying datasets and class complexities, from single-crop datasets to combined datasets with 15 classes. The proposed model was evaluated on diverse datasets, demonstrating high accuracy, specificity, sensitivity, and F1 scores. Notably, FourCropNet achieved the highest accuracy of 99.7% for Grape, 99.5% for Corn, and 95.3% for the combined dataset. Its scalability and ability to generalize across datasets underscore its robustness. Comparative analysis shows that FourCropNet consistently outperforms state-of-the-art models such as MobileNet, VGG16, and EfficientNet across various metrics. FourCropNet's innovative design and consistent performance make it a reliable solution for real-time disease detection in agriculture. This model has the potential to assist farmers in timely disease diagnosis, reducing economic losses and promoting sustainable agricultural practices.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Privacy Auditing of Large Language Models
Authors:
Ashwinee Panda,
Xinyu Tang,
Milad Nasr,
Christopher A. Choquette-Choo,
Prateek Mittal
Abstract:
Current techniques for privacy auditing of large language models (LLMs) have limited efficacy -- they rely on basic approaches to generate canaries which leads to weak membership inference attacks that in turn give loose lower bounds on the empirical privacy leakage. We develop canaries that are far more effective than those used in prior work under threat models that cover a range of realistic se…
▽ More
Current techniques for privacy auditing of large language models (LLMs) have limited efficacy -- they rely on basic approaches to generate canaries which leads to weak membership inference attacks that in turn give loose lower bounds on the empirical privacy leakage. We develop canaries that are far more effective than those used in prior work under threat models that cover a range of realistic settings. We demonstrate through extensive experiments on multiple families of fine-tuned LLMs that our approach sets a new standard for detection of privacy leakage. For measuring the memorization rate of non-privately trained LLMs, our designed canaries surpass prior approaches. For example, on the Qwen2.5-0.5B model, our designed canaries achieve $49.6\%$ TPR at $1\%$ FPR, vastly surpassing the prior approach's $4.2\%$ TPR at $1\%$ FPR. Our method can be used to provide a privacy audit of $\varepsilon \approx 1$ for a model trained with theoretical $\varepsilon$ of 4. To the best of our knowledge, this is the first time that a privacy audit of LLM training has achieved nontrivial auditing success in the setting where the attacker cannot train shadow models, insert gradient canaries, or access the model at every iteration.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation
Authors:
Aishik Konwer,
Zhijian Yang,
Erhan Bas,
Cao Xiao,
Prateek Prasanna,
Parminder Bhatia,
Taha Kass-Hout
Abstract:
Foundational models such as the Segment Anything Model (SAM) are gaining traction in medical imaging segmentation, supporting multiple downstream tasks. However, such models are supervised in nature, still relying on large annotated datasets or prompts supplied by experts. Conventional techniques such as active learning to alleviate such limitations are limited in scope and still necessitate conti…
▽ More
Foundational models such as the Segment Anything Model (SAM) are gaining traction in medical imaging segmentation, supporting multiple downstream tasks. However, such models are supervised in nature, still relying on large annotated datasets or prompts supplied by experts. Conventional techniques such as active learning to alleviate such limitations are limited in scope and still necessitate continuous human involvement and complex domain knowledge for label refinement or establishing reward ground truth. To address these challenges, we propose an enhanced Segment Anything Model (SAM) framework that utilizes annotation-efficient prompts generated in a fully unsupervised fashion, while still capturing essential semantic, location, and shape information through contrastive language-image pretraining and visual question answering. We adopt the direct preference optimization technique to design an optimal policy that enables the model to generate high-fidelity segmentations with simple ratings or rankings provided by a virtual annotator simulating the human annotation process. State-of-the-art performance of our framework in tasks such as lung segmentation, breast tumor segmentation, and organ segmentation across various modalities, including X-ray, ultrasound, and abdominal CT, justifies its effectiveness in low-annotation data scenarios.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
RSQ: Learning from Important Tokens Leads to Better Quantized LLMs
Authors:
Yi-Lin Sung,
Prateek Yadav,
Jialu Li,
Jaehong Yoon,
Mohit Bansal
Abstract:
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which…
▽ More
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which have large attention scores). Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate outliers (those with exceptionally large magnitude), (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Chevalley operations on TNN Grassmannians
Authors:
Prateek Kumar Vishwakarma
Abstract:
Lusztig showed that invertible totally nonnegative (TNN) matrices form a semigroup generated by positive diagonal matrices and Chevalley generators. From its Grassmann analogue, we introduce Chevalley operations on index sets, which we show have a rich variety of applications. We first completely classify all inequalities that are quadratic in Plucker coordinates over the TNN part of the Grassmann…
▽ More
Lusztig showed that invertible totally nonnegative (TNN) matrices form a semigroup generated by positive diagonal matrices and Chevalley generators. From its Grassmann analogue, we introduce Chevalley operations on index sets, which we show have a rich variety of applications. We first completely classify all inequalities that are quadratic in Plucker coordinates over the TNN part of the Grassmannian: \[\sum_{I,J}c_{I,J}Δ_IΔ_J\ge 0\quad over\quad \mathrm{Gr}^{\ge 0}(m,m+n)\] where each $c_{I,J}$ is real, and $Δ_I,Δ_J$ are Plucker coordinates with a homogeneity condition. Using an idea of Gekhtman-Shapiro-Vainshtein, we also explain how our Chevalley operations can be motivated from cluster mutations, and lead to working in Grassmannians of smaller dimension, akin to cluster algebras.
We then present several applications of Chevalley operations. First, we obtain certificates for the above inequalities via sums of coefficients $c_{I,J}$ over 321-avoiding permutations and involutions; we believe this refined results of Rhoades-Skandera for TNN-matrix inequalities via their Temperley-Lieb immanant idea.
Second, we provide a novel proof via Chevalley operations of Lam's log-supermodularity of Plucker coordinates. This has several consequences: (a) Each positroid, corresponding to the positroid cells in Postnikov's decomposition of the TNN Grassmannian, is a distributive lattice. (b) It also yields numerical positivity in the main result of Lam-Postnikov-Pylyavskyy. (c) We show the coordinatewise monotonicity of ratios of Schur polynomials, first proved by Khare-Tao and which is the key result they use to obtain quantitative estimates for positivity preservers.
Third, we employ Chevalley operations to show that the majorization order over partitions implicates a partial order for induced character immanants over TNN matrices, proved originally by Skandera-Soskin.
△ Less
Submitted 13 March, 2025; v1 submitted 25 February, 2025;
originally announced February 2025.
-
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Authors:
Jiarui Zhang,
Mahyar Khayatkhoei,
Prateek Chhikara,
Filip Ilievski
Abstract:
Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We…
▽ More
Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs' accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model's internal state is a promising direction to mitigate this risk.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
Interleaved Gibbs Diffusion for Constrained Generation
Authors:
Gautham Govind Anil,
Sachin Yadav,
Dheeraj Nagaraj,
Karthikeyan Shanmugam,
Prateek Jain
Abstract:
We introduce Interleaved Gibbs Diffusion (IGD), a novel generative modeling framework for mixed continuous-discrete data, focusing on constrained generation problems. Prior works on discrete and continuous-discrete diffusion models assume factorized denoising distribution for fast generation, which can hinder the modeling of strong dependencies between random variables encountered in constrained g…
▽ More
We introduce Interleaved Gibbs Diffusion (IGD), a novel generative modeling framework for mixed continuous-discrete data, focusing on constrained generation problems. Prior works on discrete and continuous-discrete diffusion models assume factorized denoising distribution for fast generation, which can hinder the modeling of strong dependencies between random variables encountered in constrained generation. IGD moves beyond this by interleaving continuous and discrete denoising algorithms via a discrete time Gibbs sampling type Markov chain. IGD provides flexibility in the choice of denoisers, allows conditional generation via state-space doubling and inference time scaling via the ReDeNoise method. Empirical evaluations on three challenging tasks-solving 3-SAT, generating molecule structures, and generating layouts-demonstrate state-of-the-art performance. Notably, IGD achieves a 7% improvement on 3-SAT out of the box and achieves state-of-the-art results in molecule generation without relying on equivariant diffusion or domain-specific architectures. We explore a wide range of modeling, and interleaving strategies along with hyperparameters in each of these problems.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models
Authors:
Prateek Chhikara
Abstract:
Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence-misalignment between predicted confidence and true correctness-poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing…
▽ More
Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence-misalignment between predicted confidence and true correctness-poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations-targeted fine-tuning, structured prompting, and strategic model choice-to ensure reliable, trustworthy LLM deployments.
△ Less
Submitted 5 June, 2025; v1 submitted 16 February, 2025;
originally announced February 2025.
-
Role of phase distortion in nonlinear saturation of the unstable acoustic modes in hypersonic parallel flow boundary layer
Authors:
Altaf Ahmed,
Joaquim P. Jossy,
Prateek Gupta
Abstract:
We analyze the role of the relative phasing in the nonlinear saturation of the unstable Mack modes in a hypersonic parallel flow boundary layer in two dimensions (2D). As the linearly unstable Mack modes extract energy from the mean flow, the perturbation energy cascades into higher harmonics as well as the mean flow. The higher harmonics are generated with < 0.5% of total perturbation energy at s…
▽ More
We analyze the role of the relative phasing in the nonlinear saturation of the unstable Mack modes in a hypersonic parallel flow boundary layer in two dimensions (2D). As the linearly unstable Mack modes extract energy from the mean flow, the perturbation energy cascades into higher harmonics as well as the mean flow. The higher harmonics are generated with < 0.5% of total perturbation energy at steady state, indicating a very small role of higher harmonics in 2D. Additionally, the higher harmonics propagate with the same phase speed as the unstable mode, indicating wave steepening and a coherent energy cascade. The mean flow gets decelerated and heated due to the continuous extraction of the perturbation energy into traveling modes and the viscous dissipation of these modes. Unlike unstable modes in classical hydrodynamics, we show that the distortion in relative phasing between the streamwise velocity and wall-normal velocity due to nonlinear distortion of the mean flow is dominant. Using asymptotic reconstruction of the unstable eigenmodes, we compute the perturbation energy budgets in the linear and nonlinear regimes. Through energy budgets, we show that the viscous effects in the wall layer and the viscous effects in the critical layer sufficiently capture the distortion in phase due to the mean-flow distortion. We then combine this in a numerical model for calculating the steady-state perturbation energy and mean-flow distortion through the nonlinear saturation of unstable Mack modes in a hypersonic parallel flow boundary layer in 2D. Throughout, we compare the results of approximate theoretical analysis with 2D direct numerical simulations (DNS).
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Emergence of Order in Chemically Active Droplets: Temporal Dynamics and Collective Behavior
Authors:
Sobiya Ashraf,
Pawan Kumar,
Prateek Dwivedi,
Frédéric Blanc,
Dipin Pillai,
Rahul Mangal
Abstract:
Collective behaviors such as swarming, chemical signaling, and clustering are fundamental to biological microorganisms, enabling hierarchical colony formation, coordinated motion, and enhanced nutrient accessibility crucial for their survival. Over the past few decades, extensive research has been dedicated to unraveling the mechanisms underlying these diverse collective patterns through experimen…
▽ More
Collective behaviors such as swarming, chemical signaling, and clustering are fundamental to biological microorganisms, enabling hierarchical colony formation, coordinated motion, and enhanced nutrient accessibility crucial for their survival. Over the past few decades, extensive research has been dedicated to unraveling the mechanisms underlying these diverse collective patterns through experimental model systems. Among these, active droplets have emerged as valuable synthetic analogs, effectively replicating key biological attributes and serving as ideal platforms for investigating collective phenomena. This research explores the collective behavior of 4-Cyano-4-pentyl-biphenyl (5CB) oil droplets across varying Péclet ($Pe$) numbers. At high $Pe$, droplets exhibit a pusher mode of propulsion and form dynamic chain-like patterns. Decreasing $Pe$ enhances repulsive interactions among droplets, resulting in the inhibition of clustering. In the low $Pe$ regime, their repulsive interactions predominated by chemical field lead to the emergence of an ordered structure. Furthermore, we illustrate how active droplets efficiently navigate within a soft structured environment. These findings contribute to our comprehension of self-organized phenomena in active matter systems and provide insights for designing strategies for controlled locomotion in intricate fluidic environments.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Matryoshka Quantization
Authors:
Pranav Nair,
Puranjay Datta,
Jeff Dean,
Prateek Jain,
Aditya Kusupati
Abstract:
Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or…
▽ More
Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. Leveraging this insight, in this paper, we propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that alleviates the aforementioned challenge. This technique allows us to train and maintain a single quantized model but serve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant's co-training and co-distillation regularization, int2 precision models extracted by MatQuant outperform standard int2 quantization by up to to 4% and 7% with OmniQuant and QAT as base algorithms respectively. Finally, we demonstrate that by using an extra bit to represent outliers, a model with an effective precision of 2.05-bit gives an additional 6% improvement with OmniQuant as the base algorithm.
△ Less
Submitted 3 March, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.