-
Farm-Level, In-Season Crop Identification for India
Authors:
Ishan Deshpande,
Amandeep Kaur Reehal,
Chandan Nath,
Renu Singh,
Aayush Patel,
Aishwarya Jayagopal,
Gaurav Singh,
Gaurav Aggarwal,
Amit Agarwal,
Prathmesh Bele,
Sridhar Reddy,
Tanya Warrier,
Kinjal Singh,
Ashish Tendulkar,
Luis Pazos Outon,
Nikita Saxena,
Agata Dondzik,
Dinesh Tewari,
Shruti Garg,
Avneet Singh,
Harsh Dhand,
Vaibhav Rajan,
Alok Talekar
Abstract:
Accurate, timely, and farm-level crop type information is paramount for national food security, agricultural policy formulation, and economic planning, particularly in agriculturally significant nations like India. While remote sensing and machine learning have become vital tools for crop monitoring, existing approaches often grapple with challenges such as limited geographical scalability, restri…
▽ More
Accurate, timely, and farm-level crop type information is paramount for national food security, agricultural policy formulation, and economic planning, particularly in agriculturally significant nations like India. While remote sensing and machine learning have become vital tools for crop monitoring, existing approaches often grapple with challenges such as limited geographical scalability, restricted crop type coverage, the complexities of mixed-pixel and heterogeneous landscapes, and crucially, the robust in-season identification essential for proactive decision-making.
We present a framework designed to address the critical data gaps for targeted data driven decision making which generates farm-level, in-season, multi-crop identification at national scale (India) using deep learning. Our methodology leverages the strengths of Sentinel-1 and Sentinel-2 satellite imagery, integrated with national-scale farm boundary data. The model successfully identifies 12 major crops (which collectively account for nearly 90% of India's total cultivated area showing an agreement with national crop census 2023-24 of 94% in winter, and 75% in monsoon season). Our approach incorporates an automated season detection algorithm, which estimates crop sowing and harvest periods. This allows for reliable crop identification as early as two months into the growing season and facilitates rigorous in-season performance evaluation. Furthermore, we have engineered a highly scalable inference pipeline, culminating in what is, to our knowledge, the first pan-India, in-season, farm-level crop type data product. The system's effectiveness and scalability are demonstrated through robust validation against national agricultural statistics, showcasing its potential to deliver actionable, data-driven insights for transformative agricultural monitoring and management across India.
△ Less
Submitted 30 June, 2025;
originally announced July 2025.
-
Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations
Authors:
Ananth Agarwal,
Jasper Jian,
Christopher D. Manning,
Shikhar Murty
Abstract:
Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in…
▽ More
Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model's probing accuracy reliably predicts its downstream syntactic performance. Adopting a "mechanisms vs. outcomes" framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.
△ Less
Submitted 19 June, 2025;
originally announced June 2025.
-
Magistral
Authors:
Mistral-AI,
:,
Abhinav Rastogi,
Albert Q. Jiang,
Andy Lo,
Gabrielle Berrada,
Guillaume Lample,
Jason Rute,
Joep Barmentlo,
Karmesh Yadav,
Kartik Khandelwal,
Khyathi Raghavi Chandu,
Léonard Blier,
Lucile Saulnier,
Matthieu Dinot,
Maxime Darrin,
Neha Gupta,
Roman Soletskyi,
Sagar Vaze,
Teven Le Scao,
Yihan Wang,
Adam Yang,
Alexander H. Liu,
Alexandre Sablayrolles,
Amélie Héliou
, et al. (76 additional authors not shown)
Abstract:
We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a s…
▽ More
We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Local MDI+: Local Feature Importances for Tree-Based Models
Authors:
Zhongyuan Liang,
Zachary T. Rewolinski,
Abhineet Agarwal,
Tiffany M. Tang,
Bin Yu
Abstract:
Tree-based ensembles such as random forests remain the go-to for tabular data over deep learning models due to their prediction performance and computational efficiency. These advantages have led to their widespread deployment in high-stakes domains, where interpretability is essential for ensuring trustworthy predictions. This has motivated the development of popular local (i.e. sample-specific)…
▽ More
Tree-based ensembles such as random forests remain the go-to for tabular data over deep learning models due to their prediction performance and computational efficiency. These advantages have led to their widespread deployment in high-stakes domains, where interpretability is essential for ensuring trustworthy predictions. This has motivated the development of popular local (i.e. sample-specific) feature importance (LFI) methods such as LIME and TreeSHAP. However, these approaches rely on approximations that ignore the model's internal structure and instead depend on potentially unstable perturbations. These issues are addressed in the global setting by MDI+, a feature importance method which exploits an equivalence between decision trees and linear models on a transformed node basis. However, the global MDI+ scores are not able to explain predictions when faced with heterogeneous individual characteristics. To address this gap, we propose Local MDI+ (LMDI+), a novel extension of the MDI+ framework to the sample specific setting. LMDI+ outperforms existing baselines LIME and TreeSHAP in identifying instance-specific signal features, averaging a 10% improvement in downstream task performance across twelve real-world benchmark datasets. It further demonstrates greater stability by consistently producing similar instance-level feature importance rankings across multiple random forest fits. Finally, LMDI+ enables local interpretability use cases, including the identification of closer counterfactuals and the discovery of homogeneous subgroups.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Cost-Optimal Active AI Model Evaluation
Authors:
Anastasios N. Angelopoulos,
Jacob Eisenstein,
Jonathan Berant,
Alekh Agarwal,
Adam Fisch
Abstract:
The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, rapid iteration often makes it necessary to rely on synthetic annotation data because of the low cost, despite the potential for substantial bias. In this paper, we develop novel, cost-aware methods for actively balancing the u…
▽ More
The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, rapid iteration often makes it necessary to rely on synthetic annotation data because of the low cost, despite the potential for substantial bias. In this paper, we develop novel, cost-aware methods for actively balancing the use of a cheap, but often inaccurate, weak rater -- such as a model-based autorater that is designed to automatically assess the quality of generated content -- with a more expensive, but also more accurate, strong rater alternative such as a human. More specifically, the goal of our approach is to produce a low variance, unbiased estimate of the mean of the target "strong" rating, subject to some total annotation budget. Building on recent work in active and prediction-powered statistical inference, we derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters so as to maximize statistical efficiency. Using synthetic and real-world data, we empirically characterize the conditions under which these policies yield improvements over prior methods. We find that, especially in tasks where there is high variability in the difficulty of examples, our policies can achieve the same estimation precision at a far lower total annotation budget than standard evaluation methods.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion
Authors:
Caleb Chin,
Aashish Khubchandani,
Harshvardhan Maskara,
Kyuseong Choi,
Jacob Feitelberg,
Albert Gong,
Manit Paul,
Tathagata Sadhukhan,
Anish Agarwal,
Raaz Dwivedi
Abstract:
Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications.…
▽ More
Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N$^2$, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N$^2$ supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Hybrid AI for Responsive Multi-Turn Online Conversations with Novel Dynamic Routing and Feedback Adaptation
Authors:
Priyaranjan Pattnayak,
Amit Agarwal,
Hansa Meghwani,
Hitesh Laxmichand Patel,
Srikant Panda
Abstract:
Retrieval-Augmented Generation (RAG) systems and large language model (LLM)-powered chatbots have significantly advanced conversational AI by combining generative capabilities with external knowledge retrieval. Despite their success, enterprise-scale deployments face critical challenges, including diverse user queries, high latency, hallucinations, and difficulty integrating frequently updated dom…
▽ More
Retrieval-Augmented Generation (RAG) systems and large language model (LLM)-powered chatbots have significantly advanced conversational AI by combining generative capabilities with external knowledge retrieval. Despite their success, enterprise-scale deployments face critical challenges, including diverse user queries, high latency, hallucinations, and difficulty integrating frequently updated domain-specific knowledge. This paper introduces a novel hybrid framework that integrates RAG with intent-based canned responses, leveraging predefined high-confidence responses for efficiency while dynamically routing complex or ambiguous queries to the RAG pipeline. Our framework employs a dialogue context manager to ensure coherence in multi-turn interactions and incorporates a feedback loop to refine intents, dynamically adjust confidence thresholds, and expand response coverage over time. Experimental results demonstrate that the proposed framework achieves a balance of high accuracy (95\%) and low latency (180ms), outperforming RAG and intent-based systems across diverse query types, positioning it as a scalable and adaptive solution for enterprise conversational AI applications.
△ Less
Submitted 25 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation
Authors:
Eunsu Kim,
Haneul Yoo,
Guijin Son,
Hitesh Patel,
Amit Agarwal,
Alice Oh
Abstract:
As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we in…
▽ More
As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate that model performance varies significantly across domain-specific subsets, emphasizing the importance of domain-aware benchmarking. We believe BenchHub can encourage better dataset reuse, more transparent model comparisons, and easier identification of underrepresented areas in existing benchmarks, offering a critical infrastructure for advancing LLM evaluation research.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems
Authors:
Hansa Meghwani,
Amit Agarwal,
Priyaranjan Pattnayak,
Hitesh Laxmichand Patel,
Srikant Panda
Abstract:
Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored spe…
▽ More
Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models.
Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15\% in MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability and readiness for real-world applications.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
First Finish Search: Efficient Test-Time Scaling in Large Language Models
Authors:
Aradhye Agarwal,
Ayan Sengupta,
Tanmoy Chakraborty
Abstract:
Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shor…
▽ More
Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves $82.23\%$ accuracy on the AIME datasets, a $15\%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs
Authors:
Landon Butler,
Abhineet Agarwal,
Justin Singh Kang,
Yigit Efe Erginbas,
Bin Yu,
Kannan Ramchandran
Abstract:
Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs $n$. Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach th…
▽ More
Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs $n$. Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to $n \approx 10^3$ features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often hierarchical -- higher-order interactions are accompanied by their lower-order subsets -- which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first fits gradient boosted trees to masked LLM outputs and then extracts the important interactions. Experiments across four challenging high-dimensional datasets show that ProxySPEX more faithfully reconstructs LLM outputs by 20% over marginal attribution approaches while using $10\times$ fewer inferences than SPEX. By accounting for interactions, ProxySPEX identifies features that influence model output over 20% more than those selected by marginal approaches. Further, we apply ProxySPEX to two interpretability tasks. Data attribution, where we identify interactions among CIFAR-10 training samples that influence test predictions, and mechanistic interpretability, where we uncover interactions between attention heads, both within and across layers, on a question-answering task. ProxySPEX identifies interactions that enable more aggressive pruning of heads than marginal approaches.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use
Authors:
Hitesh Laxmichand Patel,
Amit Agarwal,
Arion Das,
Bhargava Kumar,
Srikant Panda,
Priyaranjan Pattnayak,
Taki Hasan Rafi,
Tejaswini Kumar,
Dong-Kyu Chae
Abstract:
Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to miti…
▽ More
Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding
Authors:
Amit Agarwal,
Srikant Panda,
Kulbhushan Pachauri
Abstract:
In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR er…
▽ More
In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : https://github.com/oracle-samples/fs-dag
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Advanced Integration of Discrete Line Segments in Digitized P&ID for Continuous Instrument Connectivity
Authors:
Soumya Swarup Prusty,
Astha Agarwal,
Srinivasan Iyenger
Abstract:
Piping and Instrumentation Diagrams (P&IDs) constitute the foundational blueprint of a plant, depicting the interconnections among process equipment, instrumentation for process control, and the flow of fluids and control signals. In their existing setup, the manual mapping of information from P&ID sheets holds a significant challenge. This is a time-consuming process, taking around 3-6 months, an…
▽ More
Piping and Instrumentation Diagrams (P&IDs) constitute the foundational blueprint of a plant, depicting the interconnections among process equipment, instrumentation for process control, and the flow of fluids and control signals. In their existing setup, the manual mapping of information from P&ID sheets holds a significant challenge. This is a time-consuming process, taking around 3-6 months, and is susceptible to errors. It also depends on the expertise of the domain experts and often requires multiple rounds of review. The digitization of P&IDs entails merging detected line segments, which is essential for linking various detected instruments, thereby creating a comprehensive digitized P&ID. This paper focuses on explaining how line segments which are detected using a computer vision model are merged and eventually building the connection between equipment and merged lines. Hence presenting a digitized form of information stating the interconnection between process equipment, instrumentation, flow of fluids and control signals. Eventually, which can be stored in a knowledge graph and that information along with the help of advanced algorithms can be leveraged for tasks like finding optimal routes, detecting system cycles, computing transitive closures, and more.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework
Authors:
Abhineet Agarwal,
Michael Xiao,
Rebecca Barter,
Omer Ronen,
Boyu Fan,
Bin Yu
Abstract:
As machine learning (ML) models are increasingly deployed in high-stakes domains, trustworthy uncertainty quantification (UQ) is critical for ensuring the safety and reliability of these models. Traditional UQ methods rely on specifying a true generative model and are not robust to misspecification. On the other hand, conformal inference allows for arbitrary ML models but does not consider model s…
▽ More
As machine learning (ML) models are increasingly deployed in high-stakes domains, trustworthy uncertainty quantification (UQ) is critical for ensuring the safety and reliability of these models. Traditional UQ methods rely on specifying a true generative model and are not robust to misspecification. On the other hand, conformal inference allows for arbitrary ML models but does not consider model selection, which leads to large interval sizes. We tackle these drawbacks by proposing a UQ method based on the predictability, computability, and stability (PCS) framework for veridical data science proposed by Yu and Kumbier. Specifically, PCS-UQ addresses model selection by using a prediction check to screen out unsuitable models. PCS-UQ then fits these screened algorithms across multiple bootstraps to assess inter-sample variability and algorithmic instability, enabling more reliable uncertainty estimates. Further, we propose a novel calibration scheme that improves local adaptivity of our prediction sets. Experiments across $17$ regression and $6$ classification datasets show that PCS-UQ achieves the desired coverage and reduces width over conformal approaches by $\approx 20\%$. Further, our local analysis shows PCS-UQ often achieves target coverage across subgroups while conformal methods fail to do so. For large deep-learning models, we propose computationally efficient approximation schemes that avoid the expensive multiple bootstrap trainings of PCS-UQ. Across three computer vision benchmarks, PCS-UQ reduces prediction set size over conformal methods by $20\%$. Theoretically, we show a modified PCS-UQ algorithm is a form of split conformal inference and achieves the desired coverage with exchangeable data.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation
Authors:
Max Qiushi Lin,
Jincheng Mei,
Matin Aghaei,
Michael Lu,
Bo Dai,
Alekh Agarwal,
Dale Schuurmans,
Csaba Szepesvari,
Sharan Vaswani
Abstract:
Policy gradient (PG) methods have played an essential role in the empirical successes of reinforcement learning. In order to handle large state-action spaces, PG methods are typically used with function approximation. In this setting, the approximation error in modeling problem-dependent quantities is a key notion for characterizing the global convergence of PG methods. We focus on Softmax PG with…
▽ More
Policy gradient (PG) methods have played an essential role in the empirical successes of reinforcement learning. In order to handle large state-action spaces, PG methods are typically used with function approximation. In this setting, the approximation error in modeling problem-dependent quantities is a key notion for characterizing the global convergence of PG methods. We focus on Softmax PG with linear function approximation (referred to as $\texttt{Lin-SPG}$) and demonstrate that the approximation error is irrelevant to the algorithm's global convergence even for the stochastic bandit setting. Consequently, we first identify the necessary and sufficient conditions on the feature representation that can guarantee the asymptotic global convergence of $\texttt{Lin-SPG}$. Under these feature conditions, we prove that $T$ iterations of $\texttt{Lin-SPG}$ with a problem-specific learning rate result in an $O(1/T)$ convergence to the optimal policy. Furthermore, we prove that $\texttt{Lin-SPG}$ with any arbitrary constant learning rate can ensure asymptotic global convergence to the optimal policy.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting
Authors:
Anushka Agarwal,
Muhammad Yusuf Hassan,
Talha Chafekar
Abstract:
We introduce GenSync, a novel framework for multi-identity lip-synced video synthesis using 3D Gaussian Splatting. Unlike most existing 3D methods that require training a new model for each identity , GenSync learns a unified network that synthesizes lip-synced videos for multiple speakers. By incorporating a Disentanglement Module, our approach separates identity-specific features from audio repr…
▽ More
We introduce GenSync, a novel framework for multi-identity lip-synced video synthesis using 3D Gaussian Splatting. Unlike most existing 3D methods that require training a new model for each identity , GenSync learns a unified network that synthesizes lip-synced videos for multiple speakers. By incorporating a Disentanglement Module, our approach separates identity-specific features from audio representations, enabling efficient multi-identity video synthesis. This design reduces computational overhead and achieves 6.8x faster training compared to state-of-the-art models, while maintaining high lip-sync accuracy and visual quality.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Conversations with AI Chatbots Increase Short-Term Vaccine Intentions But Do Not Outperform Standard Public Health Messaging
Authors:
Neil K. R. Sehgal,
Sunny Rai,
Manuel Tonneau,
Anish K. Agarwal,
Joseph Cappella,
Melanie Kornides,
Lyle Ungar,
Alison Buttenheim,
Sharath Chandra Guntuku
Abstract:
Large language model (LLM) based chatbots show promise in persuasive communication, but existing studies often rely on weak controls or focus on belief change rather than behavioral intentions or outcomes. This pre-registered multi-country (US, Canada, UK) randomized controlled trial involving 930 vaccine-hesitant parents evaluated brief (three-minute) multi-turn conversations with LLM-based chatb…
▽ More
Large language model (LLM) based chatbots show promise in persuasive communication, but existing studies often rely on weak controls or focus on belief change rather than behavioral intentions or outcomes. This pre-registered multi-country (US, Canada, UK) randomized controlled trial involving 930 vaccine-hesitant parents evaluated brief (three-minute) multi-turn conversations with LLM-based chatbots against standard public health messaging approaches for increasing human papillomavirus (HPV) vaccine intentions for their children. Participants were randomly assigned to: (1) a weak control (no message), (2) a strong control reflecting the standard of care (reading official public health materials), or (3 and 4) one of two chatbot conditions. One chatbot was prompted to deliver short, conversational responses, while the other used the model's default output style (longer with bullet points). While chatbot interactions significantly increased self-reported vaccination intent (by 7.1-10.3 points on a 100-point scale) compared to no message, they did not outperform standard public health materials, with the conversational chatbot performing significantly worse. Additionally, while the short-term effects of chatbot interactions faded during a 15-day follow-up, the effects of public health material persisted through a 45-day follow-up relative to no message. These findings suggest that while LLMs can effectively shift vaccination intentions in the short-term, their incremental value over existing public health communications is questionable, offering a more tempered view of their persuasive capabilities and highlighting the importance of integrating AI-driven tools alongside, rather than replacing, current public health strategies.
△ Less
Submitted 26 June, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
A Cautionary Note on Quantum Oracles
Authors:
Avantika Agarwal,
Srijita Kundu
Abstract:
In recent years, the quantum oracle model introduced by Aaronson and Kuperberg (2007) has found a lot of use in showing oracle separations between complexity classes and cryptographic primitives. It is generally assumed that proof techniques that do not relativize with respect to quantum oracles will also not relativize with respect to classical oracles. In this note, we show that this is not the…
▽ More
In recent years, the quantum oracle model introduced by Aaronson and Kuperberg (2007) has found a lot of use in showing oracle separations between complexity classes and cryptographic primitives. It is generally assumed that proof techniques that do not relativize with respect to quantum oracles will also not relativize with respect to classical oracles. In this note, we show that this is not the case: specifically, we show that there is a quantum oracle problem that is contained in the class QMA, but not in a class we call polyQCPH. The class polyQCPH is equal to PSPACE with respect to classical oracles, and it is a well-known result that QMA is contained in PSPACE (also with respect to classical oracles).
We also show that the same separation holds relative to a distributional oracle, which is a model introduced by Natarajan and Nirkhe (2024). We believe our findings show the need for some caution when using these non-standard oracle models, particularly when showing separations between quantum and classical resources.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Contracts: A unified lens on congestion control robustness, fairness, congestion, and generality
Authors:
Anup Agarwal,
Venkat Arun,
Srinivasan Seshan
Abstract:
Congestion control algorithms (CCAs) operate in partially observable environments, lacking direct visibility into link capacities, or competing flows. To ensure fair sharing of network resources, CCAs communicate their fair share through observable signals. For instance, Reno's fair share is encoded as $\propto 1/\sqrt{\texttt{loss rate}}$. We call such communication mechanisms \emph{contracts}. W…
▽ More
Congestion control algorithms (CCAs) operate in partially observable environments, lacking direct visibility into link capacities, or competing flows. To ensure fair sharing of network resources, CCAs communicate their fair share through observable signals. For instance, Reno's fair share is encoded as $\propto 1/\sqrt{\texttt{loss rate}}$. We call such communication mechanisms \emph{contracts}. We show that the design choice of contracts fixes key steady-state performance metrics, including robustness to errors in congestion signals, fairness, amount of congestion (e.g., delay, loss), and generality (e.g., range of supported link rates). This results in fundamental tradeoffs between these metrics. Using properties of contracts we also identify design pitfalls that lead to starvation (extreme unfairness). We argue that CCA design and analysis should start with contracts to conscientiously pick tradeoffs and avoid pitfalls. We empirically validate our findings and discuss their implications on CCA design and network measurement.
△ Less
Submitted 6 June, 2025; v1 submitted 25 April, 2025;
originally announced April 2025.
-
Scalable Permutation-Aware Modeling for Temporal Set Prediction
Authors:
Ashish Ranjan,
Ayush Agarwal,
Shalin Barot,
Sushant Kumar
Abstract:
Temporal set prediction involves forecasting the elements that will appear in the next set, given a sequence of prior sets, each containing a variable number of elements. Existing methods often rely on intricate architectures with substantial computational overhead, which hampers their scalability. In this work, we introduce a novel and scalable framework that leverages permutation-equivariant and…
▽ More
Temporal set prediction involves forecasting the elements that will appear in the next set, given a sequence of prior sets, each containing a variable number of elements. Existing methods often rely on intricate architectures with substantial computational overhead, which hampers their scalability. In this work, we introduce a novel and scalable framework that leverages permutation-equivariant and permutation-invariant transformations to efficiently model set dynamics. Our approach significantly reduces both training and inference time while maintaining competitive performance. Extensive experiments on multiple public benchmarks show that our method achieves results on par with or superior to state-of-the-art models across several evaluation metrics. These results underscore the effectiveness of our model in enabling efficient and scalable temporal set prediction.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Tokenization Matters: Improving Zero-Shot NER for Indic Languages
Authors:
Priyaranjan Pattnayak,
Hitesh Laxmichand Patel,
Amit Agarwal
Abstract:
Tokenization is a critical component of Natural Language Processing (NLP), especially for low resource languages, where subword segmentation influences vocabulary structure and downstream task accuracy. Although Byte Pair Encoding (BPE) is a standard tokenization method in multilingual language models, its suitability for Named Entity Recognition (NER) in low resource Indic languages remains under…
▽ More
Tokenization is a critical component of Natural Language Processing (NLP), especially for low resource languages, where subword segmentation influences vocabulary structure and downstream task accuracy. Although Byte Pair Encoding (BPE) is a standard tokenization method in multilingual language models, its suitability for Named Entity Recognition (NER) in low resource Indic languages remains underexplored due to its limitations in handling morphological complexity. In this work, we systematically compare BPE, SentencePiece, and Character Level tokenization strategies using IndicBERT for NER tasks in low resource Indic languages like Assamese, Bengali, Marathi, and Odia, as well as extremely low resource Indic languages like Santali, Manipuri, and Sindhi. We assess both intrinsic linguistic properties tokenization efficiency, out of vocabulary (OOV) rates, and morphological preservation as well as extrinsic downstream performance, including fine tuning and zero shot cross lingual transfer.
Our experiments show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic Languages, particularly in zero shot cross lingual settings, as it better preserves entity consistency. While BPE provides the most compact tokenization form, it is not capable of generalization because it misclassifies or even fails to recognize entity labels when tested on unseen languages. In contrast, SentencePiece constitutes a better linguistic structural preservation model, benefiting extremely low resource and morphologically rich Indic languages, such as Santali and Manipuri, for superior entity recognition, as well as high generalization across scripts, such as Sindhi, written in Arabic. The results point to SentencePiece as the more effective tokenization strategy for NER within multilingual and low resource Indic NLP applications.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
A Modularized Design Approach for GelSight Family of Vision-based Tactile Sensors
Authors:
Arpit Agarwal,
Mohammad Amin Mirzaee,
Xiping Sun,
Wenzhen Yuan
Abstract:
GelSight family of vision-based tactile sensors has proven to be effective for multiple robot perception and manipulation tasks. These sensors are based on an internal optical system and an embedded camera to capture the deformation of the soft sensor surface, inferring the high-resolution geometry of the objects in contact. However, customizing the sensors for different robot hands requires a ted…
▽ More
GelSight family of vision-based tactile sensors has proven to be effective for multiple robot perception and manipulation tasks. These sensors are based on an internal optical system and an embedded camera to capture the deformation of the soft sensor surface, inferring the high-resolution geometry of the objects in contact. However, customizing the sensors for different robot hands requires a tedious trial-and-error process to re-design the optical system. In this paper, we formulate the GelSight sensor design process as a systematic and objective-driven design problem and perform the design optimization with a physically accurate optical simulation. The method is based on modularizing and parameterizing the sensor's optical components and designing four generalizable objective functions to evaluate the sensor. We implement the method with an interactive and easy-to-use toolbox called OptiSense Studio. With the toolbox, non-sensor experts can quickly optimize their sensor design in both forward and inverse ways following our predefined modules and steps. We demonstrate our system with four different GelSight sensors by quickly optimizing their initial design in simulation and transferring it to the real sensors.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Fighting Fires from Space: Leveraging Vision Transformers for Enhanced Wildfire Detection and Characterization
Authors:
Aman Agarwal,
James Gearon,
Raksha Rank,
Etienne Chenevert
Abstract:
Wildfires are increasing in intensity, frequency, and duration across large parts of the world as a result of anthropogenic climate change. Modern hazard detection and response systems that deal with wildfires are under-equipped for sustained wildfire seasons. Recent work has proved automated wildfire detection using Convolutional Neural Networks (CNNs) trained on satellite imagery are capable of…
▽ More
Wildfires are increasing in intensity, frequency, and duration across large parts of the world as a result of anthropogenic climate change. Modern hazard detection and response systems that deal with wildfires are under-equipped for sustained wildfire seasons. Recent work has proved automated wildfire detection using Convolutional Neural Networks (CNNs) trained on satellite imagery are capable of high-accuracy results. However, CNNs are computationally expensive to train and only incorporate local image context. Recently, Vision Transformers (ViTs) have gained popularity for their efficient training and their ability to include both local and global contextual information. In this work, we show that ViT can outperform well-trained and specialized CNNs to detect wildfires on a previously published dataset of LandSat-8 imagery. One of our ViTs outperforms the baseline CNN comparison by 0.92%. However, we find our own implementation of CNN-based UNet to perform best in every category, showing their sustained utility in image tasks. Overall, ViTs are comparably capable in detecting wildfires as CNNs, though well-tuned CNNs are still the best technique for detecting wildfire with our UNet providing an IoU of 93.58%, better than the baseline UNet by some 4.58%.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Framing Perception: Exploring Camera Induced Objectification in Cinema
Authors:
Parth Maradia,
Ayushi Agarwal,
Srija Bhupathiraju,
Kavita Vemuri
Abstract:
This study investigates how cinematographic techniques influence viewer perception and contribute to the objectification of women, utilizing eye-tracking data from 91 participants. They watched a sexualized music video (SV) known for objectifying portrayals and a non-sexualized music video (TV). Using dynamic Areas of Interests (AOIs) (head, torso, and lower body), gaze metrics such as fixation du…
▽ More
This study investigates how cinematographic techniques influence viewer perception and contribute to the objectification of women, utilizing eye-tracking data from 91 participants. They watched a sexualized music video (SV) known for objectifying portrayals and a non-sexualized music video (TV). Using dynamic Areas of Interests (AOIs) (head, torso, and lower body), gaze metrics such as fixation duration, visit count, and scan paths were recorded to assess visual attention patterns. Participants were grouped according to their average fixations on sexualized AOIs. Statistical analyses revealed significant differences in gaze behavior between the videos and among the groups, with increased attention to sexualized AOIs in SV. Additionally, data-driven group differences in fixations identified specific segments with heightened objectification that are further analyzed using scan path visualization techniques. These findings provide strong empirical evidence of camera-driven gaze objectification, demonstrating how cinematic framing implicitly shapes objectifying gaze patterns, highlighting the critical need for mindful media representation.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Enhancements for Developing a Comprehensive AI Fairness Assessment Standard
Authors:
Avinash Agarwal,
Mayashankar Kumar,
Manisha J. Nene
Abstract:
As AI systems increasingly influence critical sectors like telecommunications, finance, healthcare, and public services, ensuring fairness in decision-making is essential to prevent biased or unjust outcomes that disproportionately affect vulnerable entities or result in adverse impacts. This need is particularly pressing as the industry approaches the 6G era, where AI will drive complex functions…
▽ More
As AI systems increasingly influence critical sectors like telecommunications, finance, healthcare, and public services, ensuring fairness in decision-making is essential to prevent biased or unjust outcomes that disproportionately affect vulnerable entities or result in adverse impacts. This need is particularly pressing as the industry approaches the 6G era, where AI will drive complex functions like autonomous network management and hyper-personalized services. The TEC Standard for Fairness Assessment and Rating of AI Systems provides guidelines for evaluating fairness in AI, focusing primarily on tabular data and supervised learning models. However, as AI applications diversify, this standard requires enhancement to strengthen its impact and broaden its applicability. This paper proposes an expansion of the TEC Standard to include fairness assessments for images, unstructured text, and generative AI, including large language models, ensuring a more comprehensive approach that keeps pace with evolving AI technologies. By incorporating these dimensions, the enhanced framework will promote responsible and trustworthy AI deployment across various sectors.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Right Prediction, Wrong Reasoning: Uncovering LLM Misalignment in RA Disease Diagnosis
Authors:
Umakanta Maharana,
Sarthak Verma,
Avarna Agarwal,
Prakashini Mruthyunjaya,
Dwarikanath Mahapatra,
Sakir Ahmed,
Murari Mandal
Abstract:
Large language models (LLMs) offer a promising pre-screening tool, improving early disease detection and providing enhanced healthcare access for underprivileged communities. The early diagnosis of various diseases continues to be a significant challenge in healthcare, primarily due to the nonspecific nature of early symptoms, the shortage of expert medical practitioners, and the need for prolonge…
▽ More
Large language models (LLMs) offer a promising pre-screening tool, improving early disease detection and providing enhanced healthcare access for underprivileged communities. The early diagnosis of various diseases continues to be a significant challenge in healthcare, primarily due to the nonspecific nature of early symptoms, the shortage of expert medical practitioners, and the need for prolonged clinical evaluations, all of which can delay treatment and adversely affect patient outcomes. With impressive accuracy in prediction across a range of diseases, LLMs have the potential to revolutionize clinical pre-screening and decision-making for various medical conditions. In this work, we study the diagnostic capability of LLMs for Rheumatoid Arthritis (RA) with real world patients data. Patient data was collected alongside diagnoses from medical experts, and the performance of LLMs was evaluated in comparison to expert diagnoses for RA disease prediction. We notice an interesting pattern in disease diagnosis and find an unexpected \textit{misalignment between prediction and explanation}. We conduct a series of multi-round analyses using different LLM agents. The best-performing model accurately predicts rheumatoid arthritis (RA) diseases approximately 95\% of the time. However, when medical experts evaluated the reasoning generated by the model, they found that nearly 68\% of the reasoning was incorrect. This study highlights a clear misalignment between LLMs high prediction accuracy and its flawed reasoning, raising important questions about relying on LLM explanations in clinical settings. \textbf{LLMs provide incorrect reasoning to arrive at the correct answer for RA disease diagnosis.}
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Ordering-based Conditions for Global Convergence of Policy Gradient Methods
Authors:
Jincheng Mei,
Bo Dai,
Alekh Agarwal,
Mohammad Ghavamzadeh,
Csaba Szepesvari,
Dale Schuurmans
Abstract:
We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolor{blue}{First}, we establish a few key observations that frame the study: \textbf{(i)} Global convergence can be achieved under linear function approximation without policy or r…
▽ More
We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolor{blue}{First}, we establish a few key observations that frame the study: \textbf{(i)} Global convergence can be achieved under linear function approximation without policy or reward realizability, both for the standard Softmax PG and natural policy gradient (NPG). \textbf{(ii)} Approximation error is not a key quantity for characterizing global convergence in either algorithm. \textbf{(iii)} The conditions on the representation that imply global convergence are different between these two algorithms. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation. \textcolor{blue}{Second}, motivated by these observations, we establish new general results: \textbf{(i)} NPG with linear function approximation achieves global convergence \emph{if and only if} the projection of the reward onto the representable space preserves the optimal action's rank, a quantity that is not strongly related to approximation error. \textbf{(ii)} The global convergence of Softmax PG occurs if the representation satisfies a non-domination condition and can preserve the ranking of rewards, which goes well beyond policy or reward realizability. We provide experimental results to support these theoretical findings.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
A Causal Inference Framework for Data Rich Environments
Authors:
Alberto Abadie,
Anish Agarwal,
Devavrat Shah
Abstract:
We propose a formal model for counterfactual estimation with unobserved confounding in "data-rich" settings, i.e., where there are a large number of units and a large number of measurements per unit. Our model provides a bridge between the structural causal model view of causal inference common in the graphical models literature with that of the latent factor model view common in the potential out…
▽ More
We propose a formal model for counterfactual estimation with unobserved confounding in "data-rich" settings, i.e., where there are a large number of units and a large number of measurements per unit. Our model provides a bridge between the structural causal model view of causal inference common in the graphical models literature with that of the latent factor model view common in the potential outcomes literature. We show how classic models for potential outcomes and treatment assignments fit within our framework. We provide an identification argument for the average treatment effect, the average treatment effect on the treated, and the average treatment effect on the untreated. For any estimator that has a fast enough estimation error rate for a certain nuisance parameter, we establish it is consistent for these various causal parameters. We then show principal component regression is one such estimator that leads to consistent estimation, and we analyze the minimal smoothness required of the potential outcomes function for consistency.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
Empirical Analysis of Sim-and-Real Cotraining Of Diffusion Policies For Planar Pushing from Pixels
Authors:
Adam Wei,
Abhinav Agarwal,
Boyuan Chen,
Rohan Bosworth,
Nicholas Pfaff,
Russ Tedrake
Abstract:
In imitation learning for robotics, cotraining with demonstration data generated both in simulation and on real hardware has emerged as a powerful recipe to overcome the sim2real gap. This work seeks to elucidate basic principles of this sim-and-real cotraining to help inform simulation design, sim-and-real dataset creation, and policy training. Focusing narrowly on the canonical task of planar pu…
▽ More
In imitation learning for robotics, cotraining with demonstration data generated both in simulation and on real hardware has emerged as a powerful recipe to overcome the sim2real gap. This work seeks to elucidate basic principles of this sim-and-real cotraining to help inform simulation design, sim-and-real dataset creation, and policy training. Focusing narrowly on the canonical task of planar pushing from camera inputs enabled us to be thorough in our study. These experiments confirm that cotraining with simulated data \emph{can} dramatically improve performance in real, especially when real data is limited. Performance gains scale with simulated data, but eventually plateau; real-world data increases this performance ceiling. The results also suggest that reducing the domain gap in physics may be more important than visual fidelity for non-prehensile manipulation tasks. Perhaps surprisingly, having some visual domain gap actually helps the cotrained policy -- binary probes reveal that high-performing policies learn to distinguish simulated domains from real. We conclude by investigating this nuance and mechanisms that facilitate positive transfer between sim-and-real. In total, our experiments span over 40 real-world policies (evaluated on 800+ trials) and 200 simulated policies (evaluated on 40,000+ trials).
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
States of Disarray: Cleaning Data for Gerrymandering Analysis
Authors:
Ananya Agarwal,
Fnu Alusi,
Arbie Hsu,
Arif Syraj,
Ellen Veomett
Abstract:
The mathematics of redistricting is an area of study that has exploded in recent years. In particular, many different research groups and expert witnesses in court cases have used outlier analysis to argue that a proposed map is a gerrymander. This outlier analysis relies on having an ensemble of potential redistricting maps against which the proposed map is compared. Arguably the most widely-acce…
▽ More
The mathematics of redistricting is an area of study that has exploded in recent years. In particular, many different research groups and expert witnesses in court cases have used outlier analysis to argue that a proposed map is a gerrymander. This outlier analysis relies on having an ensemble of potential redistricting maps against which the proposed map is compared. Arguably the most widely-accepted method of creating such an ensemble is to use a Markov Chain Monte Carlo (MCMC) process. This process requires that various pieces of data be gathered, cleaned, and coalesced into a single file that can be used as the seed of the MCMC process.
In this article, we describe how we have begun this cleaning process for each state, and made the resulting data available for the public at https://github.com/eveomett-states . At the time of submission, we have data for 22 states available for researchers, students, and the general public to easily access and analyze. We will continue the data cleaning process for each state, and we hope that the availability of these datasets will both further research in this area, and increase the public's interest in and understanding of modern techniques to detect gerrymandering.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Authors:
Samuel Cahyawijaya,
Holy Lovenia,
Joel Ruben Antony Moniz,
Tack Hwa Wong,
Mohammad Rifqi Farhansyah,
Thant Thiri Maung,
Frederikus Hudi,
David Anugraha,
Muhammad Ravi Shulthan Habibi,
Muhammad Reza Qorib,
Amit Agarwal,
Joseph Marvin Imperial,
Hitesh Laxmichand Patel,
Vicky Feliren,
Bahrul Ilmi Nasution,
Manuel Antonio Rufino,
Genta Indra Winata,
Rian Adam Rajagede,
Carlos Rafael Catalan,
Mohamed Fazli Imam,
Priyaranjan Pattnayak,
Salsabila Zahirah Pranida,
Kevin Pratama,
Yeshil Bangera,
Adisai Na-Thalang
, et al. (67 additional authors not shown)
Abstract:
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA…
▽ More
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
△ Less
Submitted 18 March, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
Mitigating Preference Hacking in Policy Optimization with Pessimism
Authors:
Dhawal Gupta,
Adam Fisch,
Christoph Dann,
Alekh Agarwal
Abstract:
This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on \emph{fixed preference datasets}, and these models are unreliable when evaluated outside the support of this preference data, leading to the common reward or preference hacking…
▽ More
This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on \emph{fixed preference datasets}, and these models are unreliable when evaluated outside the support of this preference data, leading to the common reward or preference hacking phenomenon. We propose novel, pessimistic objectives for RLHF which are provably robust to overoptimization through the use of pessimism in the face of uncertainty, and design practical algorithms, P3O and PRPO, to optimize these objectives. Our approach is derived for the general preference optimization setting, but can be used with reward models as well. We evaluate P3O and PRPO on the tasks of fine-tuning language models for document summarization and creating helpful assistants, demonstrating remarkable resilience to overoptimization.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Adaptive Test-Time Intervention for Concept Bottleneck Models
Authors:
Matthew Shen,
Aliyah Hsu,
Abhineet Agarwal,
Bin Yu
Abstract:
Concept bottleneck models (CBM) aim to improve model interpretability by predicting human level "concepts" in a bottleneck within a deep learning model architecture. However, how the predicted concepts are used in predicting the target still either remains black-box or is simplified to maintain interpretability at the cost of prediction performance. We propose to use Fast Interpretable Greedy Sum-…
▽ More
Concept bottleneck models (CBM) aim to improve model interpretability by predicting human level "concepts" in a bottleneck within a deep learning model architecture. However, how the predicted concepts are used in predicting the target still either remains black-box or is simplified to maintain interpretability at the cost of prediction performance. We propose to use Fast Interpretable Greedy Sum-Trees (FIGS) to obtain Binary Distillation (BD). This new method, called FIGS-BD, distills a binary-augmented concept-to-target portion of the CBM into an interpretable tree-based model, while maintaining the competitive prediction performance of the CBM teacher. FIGS-BD can be used in downstream tasks to explain and decompose CBM predictions into interpretable binary-concept-interaction attributions and guide adaptive test-time intervention. Across 4 datasets, we demonstrate that our adaptive test-time intervention identifies key concepts that significantly improve performance for realistic human-in-the-loop settings that only allow for limited concept interventions.
△ Less
Submitted 14 April, 2025; v1 submitted 9 March, 2025;
originally announced March 2025.
-
Vector Quantized Feature Fields for Fast 3D Semantic Lifting
Authors:
George Tang,
Aditya Agarwal,
Weiqiao Han,
Trevor Darrell,
Yutong Bai
Abstract:
We generalize lifting to semantic lifting by incorporating per-view masks that indicate relevant pixels for lifting tasks. These masks are determined by querying corresponding multiscale pixel-aligned feature maps, which are derived from scene representations such as distilled feature fields and feature point clouds. However, storing per-view feature maps rendered from distilled feature fields is…
▽ More
We generalize lifting to semantic lifting by incorporating per-view masks that indicate relevant pixels for lifting tasks. These masks are determined by querying corresponding multiscale pixel-aligned feature maps, which are derived from scene representations such as distilled feature fields and feature point clouds. However, storing per-view feature maps rendered from distilled feature fields is impractical, and feature point clouds are expensive to store and query. To enable lightweight on-demand retrieval of pixel-aligned relevance masks, we introduce the Vector-Quantized Feature Field. We demonstrate the effectiveness of the Vector-Quantized Feature Field on complex indoor and outdoor scenes. Semantic lifting, when paired with a Vector-Quantized Feature Field, can unlock a myriad of applications in scene representation and embodied intelligence. Specifically, we showcase how our method enables text-driven localized scene editing and significantly improves the efficiency of embodied question answering.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Authors:
Richard Ren,
Arunim Agarwal,
Mantas Mazeika,
Cristina Menghini,
Robert Vacareanu,
Brad Kenstler,
Mick Yang,
Isabelle Barrass,
Alice Gatti,
Xuwang Yin,
Eduardo Trevino,
Matias Geralnik,
Adam Khoja,
Dean Lee,
Summer Yue,
Dan Hendrycks
Abstract:
As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. Howeve…
▽ More
As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy--the correctness of a model's beliefs--in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy.
△ Less
Submitted 20 March, 2025; v1 submitted 5 March, 2025;
originally announced March 2025.
-
Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models
Authors:
Lior Belenki,
Alekh Agarwal,
Tianze Shi,
Kristina Toutanova
Abstract:
We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture via a Mixture of Data Experts (MDE). We use this approximation as a source of additional features in a regression model, trained from observations of model loss for a small number of mixtures.
Experiments with Transformer deco…
▽ More
We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture via a Mixture of Data Experts (MDE). We use this approximation as a source of additional features in a regression model, trained from observations of model loss for a small number of mixtures.
Experiments with Transformer decoder-only language models in the range of 70M to 1B parameters on the SlimPajama dataset show that our method achieves significantly better performance than approaches that train regression models using only the mixture rates as input features. Combining this improved optimization method with an objective that takes into account cross-entropy on end task data leads to superior performance on few-shot downstream evaluations.
We also provide theoretical insights on why aggregation of data expert predictions can provide good approximations to model losses for data mixtures.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
SPEX: Scaling Feature Interaction Explanations for LLMs
Authors:
Justin Singh Kang,
Landon Butler,
Abhineet Agarwal,
Yigit Efe Erginbas,
Ramtin Pedarsani,
Kannan Ramchandran,
Bin Yu
Abstract:
Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide marginal feature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We propose Spectral Explainer (SPEX), a model-agnostic interaction attr…
▽ More
Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide marginal feature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We propose Spectral Explainer (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($\approx 1000)$. SPEX exploits underlying natural sparsity among interactions -- common in real-world data -- and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, HotpotQA, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (GPT-4o mini) and compositional reasoning in vision-language models.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Clinical QA 2.0: Multi-Task Learning for Answer Extraction and Categorization
Authors:
Priyaranjan Pattnayak,
Hitesh Laxmichand Patel,
Amit Agarwal,
Bhargava Kumar,
Srikant Panda,
Tejaswini Kumar
Abstract:
Clinical Question Answering (CQA) plays a crucial role in medical decision-making, enabling physicians to extract relevant information from Electronic Medical Records (EMRs). While transformer-based models such as BERT, BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in CQA, existing models lack the ability to categorize extracted answers, which is critical for structured…
▽ More
Clinical Question Answering (CQA) plays a crucial role in medical decision-making, enabling physicians to extract relevant information from Electronic Medical Records (EMRs). While transformer-based models such as BERT, BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in CQA, existing models lack the ability to categorize extracted answers, which is critical for structured retrieval, content filtering, and medical decision support.
To address this limitation, we introduce a Multi-Task Learning (MTL) framework that jointly trains CQA models for both answer extraction and medical categorization. In addition to predicting answer spans, our model classifies responses into five standardized medical categories: Diagnosis, Medication, Symptoms, Procedure, and Lab Reports. This categorization enables more structured and interpretable outputs, making clinical QA models more useful in real-world healthcare settings.
We evaluate our approach on emrQA, a large-scale dataset for medical question answering. Results show that MTL improves F1-score by 2.2% compared to standard fine-tuning, while achieving 90.7% accuracy in answer categorization. These findings suggest that MTL not only enhances CQA performance but also introduces an effective mechanism for categorization and structured medical information retrieval.
△ Less
Submitted 23 April, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
SycEval: Evaluating LLM Sycophancy
Authors:
Aaron Fanous,
Jacob Goldberg,
Ank A. Agarwal,
Joanna Lin,
Anson Zhou,
Roxana Daneshjou,
Sanmi Koyejo
Abstract:
Large language models (LLMs) are increasingly applied in educational, clinical, and professional settings, but their tendency for sycophancy -- prioritizing user agreement over independent reasoning -- poses risks to reliability. This study introduces a framework to evaluate sycophantic behavior in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and MedQuad (medical advice)…
▽ More
Large language models (LLMs) are increasingly applied in educational, clinical, and professional settings, but their tendency for sycophancy -- prioritizing user agreement over independent reasoning -- poses risks to reliability. This study introduces a framework to evaluate sycophantic behavior in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19% of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred in 43.52% of cases, while regressive sycophancy, leading to incorrect answers, was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$, $p<0.001$), particularly in computational tasks, where regressive sycophancy increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$). Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$, $p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI: [77.2%, 79.8%]) regardless of context or model. These findings emphasize the risks and opportunities of deploying LLMs in structured and dynamic domains, offering insights into prompt programming and model optimization for safer AI applications.
△ Less
Submitted 5 March, 2025; v1 submitted 12 February, 2025;
originally announced February 2025.
-
Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates
Authors:
Jincheng Mei,
Bo Dai,
Alekh Agarwal,
Sharan Vaswani,
Anant Raj,
Csaba Szepesvari,
Dale Schuurmans
Abstract:
We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using \emph{any} constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down…
▽ More
We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using \emph{any} constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down. The proofs are based on novel findings about action sampling rates and the relationship between cumulative progress and noise, and extend the current understanding of how simple stochastic gradient methods behave in bandit settings.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Design Considerations in Offline Preference-based RL
Authors:
Alekh Agarwal,
Christoph Dann,
Teodor V. Marinov
Abstract:
Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the…
▽ More
Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Catoni Contextual Bandits are Robust to Heavy-tailed Rewards
Authors:
Chenlu Ye,
Yujia Jin,
Alekh Agarwal,
Tong Zhang
Abstract:
Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni…
▽ More
Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni's estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$ as well as the number of rounds $T$. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Hybrid Graphs for Table-and-Text based Question Answering using LLMs
Authors:
Ankush Agarwal,
Ganesh S,
Chaitanya Devaguptapu
Abstract:
Answering questions that require reasoning and aggregation across both structured (tables) and unstructured (raw text) data sources presents significant challenges. Current methods rely on fine-tuning and high-quality, human-curated data, which is difficult to obtain. Recent advances in Large Language Models (LLMs) have shown promising results for multi-hop question answering (QA) over single-sour…
▽ More
Answering questions that require reasoning and aggregation across both structured (tables) and unstructured (raw text) data sources presents significant challenges. Current methods rely on fine-tuning and high-quality, human-curated data, which is difficult to obtain. Recent advances in Large Language Models (LLMs) have shown promising results for multi-hop question answering (QA) over single-source text data in a zero-shot setting, yet exploration into multi-source Table-Text QA remains limited. In this paper, we present a novel Hybrid Graph-based approach for Table-Text QA that leverages LLMs without fine-tuning. Our method constructs a unified Hybrid Graph from textual and tabular data, pruning information based on the input question to provide the LLM with relevant context concisely. We evaluate our approach on the challenging Hybrid-QA and OTT-QA datasets using state-of-the-art LLMs, including GPT-3.5, GPT-4, and LLaMA-3. Our method achieves the best zero-shot performance on both datasets, improving Exact Match scores by up to 10% on Hybrid-QA and 5.4% on OTT-QA. Moreover, our approach reduces token usage by up to 53% compared to the original context.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
Standardised schema and taxonomy for AI incident databases in critical digital infrastructure
Authors:
Avinash Agarwal,
Manisha J. Nene
Abstract:
The rapid deployment of Artificial Intelligence (AI) in critical digital infrastructure introduces significant risks, necessitating a robust framework for systematically collecting AI incident data to prevent future incidents. Existing databases lack the granularity as well as the standardized structure required for consistent data collection and analysis, impeding effective incident management. T…
▽ More
The rapid deployment of Artificial Intelligence (AI) in critical digital infrastructure introduces significant risks, necessitating a robust framework for systematically collecting AI incident data to prevent future incidents. Existing databases lack the granularity as well as the standardized structure required for consistent data collection and analysis, impeding effective incident management. This work proposes a standardized schema and taxonomy for AI incident databases, addressing these challenges by enabling detailed and structured documentation of AI incidents across sectors. Key contributions include developing a unified schema, introducing new fields such as incident severity, causes, and harms caused, and proposing a taxonomy for classifying AI incidents in critical digital infrastructure. The proposed solution facilitates more effective incident data collection and analysis, thus supporting evidence-based policymaking, enhancing industry safety measures, and promoting transparency. This work lays the foundation for a coordinated global response to AI incidents, ensuring trust, safety, and accountability in using AI across regions.
△ Less
Submitted 28 January, 2025;
originally announced January 2025.
-
Advancing Trustworthy AI for Sustainable Development: Recommendations for Standardising AI Incident Reporting
Authors:
Avinash Agarwal,
Manisha J Nene
Abstract:
The increasing use of AI technologies has led to increasing AI incidents, posing risks and causing harm to individuals, organizations, and society. This study recognizes and addresses the lack of standardized protocols for reliably and comprehensively gathering such incident data crucial for preventing future incidents and developing mitigating strategies. Specifically, this study analyses existin…
▽ More
The increasing use of AI technologies has led to increasing AI incidents, posing risks and causing harm to individuals, organizations, and society. This study recognizes and addresses the lack of standardized protocols for reliably and comprehensively gathering such incident data crucial for preventing future incidents and developing mitigating strategies. Specifically, this study analyses existing open-access AI-incident databases through a systematic methodology and identifies nine gaps in current AI incident reporting practices. Further, it proposes nine actionable recommendations to enhance standardization efforts to address these gaps. Ensuring the trustworthiness of enabling technologies such as AI is necessary for sustainable digital transformation. Our research promotes the development of standards to prevent future AI incidents and promote trustworthy AI, thus facilitating achieving the UN sustainable development goals. Through international cooperation, stakeholders can unlock the transformative potential of AI, enabling a sustainable and inclusive future for all.
△ Less
Submitted 1 January, 2025;
originally announced January 2025.
-
Humanity's Last Exam
Authors:
Long Phan,
Alice Gatti,
Ziwen Han,
Nathaniel Li,
Josephina Hu,
Hugh Zhang,
Chen Bo Calvin Zhang,
Mohamed Shaaban,
John Ling,
Sean Shi,
Michael Choi,
Anish Agrawal,
Arnav Chopra,
Adam Khoja,
Ryan Kim,
Richard Ren,
Jason Hausenloy,
Oliver Zhang,
Mantas Mazeika,
Dmitry Dodonov,
Tung Nguyen,
Jaeho Lee,
Daron Anderson,
Mikhail Doroshenko,
Alun Cennyth Stokes
, et al. (1084 additional authors not shown)
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of…
▽ More
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
△ Less
Submitted 19 April, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
AI Technicians: Developing Rapid Occupational Training Methods for a Competitive AI Workforce
Authors:
Jaromir Savelka,
Can Kultur,
Arav Agarwal,
Christopher Bogart,
Heather Burte,
Adam Zhang,
Majd Sakr
Abstract:
The accelerating pace of developments in Artificial Intelligence~(AI) and the increasing role that technology plays in society necessitates substantial changes in the structure of the workforce. Besides scientists and engineers, there is a need for a very large workforce of competent AI technicians (i.e., maintainers, integrators) and users~(i.e., operators). As traditional 4-year and 2-year degre…
▽ More
The accelerating pace of developments in Artificial Intelligence~(AI) and the increasing role that technology plays in society necessitates substantial changes in the structure of the workforce. Besides scientists and engineers, there is a need for a very large workforce of competent AI technicians (i.e., maintainers, integrators) and users~(i.e., operators). As traditional 4-year and 2-year degree-based education cannot fill this quickly opening gap, alternative training methods have to be developed. We present the results of the first four years of the AI Technicians program which is a unique collaboration between the U.S. Army's Artificial Intelligence Integration Center (AI2C) and Carnegie Mellon University to design, implement and evaluate novel rapid occupational training methods to create a competitive AI workforce at the technicians level. Through this multi-year effort we have already trained 59 AI Technicians. A key observation is that ongoing frequent updates to the training are necessary as the adoption of AI in the U.S. Army and within the society at large is evolving rapidly. A tight collaboration among the stakeholders from the army and the university is essential for successful development and maintenance of the training for the evolving role. Our findings can be leveraged by large organizations that face the challenge of developing a competent AI workforce as well as educators and researchers engaged in solving the challenge.
△ Less
Submitted 17 January, 2025;
originally announced January 2025.
-
MVTamperBench: Evaluating Robustness of Vision-Language Models
Authors:
Amit Agarwal,
Srikant Panda,
Angeline Charles,
Bhargava Kumar,
Hitesh Patel,
Priyaranjan Pattnayak,
Taki Hasan Rafi,
Tejaswini Kumar,
Hansa Meghwani,
Karan Gupta,
Dong-Kyu Chae
Abstract:
Multimodal Large Language Models (MLLMs), are recent advancement of Vision-Language Models (VLMs) that have driven major advances in video understanding. However, their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce \textbf{MVTamperBench}, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering te…
▽ More
Multimodal Large Language Models (MLLMs), are recent advancement of Vision-Language Models (VLMs) that have driven major advances in video understanding. However, their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce \textbf{MVTamperBench}, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation, masking, substitution, repetition, and dropping; based on real-world visual tampering scenarios such as surveillance interference, social media content edits, and misinformation injection. MVTamperBench comprises ~3.4K original videos, expanded into over ~17K tampered clips covering 19 distinct video manipulation tasks. This benchmark challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs from 15+ model families. We reveal substantial variability in resilience across tampering types and show that larger parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content distribution, and enforcing policies on media platforms. We release all code, data, and benchmark to foster open research in trustworthy video understanding.
Code: https://amitbcp.github.io/MVTamperBench/ Data: https://huggingface.co/datasets/Srikant86/MVTamperBench
△ Less
Submitted 11 June, 2025; v1 submitted 27 December, 2024;
originally announced December 2024.
-
Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy
Authors:
Priyaranjan Pattnayak,
Hitesh Laxmichand Patel,
Bhargava Kumar,
Amit Agarwal,
Ishan Banerjee,
Srikant Panda,
Tejaswini Kumar
Abstract:
Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems by integrating and analyzing diverse types of data, including text, images, audio, and video. Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and imag…
▽ More
Multimodal learning, a rapidly evolving field in artificial intelligence, seeks to construct more versatile and robust systems by integrating and analyzing diverse types of data, including text, images, audio, and video. Inspired by the human ability to assimilate information through many senses, this method enables applications such as text-to-video conversion, visual question answering, and image captioning. Recent developments in datasets that support multimodal language models (MLLMs) are highlighted in this overview. Large-scale multimodal datasets are essential because they allow for thorough testing and training of these models. With an emphasis on their contributions to the discipline, the study examines a variety of datasets, including those for training, domain-specific tasks, and real-world applications. It also emphasizes how crucial benchmark datasets are for assessing models' performance in a range of scenarios, scalability, and applicability. Since multimodal learning is always changing, overcoming these obstacles will help AI research and applications reach new heights.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.