-
A U-Net and Transformer Pipeline for Multilingual Image Translation
Authors:
Siddharth Sahay,
Radhika Agarwal
Abstract:
This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. T…
▽ More
This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.
△ Less
Submitted 27 October, 2025;
originally announced October 2025.
-
How to build a sovereign network? -- A proposal to measure network sovereignty
Authors:
Shakthivelu Janardhanan,
Ritanshi Agarwal,
Wolfgang Kellerer,
Carmen Mas-Machuca
Abstract:
Network sovereignty is a network operator's ability to reduce the dependency on component manufacturers to minimize the impact of manufacturer failures. Network operators now face new design challenges to increase network sovereignty and avoid vendor lock-in problems because a high dependency on a manufacturer corresponds to low survivability if that manufacturer is unavailable. The main contribut…
▽ More
Network sovereignty is a network operator's ability to reduce the dependency on component manufacturers to minimize the impact of manufacturer failures. Network operators now face new design challenges to increase network sovereignty and avoid vendor lock-in problems because a high dependency on a manufacturer corresponds to low survivability if that manufacturer is unavailable. The main contribution of this work is the proposal of a novel metric to measure network sovereignty, the Cut Set Coloring (CSC) score. Based on the CSC core metric CSC-ILP, our Integer Linear Program formulation is presented to maximize network sovereignty. We compare CSC-ILP's performance with state of the art manufacturer assignment strategies.
△ Less
Submitted 27 October, 2025;
originally announced October 2025.
-
Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection
Authors:
Federica Gamba,
Aman Sinha,
Timothee Mickus,
Raul Vazquez,
Patanjali Bhamidipati,
Claudio Savelli,
Ahana Chattopadhyay,
Laura A. Zanella,
Yash Kankanampati,
Binesh Arakkal Remesh,
Aryan Ashok Chandramania,
Rohit Agarwal,
Chuyuan Li,
Ioana Buhnila,
Radhika Mamidi
Abstract:
We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and contex…
▽ More
We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates these distortions, particularly given LLMs' lack of true comprehension, limited contextual understanding, and bias toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated scientific questions and over 7000 LLM-generated answers from 16 publicly available models, provided as question-answer pairs along with token sequences and corresponding logits. Each instance is annotated with a binary label indicating the presence of a scientific hallucination, denoted as a factuality error, and a fluency label, capturing issues in the linguistic quality or naturalness of the text. CAP is publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and the development of more reliable scientific NLP systems.
△ Less
Submitted 25 October, 2025;
originally announced October 2025.
-
The Art of Scaling Reinforcement Learning Compute for LLMs
Authors:
Devvrit Khatri,
Lovish Madaan,
Rishabh Tiwari,
Rachit Bansal,
Sai Surya Duvvuri,
Manzil Zaheer,
Inderjit S. Dhillon,
David Brandfonbrener,
Rishabh Agarwal
Abstract:
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to…
▽ More
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.
△ Less
Submitted 15 October, 2025;
originally announced October 2025.
-
Mapping Patient-Perceived Physician Traits from Nationwide Online Reviews with LLMs
Authors:
Junjie Luo,
Rui Han,
Arshana Welivita,
Zeleikun Di,
Jingfu Wu,
Xuzhe Zhi,
Ritu Agarwal,
Gordon Gao
Abstract:
Understanding how patients perceive their physicians is essential to improving trust, communication, and satisfaction. We present a large language model (LLM)-based pipeline that infers Big Five personality traits and five patient-oriented subjective judgments. The analysis encompasses 4.1 million patient reviews of 226,999 U.S. physicians from an initial pool of one million. We validate the metho…
▽ More
Understanding how patients perceive their physicians is essential to improving trust, communication, and satisfaction. We present a large language model (LLM)-based pipeline that infers Big Five personality traits and five patient-oriented subjective judgments. The analysis encompasses 4.1 million patient reviews of 226,999 U.S. physicians from an initial pool of one million. We validate the method through multi-model comparison and human expert benchmarking, achieving strong agreement between human and LLM assessments (correlation coefficients 0.72-0.89) and external validity through correlations with patient satisfaction (r = 0.41-0.81, all p<0.001). National-scale analysis reveals systematic patterns: male physicians receive higher ratings across all traits, with largest disparities in clinical competence perceptions; empathy-related traits predominate in pediatrics and psychiatry; and all traits positively predict overall satisfaction. Cluster analysis identifies four distinct physician archetypes, from "Well-Rounded Excellent" (33.8%, uniformly high traits) to "Underperforming" (22.6%, consistently low). These findings demonstrate that automated trait extraction from patient narratives can provide interpretable, validated metrics for understanding physician-patient relationships at scale, with implications for quality measurement, bias detection, and workforce development in healthcare.
△ Less
Submitted 4 October, 2025;
originally announced October 2025.
-
PAME-AI: Patient Messaging Creation and Optimization using Agentic AI
Authors:
Junjie Luo,
Yihong Guo,
Anqi Liu,
Ritu Agarwal,
Gordon Gao
Abstract:
Messaging patients is a critical part of healthcare communication, helping to improve things like medication adherence and healthy behaviors. However, traditional mobile message design has significant limitations due to its inability to explore the high-dimensional design space. We develop PAME-AI, a novel approach for Patient Messaging Creation and Optimization using Agentic AI. Built on the Data…
▽ More
Messaging patients is a critical part of healthcare communication, helping to improve things like medication adherence and healthy behaviors. However, traditional mobile message design has significant limitations due to its inability to explore the high-dimensional design space. We develop PAME-AI, a novel approach for Patient Messaging Creation and Optimization using Agentic AI. Built on the Data-Information-Knowledge-Wisdom (DIKW) hierarchy, PAME-AI offers a structured framework to move from raw data to actionable insights for high-performance messaging design. PAME-AI is composed of a system of specialized computational agents that progressively transform raw experimental data into actionable message design strategies. We demonstrate our approach's effectiveness through a two-stage experiment, comprising of 444,691 patient encounters in Stage 1 and 74,908 in Stage 2. The best-performing generated message achieved 68.76% engagement compared to the 61.27% baseline, representing a 12.2% relative improvement in click-through rates. This agentic architecture enables parallel processing, hypothesis validation, and continuous learning, making it particularly suitable for large-scale healthcare communication optimization.
△ Less
Submitted 30 September, 2025; v1 submitted 29 September, 2025;
originally announced September 2025.
-
General Framework for Twisted Bilayer Photonic Crystal with Interlayer Coupling and Far-Field Response
Authors:
Shupeng Xu,
Dun Wang,
Ritesh Agarwal
Abstract:
We develop a general theory for twisted bilayer photonic crystals that takes into account both far-field response and near-field coupling. The theory is based on the framework of a generalized Rayleigh-Schrödinger perturbation theory for non-Hermitian Hamiltonians. A universal form for interlayer coupling is derived, which relates the hopping strength to the Fourier transforms of the Wannier funct…
▽ More
We develop a general theory for twisted bilayer photonic crystals that takes into account both far-field response and near-field coupling. The theory is based on the framework of a generalized Rayleigh-Schrödinger perturbation theory for non-Hermitian Hamiltonians. A universal form for interlayer coupling is derived, which relates the hopping strength to the Fourier transforms of the Wannier functions in the single layer photonic crystal. For low energy states at the K point in hexagonal lattices, the interlayer coupling reduces to that in the Bistritzer-MacDonald model for graphene. As an example, we study a twisted bilayer photonic crystal slab with air holes arranged in a honeycomb lattice in each layer. The first order solution of our model predicts a four-fold band splitting in the far-field spectrum compared to the single-layer case, which is confirmed by numerical simulations. Moreover, our theory reveals that for low energy states at K points, scattering towards the Γ point via the moiré potential is suppressed. Based on our theory, we propose a wide-angle, high-Q tunable flat band cavity by combining the bilayer at a large twist angle with a Brillouinzone-folding perturbation within each layer. The cavity behaves like a collection of quasi-bound states in the continuum with a divergent density of states, with potential applications in nonlinear optics, lasing and quantum optics.
△ Less
Submitted 28 September, 2025;
originally announced September 2025.
-
Match Chat: Real Time Generative AI and Generative Computing for Tennis
Authors:
Aaron Baughman,
Gozde Akay,
Eduardo Morales,
Rahul Agarwal,
Preetika Srivastava
Abstract:
We present Match Chat, a real-time, agent-driven assistant designed to enhance the tennis fan experience by delivering instant, accurate responses to match-related queries. Match Chat integrates Generative Artificial Intelligence (GenAI) with Generative Computing (GenComp) techniques to synthesize key insights during live tennis singles matches. The system debuted at the 2025 Wimbledon Championshi…
▽ More
We present Match Chat, a real-time, agent-driven assistant designed to enhance the tennis fan experience by delivering instant, accurate responses to match-related queries. Match Chat integrates Generative Artificial Intelligence (GenAI) with Generative Computing (GenComp) techniques to synthesize key insights during live tennis singles matches. The system debuted at the 2025 Wimbledon Championships and the 2025 US Open, where it provided about 1 million users with seamless access to streaming and static data through natural language queries. The architecture is grounded in an Agent-Oriented Architecture (AOA) combining rule engines, predictive models, and agents to pre-process and optimize user queries before passing them to GenAI components. The Match Chat system had an answer accuracy of 92.83% with an average response time of 6.25 seconds under loads of up to 120 requests per second (RPS). Over 96.08% of all queries were guided using interactive prompt design, contributing to a user experience that prioritized clarity, responsiveness, and minimal effort. The system was designed to mask architectural complexity, offering a frictionless and intuitive interface that required no onboarding or technical familiarity. Across both Grand Slam deployments, Match Chat maintained 100% uptime and supported nearly 1 million unique users, underscoring the scalability and reliability of the platform. This work introduces key design patterns for real-time, consumer-facing AI systems that emphasize speed, precision, and usability that highlights a practical path for deploying performant agentic systems in dynamic environments.
△ Less
Submitted 15 September, 2025;
originally announced September 2025.
-
Quantum-Enhanced Analysis and Grading of Vocal Performance
Authors:
Rohan Agarwal
Abstract:
We present QuantumMelody, a hybrid quantum-classical method for objective singing assessment. Grouped vocal features (pitch stability, dynamics, timbre) are encoded into a small simulated quantum circuit; all nine qubits are initialized with a Hadamard on each qubit and then receive Rx, Ry, and Rz rotations, with intra- and cross-group entanglement. The circuit measurement probabilities are fused…
▽ More
We present QuantumMelody, a hybrid quantum-classical method for objective singing assessment. Grouped vocal features (pitch stability, dynamics, timbre) are encoded into a small simulated quantum circuit; all nine qubits are initialized with a Hadamard on each qubit and then receive Rx, Ry, and Rz rotations, with intra- and cross-group entanglement. The circuit measurement probabilities are fused with spectrogram transformer embeddings to estimate a grade on labels 2-5 and to surface technique-level feedback. On 168 labeled 20 second excerpts, the hybrid reaches 74.29% agreement with expert graders, a +12.86 point gain over a classical-features baseline. Processing is sub-minute per recording on a laptop-class Qiskit simulator; we do not claim hardware speedups. This is a feasibility step toward interpretable, objective singing assessment in applied audio signal processing.
△ Less
Submitted 27 August, 2025;
originally announced September 2025.
-
Exact expressions for nonperturbative guiding center theory in symmetric fields
Authors:
I. Hollas,
R. Agarwal,
J. W. Burby,
A. J. Brizard
Abstract:
We apply a recently-developed nonperturbative guiding center formalism to charged particle dynamics in fields with two-parameter continuous symmetry groups. This entails finding exact constants of motion, valid in the nonperturbative regime, that agree with Kruskal's adiabatic invariant series to all orders in the perturbative regime, when the field scale length is large compared with a typical gy…
▽ More
We apply a recently-developed nonperturbative guiding center formalism to charged particle dynamics in fields with two-parameter continuous symmetry groups. This entails finding exact constants of motion, valid in the nonperturbative regime, that agree with Kruskal's adiabatic invariant series to all orders in the perturbative regime, when the field scale length is large compared with a typical gyroradius. We demonstrate that the nonperturbative guiding center model makes exact predictions in these cases, even though it eliminates the cyclotron timescale, thereby establishing a theoretical baseline for performance of the nonperturbative formalism.
△ Less
Submitted 13 August, 2025;
originally announced August 2025.
-
Extreme Event Precursor Prediction in Turbulent Dynamical Systems via CNN-Augmented Recurrence Analysis
Authors:
Rahul Agarwal,
Mustafa A. Mohamad
Abstract:
We present a general framework to predict precursors to extreme events in turbulent dynamical systems. The approach combines phase-space reconstruction techniques with recurrence matrices and convolutional neural networks to identify precursors to extreme events. We evaluate the framework across three distinct testbed systems: a triad turbulent interaction model, a prototype stochastic anisotropic…
▽ More
We present a general framework to predict precursors to extreme events in turbulent dynamical systems. The approach combines phase-space reconstruction techniques with recurrence matrices and convolutional neural networks to identify precursors to extreme events. We evaluate the framework across three distinct testbed systems: a triad turbulent interaction model, a prototype stochastic anisotropic turbulent flow, and the Kolmogorov flow. This method offers three key advantages: (1) a threshold-free classification strategy that eliminates subjective parameter tuning, (2) efficient training using only $\mathcal{O}(100)$ recurrence matrices, and (3) ability to generalize to unseen systems. The results demonstrate robust predictive performance across all test systems: 96\% detection rate for the triad model with a mean lead time of 1.8 time units, 96\% for the anisotropic turbulent flow with a mean lead time of 6.1 time units, and 93\% for the Kolmogorov flow with a mean lead time of 22.7 units.
△ Less
Submitted 6 August, 2025;
originally announced August 2025.
-
Towards Compute-Optimal Many-Shot In-Context Learning
Authors:
Shahriar Golchin,
Yanfei Chen,
Rujun Han,
Manan Gandhi,
Tianli Yu,
Swaroop Mishra,
Mihai Surdeanu,
Rishabh Agarwal,
Chen-Yu Lee,
Tomas Pfister
Abstract:
Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the bene…
▽ More
Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.
△ Less
Submitted 29 August, 2025; v1 submitted 22 July, 2025;
originally announced July 2025.
-
SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains
Authors:
Krithika Ramesh,
Daniel Smolyak,
Zihao Zhao,
Nupoor Gandhi,
Ritu Agarwal,
Margrét Bjarnadóttir,
Anjalie Field
Abstract:
We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consiste…
▽ More
We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit's generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development.
△ Less
Submitted 9 July, 2025;
originally announced July 2025.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Authors:
Gheorghe Comanici,
Eric Bieber,
Mike Schaekermann,
Ice Pasupat,
Noveen Sachdeva,
Inderjit Dhillon,
Marcel Blistein,
Ori Ram,
Dan Zhang,
Evan Rosen,
Luke Marris,
Sam Petulla,
Colin Gaffney,
Asaf Aharoni,
Nathan Lintz,
Tiago Cardal Pais,
Henrik Jacobsson,
Idan Szpektor,
Nan-Jiang Jiang,
Krishna Haridasan,
Ahmed Omran,
Nikunj Saunshi,
Dara Bahri,
Gaurav Mishra,
Eric Chu
, et al. (3410 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde…
▽ More
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
△ Less
Submitted 16 October, 2025; v1 submitted 7 July, 2025;
originally announced July 2025.
-
Online Planning for Cooperative Air-Ground Robot Systems with Unknown Fuel Requirements
Authors:
Ritvik Agarwal,
Behnoushsadat Hatami,
Alvika Gautam,
Parikshit Maini
Abstract:
We consider an online variant of the fuel-constrained UAV routing problem with a ground-based mobile refueling station (FCURP-MRS), where targets incur unknown fuel costs. We develop a two-phase solution: an offline heuristic-based planner computes initial UAV and UGV paths, and a novel online planning algorithm that dynamically adjusts rendezvous points based on real-time fuel consumption during…
▽ More
We consider an online variant of the fuel-constrained UAV routing problem with a ground-based mobile refueling station (FCURP-MRS), where targets incur unknown fuel costs. We develop a two-phase solution: an offline heuristic-based planner computes initial UAV and UGV paths, and a novel online planning algorithm that dynamically adjusts rendezvous points based on real-time fuel consumption during target processing. Preliminary Gazebo simulations demonstrate the feasibility of our approach in maintaining UAV-UGV path validity, ensuring mission completion. Link to video: https://youtu.be/EmpVj-fjqNY
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
FEWSim: A Visual Analytic Framework for Exploring the Nexus of Food-Energy-Water Simulations
Authors:
Fan Lei,
David A. Sampson,
Jiayi Hong,
Yuxin Ma,
Giuseppe Mascaro,
Dave White,
Rimjhim Agarwal,
Ross Maciejewski
Abstract:
The interdependencies of food, energy, and water (FEW) systems create a nexus opportunity to explore the strengths and vulnerabilities of individual and cross-sector interactions within FEW systems. However, the variables quantifying nexus interactions are hard to observe, which hinders the cross-sector analysis. To overcome such challenges, we present FEWSim, a visual analytics framework designed…
▽ More
The interdependencies of food, energy, and water (FEW) systems create a nexus opportunity to explore the strengths and vulnerabilities of individual and cross-sector interactions within FEW systems. However, the variables quantifying nexus interactions are hard to observe, which hinders the cross-sector analysis. To overcome such challenges, we present FEWSim, a visual analytics framework designed to support domain experts in exploring and interpreting simulation results from a coupled FEW model. FEWSim employs a three-layer asynchronous architecture: the model layer integrates food, energy, and water models to simulate the FEW nexus; the middleware layer manages scenario configuration and execution; and the visualization layer provides interactive visual exploration of simulated time-series results across FEW sectors. The visualization layer further facilitates the exploration across multiple scenarios and evaluates scenario differences in performance using sustainability indices of the FEW nexus. We demonstrate the utility of FEWSim through a case study for the Phoenix Active Management Area (AMA) in Arizona.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
SARAL-Bot: Autonomous Robot for Strawberry Plant Care
Authors:
Arif Ahmed,
Ritvik Agarwal,
Gaurav Srikar,
Nathaniel Rose,
Parikshit Maini
Abstract:
Strawberry farming demands intensive labor for monitoring and maintaining plant health. To address this, Team SARAL develops an autonomous robot for the 2024 ASABE Student Robotics Challenge, capable of navigation, unhealthy leaf detection, and removal. The system addresses labor shortages, reduces costs, and supports sustainable farming through vision-based plant assessment. This work demonstrate…
▽ More
Strawberry farming demands intensive labor for monitoring and maintaining plant health. To address this, Team SARAL develops an autonomous robot for the 2024 ASABE Student Robotics Challenge, capable of navigation, unhealthy leaf detection, and removal. The system addresses labor shortages, reduces costs, and supports sustainable farming through vision-based plant assessment. This work demonstrates the potential of robotics to modernize strawberry cultivation and enable scalable, intelligent agricultural solutions.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
Overcoming Challenges of Partial Client Participation in Federated Learning : A Comprehensive Review
Authors:
Mrinmay Sen,
Shruti Aparna,
Rohit Agarwal,
Chalavadi Krishna Mohan
Abstract:
Federated Learning (FL) is a learning mechanism that falls under the distributed training umbrella, which collaboratively trains a shared global model without disclosing the raw data from different clients. This paper presents an extensive survey on the impact of partial client participation in federated learning. While much of the existing research focuses on addressing issues such as generalizat…
▽ More
Federated Learning (FL) is a learning mechanism that falls under the distributed training umbrella, which collaboratively trains a shared global model without disclosing the raw data from different clients. This paper presents an extensive survey on the impact of partial client participation in federated learning. While much of the existing research focuses on addressing issues such as generalization, robustness, and fairness caused by data heterogeneity under the assumption of full client participation, limited attention has been given to the practical and theoretical challenges arising from partial client participation, which is common in real-world scenarios. This survey provides an in-depth review of existing FL methods designed to cope with partial client participation. We offer a comprehensive analysis supported by theoretical insights and empirical findings, along with a structured categorization of these methods, highlighting their respective advantages and disadvantages.
△ Less
Submitted 6 June, 2025; v1 submitted 3 June, 2025;
originally announced June 2025.
-
REDDIX-NET: A Novel Dataset and Benchmark for Moderating Online Explicit Services
Authors:
MSVPJ Sathvik,
Manan Roy Choudhury,
Rishita Agarwal,
Sathwik Narkedimilli,
Vivek Gupta
Abstract:
The rise of online platforms has enabled covert illicit activities, including online prostitution, to pose challenges for detection and regulation. In this study, we introduce REDDIX-NET, a novel benchmark dataset specifically designed for moderating online sexual services and going beyond traditional NSFW filters. The dataset is derived from thousands of web-scraped NSFW posts on Reddit and categ…
▽ More
The rise of online platforms has enabled covert illicit activities, including online prostitution, to pose challenges for detection and regulation. In this study, we introduce REDDIX-NET, a novel benchmark dataset specifically designed for moderating online sexual services and going beyond traditional NSFW filters. The dataset is derived from thousands of web-scraped NSFW posts on Reddit and categorizes users into six behavioral classes reflecting different service offerings and user intentions. We evaluate the classification performance of state-of-the-art large language models (GPT-4, LlaMA 3.3-70B-Instruct, Gemini 1.5 Flash, Mistral 8x7B, Qwen 2.5 Turbo, Claude 3.5 Haiku) using advanced quantitative metrics, finding promising results with models like GPT-4 and Gemini 1.5 Flash. Beyond classification, we conduct sentiment and comment analysis, leveraging LLM and PLM-based approaches and metadata extraction to uncover behavioral and temporal patterns. These analyses reveal peak engagement times and distinct user interaction styles across categories. Our findings provide critical insights into AI-driven moderation and enforcement, offering a scalable framework for platforms to combat online prostitution and associated harms.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Automated Meta Prompt Engineering for Alignment with the Theory of Mind
Authors:
Aaron Baughman,
Rahul Agarwal,
Eduardo Morales,
Gozde Akay
Abstract:
We introduce a method of meta-prompting that jointly produces fluent text for complex tasks while optimizing the similarity of neural states between a human's mental expectation and a Large Language Model's (LLM) neural processing. A technique of agentic reinforcement learning is applied, in which an LLM as a Judge (LLMaaJ) teaches another LLM, through in-context learning, how to produce content b…
▽ More
We introduce a method of meta-prompting that jointly produces fluent text for complex tasks while optimizing the similarity of neural states between a human's mental expectation and a Large Language Model's (LLM) neural processing. A technique of agentic reinforcement learning is applied, in which an LLM as a Judge (LLMaaJ) teaches another LLM, through in-context learning, how to produce content by interpreting the intended and unintended generated text traits. To measure human mental beliefs around content production, users modify long form AI-generated text articles before publication at the US Open 2024 tennis Grand Slam. Now, an LLMaaJ can solve the Theory of Mind (ToM) alignment problem by anticipating and including human edits within the creation of text from an LLM. Throughout experimentation and by interpreting the results of a live production system, the expectations of human content reviewers had 100% of alignment with AI 53.8% of the time with an average iteration count of 4.38. The geometric interpretation of content traits such as factualness, novelty, repetitiveness, and relevancy over a Hilbert vector space combines spatial volume (all trait importance) with vertices alignment (individual trait relevance) enabled the LLMaaJ to optimize on Human ToM. This resulted in an increase in content quality by extending the coverage of tennis action. Our work that was deployed at the US Open 2024 has been used across other live events within sports and entertainment.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
Authors:
Kusha Sareen,
Morgane M Moss,
Alessandro Sordoni,
Rishabh Agarwal,
Arian Hosseini
Abstract:
Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a…
▽ More
Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Linguistic Complexity and Socio-cultural Patterns in Hip-Hop Lyrics
Authors:
Aayam Bansal,
Raghav Agarwal,
Kaashvi Jain
Abstract:
This paper presents a comprehensive computational framework for analyzing linguistic complexity and socio-cultural trends in hip-hop lyrics. Using a dataset of 3,814 songs from 146 influential artists spanning four decades (1980-2020), we employ natural language processing techniques to quantify multiple dimensions of lyrical complexity. Our analysis reveals a 23.7% increase in vocabulary diversit…
▽ More
This paper presents a comprehensive computational framework for analyzing linguistic complexity and socio-cultural trends in hip-hop lyrics. Using a dataset of 3,814 songs from 146 influential artists spanning four decades (1980-2020), we employ natural language processing techniques to quantify multiple dimensions of lyrical complexity. Our analysis reveals a 23.7% increase in vocabulary diversity over the study period, with East Coast artists demonstrating 17.3% higher lexical variation than other regions. Rhyme density increased by 34.2% across all regions, with Midwest artists exhibiting the highest technical complexity (3.04 rhymes per line). Topic modeling identified significant shifts in thematic content, with social justice themes decreasing from 28.5% to 13.8% of content while introspective themes increased from 7.6% to 26.3%. Sentiment analysis demon- strated that lyrics became significantly more negative during sociopolitical crises, with polarity decreasing by 0.31 following major social unrest. Multi-dimensional analysis revealed four dis- tinct stylistic approaches that correlate strongly with geographic origin (r=0.68, p!0.001) and time period (r=0.59, p<0.001). These findings establish quantitative evidence for the evolution of hip- hop as both an art form and a reflection of societal dynamics, providing insights into the interplay between linguistic innovation and cultural context in popular music.
△ Less
Submitted 29 April, 2025;
originally announced May 2025.
-
The Muon Collider
Authors:
Carlotta Accettura,
Simon Adrian,
Rohit Agarwal,
Claudia Ahdida,
Chiara Aime',
Avni Aksoy,
Gian Luigi Alberghi,
Siobhan Alden,
Luca Alfonso,
Muhammad Ali,
Anna Rita Altamura,
Nicola Amapane,
Kathleen Amm,
David Amorim,
Paolo Andreetto,
Fabio Anulli,
Ludovica Aperio Bella,
Rob Appleby,
Artur Apresyan,
Pouya Asadi,
Mohammed Attia Mahmoud,
Bernhard Auchmann,
John Back,
Anthony Badea,
Kyu Jung Bae
, et al. (433 additional authors not shown)
Abstract:
Muons offer a unique opportunity to build a compact high-energy electroweak collider at the 10 TeV scale. A Muon Collider enables direct access to the underlying simplicity of the Standard Model and unparalleled reach beyond it. It will be a paradigm-shifting tool for particle physics representing the first collider to combine the high-energy reach of a proton collider and the high precision of an…
▽ More
Muons offer a unique opportunity to build a compact high-energy electroweak collider at the 10 TeV scale. A Muon Collider enables direct access to the underlying simplicity of the Standard Model and unparalleled reach beyond it. It will be a paradigm-shifting tool for particle physics representing the first collider to combine the high-energy reach of a proton collider and the high precision of an electron-positron collider, yielding a physics potential significantly greater than the sum of its individual parts. A high-energy muon collider is the natural next step in the exploration of fundamental physics after the HL-LHC and a natural complement to a future low-energy Higgs factory. Such a facility would significantly broaden the scope of particle colliders, engaging the many frontiers of the high energy community.
The last European Strategy for Particle Physics Update and later the Particle Physics Project Prioritisation Panel in the US requested a study of the muon collider, which is being carried on by the International Muon Collider Collaboration. In this comprehensive document we present the physics case, the state of the work on accelerator design and technology, and propose an R\&D project that can make the muon collider a reality.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Process Reward Models That Think
Authors:
Muhammad Khalifa,
Rishabh Agarwal,
Lajanugen Logeswaran,
Jaekyeom Kim,
Hao Peng,
Moontae Lee,
Honglak Lee,
Lu Wang
Abstract:
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier…
▽ More
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.
△ Less
Submitted 25 September, 2025; v1 submitted 23 April, 2025;
originally announced April 2025.
-
Haphazard Inputs as Images in Online Learning
Authors:
Rohit Agarwal,
Aryan Dessai,
Arif Ahmed Sekh,
Krishna Agarwal,
Alexander Horsch,
Dilip K. Prasad
Abstract:
The field of varying feature space in online learning settings, also known as haphazard inputs, is very prominent nowadays due to its applicability in various fields. However, the current solutions to haphazard inputs are model-dependent and cannot benefit from the existing advanced deep-learning methods, which necessitate inputs of fixed dimensions. Therefore, we propose to transform the varying…
▽ More
The field of varying feature space in online learning settings, also known as haphazard inputs, is very prominent nowadays due to its applicability in various fields. However, the current solutions to haphazard inputs are model-dependent and cannot benefit from the existing advanced deep-learning methods, which necessitate inputs of fixed dimensions. Therefore, we propose to transform the varying feature space in an online learning setting to a fixed-dimension image representation on the fly. This simple yet novel approach is model-agnostic, allowing any vision-based models to be applicable for haphazard inputs, as demonstrated using ResNet and ViT. The image representation handles the inconsistent input data seamlessly, making our proposed approach scalable and robust. We show the efficacy of our method on four publicly available datasets. The code is available at https://github.com/Rohit102497/HaphazardInputsAsImages.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Finding Interest Needle in Popularity Haystack: Improving Retrieval by Modeling Item Exposure
Authors:
Rahul Agarwal,
Amit Jaspal,
Saurabh Gupta,
Omkar Vichare
Abstract:
Recommender systems operate in closed feedback loops, where user interactions reinforce popularity bias, leading to over-recommendation of already popular items while under-exposing niche or novel content. Existing bias mitigation methods, such as Inverse Propensity Scoring (IPS) and Off-Policy Correction (OPC), primarily operate at the ranking stage or during training, lacking explicit real-time…
▽ More
Recommender systems operate in closed feedback loops, where user interactions reinforce popularity bias, leading to over-recommendation of already popular items while under-exposing niche or novel content. Existing bias mitigation methods, such as Inverse Propensity Scoring (IPS) and Off-Policy Correction (OPC), primarily operate at the ranking stage or during training, lacking explicit real-time control over exposure dynamics. In this work, we introduce an exposure-aware retrieval scoring approach, which explicitly models item exposure probability and adjusts retrieval-stage ranking at inference time. Unlike prior work, this method decouples exposure effects from engagement likelihood, enabling controlled trade-offs between fairness and engagement in large-scale recommendation platforms. We validate our approach through online A/B experiments in a real-world video recommendation system, demonstrating a 25% increase in uniquely retrieved items and a 40% reduction in the dominance of over-popular content, all while maintaining overall user engagement levels. Our results establish a scalable, deployable solution for mitigating popularity bias at the retrieval stage, offering a new paradigm for bias-aware personalization.
△ Less
Submitted 8 June, 2025; v1 submitted 30 March, 2025;
originally announced March 2025.
-
Symmetry Enhanced Unconventional Spin Current Anisotropy in a Collinear Antiferromagnet
Authors:
Pankhuri Gupta,
Kacho Imtiyaz Ali Khan,
Akash Kumar,
Rekha Agarwal,
Nidhi Kandwal,
Ram Singh Yadav,
Johan Åkerman,
Pranaba Kishor Muduli
Abstract:
Spin-orbit torque (SOT) presents a promising avenue for energy-efficient spintronics devices, surpassing the limitations of spin transfer torque. While extensively studied in heavy metals, SOT in antiferromagnetic quantum materials remains largely unexplored. Here, we investigate SOT in epitaxial FeSn, a collinear antiferromagnet with a kagome lattice. FeSn exhibits intriguing topological quantum…
▽ More
Spin-orbit torque (SOT) presents a promising avenue for energy-efficient spintronics devices, surpassing the limitations of spin transfer torque. While extensively studied in heavy metals, SOT in antiferromagnetic quantum materials remains largely unexplored. Here, we investigate SOT in epitaxial FeSn, a collinear antiferromagnet with a kagome lattice. FeSn exhibits intriguing topological quantum features, including two-dimensional flat bands and Dirac-like surface states, making it an ideal platform for investigating emergent SOT properties. Using spin-torque ferromagnetic resonance, we uncover a six-fold symmetric damping-like SOT in epitaxial-FeSn/Py heterostructures, reflecting the six-fold symmetry of the epitaxial [0001]-oriented FeSn films. Additionally, we observe a substantial unconventional field-like torque, originating from spin currents with out-of-plane spin polarization. This torque exhibits a unique angular dependence-a superposition of six-fold crystalline symmetry and uniaxial symmetry associated with the antiferromagnetic spin Hall effect. Notably, the unconventional field-like torque is enhanced when the RF current flows along the Neel vector in FeSn. Our findings reveal an unconventional spin current anisotropy tunable by crystalline and magnetic symmetry, offering a novel approach for controlling SOT in antiferromagnetic spintronics.
△ Less
Submitted 31 March, 2025; v1 submitted 26 March, 2025;
originally announced March 2025.
-
Gemma 3 Technical Report
Authors:
Gemma Team,
Aishwarya Kamath,
Johan Ferret,
Shreya Pathak,
Nino Vieillard,
Ramona Merhej,
Sarah Perrin,
Tatiana Matejovicova,
Alexandre Ramé,
Morgane Rivière,
Louis Rouillard,
Thomas Mesnard,
Geoffrey Cideron,
Jean-bastien Grill,
Sabela Ramos,
Edouard Yvinec,
Michelle Casbon,
Etienne Pot,
Ivo Penchev,
Gaël Liu,
Francesco Visin,
Kathleen Kenealy,
Lucas Beyer,
Xiaohai Zhai,
Anton Tsitsulin
, et al. (191 additional authors not shown)
Abstract:
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achie…
▽ More
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care
Authors:
Ziqi Yang,
Yuxuan Lu,
Jennifer Bagdasarian,
Vedant Das Swain,
Ritu Agarwal,
Collin Campbell,
Waddah Al-Refaire,
Jehan El-Bayoumi,
Guodong Gao,
Dakuo Wang,
Bingsheng Yao,
Nawar Shara
Abstract:
Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group of cancers that account for more than 35% of cancer-related deaths worldwide, but postoperative complications are unpredictable and can be life-threatening. In this paper, we investigate how recent advancements in large language models (LLMs) can benefit remote patient monitoring (RPM) systems through clinical integration…
▽ More
Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group of cancers that account for more than 35% of cancer-related deaths worldwide, but postoperative complications are unpredictable and can be life-threatening. In this paper, we investigate how recent advancements in large language models (LLMs) can benefit remote patient monitoring (RPM) systems through clinical integration by designing RECOVER, an LLM-powered RPM system for postoperative GI cancer care. To closely engage stakeholders in the design process, we first conducted seven participatory design sessions with five clinical staff and interviewed five cancer patients to derive six major design strategies for integrating clinical guidelines and information needs into LLM-based RPM systems. We then designed and implemented RECOVER, which features an LLM-powered conversational agent for cancer patients and an interactive dashboard for clinical staff to enable efficient postoperative RPM. Finally, we used RECOVER as a pilot system to assess the implementation of our design strategies with four clinical staff and five patients, providing design implications by identifying crucial design elements, offering insights on responsible AI, and outlining opportunities for future LLM-powered RPM systems.
△ Less
Submitted 8 February, 2025;
originally announced February 2025.
-
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Authors:
Mrinank Sharma,
Meg Tong,
Jesse Mu,
Jerry Wei,
Jorrit Kruthoff,
Scott Goodfriend,
Euan Ong,
Alwin Peng,
Raj Agarwal,
Cem Anil,
Amanda Askell,
Nathan Bailey,
Joe Benton,
Emma Bluemke,
Samuel R. Bowman,
Eric Christiansen,
Hoagy Cunningham,
Andy Dau,
Anjali Gopal,
Rob Gilson,
Logan Graham,
Logan Howard,
Nimit Kalra,
Taesung Lee,
Kevin Lin
, et al. (18 additional authors not shown)
Abstract:
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by promptin…
▽ More
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
Flat panel laser displays enabled by large-scale visible photonic integrated circuits
Authors:
Zhujun Shi,
Risheng Cheng,
Guohua Wei,
Steven A. Hickman,
Min Chul Shin,
Peter Topalian,
Lei Wang,
Dusan Coso,
Brian Le,
Lizzy Lee,
Sean Braxton,
Alexander Koshelev,
Maxwell F. Parsons,
Rahul Agarwal,
Barry Silverstein,
Yun Wang,
Giuseppe Calafiore
Abstract:
Laser-based displays are highly sought after for their superior brightness and color performance, especially in advanced applications like augmented reality (AR). However, their broader adoption has been hindered by bulky projector designs and complex optical module assemblies. Here, we introduce a new laser display architecture enabled by large-scale visible photonic integrated circuits (PICs) to…
▽ More
Laser-based displays are highly sought after for their superior brightness and color performance, especially in advanced applications like augmented reality (AR). However, their broader adoption has been hindered by bulky projector designs and complex optical module assemblies. Here, we introduce a new laser display architecture enabled by large-scale visible photonic integrated circuits (PICs) to address these challenges. Unlike previous projector-style laser displays, this architecture features an ultra-thin, flat-panel form factor, replacing bulky free-space illumination modules with a single, high-performance photonic chip. Centimeter-scale PIC devices, which integrate thousands of distinct optical components on-chip, are carefully tailored to achieve high display uniformity, contrast, and efficiency. We demonstrate a 2 mm-thick flat-panel laser display combining the PIC with a liquid-crystal-on-silicon (LCoS) panel, achieving 211% of the color gamut and more than 80% volume reduction compared to traditional LCoS displays. We further showcase its application in a see-through AR system. Our work represents a major advancement in the integration of nanophotonics with display technology, enabling a range of new display concepts, from high-performance immersive displays to slim-panel 3D holography.
△ Less
Submitted 26 December, 2024;
originally announced December 2024.
-
Generalizable Articulated Object Perception with Superpoints
Authors:
Qiaojun Yu,
Ce Hao,
Xibin Yuan,
Li Zhang,
Liu Liu,
Yukang Huo,
Rohit Agarwal,
Cewu Lu
Abstract:
Manipulating articulated objects with robotic arms is challenging due to the complex kinematic structure, which requires precise part segmentation for efficient manipulation. In this work, we introduce a novel superpoint-based perception method designed to improve part segmentation in 3D point clouds of articulated objects. We propose a learnable, part-aware superpoint generation technique that ef…
▽ More
Manipulating articulated objects with robotic arms is challenging due to the complex kinematic structure, which requires precise part segmentation for efficient manipulation. In this work, we introduce a novel superpoint-based perception method designed to improve part segmentation in 3D point clouds of articulated objects. We propose a learnable, part-aware superpoint generation technique that efficiently groups points based on their geometric and semantic similarities, resulting in clearer part boundaries. Furthermore, by leveraging the segmentation capabilities of the 2D foundation model SAM, we identify the centers of pixel regions and select corresponding superpoints as candidate query points. Integrating a query-based transformer decoder further enhances our method's ability to achieve precise part segmentation. Experimental results on the GAPartNet dataset show that our method outperforms existing state-of-the-art approaches in cross-category part segmentation, achieving AP50 scores of 77.9% for seen categories (4.4% improvement) and $39.3\%$ for unseen categories (11.6% improvement), with superior results in 5 out of 9 part categories for seen objects and outperforming all previous methods across all part categories for unseen objects.
△ Less
Submitted 21 December, 2024;
originally announced December 2024.
-
Improving Equity in Health Modeling with GPT4-Turbo Generated Synthetic Data: A Comparative Study
Authors:
Daniel Smolyak,
Arshana Welivita,
Margrét V. Bjarnadóttir,
Ritu Agarwal
Abstract:
Objective. Demographic groups are often represented at different rates in medical datasets. These differences can create bias in machine learning algorithms, with higher levels of performance for better-represented groups. One promising solution to this problem is to generate synthetic data to mitigate potential adverse effects of non-representative data sets.
Methods. We build on recent advance…
▽ More
Objective. Demographic groups are often represented at different rates in medical datasets. These differences can create bias in machine learning algorithms, with higher levels of performance for better-represented groups. One promising solution to this problem is to generate synthetic data to mitigate potential adverse effects of non-representative data sets.
Methods. We build on recent advances in LLM-based synthetic data generation to create a pipeline where the synthetic data is generated separately for each demographic group. We conduct our study using MIMIC-IV and Framingham "Offspring and OMNI-1 Cohorts" datasets. We prompt GPT4-Turbo to create group-specific data, providing training examples and the dataset context. An exploratory analysis is conducted to ascertain the quality of the generated data. We then evaluate the utility of the synthetic data for augmentation of a training dataset in a downstream machine learning task, focusing specifically on model performance metrics across groups.
Results. The performance of GPT4-Turbo augmentation is generally superior but not always. In the majority of experiments our method outperforms standard modeling baselines, however, prompting GPT-4-Turbo to produce data specific to a group provides little to no additional benefit over a prompt that does not specify the group.
Conclusion. We developed a method for using LLMs out-of-the-box to synthesize group-specific data to address imbalances in demographic representation in medical datasets. As another "tool in the toolbox", this method can improve model fairness and thus health equity. More research is needed to understand the conditions under which LLM generated synthetic data is useful for non-representative medical data sets.
△ Less
Submitted 20 December, 2024;
originally announced December 2024.
-
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models
Authors:
Yinlam Chow,
Guy Tennenholtz,
Izzeddin Gur,
Vincent Zhuang,
Bo Dai,
Sridhar Thiagarajan,
Craig Boutilier,
Rishabh Agarwal,
Aviral Kumar,
Aleksandra Faust
Abstract:
Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large language models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective…
▽ More
Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large language models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input -- a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
A Large Sensor Foundation Model Pretrained on Continuous Glucose Monitor Data for Diabetes Management
Authors:
Junjie Luo,
Abhimanyu Kumbara,
Mansur Shomali,
Rui Han,
Anand Iyer,
Ritu Agarwal,
Gordon Gao
Abstract:
Continuous glucose monitoring (CGM) combined with AI offers new opportunities for proactive diabetes management through real-time glucose forecasting. However, most existing models are task-specific and lack generalization across patient populations. Inspired by the autoregressive paradigm of large language models, we introduce CGM-LSM, a Transformer decoder-based Large Sensor Model (LSM) pretrain…
▽ More
Continuous glucose monitoring (CGM) combined with AI offers new opportunities for proactive diabetes management through real-time glucose forecasting. However, most existing models are task-specific and lack generalization across patient populations. Inspired by the autoregressive paradigm of large language models, we introduce CGM-LSM, a Transformer decoder-based Large Sensor Model (LSM) pretrained on 1.6 million CGM records from patients with different diabetes types, ages, and genders. We model patients as sequences of glucose time steps to learn latent knowledge embedded in CGM data and apply it to the prediction of glucose readings for a 2-hour horizon. Compared with prior methods, CGM-LSM significantly improves prediction accuracy and robustness: a 48.51% reduction in root mean square error in one-hour horizon forecasting and consistent zero-shot prediction performance across held-out patient groups. We analyze model performance variations across patient subgroups and prediction scenarios and outline key opportunities and challenges for advancing CGM foundation models.
△ Less
Submitted 1 August, 2025; v1 submitted 12 December, 2024;
originally announced December 2024.
-
QuTiP 5: The Quantum Toolbox in Python
Authors:
Neill Lambert,
Eric Giguère,
Paul Menczel,
Boxi Li,
Patrick Hopf,
Gerardo Suárez,
Marc Gali,
Jake Lishman,
Rushiraj Gadhvi,
Rochisha Agarwal,
Asier Galicia,
Nathan Shammah,
Paul Nation,
J. R. Johansson,
Shahnawaz Ahmed,
Simon Cross,
Alexander Pitchford,
Franco Nori
Abstract:
QuTiP, the Quantum Toolbox in Python, has been at the forefront of open-source quantum software for the past 13 years. It is used as a research, teaching, and industrial tool, and has been downloaded millions of times by users around the world. Here we introduce the latest developments in QuTiP v5, which are set to have a large impact on the future of QuTiP and enable it to be a modern, continuous…
▽ More
QuTiP, the Quantum Toolbox in Python, has been at the forefront of open-source quantum software for the past 13 years. It is used as a research, teaching, and industrial tool, and has been downloaded millions of times by users around the world. Here we introduce the latest developments in QuTiP v5, which are set to have a large impact on the future of QuTiP and enable it to be a modern, continuously developed and popular tool for another decade and more. We summarize the code design and fundamental data layer changes as well as efficiency improvements, new solvers, applications to quantum circuits with QuTiP-QIP, and new quantum control tools with QuTiP-QOC. Additional flexibility in the data layer underlying all ``quantum objects'' in QuTiP allows us to harness the power of state-of-the-art data formats and packages like JAX, CuPy, and more. We explain these new features with a series of both well-known and new examples. The code for these examples is available in a static form on GitHub and as continuously updated and documented notebooks in the qutip-tutorials package.
△ Less
Submitted 1 October, 2025; v1 submitted 5 December, 2024;
originally announced December 2024.
-
ENCLIP: Ensembling and Clustering-Based Contrastive Language-Image Pretraining for Fashion Multimodal Search with Limited Data and Low-Quality Images
Authors:
Prithviraj Purushottam Naik,
Rohit Agarwal
Abstract:
Multimodal search has revolutionized the fashion industry, providing a seamless and intuitive way for users to discover and explore fashion items. Based on their preferences, style, or specific attributes, users can search for products by combining text and image information. Text-to-image searches enable users to find visually similar items or describe products using natural language. This paper…
▽ More
Multimodal search has revolutionized the fashion industry, providing a seamless and intuitive way for users to discover and explore fashion items. Based on their preferences, style, or specific attributes, users can search for products by combining text and image information. Text-to-image searches enable users to find visually similar items or describe products using natural language. This paper presents an innovative approach called ENCLIP, for enhancing the performance of the Contrastive Language-Image Pretraining (CLIP) model, specifically in Multimodal Search targeted towards the domain of fashion intelligence. This method focuses on addressing the challenges posed by limited data availability and low-quality images. This paper proposes an algorithm that involves training and ensembling multiple instances of the CLIP model, and leveraging clustering techniques to group similar images together. The experimental findings presented in this study provide evidence of the effectiveness of the methodology. This approach unlocks the potential of CLIP in the domain of fashion intelligence, where data scarcity and image quality issues are prevalent. Overall, the ENCLIP method represents a valuable contribution to the field of fashion intelligence and provides a practical solution for optimizing the CLIP model in scenarios with limited data and low-quality images.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
Injection Attacks Against End-to-End Encrypted Applications
Authors:
Andrés Fábrega,
Carolina Ortega Pérez,
Armin Namavari,
Ben Nassi,
Rachit Agarwal,
Thomas Ristenpart
Abstract:
We explore an emerging threat model for end-to-end (E2E) encrypted applications: an adversary sends chosen messages to a target client, thereby "injecting" adversarial content into the application state. Such state is subsequently encrypted and synchronized to an adversarially-visible storage. By observing the lengths of the resulting cloud-stored ciphertexts, the attacker backs out confidential i…
▽ More
We explore an emerging threat model for end-to-end (E2E) encrypted applications: an adversary sends chosen messages to a target client, thereby "injecting" adversarial content into the application state. Such state is subsequently encrypted and synchronized to an adversarially-visible storage. By observing the lengths of the resulting cloud-stored ciphertexts, the attacker backs out confidential information. We investigate this injection threat model in the context of state-of-the-art encrypted messaging applications that support E2E encrypted backups. We show proof-of-concept attacks that can recover information about E2E encrypted messages or attachments sent via WhatsApp, assuming the ability to compromise the target user's Google or Apple account (which gives access to encrypted backups). We also show weaknesses in Signal's encrypted backup design that would allow injection attacks to infer metadata including a target user's number of contacts and conversations, should the adversary somehow obtain access to the user's encrypted Signal backup. While we do not believe our results should be of immediate concern for users of these messaging applications, our results do suggest that more work is needed to build tools that enjoy strong E2E security guarantees.
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
MuCol Milestone Report No. 5: Preliminary Parameters
Authors:
Carlotta Accettura,
Simon Adrian,
Rohit Agarwal,
Claudia Ahdida,
Chiara Aimé,
Avni Aksoy,
Gian Luigi Alberghi,
Siobhan Alden,
Luca Alfonso,
Nicola Amapane,
David Amorim,
Paolo Andreetto,
Fabio Anulli,
Rob Appleby,
Artur Apresyan,
Pouya Asadi,
Mohammed Attia Mahmoud,
Bernhard Auchmann,
John Back,
Anthony Badea,
Kyu Jung Bae,
E. J. Bahng,
Lorenzo Balconi,
Fabrice Balli,
Laura Bandiera
, et al. (369 additional authors not shown)
Abstract:
This document is comprised of a collection of updated preliminary parameters for the key parts of the muon collider. The updated preliminary parameters follow on from the October 2023 Tentative Parameters Report. Particular attention has been given to regions of the facility that are believed to hold greater technical uncertainty in their design and that have a strong impact on the cost and power…
▽ More
This document is comprised of a collection of updated preliminary parameters for the key parts of the muon collider. The updated preliminary parameters follow on from the October 2023 Tentative Parameters Report. Particular attention has been given to regions of the facility that are believed to hold greater technical uncertainty in their design and that have a strong impact on the cost and power consumption of the facility. The data is collected from a collaborative spreadsheet and transferred to overleaf.
△ Less
Submitted 5 November, 2024;
originally announced November 2024.
-
Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self-Play
Authors:
Ziyu Ye,
Rishabh Agarwal,
Tianqi Liu,
Rishabh Joshi,
Sarmishta Velury,
Quoc V. Le,
Qijun Tan,
Yuan Liu
Abstract:
Current reinforcement learning (RL) frameworks for large language models (LLM) post-training typically assume a fixed prompt distribution, which is sub-optimal and bottlenecks scalability. Prior works have explored prompt evolving, but are often limited to the supervised fine-tuning stage, and prompts are sampled and evolved uniformly without signals. This empirical work presents a paradigm shift:…
▽ More
Current reinforcement learning (RL) frameworks for large language models (LLM) post-training typically assume a fixed prompt distribution, which is sub-optimal and bottlenecks scalability. Prior works have explored prompt evolving, but are often limited to the supervised fine-tuning stage, and prompts are sampled and evolved uniformly without signals. This empirical work presents a paradigm shift: Evolving Alignment via Asymmetric Self-Play (eva), that casts post-training as an infinite game with regret-based signals for 2 players: (i) a creator, who strategically samples and creates new informative prompts and (ii) a solver, who learns to produce preferred responses. eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training. The design is simple, easy-to-use yet remarkably effective: eva sets a new SOTA on challenging benchmarks, without any extra human prompts, e.g. it boosts the win-rate of gemma-2-9b-it on Arena-Hard by 51.6% -> 60.1% for DPO and 52.6% -> 62.4% for RLOO, surpassing claude-3-opus and catching up to gemini-1.5-pro, both of which are orders of magnitude larger. Extensive experiments show eva can create effective RL curricula and is robust across ablations. We believe adaptively evolving prompts are key to designing the next-generation RL post-training scheme.
△ Less
Submitted 9 April, 2025; v1 submitted 31 October, 2024;
originally announced November 2024.
-
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Authors:
Michael Noukhovitch,
Shengyi Huang,
Sophie Xhonneux,
Arian Hosseini,
Rishabh Agarwal,
Aaron Courville
Abstract:
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous…
▽ More
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model which give a worse training signal. We tackle the fundamental challenge in this regime: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we test, online DPO is found to be most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. We verify the scalability of asynchronous RLHF by training a general-purpose chatbot from LLaMA 3.1 8B on an instruction-following task ~40% faster than a synchronous run while matching final performance. Finally, we extend our results to math and reasoning to demonstrate asynchronous RL can finetune Rho 1B on GSM8k ~70% faster while matching synchronous accuracy.
△ Less
Submitted 26 April, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
packetLSTM: Dynamic LSTM Framework for Streaming Data with Varying Feature Space
Authors:
Rohit Agarwal,
Karaka Prasanth Naidu,
Alexander Horsch,
Krishna Agarwal,
Dilip K. Prasad
Abstract:
We study the online learning problem characterized by the varying input feature space of streaming data. Although LSTMs have been employed to effectively capture the temporal nature of streaming data, they cannot handle the dimension-varying streams in an online learning setting. Therefore, we propose a dynamic LSTM-based novel method, called packetLSTM, to model the dimension-varying streams. The…
▽ More
We study the online learning problem characterized by the varying input feature space of streaming data. Although LSTMs have been employed to effectively capture the temporal nature of streaming data, they cannot handle the dimension-varying streams in an online learning setting. Therefore, we propose a dynamic LSTM-based novel method, called packetLSTM, to model the dimension-varying streams. The packetLSTM's dynamic framework consists of an evolving packet of LSTMs, each dedicated to processing one input feature. Each LSTM retains the local information of its corresponding feature, while a shared common memory consolidates global information. This configuration facilitates continuous learning and mitigates the issue of forgetting, even when certain features are absent for extended time periods. The idea of utilizing one LSTM per feature coupled with a dimension-invariant operator for information aggregation enhances the dynamic nature of packetLSTM. This dynamic nature is evidenced by the model's ability to activate, deactivate, and add new LSTMs as required, thus seamlessly accommodating varying input dimensions. The packetLSTM achieves state-of-the-art results on five datasets, and its underlying principle is extended to other RNN types, like GRU and vanilla RNN.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling
Authors:
Wenda Xu,
Rujun Han,
Zifeng Wang,
Long T. Le,
Dhruv Madeka,
Lei Li,
William Yang Wang,
Rishabh Agarwal,
Chen-Yu Lee,
Tomas Pfister
Abstract:
Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference o…
▽ More
Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student's inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.
△ Less
Submitted 27 April, 2025; v1 submitted 15 October, 2024;
originally announced October 2024.
-
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Authors:
Amrith Setlur,
Chirag Nagpal,
Adam Fisch,
Xinyang Geng,
Jacob Eisenstein,
Rishabh Agarwal,
Alekh Agarwal,
Jonathan Berant,
Aviral Kumar
Abstract:
A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically…
▽ More
A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?". Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is $>8\%$ more accurate, and $1.5-5\times$ more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with $5-6\times$ gain in sample efficiency, and $>6\%$ gain in accuracy, over ORMs.
△ Less
Submitted 10 October, 2024;
originally announced October 2024.
-
Not All LLM Reasoners Are Created Equal
Authors:
Arian Hosseini,
Alessandro Sordoni,
Daniel Toyama,
Aaron Courville,
Rishabh Agarwal
Abstract:
We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs…
▽ More
We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
Beyond Following: Mixing Active Initiative into Computational Creativity
Authors:
Zhiyu Lin,
Upol Ehsan,
Rohan Agarwal,
Samihan Dani,
Vidushi Vashishth,
Mark Riedl
Abstract:
Generative Artificial Intelligence (AI) encounters limitations in efficiency and fairness within the realm of Procedural Content Generation (PCG) when human creators solely drive and bear responsibility for the generative process. Alternative setups, such as Mixed-Initiative Co-Creative (MI-CC) systems, exhibited their promise. Still, the potential of an active mixed initiative, where AI takes a r…
▽ More
Generative Artificial Intelligence (AI) encounters limitations in efficiency and fairness within the realm of Procedural Content Generation (PCG) when human creators solely drive and bear responsibility for the generative process. Alternative setups, such as Mixed-Initiative Co-Creative (MI-CC) systems, exhibited their promise. Still, the potential of an active mixed initiative, where AI takes a role beyond following, is understudied. This work investigates the influence of the adaptive ability of an active and learning AI agent on creators' expectancy of creative responsibilities in an MI-CC setting. We built and studied a system that employs reinforcement learning (RL) methods to learn the creative responsibility preferences of a human user during online interactions. Situated in story co-creation, we develop a Multi-armed-bandit agent that learns from the human creator, updates its collaborative decision-making belief, and switches between its capabilities during an MI-CC experience. With 39 participants joining a human subject study, Our developed system's learning capabilities are well recognized compared to the non-learning ablation, corresponding to a significant increase in overall satisfaction with the MI-CC experience. These findings indicate a robust association between effective MI-CC collaborative interactions, particularly the implementation of proactive AI initiatives, and deepened understanding among all participants.
△ Less
Submitted 6 September, 2024;
originally announced September 2024.
-
Training Language Models to Self-Correct via Reinforcement Learning
Authors:
Aviral Kumar,
Vincent Zhuang,
Rishabh Agarwal,
Yi Su,
John D Co-Reyes,
Avi Singh,
Kate Baumli,
Shariq Iqbal,
Colton Bishop,
Rebecca Roelofs,
Lei M Zhang,
Kay McKinney,
Disha Shrivastava,
Cosmin Paduraru,
George Tucker,
Doina Precup,
Feryal Behbahani,
Aleksandra Faust
Abstract:
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) app…
▽ More
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
△ Less
Submitted 4 October, 2024; v1 submitted 19 September, 2024;
originally announced September 2024.
-
Hedging Is Not All You Need: A Simple Baseline for Online Learning Under Haphazard Inputs
Authors:
Himanshu Buckchash,
Momojit Biswas,
Rohit Agarwal,
Dilip K. Prasad
Abstract:
Handling haphazard streaming data, such as data from edge devices, presents a challenging problem. Over time, the incoming data becomes inconsistent, with missing, faulty, or new inputs reappearing. Therefore, it requires models that are reliable. Recent methods to solve this problem depend on a hedging-based solution and require specialized elements like auxiliary dropouts, forked architectures,…
▽ More
Handling haphazard streaming data, such as data from edge devices, presents a challenging problem. Over time, the incoming data becomes inconsistent, with missing, faulty, or new inputs reappearing. Therefore, it requires models that are reliable. Recent methods to solve this problem depend on a hedging-based solution and require specialized elements like auxiliary dropouts, forked architectures, and intricate network design. We observed that hedging can be reduced to a special case of weighted residual connection; this motivated us to approximate it with plain self-attention. In this work, we propose HapNet, a simple baseline that is scalable, does not require online backpropagation, and is adaptable to varying input types. All present methods are restricted to scaling with a fixed window; however, we introduce a more complex problem of scaling with a variable window where the data becomes positionally uncorrelated, and cannot be addressed by present methods. We demonstrate that a variant of the proposed approach can work even for this complex scenario. We extensively evaluated the proposed approach on five benchmarks and found competitive performance.
△ Less
Submitted 30 December, 2024; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
Authors:
Hritik Bansal,
Arian Hosseini,
Rishabh Agarwal,
Vinh Q. Tran,
Mehran Kazemi
Abstract:
Training on high-quality synthetic data from strong language models (LMs) is a common strategy to improve the reasoning performance of LMs. In this work, we revisit whether this strategy is compute-optimal under a fixed inference budget (e.g., FLOPs). To do so, we investigate the trade-offs between generating synthetic data using a stronger but more expensive (SE) model versus a weaker but cheaper…
▽ More
Training on high-quality synthetic data from strong language models (LMs) is a common strategy to improve the reasoning performance of LMs. In this work, we revisit whether this strategy is compute-optimal under a fixed inference budget (e.g., FLOPs). To do so, we investigate the trade-offs between generating synthetic data using a stronger but more expensive (SE) model versus a weaker but cheaper (WC) model. We evaluate the generated data across three key metrics: coverage, diversity, and false positive rate, and show that the data from WC models may have higher coverage and diversity, but also exhibit higher false positive rates. We then finetune LMs on data from SE and WC models in different settings: knowledge distillation, self-improvement, and a novel weak-to-strong improvement setup where a weaker LM teaches reasoning to a stronger LM. Our findings reveal that models finetuned on WC-generated data consistently outperform those trained on SE-generated data across multiple benchmarks and multiple choices of WC and SE models. These results challenge the prevailing practice of relying on SE models for synthetic data generation, suggesting that WC may be the compute-optimal approach for training advanced LM reasoners.
△ Less
Submitted 7 October, 2024; v1 submitted 29 August, 2024;
originally announced August 2024.
-
Lyrically Speaking: Exploring the Link Between Lyrical Emotions, Themes and Depression Risk
Authors:
Pavani Chowdary,
Bhavyajeet Singh,
Rajat Agarwal,
Vinoo Alluri
Abstract:
Lyrics play a crucial role in affecting and reinforcing emotional states by providing meaning and emotional connotations that interact with the acoustic properties of the music. Specific lyrical themes and emotions may intensify existing negative states in listeners and may lead to undesirable outcomes, especially in listeners with mood disorders such as depression. Hence, it is important for such…
▽ More
Lyrics play a crucial role in affecting and reinforcing emotional states by providing meaning and emotional connotations that interact with the acoustic properties of the music. Specific lyrical themes and emotions may intensify existing negative states in listeners and may lead to undesirable outcomes, especially in listeners with mood disorders such as depression. Hence, it is important for such individuals to be mindful of their listening strategies. In this study, we examine online music consumption of individuals at risk of depression in light of lyrical themes and emotions. Lyrics obtained from the listening histories of 541 Last.fm users, divided into At-Risk and No-Risk based on their mental well-being scores, were analyzed using natural language processing techniques. Statistical analyses of the results revealed that individuals at risk for depression prefer songs with lyrics associated with low valence and low arousal. Additionally, lyrics associated with themes of denial, self-reference, and ambivalence were preferred. In contrast, themes such as liberation, familiarity, and activity are not as favored. This study opens up the possibility of an approach to assessing depression risk from the digital footprint of individuals and potentially developing personalized recommendation systems.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.