-
EmojiVoice: Towards long-term controllable expressivity in robot speech
Authors:
Paige Tuttösí,
Shivam Mehta,
Zachary Syvenky,
Bermet Burkanova,
Gustav Eje Henter,
Angelica Lim
Abstract:
Humans vary their expressivity when speaking for extended periods to maintain engagement with their listener. Although social robots tend to be deployed with ``expressive'' joyful voices, they lack this long-term variation found in human speech. Foundation model text-to-speech systems are beginning to mimic the expressivity in human speech, but they are difficult to deploy offline on robots. We pr…
▽ More
Humans vary their expressivity when speaking for extended periods to maintain engagement with their listener. Although social robots tend to be deployed with ``expressive'' joyful voices, they lack this long-term variation found in human speech. Foundation model text-to-speech systems are beginning to mimic the expressivity in human speech, but they are difficult to deploy offline on robots. We present EmojiVoice, a free, customizable text-to-speech (TTS) toolkit that allows social roboticists to build temporally variable, expressive speech on social robots. We introduce emoji-prompting to allow fine-grained control of expressivity on a phase level and use the lightweight Matcha-TTS backbone to generate speech in real-time. We explore three case studies: (1) a scripted conversation with a robot assistant, (2) a storytelling robot, and (3) an autonomous speech-to-speech interactive agent. We found that using varied emoji prompting improved the perception and expressivity of speech over a long period in a storytelling task, but expressive voice was not preferred in the assistant use case.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
The Amazon Nova Family of Models: Technical Report and Model Card
Authors:
Amazon AGI,
Aaron Langford,
Aayush Shah,
Abhanshu Gupta,
Abhimanyu Bhatter,
Abhinav Goyal,
Abhinav Mathur,
Abhinav Mohanty,
Abhishek Kumar,
Abhishek Sethi,
Abi Komma,
Abner Pena,
Achin Jain,
Adam Kunysz,
Adam Opyrchal,
Adarsh Singh,
Aditya Rawal,
Adok Achar Budihal Prasad,
Adrià de Gispert,
Agnika Kumar,
Aishwarya Aryamane,
Ajay Nair,
Akilan M,
Akshaya Iyengar,
Akshaya Vishnu Kudlu Shanbhogue
, et al. (761 additional authors not shown)
Abstract:
We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents…
▽ More
We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.
△ Less
Submitted 17 March, 2025;
originally announced June 2025.
-
FIGNN: Feature-Specific Interpretability for Graph Neural Network Surrogate Models
Authors:
Riddhiman Raut,
Romit Maulik,
Shivam Barwey
Abstract:
This work presents a novel graph neural network (GNN) architecture, the Feature-specific Interpretable Graph Neural Network (FIGNN), designed to enhance the interpretability of deep learning surrogate models defined on unstructured grids in scientific applications. Traditional GNNs often obscure the distinct spatial influences of different features in multivariate prediction tasks. FIGNN addresses…
▽ More
This work presents a novel graph neural network (GNN) architecture, the Feature-specific Interpretable Graph Neural Network (FIGNN), designed to enhance the interpretability of deep learning surrogate models defined on unstructured grids in scientific applications. Traditional GNNs often obscure the distinct spatial influences of different features in multivariate prediction tasks. FIGNN addresses this limitation by introducing a feature-specific pooling strategy, which enables independent attribution of spatial importance for each predicted variable. Additionally, a mask-based regularization term is incorporated into the training objective to explicitly encourage alignment between interpretability and predictive error, promoting localized attribution of model performance. The method is evaluated for surrogate modeling of two physically distinct systems: the SPEEDY atmospheric circulation model and the backward-facing step (BFS) fluid dynamics benchmark. Results demonstrate that FIGNN achieves competitive predictive performance while revealing physically meaningful spatial patterns unique to each feature. Analysis of rollout stability, feature-wise error budgets, and spatial mask overlays confirm the utility of FIGNN as a general-purpose framework for interpretable surrogate modeling in complex physical domains.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
Authors:
Tiyasa Mitra,
Ritika Borkar,
Nidhi Bhatia,
Ramon Matas,
Shivam Raj,
Dheevatsa Mudigere,
Ritchie Zhao,
Maximilian Golub,
Arpan Dutta,
Sailaja Madduri,
Dharmesh Jani,
Brian Pharris,
Bita Darvish Rouhani
Abstract:
As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination.…
▽ More
As inference scales to multi-node deployments, disaggregation - splitting inference into distinct phases - offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, practical deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. Our findings offer actionable insights for efficient disaggregated deployments to navigate the trade-off between system throughput and interactivity.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
A Risk-Aware Reinforcement Learning Reward for Financial Trading
Authors:
Uditansh Srivastava,
Shivam Aryan,
Shaurya Singh
Abstract:
We propose a novel composite reward function for reinforcement learning in financial trading that balances return and risk using four differentiable terms: annualized return downside risk differential return and the Treynor ratio
Unlike single metric objectives for example the Sharpe ratio our formulation is modular and parameterized by weights w1 w2 w3 and w4 enabling practitioners to encode di…
▽ More
We propose a novel composite reward function for reinforcement learning in financial trading that balances return and risk using four differentiable terms: annualized return downside risk differential return and the Treynor ratio
Unlike single metric objectives for example the Sharpe ratio our formulation is modular and parameterized by weights w1 w2 w3 and w4 enabling practitioners to encode diverse investor preferences
We tune these weights via grid search to target specific risk return profiles
We derive closed form gradients for each term to facilitate gradient based training and analyze key theoretical properties including monotonicity boundedness and modularity
This framework offers a general blueprint for building robust multi objective reward functions in complex trading environments and can be extended with additional risk measures or adaptive weighting
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions
Authors:
Bimsara Pathiraja,
Maitreya Patel,
Shivam Singh,
Yezhou Yang,
Chitta Baral
Abstract:
Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perfor…
▽ More
Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit -- an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods. We release data \& checkpoint for reproducibility.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
Authors:
Shivam Chandhok,
Qian Yang,
Oscar Manas,
Kanishk Jain,
Leonid Sigal,
Aishwarya Agrawal
Abstract:
Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive-requiring large-scale datasets, high-quality annotations, and large compute budgets. We propose PRioritized cOncept learninG via Relative Error-driven Sample Selection (PROGRESS), a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next base…
▽ More
Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive-requiring large-scale datasets, high-quality annotations, and large compute budgets. We propose PRioritized cOncept learninG via Relative Error-driven Sample Selection (PROGRESS), a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples-those it has not already mastered and that are not too difficult to learn at the current stage of training. This strategy effectively controls skill acquisition and the order in which skills are learned. Specifically, we sample from skills showing the highest learning progress, prioritizing those with the most rapid improvement. Unlike prior methods, PROGRESS requires no upfront answer annotations, queries answers only on a need basis, avoids reliance on additional supervision from auxiliary VLMs, and does not require compute-heavy gradient computations for data selection. Experiments across multiple instruction-tuning datasets of varying scales demonstrate that PROGRESS consistently outperforms state-of-the-art baselines with much less data and supervision. Additionally, we show strong cross-architecture generalization and transferability to larger models, validating PROGRESS as a scalable solution for efficient learning.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Beyond Winning: Margin of Victory Relative to Expectation Unlocks Accurate Skill Ratings
Authors:
Shivam Shorewala,
Zihao Yang
Abstract:
Knowledge of accurate relative skills in any competitive system is essential, but foundational approaches such as ELO discard extremely relevant performance data by concentrating exclusively on binary outcomes. While margin of victory (MOV) extensions exist, they often lack a definitive method for incorporating this information. We introduce Margin of Victory Differential Analysis (MOVDA), a frame…
▽ More
Knowledge of accurate relative skills in any competitive system is essential, but foundational approaches such as ELO discard extremely relevant performance data by concentrating exclusively on binary outcomes. While margin of victory (MOV) extensions exist, they often lack a definitive method for incorporating this information. We introduce Margin of Victory Differential Analysis (MOVDA), a framework that enhances traditional rating systems by using the deviation between the true MOV and a $\textit{modeled expectation}$. MOVDA learns a domain-specific, non-linear function (a scaled hyperbolic tangent that captures saturation effects and home advantage) to predict expected MOV based on rating differentials. Crucially, the $\textit{difference}$ between the true and expected MOV provides a subtle and weighted signal for rating updates, highlighting informative deviations in all levels of contests. Extensive experiments on professional NBA basketball data (from 2013 to 2023, with 13,619 games) show that MOVDA significantly outperforms standard ELO and Bayesian baselines. MOVDA reduces Brier score prediction error by $1.54\%$ compared to TrueSkill, increases outcome accuracy by $0.58\%$, and most importantly accelerates rating convergence by $13.5\%$, while maintaining the computational efficiency of the original ELO updates. MOVDA offers a theoretically motivated, empirically superior, and computationally lean approach to integrating performance magnitude into skill rating for competitive environments like the NBA.
△ Less
Submitted 30 May, 2025;
originally announced June 2025.
-
AutoChemSchematic AI: A Closed-Loop, Physics-Aware Agentic Framework for Auto-Generating Chemical Process and Instrumentation Diagrams
Authors:
Sakhinana Sagar Srinivas,
Shivam Gupta,
Venkataramana Runkana
Abstract:
Recent advancements in generative AI have accelerated the discovery of novel chemicals and materials; however, transitioning these discoveries to industrial-scale production remains a critical bottleneck, as it requires the development of entirely new chemical manufacturing processes. Current AI methods cannot auto-generate PFDs or PIDs, despite their critical role in scaling chemical processes, w…
▽ More
Recent advancements in generative AI have accelerated the discovery of novel chemicals and materials; however, transitioning these discoveries to industrial-scale production remains a critical bottleneck, as it requires the development of entirely new chemical manufacturing processes. Current AI methods cannot auto-generate PFDs or PIDs, despite their critical role in scaling chemical processes, while adhering to engineering constraints. We present a closed loop, physics aware framework for the automated generation of industrially viable PFDs and PIDs. The framework integrates domain specialized small scale language models (SLMs) (trained for chemical process QA tasks) with first principles simulation, leveraging three key components: (1) a hierarchical knowledge graph of process flow and instrumentation descriptions for 1,020+ chemicals, (2) a multi-stage training pipeline that fine tunes domain specialized SLMs on synthetic datasets via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Retrieval-Augmented Instruction Tuning (RAIT), and (3) DWSIM based simulator in the loop validation to ensure feasibility. To improve both runtime efficiency and model compactness, the framework incorporates advanced inference time optimizations including FlashAttention, Lookahead Decoding, PagedAttention with KV-cache quantization, and Test Time Inference Scaling and independently applies structural pruning techniques (width and depth) guided by importance heuristics to reduce model size with minimal accuracy loss. Experiments demonstrate that the framework generates simulator-validated process descriptions with high fidelity, outperforms baseline methods in correctness, and generalizes to unseen chemicals. By bridging AI-driven design with industrial-scale feasibility, this work significantly reduces R&D timelines from lab discovery to plant deployment.
△ Less
Submitted 1 June, 2025; v1 submitted 30 May, 2025;
originally announced May 2025.
-
Anomaly Detection and Improvement of Clusters using Enhanced K-Means Algorithm
Authors:
Vardhan Shorewala,
Shivam Shorewala
Abstract:
This paper introduces a unified approach to cluster refinement and anomaly detection in datasets. We propose a novel algorithm that iteratively reduces the intra-cluster variance of N clusters until a global minimum is reached, yielding tighter clusters than the standard k-means algorithm. We evaluate the method using intrinsic measures for unsupervised learning, including the silhouette coefficie…
▽ More
This paper introduces a unified approach to cluster refinement and anomaly detection in datasets. We propose a novel algorithm that iteratively reduces the intra-cluster variance of N clusters until a global minimum is reached, yielding tighter clusters than the standard k-means algorithm. We evaluate the method using intrinsic measures for unsupervised learning, including the silhouette coefficient, Calinski-Harabasz index, and Davies-Bouldin index, and extend it to anomaly detection by identifying points whose assignment causes a significant variance increase. External validation on synthetic data and the UCI Breast Cancer and UCI Wine Quality datasets employs the Jaccard similarity score, V-measure, and F1 score. Results show variance reductions of 18.7% and 88.1% on the synthetic and Wine Quality datasets, respectively, along with accuracy and F1 score improvements of 22.5% and 20.8% on the Wine Quality dataset.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Authors:
Suhana Bedi,
Hejie Cui,
Miguel Fuentes,
Alyssa Unell,
Michael Wornow,
Juan M. Banda,
Nikesh Kotecha,
Timothy Keyes,
Yifan Mai,
Mert Oez,
Hao Qiu,
Shrey Jain,
Leonardo Schettini,
Mehr Kashyap,
Jason Alan Fries,
Akshay Swaminathan,
Philip Chung,
Fateme Nateghi,
Asad Aali,
Ashwin Nayak,
Shivam Vedak,
Sneha S. Jain,
Birju Patel,
Oluseyi Fayanju,
Shreya Shah
, et al. (56 additional authors not shown)
Abstract:
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcatego…
▽ More
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
△ Less
Submitted 2 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation
Authors:
Yifan Yin,
Zhengtao Han,
Shivam Aarya,
Jianxin Wang,
Shuhang Xu,
Jiawei Peng,
Angtian Wang,
Alan Yuille,
Tianmin Shu
Abstract:
Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-purpose robot manipulation policies guided by language instructions, there is a notable lack of large-scale datasets for fine-grained manipulation tasks with part…
▽ More
Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-purpose robot manipulation policies guided by language instructions, there is a notable lack of large-scale datasets for fine-grained manipulation tasks with part-level instructions and diverse 3D object instances annotated with part-level labels. In this work, we introduce PartInstruct, the first large-scale benchmark for training and evaluating fine-grained robot manipulation models using part-level instructions. PartInstruct comprises 513 object instances across 14 categories, each annotated with part-level information, and 1302 fine-grained manipulation tasks organized into 16 task classes. Our training set consists of over 10,000 expert demonstrations synthesized in a 3D simulator, where each demonstration is paired with a high-level task instruction, a chain of base part-based skill instructions, and ground-truth 3D information about the object and its parts. Additionally, we designed a comprehensive test suite to evaluate the generalizability of learned policies across new states, objects, and tasks. We evaluated several state-of-the-art robot manipulation approaches, including end-to-end vision-language policy learning and bi-level planning models for robot manipulation on our benchmark. The experimental results reveal that current models struggle to robustly ground part concepts and predict actions in 3D space, and face challenges when manipulating object parts in long-horizon tasks.
△ Less
Submitted 16 June, 2025; v1 submitted 27 May, 2025;
originally announced May 2025.
-
DNF Learning via Locally Mixing Random Walks
Authors:
Josh Alman,
Shivam Nadimpalli,
Shyamal Patel,
Rocco A. Servedio
Abstract:
We give two results on PAC learning DNF formulas using membership queries in the challenging "distribution-free" learning framework, where learning algorithms must succeed for an arbitrary and unknown distribution over $\{0,1\}^n$.
(1) We first give a quasi-polynomial time "list-decoding" algorithm for learning a single term of an unknown DNF formula. More precisely, for any target $s$-term DNF…
▽ More
We give two results on PAC learning DNF formulas using membership queries in the challenging "distribution-free" learning framework, where learning algorithms must succeed for an arbitrary and unknown distribution over $\{0,1\}^n$.
(1) We first give a quasi-polynomial time "list-decoding" algorithm for learning a single term of an unknown DNF formula. More precisely, for any target $s$-term DNF formula $f = T_1 \vee \cdots \vee T_s$ over $\{0,1\}^n$ and any unknown distribution $D$ over $\{0,1\}^n$, our algorithm, which uses membership queries and random examples from $D$, runs in $\textsf{quasipoly}(n,s)$ time and outputs a list $L$ of candidate terms such that with high probability some term $T_i$ of $f$ belongs to $L$.
(2) We then use result (1) to give a $\textsf{quasipoly}(n,s)$-time algorithm, in the distribution-free PAC learning model with membership queries, for learning the class of size-$s$ DNFs in which all terms have the same size. Our algorithm learns using a DNF hypothesis.
The key tool used to establish result (1) is a new result on "locally mixing random walks," which, roughly speaking, shows that a random walk on a graph that is covered by a small number of expanders has a non-negligible probability of mixing quickly in a subset of these expanders.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Tropical Geometry Based Edge Detection Using Min-Plus and Max-Plus Algebra
Authors:
Shivam Kumar Jha S,
Jaya NN Iyer
Abstract:
This paper proposes a tropical geometry-based edge detection framework that reformulates convolution and gradient computations using min-plus and max-plus algebra. The tropical formulation emphasizes dominant intensity variations, contributing to sharper and more continuous edge representations. Three variants are explored: an adaptive threshold-based method, a multi-kernel min-plus method, and a…
▽ More
This paper proposes a tropical geometry-based edge detection framework that reformulates convolution and gradient computations using min-plus and max-plus algebra. The tropical formulation emphasizes dominant intensity variations, contributing to sharper and more continuous edge representations. Three variants are explored: an adaptive threshold-based method, a multi-kernel min-plus method, and a max-plus method emphasizing structural continuity. The framework integrates multi-scale processing, Hessian filtering, and wavelet shrinkage to enhance edge transitions while maintaining computational efficiency. Experiments on MATLAB built-in grayscale and color images suggest that tropical formulations integrated with classical operators, such as Canny and LoG, can improve boundary detection in low-contrast and textured regions. Quantitative evaluation using standard edge metrics indicates favorable edge clarity and structural coherence. These results highlight the potential of tropical algebra as a scalable and noise-aware formulation for edge detection in practical image analysis tasks.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
MetaGen Blended RAG: Unlocking Zero-Shot Precision for Specialized Domain Question-Answering
Authors:
Kunal Sawarkar,
Shivam R. Solanki,
Abhilasha Mangal
Abstract:
Retrieval-Augmented Generation (RAG) struggles with domain-specific enterprise datasets, often isolated behind firewalls and rich in complex, specialized terminology unseen by LLMs during pre-training. Semantic variability across domains like medicine, networking, or law hampers RAG's context precision, while fine-tuning solutions are costly, slow, and lack generalization as new data emerges. Achi…
▽ More
Retrieval-Augmented Generation (RAG) struggles with domain-specific enterprise datasets, often isolated behind firewalls and rich in complex, specialized terminology unseen by LLMs during pre-training. Semantic variability across domains like medicine, networking, or law hampers RAG's context precision, while fine-tuning solutions are costly, slow, and lack generalization as new data emerges. Achieving zero-shot precision with retrievers without fine-tuning still remains a key challenge. We introduce 'MetaGen Blended RAG', a novel enterprise search approach that enhances semantic retrievers through a metadata generation pipeline and hybrid query indexes using dense and sparse vectors. By leveraging key concepts, topics, and acronyms, our method creates metadata-enriched semantic indexes and boosted hybrid queries, delivering robust, scalable performance without fine-tuning. On the biomedical PubMedQA dataset, MetaGen Blended RAG achieves 82% retrieval accuracy and 77% RAG accuracy, surpassing all prior zero-shot RAG benchmarks and even rivaling fine-tuned models on that dataset, while also excelling on datasets like SQuAD and NQ. This approach redefines enterprise search using a new approach to building semantic retrievers with unmatched generalization across specialized domains.
△ Less
Submitted 4 June, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Authors:
Shivam Agarwal,
Zimin Zhang,
Lifan Yuan,
Jiawei Han,
Hao Peng
Abstract:
Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetu…
▽ More
Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
APEX: Action Priors Enable Efficient Exploration for Skill Imitation on Articulated Robots
Authors:
Shivam Sood,
Laukik B Nakhwa,
Yuhong Cao,
Sun Ge,
Guillaume Sartoretti
Abstract:
Learning by imitation provides an effective way for robots to develop well-regulated complex behaviors and directly benefit from natural demonstrations. State-of-the-art imitation learning (IL) approaches typically leverage Adversarial Motion Priors (AMP), which, despite their impressive results, suffer from two key limitations. They are prone to mode collapse, which often leads to overfitting to…
▽ More
Learning by imitation provides an effective way for robots to develop well-regulated complex behaviors and directly benefit from natural demonstrations. State-of-the-art imitation learning (IL) approaches typically leverage Adversarial Motion Priors (AMP), which, despite their impressive results, suffer from two key limitations. They are prone to mode collapse, which often leads to overfitting to the simulation environment and thus increased sim-to-real gap, and they struggle to learn diverse behaviors effectively. To overcome these limitations, we introduce APEX (Action Priors enable Efficient eXploration): a simple yet versatile IL framework that integrates demonstrations directly into reinforcement learning (RL), maintaining high exploration while grounding behavior with expert-informed priors. We achieve this through a combination of decaying action priors, which initially bias exploration toward expert demonstrations but gradually allow the policy to explore independently. This is complemented by a multi-critic RL framework that effectively balances stylistic consistency with task performance. Our approach achieves sample-efficient IL and enables the acquisition of diverse skills within a single policy. APEX generalizes to varying velocities and preserves reference-like styles across complex tasks such as navigating rough terrain and climbing stairs, utilizing only flat-terrain kinematic motion data as a prior. We validate our framework through extensive hardware experiments on the Unitree Go2 quadruped. There, APEX yields diverse and agile locomotion gaits, inherent gait transitions, and the highest reported speed for the platform to the best of our knowledge (peak velocity of ~3.3 m/s on hardware). Our results establish APEX as a compelling alternative to existing IL methods, offering better efficiency, adaptability, and real-world performance. https://marmotlab.github.io/APEX/
△ Less
Submitted 12 June, 2025; v1 submitted 15 May, 2025;
originally announced May 2025.
-
Cost-Effective, Low Latency Vector Search with Azure Cosmos DB
Authors:
Nitish Upreti,
Krishnan Sundaram,
Hari Sudan Sundar,
Samer Boshra,
Balachandar Perumalswamy,
Shivam Atri,
Martin Chisholm,
Revti Raman Singh,
Greg Yang,
Subramanyam Pattipaka,
Tamara Hass,
Nitesh Dudhey,
James Codella,
Mark Hildebrand,
Magdalen Manohar,
Jack Moffitt,
Haiyang Xu,
Naren Datha,
Suryansh Gupta,
Ravishankar Krishnaswamy,
Prashant Gupta,
Abhishek Sahu,
Ritika Mor,
Santosh Kulkarni,
Hemeswari Varada
, et al. (11 additional authors not shown)
Abstract:
Vector indexing enables semantic search over diverse corpora and has become an important interface to databases for both users and AI agents. Efficient vector search requires deep optimizations in database systems. This has motivated a new class of specialized vector databases that optimize for vector search quality and cost. Instead, we argue that a scalable, high-performance, and cost-efficient…
▽ More
Vector indexing enables semantic search over diverse corpora and has become an important interface to databases for both users and AI agents. Efficient vector search requires deep optimizations in database systems. This has motivated a new class of specialized vector databases that optimize for vector search quality and cost. Instead, we argue that a scalable, high-performance, and cost-efficient vector search system can be built inside a cloud-native operational database like Azure Cosmos DB while leveraging the benefits of a distributed database such as high availability, durability, and scale. We do this by deeply integrating DiskANN, a state-of-the-art vector indexing library, inside Azure Cosmos DB NoSQL. This system uses a single vector index per partition stored in existing index trees, and kept in sync with underlying data. It supports < 20ms query latency over an index spanning 10 million of vectors, has stable recall over updates, and offers nearly 15x and 41x lower query cost compared to Zilliz and Pinecone serverless enterprise products. It also scales out to billions of vectors via automatic partitioning. This convergent design presents a point in favor of integrating vector indices into operational databases in the context of recent debates on specialized vector databases, and offers a template for vector indexing in other databases.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Hardware-Enabled Mechanisms for Verifying Responsible AI Development
Authors:
Aidan O'Gara,
Gabriel Kulp,
Will Hodgkins,
James Petrie,
Vincent Immler,
Aydin Aysu,
Kanad Basu,
Shivam Bhasin,
Stjepan Picek,
Ankur Srivastava
Abstract:
Advancements in AI capabilities, driven in large part by scaling up computing resources used for AI training, have created opportunities to address major global challenges but also pose risks of misuse. Hardware-enabled mechanisms (HEMs) can support responsible AI development by enabling verifiable reporting of key properties of AI training activities such as quantity of compute used, training clu…
▽ More
Advancements in AI capabilities, driven in large part by scaling up computing resources used for AI training, have created opportunities to address major global challenges but also pose risks of misuse. Hardware-enabled mechanisms (HEMs) can support responsible AI development by enabling verifiable reporting of key properties of AI training activities such as quantity of compute used, training cluster configuration or location, as well as policy enforcement. Such tools can promote transparency and improve security, while addressing privacy and intellectual property concerns. Based on insights from an interdisciplinary workshop, we identify open questions regarding potential implementation approaches, emphasizing the need for further research to ensure robust, scalable solutions.
△ Less
Submitted 2 April, 2025;
originally announced May 2025.
-
Optimal Interactive Learning on the Job via Facility Location Planning
Authors:
Shivam Vats,
Michelle Zhao,
Patrick Callaghan,
Mingxi Jia,
Maxim Likhachev,
Oliver Kroemer,
George Konidaris
Abstract:
Collaborative robots must continually adapt to novel tasks and user preferences without overburdening the user. While prior interactive robot learning methods aim to reduce human effort, they are typically limited to single-task scenarios and are not well-suited for sustained, multi-task collaboration. We propose COIL (Cost-Optimal Interactive Learning) -- a multi-task interaction planner that min…
▽ More
Collaborative robots must continually adapt to novel tasks and user preferences without overburdening the user. While prior interactive robot learning methods aim to reduce human effort, they are typically limited to single-task scenarios and are not well-suited for sustained, multi-task collaboration. We propose COIL (Cost-Optimal Interactive Learning) -- a multi-task interaction planner that minimizes human effort across a sequence of tasks by strategically selecting among three query types (skill, preference, and help). When user preferences are known, we formulate COIL as an uncapacitated facility location (UFL) problem, which enables bounded-suboptimal planning in polynomial time using off-the-shelf approximation algorithms. We extend our formulation to handle uncertainty in user preferences by incorporating one-step belief space planning, which uses these approximation algorithms as subroutines to maintain polynomial-time performance. Simulated and physical experiments on manipulation tasks show that our framework significantly reduces the amount of work allocated to the human while maintaining successful task completion.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Scientific Workflow Scheduling in Cloud Considering Cold Start and Variable Pricing Model
Authors:
Suvarthi Sarkar,
Sparsh Mittal,
Shivam Garg,
Aryabartta Sahu
Abstract:
Cloud computing has become a pivotal platform for executing scientific workflows due to its scalable and cost-effective infrastructure. Scientific Cloud Service Providers (SCSPs) act as intermediaries that rent virtual machines (VMs) from Infrastructure-as-a-Service (IaaS) providers to meet users' workflow execution demands. The SCSP earns profit from the execution of scientific workflows if it co…
▽ More
Cloud computing has become a pivotal platform for executing scientific workflows due to its scalable and cost-effective infrastructure. Scientific Cloud Service Providers (SCSPs) act as intermediaries that rent virtual machines (VMs) from Infrastructure-as-a-Service (IaaS) providers to meet users' workflow execution demands. The SCSP earns profit from the execution of scientific workflows if it completes the execution of the workflow before the specified deadline of the workflow. This paper addresses two key challenges that impact the profitability of SCSPs: the cold start problem and the efficient management of diverse VM pricing models, namely reserved, on-demand, and spot instances.
We propose a hybrid scheduling framework that integrates initial planning based on historical data with real-time adaptations informed by actual workload variations. In the initial phase, VMs are provisioned using reserved pricing based on predicted workloads and spot instances. During execution, the system dynamically adjusts by provisioning additional VMs through on-demand or spot instances to accommodate unexpected bursts in task arrivals. Our framework also incorporates a dependency-aware task scheduling strategy that accounts for cold start delays and spot pricing volatility. Experimental results on real-world benchmark datasets demonstrate that our approach outperforms state-of-the-art methods, achieving up to 20% improvement over cold-start-focused techniques and 15% over pricing-model-based VM provisioning strategies.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation
Authors:
Shivam Duggal,
Yushi Hu,
Oscar Michel,
Aniruddha Kembhavi,
William T. Freeman,
Noah A. Smith,
Ranjay Krishna,
Antonio Torralba,
Ali Farhadi,
Wei-Chiu Ma
Abstract:
Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To effectively assess the quality of the generated 3D data, there is a need for a reliable 3D evaluation tool. Unfortunately, existing 3D evaluation metrics often ov…
▽ More
Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To effectively assess the quality of the generated 3D data, there is a need for a reliable 3D evaluation tool. Unfortunately, existing 3D evaluation metrics often overlook the geometric quality of generated assets or merely rely on black-box multimodal large language models for coarse assessment. In this paper, we introduce Eval3D, a fine-grained, interpretable evaluation tool that can faithfully evaluate the quality of generated 3D assets based on various distinct yet complementary criteria. Our key observation is that many desired properties of 3D generation, such as semantic and geometric consistency, can be effectively captured by measuring the consistency among various foundation models and tools. We thus leverage a diverse set of models and tools as probes to evaluate the inconsistency of generated 3D assets across different aspects. Compared to prior work, Eval3D provides pixel-wise measurement, enables accurate 3D spatial feedback, and aligns more closely with human judgments. We comprehensively evaluate existing 3D generation models using Eval3D and highlight the limitations and challenges of current models.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
CUBETESTERAI: Automated JUnit Test Generation using the LLaMA Model
Authors:
Daniele Gorla,
Shivam Kumar,
Pietro Nicolaus Roselli Lorenzini,
Alireza Alipourfaz
Abstract:
This paper presents an approach to automating JUnit test generation for Java applications using the Spring Boot framework, leveraging the LLaMA (Large Language Model Architecture) model to enhance the efficiency and accuracy of the testing process. The resulting tool, called CUBETESTERAI, includes a user-friendly web interface and the integration of a CI/CD pipeline using GitLab and Docker. These…
▽ More
This paper presents an approach to automating JUnit test generation for Java applications using the Spring Boot framework, leveraging the LLaMA (Large Language Model Architecture) model to enhance the efficiency and accuracy of the testing process. The resulting tool, called CUBETESTERAI, includes a user-friendly web interface and the integration of a CI/CD pipeline using GitLab and Docker. These components streamline the automated test generation process, allowing developers to generate JUnit tests directly from their code snippets with minimal manual intervention. The final implementation executes the LLaMA models through RunPod, an online GPU service, which also enhances the privacy of our tool. Using the advanced natural language processing capabilities of the LLaMA model, CUBETESTERAI is able to generate test cases that provide high code coverage and accurate validation of software functionalities in Java-based Spring Boot applications. Furthermore, it efficiently manages resource-intensive operations and refines the generated tests to address common issues like missing imports and handling of private methods. By comparing CUBETESTERAI with some state-of-the-art tools, we show that our proposal consistently demonstrates competitive and, in many cases, better performance in terms of code coverage in different real-life Java programs.
△ Less
Submitted 13 March, 2025;
originally announced April 2025.
-
Causal-Copilot: An Autonomous Causal Analysis Agent
Authors:
Xinyue Wang,
Kun Zhou,
Wenyi Wu,
Har Simrat Singh,
Fang Nan,
Songyao Jin,
Aryan Philip,
Saloni Patnaik,
Hou Zhu,
Shivam Singh,
Parjanya Prashant,
Qian Shen,
Biwei Huang
Abstract:
Causal analysis plays a foundational role in scientific discovery and reliable decision-making, yet it remains largely inaccessible to domain experts due to its conceptual and algorithmic complexity. This disconnect between causal methodology and practical usability presents a dual challenge: domain experts are unable to leverage recent advances in causal learning, while causal researchers lack br…
▽ More
Causal analysis plays a foundational role in scientific discovery and reliable decision-making, yet it remains largely inaccessible to domain experts due to its conceptual and algorithmic complexity. This disconnect between causal methodology and practical usability presents a dual challenge: domain experts are unable to leverage recent advances in causal learning, while causal researchers lack broad, real-world deployment to test and refine their methods. To address this, we introduce Causal-Copilot, an autonomous agent that operationalizes expert-level causal analysis within a large language model framework. Causal-Copilot automates the full pipeline of causal analysis for both tabular and time-series data -- including causal discovery, causal inference, algorithm selection, hyperparameter optimization, result interpretation, and generation of actionable insights. It supports interactive refinement through natural language, lowering the barrier for non-specialists while preserving methodological rigor. By integrating over 20 state-of-the-art causal analysis techniques, our system fosters a virtuous cycle -- expanding access to advanced causal methods for domain experts while generating rich, real-world applications that inform and advance causal theory. Empirical evaluations demonstrate that Causal-Copilot achieves superior performance compared to existing baselines, offering a reliable, scalable, and extensible solution that bridges the gap between theoretical sophistication and real-world applicability in causal analysis. A live interactive demo of Causal-Copilot is available at https://causalcopilot.com/.
△ Less
Submitted 21 April, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
Cluster weighted models with multivariate skewed distributions for functional data
Authors:
Cristina Anton,
Roy Shivam Ram Shreshtth
Abstract:
We propose a clustering method, funWeightClustSkew, based on mixtures of functional linear regression models and three skewed multivariate distributions: the variance-gamma distribution, the skew-t distribution, and the normal-inverse Gaussian distribution. Our approach follows the framework of the functional high dimensional data clustering (funHDDC) method, and we extend to functional data the c…
▽ More
We propose a clustering method, funWeightClustSkew, based on mixtures of functional linear regression models and three skewed multivariate distributions: the variance-gamma distribution, the skew-t distribution, and the normal-inverse Gaussian distribution. Our approach follows the framework of the functional high dimensional data clustering (funHDDC) method, and we extend to functional data the cluster weighted models based on skewed distributions used for finite dimensional multivariate data. We consider several parsimonious models, and to estimate the parameters we construct an expectation maximization (EM) algorithm. We illustrate the performance of funWeightClustSkew for simulated data and for the Air Quality dataset.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Authors:
Monojit Choudhury,
Shivam Chauhan,
Rocktim Jyoti Das,
Dhruv Sahnan,
Xudong Han,
Haonan Li,
Aaryamonvikram Singh,
Alok Anil Jadhav,
Utkarsh Agarwal,
Mukund Choudhary,
Debopriyo Banerjee,
Fajri Koto,
Junaid Bhat,
Awantika Shukla,
Samujjwal Ghosh,
Samta Kamboj,
Onkar Pandit,
Lalit Pradhan,
Rahul Pal,
Sunil Sahu,
Soundar Doraiswamy,
Parvez Mullah,
Ali El Filali,
Neha Sengupta,
Gokul Ramakrishnan
, et al. (5 additional authors not shown)
Abstract:
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorp…
▽ More
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continuous pre-training with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context
Authors:
Shubham Kumar Nigam,
Balaramamahanthi Deepak Patnaik,
Shivam Mishra,
Noel Shallum,
Kripabandhu Ghosh,
Arnab Bhattacharya
Abstract:
In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India and various High Courts. Derived from the Hindi ter…
▽ More
In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India and various High Courts. Derived from the Hindi terms "Tathya" (fact) and "Nyaya" (justice), the TathyaNyaya dataset is uniquely designed to focus on factual statements rather than complete legal texts, reflecting real-world judicial processes where factual data drives outcomes. Complementing this dataset, we present FactLegalLlama, an instruction-tuned variant of the LLaMa-3-8B Large Language Model (LLM), optimized for generating high-quality explanations in FJPE tasks. Finetuned on the factual data in TathyaNyaya, FactLegalLlama integrates predictive accuracy with coherent, contextually relevant explanations, addressing the critical need for transparency and interpretability in AI-assisted legal systems. Our methodology combines transformers for binary judgment prediction with FactLegalLlama for explanation generation, creating a robust framework for advancing FJPE in the Indian legal domain. TathyaNyaya not only surpasses existing datasets in scale and diversity but also establishes a benchmark for building explainable AI systems in legal analysis. The findings underscore the importance of factual precision and domain-specific tuning in enhancing predictive performance and interpretability, positioning TathyaNyaya and FactLegalLlama as foundational resources for AI-assisted legal decision-making.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Multi-lingual Multi-turn Automated Red Teaming for LLMs
Authors:
Abhishek Singhania,
Christophe Dupuy,
Shivam Mangale,
Amani Namboori
Abstract:
Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to ``model alignment'', i.e., preventing LLMs to generate unsafe responses when deployed into customer-facing applications. One popular method to evaluate safety risks is \textit{red-teaming}, where agents…
▽ More
Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to ``model alignment'', i.e., preventing LLMs to generate unsafe responses when deployed into customer-facing applications. One popular method to evaluate safety risks is \textit{red-teaming}, where agents attempt to bypass alignment by crafting elaborate prompts that trigger unsafe responses from a model. Standard human-driven red-teaming is costly, time-consuming and rarely covers all the recent features (e.g., multi-lingual, multi-modal aspects), while proposed automation methods only cover a small subset of LLMs capabilities (i.e., English or single-turn). We present Multi-lingual Multi-turn Automated Red Teaming (\textbf{MM-ART}), a method to fully automate conversational, multi-lingual red-teaming operations and quickly identify prompts leading to unsafe responses. Through extensive experiments on different languages, we show the studied LLMs are on average 71\% more vulnerable after a 5-turn conversation in English than after the initial turn. For conversations in non-English languages, models display up to 195\% more safety vulnerabilities than the standard single-turn English approach, confirming the need for automated red-teaming methods matching LLMs capabilities.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding
Authors:
Sakhinana Sagar Srinivas,
Akash Das,
Shivam Gupta,
Venkataramana Runkana
Abstract:
We present a comprehensive framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. This approach significantly improves large language models on knowledge-intensive tasks, including opendomain question answering and complex reasoning. Our framework integrates two complementary techniques: Policy-Optimized RetrievalAug…
▽ More
We present a comprehensive framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. This approach significantly improves large language models on knowledge-intensive tasks, including opendomain question answering and complex reasoning. Our framework integrates two complementary techniques: Policy-Optimized RetrievalAugmented Generation (PORAG), which optimizes the use of retrieved information, and Adaptive Token-Layer Attention Scoring (ATLAS), which dynamically determines retrieval timing and content based on contextual needs. Together, these techniques enhance both the utilization and relevance of retrieved content, improving factual accuracy and response quality. Designed as a lightweight solution compatible with any Transformer-based LLM without requiring additional training, our framework excels in knowledge-intensive tasks, boosting output accuracy in RAG settings. We further propose CRITIC, a novel method to selectively compress key-value caches by token importance, mitigating memory bottlenecks in long-context applications. The framework also incorporates test-time scaling techniques to dynamically balance reasoning depth and computational resources, alongside optimized decoding strategies for faster inference. Experiments on benchmark datasets show that our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems. This integrated approach advances the development of robust, efficient, and scalable RAG systems across diverse applications.
△ Less
Submitted 20 May, 2025; v1 submitted 1 April, 2025;
originally announced April 2025.
-
Agentic Multimodal AI for Hyperpersonalized B2B and B2C Advertising in Competitive Markets: An AI-Driven Competitive Advertising Framework
Authors:
Sakhinana Sagar Srinivas,
Akash Das,
Shivam Gupta,
Venkataramana Runkana
Abstract:
The growing use of foundation models (FMs) in real-world applications demands adaptive, reliable, and efficient strategies for dynamic markets. In the chemical industry, AI-discovered materials drive innovation, but commercial success hinges on market adoption, requiring FM-driven advertising frameworks that operate in-the-wild. We present a multilingual, multimodal AI framework for autonomous, hy…
▽ More
The growing use of foundation models (FMs) in real-world applications demands adaptive, reliable, and efficient strategies for dynamic markets. In the chemical industry, AI-discovered materials drive innovation, but commercial success hinges on market adoption, requiring FM-driven advertising frameworks that operate in-the-wild. We present a multilingual, multimodal AI framework for autonomous, hyper-personalized advertising in B2B and B2C markets. By integrating retrieval-augmented generation (RAG), multimodal reasoning, and adaptive persona-based targeting, our system generates culturally relevant, market-aware ads tailored to shifting consumer behaviors and competition. Validation combines real-world product experiments with a Simulated Humanistic Colony of Agents to model consumer personas, optimize strategies at scale, and ensure privacy compliance. Synthetic experiments mirror real-world scenarios, enabling cost-effective testing of ad strategies without risky A/B tests. Combining structured retrieval-augmented reasoning with in-context learning (ICL), the framework boosts engagement, prevents market cannibalization, and maximizes ROAS. This work bridges AI-driven innovation and market adoption, advancing multimodal FM deployment for high-stakes decision-making in commercial marketing.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead
Authors:
Vidhisha Balachandran,
Jingya Chen,
Lingjiao Chen,
Shivam Garg,
Neel Joshi,
Yash Lara,
John Langford,
Besmira Nushi,
Vibhav Vineet,
Yue Wu,
Safoora Yousefi
Abstract:
Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods ac…
▽ More
Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.
△ Less
Submitted 31 March, 2025;
originally announced April 2025.
-
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Authors:
Shivam Mehta,
Nebojsa Jojic,
Hannes Gamper
Abstract:
Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tok…
▽ More
Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Maya: Optimizing Deep Learning Training Workloads using Emulated Virtual Accelerators
Authors:
Srihas Yarlagadda,
Amey Agrawal,
Elton Pinto,
Hakesh Darapaneni,
Mitali Meratwal,
Shivam Mittal,
Pranavi Bajjuri,
Srinivas Sridharan,
Alexey Tumanov
Abstract:
Training large foundation models costs hundreds of millions of dollars, making deployment optimization critical. Current approaches require machine learning engineers to manually craft training recipes through error-prone trial-and-error on expensive compute clusters. To enable efficient exploration of training configurations, researchers have developed performance modeling systems. However, these…
▽ More
Training large foundation models costs hundreds of millions of dollars, making deployment optimization critical. Current approaches require machine learning engineers to manually craft training recipes through error-prone trial-and-error on expensive compute clusters. To enable efficient exploration of training configurations, researchers have developed performance modeling systems. However, these systems force users to translate their workloads into custom specification languages, introducing a fundamental semantic gap between the actual workload and its representation. This gap creates an inherent tradeoff: systems must either support a narrow set of workloads to maintain usability, require complex specifications that limit practical adoption, or compromise prediction accuracy with simplified models.
We present Maya, a performance modeling system that eliminates these tradeoffs through transparent device emulation. By operating at the narrow interface between training frameworks and accelerator devices, Maya can capture complete workload behavior without requiring code modifications or translations. Maya intercepts device API calls from unmodified training code to directly observe low-level operations, enabling accurate performance prediction while maintaining both ease of use and generality. Our evaluation shows Maya achieves less than 5% prediction error across diverse models and optimization strategies, identifying configurations that reduce training costs by up to 56% compared to existing approaches.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Efficient Knowledge Distillation via Curriculum Extraction
Authors:
Shivam Gupta,
Sushrut Karmalkar
Abstract:
Knowledge distillation is a technique used to train a small student network using the output generated by a large teacher network, and has many empirical advantages~\citep{Hinton2015DistillingTK}. While the standard one-shot approach to distillation only uses the output of the final teacher network, recent work~\citep{panigrahi2024progressive} has shown that using intermediate checkpoints from the…
▽ More
Knowledge distillation is a technique used to train a small student network using the output generated by a large teacher network, and has many empirical advantages~\citep{Hinton2015DistillingTK}. While the standard one-shot approach to distillation only uses the output of the final teacher network, recent work~\citep{panigrahi2024progressive} has shown that using intermediate checkpoints from the teacher's training process as an implicit ``curriculum'' for progressive distillation can significantly speed up training. However, such schemes require storing these checkpoints, and often require careful selection of the intermediate checkpoints to train on, which can be impractical for large-scale training.
In this paper, we show that a curriculum can be \emph{extracted} from just the fully trained teacher network, and that this extracted curriculum can give similar efficiency benefits to those of progressive distillation. Our extraction scheme is natural; we use a random projection of the hidden representations of the teacher network to progressively train the student network, before training using the output of the full network. We show that our scheme significantly outperforms one-shot distillation and achieves a performance similar to that of progressive distillation for learning sparse parities with two-layer networks, and provide theoretical guarantees for this setting. Additionally, we show that our method outperforms one-shot distillation even when using transformer-based architectures, both for sparse-parity learning, and language modeling tasks.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
FLEX: A Framework for Learning Robot-Agnostic Force-based Skills Involving Sustained Contact Object Manipulation
Authors:
Shijie Fang,
Wenchang Gao,
Shivam Goel,
Christopher Thierauf,
Matthias Scheutz,
Jivko Sinapov
Abstract:
Learning to manipulate objects efficiently, particularly those involving sustained contact (e.g., pushing, sliding) and articulated parts (e.g., drawers, doors), presents significant challenges. Traditional methods, such as robot-centric reinforcement learning (RL), imitation learning, and hybrid techniques, require massive training and often struggle to generalize across different objects and rob…
▽ More
Learning to manipulate objects efficiently, particularly those involving sustained contact (e.g., pushing, sliding) and articulated parts (e.g., drawers, doors), presents significant challenges. Traditional methods, such as robot-centric reinforcement learning (RL), imitation learning, and hybrid techniques, require massive training and often struggle to generalize across different objects and robot platforms. We propose a novel framework for learning object-centric manipulation policies in force space, decoupling the robot from the object. By directly applying forces to selected regions of the object, our method simplifies the action space, reduces unnecessary exploration, and decreases simulation overhead. This approach, trained in simulation on a small set of representative objects, captures object dynamics -- such as joint configurations -- allowing policies to generalize effectively to new, unseen objects. Decoupling these policies from robot-specific dynamics enables direct transfer to different robotic platforms (e.g., Kinova, Panda, UR5) without retraining. Our evaluations demonstrate that the method significantly outperforms baselines, achieving over an order of magnitude improvement in training efficiency compared to other state-of-the-art methods. Additionally, operating in force space enhances policy transferability across diverse robot platforms and object types. We further showcase the applicability of our method in a real-world robotic setting. For supplementary materials and videos, please visit: https://tufts-ai-robotics-group.github.io/FLEX/
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Neural reservoir control of a soft bio-hybrid arm
Authors:
Noel Naughton,
Arman Tekinalp,
Keshav Shivam,
Seung Hung Kim,
Volodymyr Kindratenko,
Mattia Gazzola
Abstract:
A long-standing engineering problem, the control of soft robots is difficult because of their highly non-linear, heterogeneous, anisotropic, and distributed nature. Here, bridging engineering and biology, a neural reservoir is employed for the dynamic control of a bio-hybrid model arm made of multiple muscle-tendon groups enveloping an elastic spine. We show how the use of reservoirs facilitates s…
▽ More
A long-standing engineering problem, the control of soft robots is difficult because of their highly non-linear, heterogeneous, anisotropic, and distributed nature. Here, bridging engineering and biology, a neural reservoir is employed for the dynamic control of a bio-hybrid model arm made of multiple muscle-tendon groups enveloping an elastic spine. We show how the use of reservoirs facilitates simultaneous control and self-modeling across a set of challenging tasks, outperforming classic neural network approaches. Further, by implementing a spiking reservoir on neuromorphic hardware, energy efficiency is achieved, with nearly two-orders of magnitude improvement relative to standard CPUs, with implications for the on-board control of untethered, small-scale soft robots.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Can KAN CANs? Input-convex Kolmogorov-Arnold Networks (KANs) as hyperelastic constitutive artificial neural networks (CANs)
Authors:
Prakash Thakolkaran,
Yaqi Guo,
Shivam Saini,
Mathias Peirlinck,
Benjamin Alheit,
Siddhant Kumar
Abstract:
Traditional constitutive models rely on hand-crafted parametric forms with limited expressivity and generalizability, while neural network-based models can capture complex material behavior but often lack interpretability. To balance these trade-offs, we present monotonic Input-Convex Kolmogorov-Arnold Networks (ICKANs) for learning polyconvex hyperelastic constitutive laws. ICKANs leverage the Ko…
▽ More
Traditional constitutive models rely on hand-crafted parametric forms with limited expressivity and generalizability, while neural network-based models can capture complex material behavior but often lack interpretability. To balance these trade-offs, we present monotonic Input-Convex Kolmogorov-Arnold Networks (ICKANs) for learning polyconvex hyperelastic constitutive laws. ICKANs leverage the Kolmogorov-Arnold representation, decomposing the model into compositions of trainable univariate spline-based activation functions for rich expressivity. We introduce trainable monotonic input-convex splines within the KAN architecture, ensuring physically admissible polyconvex models for isotropic compressible hyperelasticity. The resulting models are both compact and interpretable, enabling explicit extraction of analytical constitutive relationships through a monotonic input-convex symbolic regression technique. Through unsupervised training on full-field strain data and limited global force measurements, ICKANs accurately capture nonlinear stress-strain behavior across diverse strain states. Finite element simulations of unseen geometries with trained ICKAN hyperelastic constitutive models confirm the framework's robustness and generalization capability.
△ Less
Submitted 4 June, 2025; v1 submitted 7 March, 2025;
originally announced March 2025.
-
Multi Agent based Medical Assistant for Edge Devices
Authors:
Sakharam Gawade,
Shivam Akhouri,
Chinmay Kulkarni,
Jagdish Samant,
Pragya Sahu,
Aastik,
Jai Pahal,
Saswat Meher
Abstract:
Large Action Models (LAMs) have revolutionized intelligent automation, but their application in healthcare faces challenges due to privacy concerns, latency, and dependency on internet access. This report introduces an ondevice, multi-agent healthcare assistant that overcomes these limitations. The system utilizes smaller, task-specific agents to optimize resources, ensure scalability and high per…
▽ More
Large Action Models (LAMs) have revolutionized intelligent automation, but their application in healthcare faces challenges due to privacy concerns, latency, and dependency on internet access. This report introduces an ondevice, multi-agent healthcare assistant that overcomes these limitations. The system utilizes smaller, task-specific agents to optimize resources, ensure scalability and high performance. Our proposed system acts as a one-stop solution for health care needs with features like appointment booking, health monitoring, medication reminders, and daily health reporting. Powered by the Qwen Code Instruct 2.5 7B model, the Planner and Caller Agents achieve an average RougeL score of 85.5 for planning and 96.5 for calling for our tasks while being lightweight for on-device deployment. This innovative approach combines the benefits of ondevice systems with multi-agent architectures, paving the way for user-centric healthcare solutions.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs
Authors:
Shivam Ratnakar,
Abhiroop Talasila,
Raghav Chamadiya,
Nikhil Agarwal,
Vinayak K Doifode
Abstract:
This paper presents an extensive examination of Parameter-Efficient Fine-Tuning (PEFT) for embedding domain specific facts into Large Language Models (LLMs), focusing on improving the fine-tuning process by categorizing question-answer (QA) pairs into Factual and Conceptual classes using a BERT-based classifier. Two distinct Llama-2 models are fine-tuned based on these classifications and evaluate…
▽ More
This paper presents an extensive examination of Parameter-Efficient Fine-Tuning (PEFT) for embedding domain specific facts into Large Language Models (LLMs), focusing on improving the fine-tuning process by categorizing question-answer (QA) pairs into Factual and Conceptual classes using a BERT-based classifier. Two distinct Llama-2 models are fine-tuned based on these classifications and evaluated using larger models like GPT-3.5 Turbo and Gemini. Our results indicate that models trained on conceptual datasets outperform those trained on factual datasets. Additionally, we compare the efficiency of two synthetic fine-tuning dataset generation techniques, D-RAG and D-Naive, with D-Naive demonstrating superior performance. Although PEFT has shown effectiveness, our research indicates that it may not be the most optimal method for embedding facts into LLMs. However, it has demonstrated exceptional performance in instruction-based tasks. Our findings are reinforced by a 1000-sample dataset in the data center domain, where the fine-tuned Llama-2 7B model significantly outperforms the baseline model in generating product recommendations. Our study highlights the importance of QA pair categorization and synthetic dataset generation techniques in enhancing the performance of LLMs in specific domains.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration
Authors:
Rohan Juneja,
Shivam Aggarwal,
Safeen Huda,
Tulika Mitra,
Li-Shiuan Peh
Abstract:
Quantization is critical for efficiently deploying large language models (LLMs). Yet conventional methods remain hardware-agnostic, limited to bit-width constraints, and do not account for intrinsic circuit characteristics such as the timing behaviors and energy profiles of Multiply-Accumulate (MAC) units. This disconnect from circuit-level behavior limits the ability to exploit available timing m…
▽ More
Quantization is critical for efficiently deploying large language models (LLMs). Yet conventional methods remain hardware-agnostic, limited to bit-width constraints, and do not account for intrinsic circuit characteristics such as the timing behaviors and energy profiles of Multiply-Accumulate (MAC) units. This disconnect from circuit-level behavior limits the ability to exploit available timing margins and energy-saving opportunities, reducing the overall efficiency of deployment on modern accelerators.
To address these limitations, we propose HALO, a versatile framework for Hardware-Aware Post-Training Quantization (PTQ). Unlike traditional methods, HALO explicitly incorporates detailed hardware characteristics, including critical-path timing and power consumption, into its quantization approach. HALO strategically selects weights with low critical-path-delays enabling higher operational frequencies and dynamic frequency scaling without disrupting the architecture's dataflow. Remarkably, HALO achieves these improvements with only a few dynamic voltage and frequency scaling (DVFS) adjustments, ensuring simplicity and practicality in deployment. Additionally, by reducing switching activity within the MAC units, HALO effectively lowers energy consumption. Evaluations on accelerators such as Tensor Processing Units (TPUs) and Graphics Processing Units (GPUs) demonstrate that HALO significantly enhances inference efficiency, achieving average performance improvements of 270% and energy savings of 51% over baseline quantization methods, all with minimal impact on accuracy.
△ Less
Submitted 25 April, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
TerEffic: Highly Efficient Ternary LLM Inference on FPGA
Authors:
Chenyang Yin,
Zhenyu Bai,
Pranav Venkatram,
Shivam Aggarwal,
Zhaoying Li,
Tulika Mitra
Abstract:
Deploying Large Language Models (LLMs) efficiently on edge devices is often constrained by limited memory capacity and high power consumption. Low-bit quantization methods, particularly ternary quantization, have demonstrated significant potential in preserving model accuracy while substantially decreasing memory footprint and computational costs. However, existing general-purpose architectures an…
▽ More
Deploying Large Language Models (LLMs) efficiently on edge devices is often constrained by limited memory capacity and high power consumption. Low-bit quantization methods, particularly ternary quantization, have demonstrated significant potential in preserving model accuracy while substantially decreasing memory footprint and computational costs. However, existing general-purpose architectures and accelerators have not fully exploited the advantages of low-bit quantization due to insufficient specialized hardware support. We introduce TerEffic, an FPGA-based architecture tailored for ternary-quantized LLM inference. The proposed system offers flexibility through reconfigurable hardware to meet various system requirements. We evaluated two representative configurations: a fully on-chip design that stores all weights within on-chip memories, scaling out using multiple FPGAs, and an HBM-assisted design capable of accommodating larger models on a single FPGA board. Experimental results demonstrate significant performance and energy efficiency improvements. For single-batch inference on a 370 M-parameter model, our fully on-chip architecture achieves 16,300 tokens/second, delivering a throughput 192 times higher than NVIDIA Jetson Orin Nano with a power efficiency of 455 tokens/second/W, marking a 19-fold improvement. The HBM-assisted architecture processes 727 tokens/second for a larger 2.7B-parameter model, which is 3 times of the throughput of NVIDIA A100, while consuming only 46W, resulting in a power efficiency of 16 tokens/second/W, an 8-fold improvement over the A100.
△ Less
Submitted 1 May, 2025; v1 submitted 23 February, 2025;
originally announced February 2025.
-
FLIGHT: Facility Location Integrating Generalized, Holistic Theory of Welfare
Authors:
Avyukta Manjunatha Vummintala,
Shivam Gupta,
Shweta Jain,
Sujit Gujar
Abstract:
The Facility Location Problem (FLP) is a well-studied optimization problem with applications in many real-world scenarios. Past literature has explored the solutions from different perspectives to tackle FLPs. These include investigating FLPs under objective functions such as utilitarian, egalitarian, Nash welfare, etc. Also, there is no treatment for asymmetric welfare functions around the facili…
▽ More
The Facility Location Problem (FLP) is a well-studied optimization problem with applications in many real-world scenarios. Past literature has explored the solutions from different perspectives to tackle FLPs. These include investigating FLPs under objective functions such as utilitarian, egalitarian, Nash welfare, etc. Also, there is no treatment for asymmetric welfare functions around the facility. We propose a unified framework, FLIGHT, to accommodate a broad class of welfare notions. The framework undergoes rigorous theoretical analysis, and we prove some structural properties of the solution to FLP. Additionally, we provide approximation bounds, which provide insight into an interesting fact: as the number of agents arbitrarily increases, the choice of welfare notion is irrelevant. Furthermore, the paper also includes results around concentration bounds under certain distributional assumptions over the preferred locations of agents.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Entity Framing and Role Portrayal in the News
Authors:
Tarek Mahmoud,
Zhuohan Xie,
Dimitar Dimitrov,
Nikolaos Nikolaidis,
Purificação Silvano,
Roman Yangarber,
Shivam Sharma,
Elisa Sartori,
Nicolas Stefanovitch,
Giovanni Da San Martino,
Jakub Piskorski,
Preslav Nakov
Abstract:
We introduce a novel multilingual hierarchical corpus annotated for entity framing and role portrayal in news articles. The dataset uses a unique taxonomy inspired by storytelling elements, comprising 22 fine-grained roles, or archetypes, nested within three main categories: protagonist, antagonist, and innocent. Each archetype is carefully defined, capturing nuanced portrayals of entities such as…
▽ More
We introduce a novel multilingual hierarchical corpus annotated for entity framing and role portrayal in news articles. The dataset uses a unique taxonomy inspired by storytelling elements, comprising 22 fine-grained roles, or archetypes, nested within three main categories: protagonist, antagonist, and innocent. Each archetype is carefully defined, capturing nuanced portrayals of entities such as guardian, martyr, and underdog for protagonists; tyrant, deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for innocents. The dataset includes 1,378 recent news articles in five languages (Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two critical domains of global significance: the Ukraine-Russia War and Climate Change. Over 5,800 entity mentions have been annotated with role labels. This dataset serves as a valuable resource for research into role portrayal and has broader implications for news analysis. We describe the characteristics of the dataset and the annotation process, and we report evaluation results on fine-tuned state-of-the-art multilingual transformers and hierarchical zero-shot learning using LLMs at the level of a document, a paragraph, and a sentence.
△ Less
Submitted 15 June, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
Time Series Treatment Effects Analysis with Always-Missing Controls
Authors:
Juan Shu,
Qiyu Han,
George Chen,
Xihao Cao,
Kangming Luo,
Dan Pallotta,
Shivam Agrawal,
Yuping Lu,
Xiaoyu Zhang,
Jawad Mansoor,
Jyoti Anand
Abstract:
Estimating treatment effects in time series data presents a significant challenge, especially when the control group is always unobservable. For example, in analyzing the effects of Christmas on retail sales, we lack direct observation of what would have occurred in late December without the Christmas impact. To address this, we try to recover the control group in the event period while accounting…
▽ More
Estimating treatment effects in time series data presents a significant challenge, especially when the control group is always unobservable. For example, in analyzing the effects of Christmas on retail sales, we lack direct observation of what would have occurred in late December without the Christmas impact. To address this, we try to recover the control group in the event period while accounting for confounders and temporal dependencies. Experimental results on the M5 Walmart retail sales data demonstrate robust estimation of the potential outcome of the control group as well as accurate predicted holiday effect. Furthermore, we provided theoretical guarantees for the estimated treatment effect, proving its consistency and asymptotic normality. The proposed methodology is applicable not only to this always-missing control scenario but also in other conventional time series causal inference settings.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Music for All: Representational Bias and Cross-Cultural Adaptability of Music Generation Models
Authors:
Atharva Mehta,
Shivam Chauhan,
Amirbek Djanibekov,
Atharva Kulkarni,
Gus Xia,
Monojit Choudhury
Abstract:
The advent of Music-Language Models has greatly enhanced the automatic music generation capability of AI systems, but they are also limited in their coverage of the musical genres and cultures of the world. We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7% of the total hours of existing music…
▽ More
The advent of Music-Language Models has greatly enhanced the automatic music generation capability of AI systems, but they are also limited in their coverage of the musical genres and cultures of the world. We present a study of the datasets and research papers for music generation and quantify the bias and under-representation of genres. We find that only 5.7% of the total hours of existing music datasets come from non-Western genres, which naturally leads to disparate performance of the models across genres. We then investigate the efficacy of Parameter-Efficient Fine-Tuning (PEFT) techniques in mitigating this bias. Our experiments with two popular models -- MusicGen and Mustango, for two underrepresented non-Western music traditions -- Hindustani Classical and Turkish Makam music, highlight the promises as well as the non-triviality of cross-genre adaptation of music through small datasets, implying the need for more equitable baseline music-language models that are designed for cross-cultural transfer learning.
△ Less
Submitted 6 May, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement
Authors:
Shivam Singh,
Karthik Swaminathan,
Nabanita Dash,
Ramandeep Singh,
Snehasis Banerjee,
Mohan Sridharan,
Madhava Krishna
Abstract:
An embodied agent assisting humans is often asked to complete new tasks, and there may not be sufficient time or labeled examples to train the agent to perform these new tasks. Large Language Models (LLMs) trained on considerable knowledge across many domains can be used to predict a sequence of abstract actions for completing such tasks, although the agent may not be able to execute this sequence…
▽ More
An embodied agent assisting humans is often asked to complete new tasks, and there may not be sufficient time or labeled examples to train the agent to perform these new tasks. Large Language Models (LLMs) trained on considerable knowledge across many domains can be used to predict a sequence of abstract actions for completing such tasks, although the agent may not be able to execute this sequence due to task-, agent-, or domain-specific constraints. Our framework addresses these challenges by leveraging the generic predictions provided by LLM and the prior domain knowledge encoded in a Knowledge Graph (KG), enabling an agent to quickly adapt to new tasks. The robot also solicits and uses human input as needed to refine its existing knowledge. Based on experimental evaluation in the context of cooking and cleaning tasks in simulation domains, we demonstrate that the interplay between LLM, KG, and human input leads to substantial performance gains compared with just using the LLM. Project website§: https://sssshivvvv.github.io/adaptbot/
△ Less
Submitted 6 March, 2025; v1 submitted 4 February, 2025;
originally announced February 2025.
-
Anticipate & Act : Integrating LLMs and Classical Planning for Efficient Task Execution in Household Environments
Authors:
Raghav Arora,
Shivam Singh,
Karthik Swaminathan,
Ahana Datta,
Snehasis Banerjee,
Brojeshwar Bhowmick,
Krishna Murthy Jatavallabhula,
Mohan Sridharan,
Madhava Krishna
Abstract:
Assistive agents performing household tasks such as making the bed or cooking breakfast often compute and execute actions that accomplish one task at a time. However, efficiency can be improved by anticipating upcoming tasks and computing an action sequence that jointly achieves these tasks. State-of-the-art methods for task anticipation use data-driven deep networks and Large Language Models (LLM…
▽ More
Assistive agents performing household tasks such as making the bed or cooking breakfast often compute and execute actions that accomplish one task at a time. However, efficiency can be improved by anticipating upcoming tasks and computing an action sequence that jointly achieves these tasks. State-of-the-art methods for task anticipation use data-driven deep networks and Large Language Models (LLMs), but they do so at the level of high-level tasks and/or require many training examples. Our framework leverages the generic knowledge of LLMs through a small number of prompts to perform high-level task anticipation, using the anticipated tasks as goals in a classical planning system to compute a sequence of finer-granularity actions that jointly achieve these goals. We ground and evaluate our framework's abilities in realistic scenarios in the VirtualHome environment and demonstrate a 31% reduction in execution time compared with a system that does not consider upcoming tasks.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Dancing With Chains: Ideating Under Constraints With UIDEC in UI/UX Design
Authors:
Atefeh Shokrizadeh,
Boniface Bahati Tadjuidje,
Shivam Kumar,
Sohan Kamble,
Jinghui Cheng
Abstract:
UI/UX designers often work under constraints like brand identity, design norms, and industry guidelines. How these constraints impact designers' ideation and exploration processes should be addressed in creativity-support tools for design. Through an exploratory interview study, we identified three designer personas with varying views on having constraints in the ideation process, which guided the…
▽ More
UI/UX designers often work under constraints like brand identity, design norms, and industry guidelines. How these constraints impact designers' ideation and exploration processes should be addressed in creativity-support tools for design. Through an exploratory interview study, we identified three designer personas with varying views on having constraints in the ideation process, which guided the creation of UIDEC, a GenAI-powered tool for supporting creativity under constraints. UIDEC allows designers to specify project details, such as purpose, target audience, industry, and design styles, based on which it generates diverse design examples that adhere to these constraints, with minimal need to write prompts. In a user evaluation involving designers representing the identified personas, participants found UIDEC compatible with their existing ideation process and useful for creative inspiration, especially when starting new projects. Our work provides design implications to AI-powered tools that integrate constraints during UI/UX design ideation to support creativity.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference
Authors:
Yujie Zhang,
Shivam Aggarwal,
Tulika Mitra
Abstract:
Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To…
▽ More
Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.
△ Less
Submitted 4 May, 2025; v1 submitted 16 December, 2024;
originally announced January 2025.
-
Model-Free and Real-Time Bioinspired Unicycle-Based Source Seeking: Differential Wheeled Robotic Experiments
Authors:
Ahmed A. Elgohary,
Sameh A. Eisa,
Shivam Bajpai
Abstract:
Bioinspred robots aimed at source-seeking are often studied, and their controls designed, using unicycle modeling and formulation. This is true not only for model-based controllers, but also for model-free, real-time control methods such as extremum seeking control (ESC). In this paper, we propose a unicycle-based ESC design applicable to differential wheeled robots that: (1) is very simple design…
▽ More
Bioinspred robots aimed at source-seeking are often studied, and their controls designed, using unicycle modeling and formulation. This is true not only for model-based controllers, but also for model-free, real-time control methods such as extremum seeking control (ESC). In this paper, we propose a unicycle-based ESC design applicable to differential wheeled robots that: (1) is very simple design, based on one simple control-affine law, and without state integrators; (2) attenuates oscillations known to persist in ESC designs (i.e., fully stop at the source); and (3) operates in a model-free, real-time setting, tolerating environmental/sensor noise. We provide simulation and real-world robotic experimental results for fixed and moving light source seeking by a differential wheeled robot using our proposed design. Results indicate clear advantages of our proposed design when compared to the literature, including attenuation of undesired oscillations, improved convergence speed, and better handling of noise.
△ Less
Submitted 11 January, 2025; v1 submitted 3 January, 2025;
originally announced January 2025.