-
GESA: Graph-Enhanced Semantic Allocation for Generalized, Fair, and Explainable Candidate-Role Matching
Authors:
Rishi Ashish Shah,
Shivaay Dhondiyal,
Kartik Sharma,
Sukriti Talwar,
Saksham Jain,
Sparsh Jain
Abstract:
Accurate, fair, and explainable allocation of candidates to roles represents a fundamental challenge across multiple domains including corporate hiring, academic admissions, fellowship awards, and volunteer placement systems. Current state-of-the-art approaches suffer from semantic inflexibility, persistent demographic bias, opacity in decision-making processes, and poor scalability under dynamic…
▽ More
Accurate, fair, and explainable allocation of candidates to roles represents a fundamental challenge across multiple domains including corporate hiring, academic admissions, fellowship awards, and volunteer placement systems. Current state-of-the-art approaches suffer from semantic inflexibility, persistent demographic bias, opacity in decision-making processes, and poor scalability under dynamic policy constraints. We present GESA (Graph-Enhanced Semantic Allocation), a comprehensive framework that addresses these limitations through the integration of domain-adaptive transformer embeddings, heterogeneous self-supervised graph neural networks, adversarial debiasing mechanisms, multi-objective genetic optimization, and explainable AI components. Our experimental evaluation on large-scale international benchmarks comprising 20,000 candidate profiles and 3,000 role specifications demonstrates superior performance with 94.5% top-3 allocation accuracy, 37% improvement in diversity representation, 0.98 fairness score across demographic categories, and sub-second end-to-end latency. Additionally, GESA incorporates hybrid recommendation capabilities and glass-box explainability, making it suitable for deployment across diverse international contexts in industry, academia, and non-profit sectors.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
LLM Output Homogenization is Task Dependent
Authors:
Shomik Jain,
Jack Lanchantin,
Maximilian Nickel,
Karen Ullrich,
Ashia Wilson,
Jamelle Watson-Daniels
Abstract:
A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writin…
▽ More
A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct conceptualizations of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving homogenization where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.
△ Less
Submitted 25 September, 2025;
originally announced September 2025.
-
Intuition to Evidence: Measuring AI's True Impact on Developer Productivity
Authors:
Anand Kumar,
Vishal Khare,
Deepak Sharma,
Satyam Kumar,
Vijay Saini,
Anshul Yadav,
Sachendra Jain,
Ankit Rana,
Pratham Verma,
Vaibhav Meena,
Avinash Edubilli
Abstract:
We present a comprehensive real-world evaluation of AI-assisted software development tools deployed at enterprise scale. Over one year, 300 engineers across multiple teams integrated an in-house AI platform (DeputyDev) that combines code generation and automated review capabilities into their daily workflows. Through rigorous cohort analysis, our study demonstrates statistically significant produc…
▽ More
We present a comprehensive real-world evaluation of AI-assisted software development tools deployed at enterprise scale. Over one year, 300 engineers across multiple teams integrated an in-house AI platform (DeputyDev) that combines code generation and automated review capabilities into their daily workflows. Through rigorous cohort analysis, our study demonstrates statistically significant productivity improvements, including an overall 31.8% reduction in PR review cycle time.
Developer adoption was strong, with 85% satisfaction for code review features and 93% expressing a desire to continue using the platform. Adoption patterns showed systematic scaling from 4% engagement in month 1 to 83% peak usage by month 6, stabilizing at 60% active engagement. Top adopters achieved a 61% increase in code volume pushed to production, contributing to approximately 30 to 40% of code shipped to production through this tool, accounting for an overall 28% increase in code shipment volume.
Unlike controlled benchmark evaluations, our longitudinal analysis provides empirical evidence from production environments, revealing both the transformative potential and practical deployment challenges of integrating AI into enterprise software development workflows.
△ Less
Submitted 23 September, 2025;
originally announced September 2025.
-
Extended AI Interactions Shape Sycophancy and Perspective Mimesis
Authors:
Shomik Jain,
Charlotte Park,
Matheus Mesquita Viana,
Ashia Wilson,
Dana Calacci
Abstract:
We investigate whether long-context interactions between users and LLMs lead to AI mirroring behaviors. We focus on two forms of mirroring: (1) sycophancy -- the tendency of models to be overly agreeable with users, and (2) perspective mimesis -- the extent to which models reflect a user's perspective. Using two weeks of interaction context collected from 38 users, we compare model responses with…
▽ More
We investigate whether long-context interactions between users and LLMs lead to AI mirroring behaviors. We focus on two forms of mirroring: (1) sycophancy -- the tendency of models to be overly agreeable with users, and (2) perspective mimesis -- the extent to which models reflect a user's perspective. Using two weeks of interaction context collected from 38 users, we compare model responses with and without long-context for two tasks: political explanations and personal advice. Our results demonstrate how and when real-world interaction contexts can amplify AI mirroring behaviors. We find that sycophancy increases in long-context, irrespective of the interaction topics. Perspective mimesis increases only in contexts where models can accurately infer user perspectives.
△ Less
Submitted 15 September, 2025;
originally announced September 2025.
-
InJecteD: Analyzing Trajectories and Drift Dynamics in Denoising Diffusion Probabilistic Models for 2D Point Cloud Generation
Authors:
Sanyam Jain,
Khuram Naveed,
Illia Oleksiienko,
Alexandros Iosifidis,
Ruben Pauwels
Abstract:
This work introduces InJecteD, a framework for interpreting Denoising Diffusion Probabilistic Models (DDPMs) by analyzing sample trajectories during the denoising process of 2D point cloud generation. We apply this framework to three datasets from the Datasaurus Dozen bullseye, dino, and circle using a simplified DDPM architecture with customizable input and time embeddings. Our approach quantifie…
▽ More
This work introduces InJecteD, a framework for interpreting Denoising Diffusion Probabilistic Models (DDPMs) by analyzing sample trajectories during the denoising process of 2D point cloud generation. We apply this framework to three datasets from the Datasaurus Dozen bullseye, dino, and circle using a simplified DDPM architecture with customizable input and time embeddings. Our approach quantifies trajectory properties, including displacement, velocity, clustering, and drift field dynamics, using statistical metrics such as Wasserstein distance and cosine similarity. By enhancing model transparency, InJecteD supports human AI collaboration by enabling practitioners to debug and refine generative models. Experiments reveal distinct denoising phases: initial noise exploration, rapid shape formation, and final refinement, with dataset-specific behaviors example, bullseyes concentric convergence vs. dinos complex contour formation. We evaluate four model configurations, varying embeddings and noise schedules, demonstrating that Fourier based embeddings improve trajectory stability and reconstruction quality
△ Less
Submitted 9 September, 2025;
originally announced September 2025.
-
LearnLens: An AI-Enhanced Dashboard to Support Teachers in Open-Ended Classrooms
Authors:
Namrata Srivastava,
Shruti Jain,
Clayton Cohn,
Naveeduddin Mohammed,
Umesh Timalsina,
Gautam Biswas
Abstract:
Exploratory learning environments (ELEs), such as simulation-based platforms and open-ended science curricula, promote hands-on exploration and problem-solving but make it difficult for teachers to gain timely insights into students' conceptual understanding. This paper presents LearnLens, a generative AI (GenAI)-enhanced teacher-facing dashboard designed to support problem-based instruction in mi…
▽ More
Exploratory learning environments (ELEs), such as simulation-based platforms and open-ended science curricula, promote hands-on exploration and problem-solving but make it difficult for teachers to gain timely insights into students' conceptual understanding. This paper presents LearnLens, a generative AI (GenAI)-enhanced teacher-facing dashboard designed to support problem-based instruction in middle school science. LearnLens processes students' open-ended responses from digital assessments to provide various insights, including sample responses, word clouds, bar charts, and AI-generated summaries. These features elucidate students' thinking, enabling teachers to adjust their instruction based on emerging patterns of understanding. The dashboard was informed by teacher input during professional development sessions and implemented within a middle school Earth science curriculum. We report insights from teacher interviews that highlight the dashboard's usability and potential to guide teachers' instruction in the classroom.
△ Less
Submitted 11 September, 2025;
originally announced September 2025.
-
CancerGUIDE: Cancer Guideline Understanding via Internal Disagreement Estimation
Authors:
Alyssa Unell,
Noel C. F. Codella,
Sam Preston,
Peniel Argaw,
Wen-wai Yim,
Zelalem Gero,
Cliff Wong,
Rajesh Jena,
Eric Horvitz,
Amanda K. Hall,
Ruican Rachel Zhong,
Jiachen Li,
Shrey Jain,
Mu Wei,
Matthew Lungren,
Hoifung Poon
Abstract:
The National Comprehensive Cancer Network (NCCN) provides evidence-based guidelines for cancer treatment. Translating complex patient presentations into guideline-compliant treatment recommendations is time-intensive, requires specialized expertise, and is prone to error. Advances in large language model (LLM) capabilities promise to reduce the time required to generate treatment recommendations a…
▽ More
The National Comprehensive Cancer Network (NCCN) provides evidence-based guidelines for cancer treatment. Translating complex patient presentations into guideline-compliant treatment recommendations is time-intensive, requires specialized expertise, and is prone to error. Advances in large language model (LLM) capabilities promise to reduce the time required to generate treatment recommendations and improve accuracy. We present an LLM agent-based approach to automatically generate guideline-concordant treatment trajectories for patients with non-small cell lung cancer (NSCLC). Our contributions are threefold. First, we construct a novel longitudinal dataset of 121 cases of NSCLC patients that includes clinical encounters, diagnostic results, and medical histories, each expertly annotated with the corresponding NCCN guideline trajectories by board-certified oncologists. Second, we demonstrate that existing LLMs possess domain-specific knowledge that enables high-quality proxy benchmark generation for both model development and evaluation, achieving strong correlation (Spearman coefficient r=0.88, RMSE = 0.08) with expert-annotated benchmarks. Third, we develop a hybrid approach combining expensive human annotations with model consistency information to create both the agent framework that predicts the relevant guidelines for a patient, as well as a meta-classifier that verifies prediction accuracy with calibrated confidence scores for treatment recommendations (AUROC=0.800), a critical capability for communicating the accuracy of outputs, custom-tailoring tradeoffs in performance, and supporting regulatory compliance. This work establishes a framework for clinically viable LLM-based guideline adherence systems that balance accuracy, interpretability, and regulatory requirements while reducing annotation costs, providing a scalable pathway toward automated clinical decision support.
△ Less
Submitted 8 September, 2025;
originally announced September 2025.
-
Demo: Healthcare Agent Orchestrator (HAO) for Patient Summarization in Molecular Tumor Boards
Authors:
Matthias Blondeel,
Noel Codella,
Sam Preston,
Hao Qiu,
Leonardo Schettini,
Frank Tuan,
Wen-wai Yim,
Smitha Saligrama,
Mert Öz,
Shrey Jain,
Matthew P. Lungren,
Thomas Osborne
Abstract:
Molecular Tumor Boards (MTBs) are multidisciplinary forums where oncology specialists collaboratively assess complex patient cases to determine optimal treatment strategies. A central element of this process is the patient summary, typically compiled by a medical oncologist, radiation oncologist, or surgeon, or their trained medical assistant, who distills heterogeneous medical records into a conc…
▽ More
Molecular Tumor Boards (MTBs) are multidisciplinary forums where oncology specialists collaboratively assess complex patient cases to determine optimal treatment strategies. A central element of this process is the patient summary, typically compiled by a medical oncologist, radiation oncologist, or surgeon, or their trained medical assistant, who distills heterogeneous medical records into a concise narrative to facilitate discussion. This manual approach is often labor-intensive, subjective, and prone to omissions of critical information. To address these limitations, we introduce the Healthcare Agent Orchestrator (HAO), a Large Language Model (LLM)-driven AI agent that coordinates a multi-agent clinical workflow to generate accurate and comprehensive patient summaries for MTBs. Evaluating predicted patient summaries against ground truth presents additional challenges due to stylistic variation, ordering, synonym usage, and phrasing differences, which complicate the measurement of both succinctness and completeness. To overcome these evaluation hurdles, we propose TBFact, a ``model-as-a-judge'' framework designed to assess the comprehensiveness and succinctness of generated summaries. Using a benchmark dataset derived from de-identified tumor board discussions, we applied TBFact to evaluate our Patient History agent. Results show that the agent captured 94% of high-importance information (including partial entailments) and achieved a TBFact recall of 0.84 under strict entailment criteria. We further demonstrate that TBFact enables a data-free evaluation framework that institutions can deploy locally without sharing sensitive clinical data. Together, HAO and TBFact establish a robust foundation for delivering reliable and scalable support to MTBs.
△ Less
Submitted 11 September, 2025; v1 submitted 8 September, 2025;
originally announced September 2025.
-
Impact of Labeling Inaccuracy and Image Noise on Tooth Segmentation in Panoramic Radiographs using Federated, Centralized and Local Learning
Authors:
Johan Andreas Balle Rubak,
Khuram Naveed,
Sanyam Jain,
Lukas Esterle,
Alexandros Iosifidis,
Ruben Pauwels
Abstract:
Objectives: Federated learning (FL) may mitigate privacy constraints, heterogeneous data quality, and inconsistent labeling in dental diagnostic AI. We compared FL with centralized (CL) and local learning (LL) for tooth segmentation in panoramic radiographs across multiple data corruption scenarios. Methods: An Attention U-Net was trained on 2066 radiographs from six institutions across four setti…
▽ More
Objectives: Federated learning (FL) may mitigate privacy constraints, heterogeneous data quality, and inconsistent labeling in dental diagnostic AI. We compared FL with centralized (CL) and local learning (LL) for tooth segmentation in panoramic radiographs across multiple data corruption scenarios. Methods: An Attention U-Net was trained on 2066 radiographs from six institutions across four settings: baseline (unaltered data); label manipulation (dilated/missing annotations); image-quality manipulation (additive Gaussian noise); and exclusion of a faulty client with corrupted data. FL was implemented via the Flower AI framework. Per-client training- and validation-loss trajectories were monitored for anomaly detection and a set of metrics (Dice, IoU, HD, HD95 and ASSD) was evaluated on a hold-out test set. From these metrics significance results were reported through Wilcoxon signed-rank test. CL and LL served as comparators. Results: Baseline: FL achieved a median Dice of 0.94889 (ASSD: 1.33229), slightly better than CL at 0.94706 (ASSD: 1.37074) and LL at 0.93557-0.94026 (ASSD: 1.51910-1.69777). Label manipulation: FL maintained the best median Dice score at 0.94884 (ASSD: 1.46487) versus CL's 0.94183 (ASSD: 1.75738) and LL's 0.93003-0.94026 (ASSD: 1.51910-2.11462). Image noise: FL led with Dice at 0.94853 (ASSD: 1.31088); CL scored 0.94787 (ASSD: 1.36131); LL ranged from 0.93179-0.94026 (ASSD: 1.51910-1.77350). Faulty-client exclusion: FL reached Dice at 0.94790 (ASSD: 1.33113) better than CL's 0.94550 (ASSD: 1.39318). Loss-curve monitoring reliably flagged the corrupted site. Conclusions: FL matches or exceeds CL and outperforms LL across corruption scenarios while preserving privacy. Per-client loss trajectories provide an effective anomaly-detection mechanism and support FL as a practical, privacy-preserving approach for scalable clinical AI deployment.
△ Less
Submitted 8 September, 2025;
originally announced September 2025.
-
Designing Gaze Analytics for ELA Instruction: A User-Centered Dashboard with Conversational AI Support
Authors:
Eduardo Davalos,
Yike Zhang,
Shruti Jain,
Namrata Srivastava,
Trieu Truong,
Nafees-ul Haque,
Tristan Van,
Jorge Salas,
Sara McFadden,
Sun-Joo Cho,
Gautam Biswas,
Amanda Goodwin
Abstract:
Eye-tracking offers rich insights into student cognition and engagement, but remains underutilized in classroom-facing educational technology due to challenges in data interpretation and accessibility. In this paper, we present the iterative design and evaluation of a gaze-based learning analytics dashboard for English Language Arts (ELA), developed through five studies involving teachers and stud…
▽ More
Eye-tracking offers rich insights into student cognition and engagement, but remains underutilized in classroom-facing educational technology due to challenges in data interpretation and accessibility. In this paper, we present the iterative design and evaluation of a gaze-based learning analytics dashboard for English Language Arts (ELA), developed through five studies involving teachers and students. Guided by user-centered design and data storytelling principles, we explored how gaze data can support reflection, formative assessment, and instructional decision-making. Our findings demonstrate that gaze analytics can be approachable and pedagogically valuable when supported by familiar visualizations, layered explanations, and narrative scaffolds. We further show how a conversational agent, powered by a large language model (LLM), can lower cognitive barriers to interpreting gaze data by enabling natural language interactions with multimodal learning analytics. We conclude with design implications for future EdTech systems that aim to integrate novel data modalities in classroom contexts.
△ Less
Submitted 3 September, 2025;
originally announced September 2025.
-
ART: Adaptive Resampling-based Training for Imbalanced Classification
Authors:
Arjun Basandrai,
Shourya Jain,
K. Ilanthenral
Abstract:
Traditional resampling methods for handling class imbalance typically uses fixed distributions, undersampling the majority or oversampling the minority. These static strategies ignore changes in class-wise learning difficulty, which can limit the overall performance of the model.
This paper proposes an Adaptive Resampling-based Training (ART) method that periodically updates the distribution of…
▽ More
Traditional resampling methods for handling class imbalance typically uses fixed distributions, undersampling the majority or oversampling the minority. These static strategies ignore changes in class-wise learning difficulty, which can limit the overall performance of the model.
This paper proposes an Adaptive Resampling-based Training (ART) method that periodically updates the distribution of the training data based on the class-wise performance of the model. Specifically, ART uses class-wise macro F1 scores, computed at fixed intervals, to determine the degree of resampling to be performed.
Unlike instance-level difficulty modeling, which is noisy and outlier-sensitive, ART adapts at the class level. This allows the model to incrementally shift its attention towards underperforming classes in a way that better aligns with the optimization objective.
Results on diverse benchmarks, including Pima Indians Diabetes and Yeast dataset demonstrate that ART consistently outperforms both resampling-based and algorithm-level methods, including Synthetic Minority Oversampling Technique (SMOTE), NearMiss Undersampling, and Cost-sensitive Learning on binary as well as multi-class classification tasks with varying degrees of imbalance.
In most settings, these improvements are statistically significant. On tabular datasets, gains are significant under paired t-tests and Wilcoxon tests (p < 0.05), while results on text and image tasks remain favorable. Compared to training on the original imbalanced data, ART improves macro F1 by an average of 2.64 percentage points across all tested tabular datasets. Unlike existing methods, whose performance varies by task, ART consistently delivers the strongest macro F1, making it a reliable choice for imbalanced classification.
△ Less
Submitted 31 August, 2025;
originally announced September 2025.
-
Sycophancy as compositions of Atomic Psychometric Traits
Authors:
Shreyans Jain,
Alexandra Yost,
Amirali Abdullah
Abstract:
Sycophancy is a key behavioral risk in LLMs, yet is often treated as an isolated failure mode that occurs via a single causal mechanism. We instead propose modeling it as geometric and causal compositions of psychometric traits such as emotionality, openness, and agreeableness - similar to factor decomposition in psychometrics. Using Contrastive Activation Addition (CAA), we map activation directi…
▽ More
Sycophancy is a key behavioral risk in LLMs, yet is often treated as an isolated failure mode that occurs via a single causal mechanism. We instead propose modeling it as geometric and causal compositions of psychometric traits such as emotionality, openness, and agreeableness - similar to factor decomposition in psychometrics. Using Contrastive Activation Addition (CAA), we map activation directions to these factors and study how different combinations may give rise to sycophancy (e.g., high extraversion combined with low conscientiousness). This perspective allows for interpretable and compositional vector-based interventions like addition, subtraction and projection; that may be used to mitigate safety-critical behaviors in LLMs.
△ Less
Submitted 26 August, 2025;
originally announced August 2025.
-
Cooperative SGD with Dynamic Mixing Matrices
Authors:
Soumya Sarkar,
Shweta Jain
Abstract:
One of the most common methods to train machine learning algorithms today is the stochastic gradient descent (SGD). In a distributed setting, SGD-based algorithms have been shown to converge theoretically under specific circumstances. A substantial number of works in the distributed SGD setting assume a fixed topology for the edge devices. These papers also assume that the contribution of nodes to…
▽ More
One of the most common methods to train machine learning algorithms today is the stochastic gradient descent (SGD). In a distributed setting, SGD-based algorithms have been shown to converge theoretically under specific circumstances. A substantial number of works in the distributed SGD setting assume a fixed topology for the edge devices. These papers also assume that the contribution of nodes to the global model is uniform. However, experiments have shown that such assumptions are suboptimal and a non uniform aggregation strategy coupled with a dynamically shifting topology and client selection can significantly improve the performance of such models. This paper details a unified framework that covers several Local-Update SGD-based distributed algorithms with dynamic topologies and provides improved or matching theoretical guarantees on convergence compared to existing work.
△ Less
Submitted 21 August, 2025; v1 submitted 20 August, 2025;
originally announced August 2025.
-
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
Authors:
NVIDIA,
:,
Aarti Basant,
Abhijit Khairnar,
Abhijit Paithankar,
Abhinav Khattar,
Adithya Renduchintala,
Aditya Malte,
Akhiad Bercovich,
Akshay Hazare,
Alejandra Rico,
Aleksander Ficek,
Alex Kondratenko,
Alex Shaposhnikov,
Alexander Bukharin,
Ali Taghibakhshi,
Amelia Barton,
Ameya Sunil Mahabaleshwarkar,
Amy Shen,
Andrew Tao,
Ann Guan,
Anna Shors,
Anubhav Mandarwal,
Arham Mehta,
Arun Venkatesan
, et al. (192 additional authors not shown)
Abstract:
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achi…
▽ More
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
△ Less
Submitted 2 September, 2025; v1 submitted 20 August, 2025;
originally announced August 2025.
-
The Multi-Stage Assignment Problem: A Fairness Perspective
Authors:
Vibulan J,
Swapnil Dhamal,
Shweta Jain
Abstract:
This paper explores the problem of fair assignment on Multi-Stage graphs. A multi-stage graph consists of nodes partitioned into $K$ disjoint sets (stages) structured as a sequence of weighted bipartite graphs formed across adjacent stages. The goal is to assign node-disjoint paths to $n$ agents starting from the first stage and ending in the last stage. We show that an efficient assignment that m…
▽ More
This paper explores the problem of fair assignment on Multi-Stage graphs. A multi-stage graph consists of nodes partitioned into $K$ disjoint sets (stages) structured as a sequence of weighted bipartite graphs formed across adjacent stages. The goal is to assign node-disjoint paths to $n$ agents starting from the first stage and ending in the last stage. We show that an efficient assignment that minimizes the overall sum of costs of all the agents' paths may be highly unfair and lead to significant cost disparities (envy) among the agents. We further show that finding an envy-minimizing assignment on a multi-stage graph is NP-hard. We propose the C-Balance algorithm, which guarantees envy that is bounded by $2M$ in the case of two agents, where $M$ is the maximum edge weight. We demonstrate the algorithm's tightness by presenting an instance where the envy is $2M$. We further show that the cost of fairness ($CoF$), defined as the ratio of the cost of the assignment given by the fair algorithm to that of the minimum cost assignment, is bounded by $2$ for C-Balance. We then extend this approach to $n$ agents by proposing the DC-Balance algorithm that makes iterative calls to C-Balance. We show the convergence of DC-Balance, resulting in envy that is arbitrarily close to $2M$. We derive $CoF$ bounds for DC-Balance and provide insights about its dependency on the instance-specific parameters and the desired degree of envy. We experimentally show that our algorithm runs several orders of magnitude faster than a suitably formulated ILP.
△ Less
Submitted 19 August, 2025;
originally announced August 2025.
-
gpt-oss-120b & gpt-oss-20b Model Card
Authors:
OpenAI,
:,
Sandhini Agarwal,
Lama Ahmad,
Jason Ai,
Sam Altman,
Andy Applebaum,
Edwin Arbus,
Rahul K. Arora,
Yu Bai,
Bowen Baker,
Haiming Bao,
Boaz Barak,
Ally Bennett,
Tyler Bertao,
Nivedita Brett,
Eugene Brevdo,
Greg Brockman,
Sebastien Bubeck,
Che Chang,
Kai Chen,
Mark Chen,
Enoch Cheung,
Aidan Clark,
Dan Cook
, et al. (102 additional authors not shown)
Abstract:
We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for develope…
▽ More
We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.
△ Less
Submitted 8 August, 2025;
originally announced August 2025.
-
GazeLT: Visual attention-guided long-tailed disease classification in chest radiographs
Authors:
Moinak Bhattacharya,
Gagandeep Singh,
Shubham Jain,
Prateek Prasanna
Abstract:
In this work, we present GazeLT, a human visual attention integration-disintegration approach for long-tailed disease classification. A radiologist's eye gaze has distinct patterns that capture both fine-grained and coarser level disease related information. While interpreting an image, a radiologist's attention varies throughout the duration; it is critical to incorporate this into a deep learnin…
▽ More
In this work, we present GazeLT, a human visual attention integration-disintegration approach for long-tailed disease classification. A radiologist's eye gaze has distinct patterns that capture both fine-grained and coarser level disease related information. While interpreting an image, a radiologist's attention varies throughout the duration; it is critical to incorporate this into a deep learning framework to improve automated image interpretation. Another important aspect of visual attention is that apart from looking at major/obvious disease patterns, experts also look at minor/incidental findings (few of these constituting long-tailed classes) during the course of image interpretation. GazeLT harnesses the temporal aspect of the visual search process, via an integration and disintegration mechanism, to improve long-tailed disease classification. We show the efficacy of GazeLT on two publicly available datasets for long-tailed disease classification, namely the NIH-CXR-LT (n=89237) and the MIMIC-CXR-LT (n=111898) datasets. GazeLT outperforms the best long-tailed loss by 4.1% and the visual attention-based baseline by 21.7% in average accuracy metrics for these datasets. Our code is available at https://github.com/lordmoinak1/gazelt.
△ Less
Submitted 13 August, 2025;
originally announced August 2025.
-
From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training
Authors:
Yuan Yuan,
Tina Sriskandarajah,
Anna-Luisa Brakman,
Alec Helyar,
Alex Beutel,
Andrea Vallone,
Saachi Jain
Abstract:
Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especial…
▽ More
Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.
△ Less
Submitted 11 August, 2025;
originally announced August 2025.
-
A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents
Authors:
Clayton Cohn,
Surya Rayala,
Namrata Srivastava,
Joyce Horn Fonteles,
Shruti Jain,
Xinying Luo,
Divya Mereddy,
Naveeduddin Mohammed,
Gautam Biswas
Abstract:
Large language models (LLMs) present new opportunities for creating pedagogical agents that engage in meaningful dialogue to support student learning. However, the current use of LLM systems like ChatGPT in classrooms often lacks the solid theoretical foundation found in earlier intelligent tutoring systems. To bridge this gap, we propose a framework that combines Evidence-Centered Design with Soc…
▽ More
Large language models (LLMs) present new opportunities for creating pedagogical agents that engage in meaningful dialogue to support student learning. However, the current use of LLM systems like ChatGPT in classrooms often lacks the solid theoretical foundation found in earlier intelligent tutoring systems. To bridge this gap, we propose a framework that combines Evidence-Centered Design with Social Cognitive Theory for adaptive scaffolding in LLM-based agents focused on STEM+C learning. We illustrate this framework with Inquizzitor, an LLM-based formative assessment agent that integrates human-AI hybrid intelligence and provides feedback grounded in cognitive science principles. Our findings show that Inquizzitor delivers high-quality assessment and interaction aligned with core learning theories, offering teachers effective guidance that students value. This research underscores the potential for theory-driven LLM integration in education, highlighting the ability of these systems to provide adaptive and principled instruction.
△ Less
Submitted 2 August, 2025;
originally announced August 2025.
-
PanoGAN A Deep Generative Model for Panoramic Dental Radiographs
Authors:
Soren Pedersen,
Sanyam Jain,
Mikkel Chavez,
Viktor Ladehoff,
Bruna Neves de Freitas,
Ruben Pauwels
Abstract:
This paper presents the development of a generative adversarial network (GAN) for synthesizing dental panoramic radiographs. Although exploratory in nature, the study aims to address the scarcity of data in dental research and education. We trained a deep convolutional GAN (DCGAN) using a Wasserstein loss with gradient penalty (WGANGP) on a dataset of 2322 radiographs of varying quality. The focus…
▽ More
This paper presents the development of a generative adversarial network (GAN) for synthesizing dental panoramic radiographs. Although exploratory in nature, the study aims to address the scarcity of data in dental research and education. We trained a deep convolutional GAN (DCGAN) using a Wasserstein loss with gradient penalty (WGANGP) on a dataset of 2322 radiographs of varying quality. The focus was on the dentoalveolar regions, other anatomical structures were cropped out. Extensive preprocessing and data cleaning were performed to standardize the inputs while preserving anatomical variability. We explored four candidate models by varying critic iterations, feature depth, and the use of denoising prior to training. A clinical expert evaluated the generated radiographs based on anatomical visibility and realism, using a 5-point scale (1 very poor 5 excellent). Most images showed moderate anatomical depiction, although some were degraded by artifacts. A trade-off was observed the model trained on non-denoised data yielded finer details especially in structures like the mandibular canal and trabecular bone, while a model trained on denoised data offered superior overall image clarity and sharpness. These findings provide a foundation for future work on GAN-based methods in dental imaging.
△ Less
Submitted 28 July, 2025;
originally announced July 2025.
-
Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks
Authors:
Utkarsh Shandilya,
Marsha Mariya Kappan,
Sanyam Jain,
Vijeta Sharma
Abstract:
Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment. While traditional models like CNNs and RNNs have achieved moderate success, they often struggle to generalize across diverse and complex actions. Recent advancements in vision-language mo…
▽ More
Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment. While traditional models like CNNs and RNNs have achieved moderate success, they often struggle to generalize across diverse and complex actions. Recent advancements in vision-language models, especially the transformer-based CLIP model, offer promising capabilities for generalizing action recognition from video data. In this work, we evaluate CLIP on the UCF-101 dataset and systematically analyze its performance under three masking strategies: (1) percentage-based and shape-based black masking at 10%, 30%, and 50%, (2) feature-specific masking to suppress bias-inducing elements, and (3) isolation masking that retains only class-specific regions. Our results reveal that CLIP exhibits inconsistent behavior and frequent misclassifications, particularly when essential visual cues are obscured. To overcome these limitations, we propose incorporating class-specific noise, learned via a custom loss function, to reinforce attention to class-defining features. This enhancement improves classification accuracy and model confidence while reducing bias. We conclude with a discussion on the challenges of applying such models in clinical domains and outline directions for future work to improve generalizability across domain-independent healthcare scenarios.
△ Less
Submitted 30 July, 2025; v1 submitted 24 July, 2025;
originally announced July 2025.
-
Frequency-Histogram Coarse Graining in Elementary Cellular Automata and 2D CA
Authors:
Sanyam Jain,
Stefano Nichele
Abstract:
Cellular automata and other discrete dynamical systems have long been studied as models of emergent complexity. Recently, neural cellular automata have been proposed as models to investigate the emerge of a more general artificial intelligence, thanks to their propensity to support properties such as self-organization, emergence, and open-endedness. However, understanding emergent complexity in la…
▽ More
Cellular automata and other discrete dynamical systems have long been studied as models of emergent complexity. Recently, neural cellular automata have been proposed as models to investigate the emerge of a more general artificial intelligence, thanks to their propensity to support properties such as self-organization, emergence, and open-endedness. However, understanding emergent complexity in large scale systems is an open challenge. How can the important computations leading to emergent complex structures and behaviors be identified? In this work, we systematically investigate a form of dimensionality reduction for 1-dimensional and 2-dimensional cellular automata based on coarse-graining of macrostates into smaller blocks. We discuss selected examples and provide the entire exploration of coarse graining with different filtering levels in the appendix (available also digitally at this link: https://s4nyam.github.io/eca88/). We argue that being able to capture emergent complexity in AI systems may pave the way to open-ended evolution, a plausible path to reach artificial general intelligence.
△ Less
Submitted 24 July, 2025;
originally announced July 2025.
-
Clo-HDnn: A 4.66 TFLOPS/W and 3.78 TOPS/W Continual On-Device Learning Accelerator with Energy-efficient Hyperdimensional Computing via Progressive Search
Authors:
Chang Eun Song,
Weihong Xu,
Keming Fan,
Soumil Jain,
Gopabandhu Hota,
Haichao Yang,
Leo Liu,
Kerem Akarvardar,
Meng-Fan Chang,
Carlos H. Diaz,
Gert Cauwenberghs,
Tajana Rosing,
Mingu Kang
Abstract:
Clo-HDnn is an on-device learning (ODL) accelerator designed for emerging continual learning (CL) tasks. Clo-HDnn integrates hyperdimensional computing (HDC) along with low-cost Kronecker HD Encoder and weight clustering feature extraction (WCFE) to optimize accuracy and efficiency. Clo-HDnn adopts gradient-free CL to efficiently update and store the learned knowledge in the form of class hypervec…
▽ More
Clo-HDnn is an on-device learning (ODL) accelerator designed for emerging continual learning (CL) tasks. Clo-HDnn integrates hyperdimensional computing (HDC) along with low-cost Kronecker HD Encoder and weight clustering feature extraction (WCFE) to optimize accuracy and efficiency. Clo-HDnn adopts gradient-free CL to efficiently update and store the learned knowledge in the form of class hypervectors. Its dual-mode operation enables bypassing costly feature extraction for simpler datasets, while progressive search reduces complexity by up to 61% by encoding and comparing only partial query hypervectors. Achieving 4.66 TFLOPS/W (FE) and 3.78 TOPS/W (classifier), Clo-HDnn delivers 7.77x and 4.85x higher energy efficiency compared to SOTA ODL accelerators.
△ Less
Submitted 23 July, 2025;
originally announced July 2025.
-
Text-to-SQL for Enterprise Data Analytics
Authors:
Albert Chen,
Manas Bundele,
Gaurav Ahlawat,
Patrick Stetz,
Zhitao Wang,
Qiang Fei,
Donghoon Jung,
Audrey Chu,
Bharadwaj Jayaraman,
Ayushi Panth,
Yatin Arora,
Sourav Jain,
Renjith Varma,
Alexey Ilin,
Iuliia Melnychuk,
Chelsea Chueh,
Joyan Sil,
Xiaofeng Wang
Abstract:
The introduction of large language models has brought rapid progress on Text-to-SQL benchmarks, but it is not yet easy to build a working enterprise solution. In this paper, we present insights from building an internal chatbot that enables LinkedIn's product managers, engineers, and operations teams to self-serve data insights from a large, dynamic data lake. Our approach features three component…
▽ More
The introduction of large language models has brought rapid progress on Text-to-SQL benchmarks, but it is not yet easy to build a working enterprise solution. In this paper, we present insights from building an internal chatbot that enables LinkedIn's product managers, engineers, and operations teams to self-serve data insights from a large, dynamic data lake. Our approach features three components. First, we construct a knowledge graph that captures up-to-date semantics by indexing database metadata, historical query logs, wikis, and code. We apply clustering to identify relevant tables for each team or product area. Second, we build a Text-to-SQL agent that retrieves and ranks context from the knowledge graph, writes a query, and automatically corrects hallucinations and syntax errors. Third, we build an interactive chatbot that supports various user intents, from data discovery to query writing to debugging, and displays responses in rich UI elements to encourage follow-up chats. Our chatbot has over 300 weekly users. Expert review shows that 53% of its responses are correct or close to correct on an internal benchmark set. Through ablation studies, we identify the most important knowledge graph and modeling components, offering a practical path for developing enterprise Text-to-SQL solutions.
△ Less
Submitted 18 July, 2025;
originally announced July 2025.
-
Federated Learning for Commercial Image Sources
Authors:
Shreyansh Jain,
Koteswar Rao Jerripothula
Abstract:
Federated Learning is a collaborative machine learning paradigm that enables multiple clients to learn a global model without exposing their data to each other. Consequently, it provides a secure learning platform with privacy-preserving capabilities. This paper introduces a new dataset containing 23,326 images collected from eight different commercial sources and classified into 31 categories, si…
▽ More
Federated Learning is a collaborative machine learning paradigm that enables multiple clients to learn a global model without exposing their data to each other. Consequently, it provides a secure learning platform with privacy-preserving capabilities. This paper introduces a new dataset containing 23,326 images collected from eight different commercial sources and classified into 31 categories, similar to the Office-31 dataset. To the best of our knowledge, this is the first image classification dataset specifically designed for Federated Learning. We also propose two new Federated Learning algorithms, namely Fed-Cyclic and Fed-Star. In Fed-Cyclic, a client receives weights from its previous client, updates them through local training, and passes them to the next client, thus forming a cyclic topology. In Fed-Star, a client receives weights from all other clients, updates its local weights through pre-aggregation (to address statistical heterogeneity) and local training, and sends its updated local weights to all other clients, thus forming a star-like topology. Our experiments reveal that both algorithms perform better than existing baselines on our newly introduced dataset.
△ Less
Submitted 17 July, 2025;
originally announced July 2025.
-
PanoDiff-SR: Synthesizing Dental Panoramic Radiographs using Diffusion and Super-resolution
Authors:
Sanyam Jain,
Bruna Neves de Freitas,
Andreas Basse-OConnor,
Alexandros Iosifidis,
Ruben Pauwels
Abstract:
There has been increasing interest in the generation of high-quality, realistic synthetic medical images in recent years. Such synthetic datasets can mitigate the scarcity of public datasets for artificial intelligence research, and can also be used for educational purposes. In this paper, we propose a combination of diffusion-based generation (PanoDiff) and Super-Resolution (SR) for generating sy…
▽ More
There has been increasing interest in the generation of high-quality, realistic synthetic medical images in recent years. Such synthetic datasets can mitigate the scarcity of public datasets for artificial intelligence research, and can also be used for educational purposes. In this paper, we propose a combination of diffusion-based generation (PanoDiff) and Super-Resolution (SR) for generating synthetic dental panoramic radiographs (PRs). The former generates a low-resolution (LR) seed of a PR (256 X 128) which is then processed by the SR model to yield a high-resolution (HR) PR of size 1024 X 512. For SR, we propose a state-of-the-art transformer that learns local-global relationships, resulting in sharper edges and textures. Experimental results demonstrate a Frechet inception distance score of 40.69 between 7243 real and synthetic images (in HR). Inception scores were 2.55, 2.30, 2.90 and 2.98 for real HR, synthetic HR, real LR and synthetic LR images, respectively. Among a diverse group of six clinical experts, all evaluating a mixture of 100 synthetic and 100 real PRs in a time-limited observation, the average accuracy in distinguishing real from synthetic images was 68.5% (with 50% corresponding to random guessing).
△ Less
Submitted 12 July, 2025;
originally announced July 2025.
-
OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique
Authors:
Wasi Uddin Ahmad,
Somshubra Majumdar,
Aleksander Ficek,
Sean Narenthiran,
Mehrzad Samadi,
Jocelyn Huang,
Siddhartha Jain,
Vahid Noroozi,
Boris Ginsburg
Abstract:
Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution…
▽ More
Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.
△ Less
Submitted 11 July, 2025;
originally announced July 2025.
-
Analysis of Propaganda in Tweets From Politically Biased Sources
Authors:
Vivek Sharma,
Mohammad Mahdi Shokri,
Sarah Ita Levitan,
Elena Filatova,
Shweta Jain
Abstract:
News outlets are well known to have political associations, and many national outlets cultivate political biases to cater to different audiences. Journalists working for these news outlets have a big impact on the stories they cover. In this work, we present a methodology to analyze the role of journalists, affiliated with popular news outlets, in propagating their bias using some form of propagan…
▽ More
News outlets are well known to have political associations, and many national outlets cultivate political biases to cater to different audiences. Journalists working for these news outlets have a big impact on the stories they cover. In this work, we present a methodology to analyze the role of journalists, affiliated with popular news outlets, in propagating their bias using some form of propaganda-like language. We introduce JMBX(Journalist Media Bias on X), a systematically collected and annotated dataset of 1874 tweets from Twitter (now known as X). These tweets are authored by popular journalists from 10 news outlets whose political biases range from extreme left to extreme right. We extract several insights from the data and conclude that journalists who are affiliated with outlets with extreme biases are more likely to use propaganda-like language in their writings compared to those who are affiliated with outlets with mild political leans. We compare eight different Large Language Models (LLM) by OpenAI and Google. We find that LLMs generally performs better when detecting propaganda in social media and news article compared to BERT-based model which is fine-tuned for propaganda detection. While the performance improvements of using large language models (LLMs) are significant, they come at a notable monetary and environmental cost. This study provides an analysis of both the financial costs, based on token usage, and the environmental impact, utilizing tools that estimate carbon emissions associated with LLM operations.
△ Less
Submitted 10 July, 2025;
originally announced July 2025.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Authors:
Gheorghe Comanici,
Eric Bieber,
Mike Schaekermann,
Ice Pasupat,
Noveen Sachdeva,
Inderjit Dhillon,
Marcel Blistein,
Ori Ram,
Dan Zhang,
Evan Rosen,
Luke Marris,
Sam Petulla,
Colin Gaffney,
Asaf Aharoni,
Nathan Lintz,
Tiago Cardal Pais,
Henrik Jacobsson,
Idan Szpektor,
Nan-Jiang Jiang,
Krishna Haridasan,
Ahmed Omran,
Nikunj Saunshi,
Dara Bahri,
Gaurav Mishra,
Eric Chu
, et al. (3284 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde…
▽ More
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
△ Less
Submitted 22 July, 2025; v1 submitted 7 July, 2025;
originally announced July 2025.
-
Can Argus Judge Them All? Comparing VLMs Across Domains
Authors:
Harsh Joshi,
Gautam Siddharth Kashyap,
Rafiq Ali,
Ebad Shabbir,
Niharika Jain,
Sarthak Jain,
Jiechao Gao,
Usman Naseem
Abstract:
Vision-Language Models (VLMs) are advancing multimodal AI, yet their performance consistency across tasks is underexamined. We benchmark CLIP, BLIP, and LXMERT across diverse datasets spanning retrieval, captioning, and reasoning. Our evaluation includes task accuracy, generation quality, efficiency, and a novel Cross-Dataset Consistency (CDC) metric. CLIP shows strongest generalization (CDC: 0.92…
▽ More
Vision-Language Models (VLMs) are advancing multimodal AI, yet their performance consistency across tasks is underexamined. We benchmark CLIP, BLIP, and LXMERT across diverse datasets spanning retrieval, captioning, and reasoning. Our evaluation includes task accuracy, generation quality, efficiency, and a novel Cross-Dataset Consistency (CDC) metric. CLIP shows strongest generalization (CDC: 0.92), BLIP excels on curated data, and LXMERT leads in structured reasoning. These results expose trade-offs between generalization and specialization, informing industrial deployment of VLMs and guiding development toward robust, task-flexible architectures.
△ Less
Submitted 23 June, 2025;
originally announced July 2025.
-
Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset
Authors:
Vasu Agrawal,
Akinniyi Akinyemi,
Kathryn Alvero,
Morteza Behrooz,
Julia Buffalini,
Fabio Maria Carlucci,
Joy Chen,
Junming Chen,
Zhang Chen,
Shiyang Cheng,
Praveen Chowdary,
Joe Chuang,
Antony D'Avirro,
Jon Daly,
Ning Dong,
Mark Duppenthaler,
Cynthia Gao,
Jeff Girard,
Martin Gleize,
Sahir Gomez,
Hongyu Gong,
Srivathsan Govindarajan,
Brandon Han,
Sen He,
Denise Hernandez
, et al. (59 additional authors not shown)
Abstract:
Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours…
▽ More
Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual behavior of their interlocutors. We present a variant with speech from an LLM model and integrations with 2D and 3D rendering methods, bringing us closer to interactive virtual agents. Additionally, we describe controllable variants of our motion models that can adapt emotional responses and expressivity levels, as well as generating more semantically-relevant gestures. Finally, we discuss methods for assessing the quality of these dyadic motion models, which are demonstrating the potential for more intuitive and responsive human-AI interactions.
△ Less
Submitted 30 June, 2025; v1 submitted 27 June, 2025;
originally announced June 2025.
-
Generative Blocks World: Moving Things Around in Pictures
Authors:
Vaibhav Vavilala,
Seemandhar Jain,
Rahul Vasanth,
D. A. Forsyth,
Anand Bhattad
Abstract:
We describe Generative Blocks World to interact with the scene of a generated image by manipulating simple geometric abstractions. Our method represents scenes as assemblies of convex 3D primitives, and the same scene can be represented by different numbers of primitives, allowing an editor to move either whole structures or small details. Once the scene geometry has been edited, the image is gene…
▽ More
We describe Generative Blocks World to interact with the scene of a generated image by manipulating simple geometric abstractions. Our method represents scenes as assemblies of convex 3D primitives, and the same scene can be represented by different numbers of primitives, allowing an editor to move either whole structures or small details. Once the scene geometry has been edited, the image is generated by a flow-based method which is conditioned on depth and a texture hint. Our texture hint takes into account the modified 3D primitives, exceeding texture-consistency provided by existing key-value caching techniques. These texture hints (a) allow accurate object and camera moves and (b) largely preserve the identity of objects depicted. Quantitative and qualitative experiments demonstrate that our approach outperforms prior works in visual fidelity, editability, and compositional generalization.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
Efficient Computation of Closed Substrings
Authors:
Samkith K Jain,
Neerja Mhaskar
Abstract:
A closed string $u$ is either of length one or contains a border that occurs only as a prefix and as a suffix in $u$ and nowhere else within $u$. In this paper, we present a fast and practical $O(n\log n)$ time algorithm to compute all $Θ(n^2)$ closed substrings by introducing a compact representation for all closed substrings of a string $ w[1..n]$, using only $O(n \log n)$ space. We also present…
▽ More
A closed string $u$ is either of length one or contains a border that occurs only as a prefix and as a suffix in $u$ and nowhere else within $u$. In this paper, we present a fast and practical $O(n\log n)$ time algorithm to compute all $Θ(n^2)$ closed substrings by introducing a compact representation for all closed substrings of a string $ w[1..n]$, using only $O(n \log n)$ space. We also present a simple and space-efficient solution to compute all maximal closed substrings (MCSs) using the suffix array ($\mathsf{SA}$) and the longest common prefix ($\mathsf{LCP}$) array of $w[1..n]$. Finally, we show that the exact number of MCSs ($M(f_n)$) in a Fibonacci word $ f_n $, for $n \geq 5$, is $\approx \left(1 + \frac{1}{φ^2}\right) F_n \approx 1.382 F_n$, where $ φ$ is the golden ratio.
△ Less
Submitted 22 September, 2025; v1 submitted 6 June, 2025;
originally announced June 2025.
-
StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation
Authors:
Ranjith Merugu,
Bryan Bo Cao,
Shubham Jain
Abstract:
Model merging has emerged as a promising solution to accommodate multiple large models within constrained memory budgets. We present StatsMerging, a novel lightweight learning-based model merging method guided by weight distribution statistics without requiring ground truth labels or test samples. StatsMerging offers three key advantages: (1) It uniquely leverages singular values from singular val…
▽ More
Model merging has emerged as a promising solution to accommodate multiple large models within constrained memory budgets. We present StatsMerging, a novel lightweight learning-based model merging method guided by weight distribution statistics without requiring ground truth labels or test samples. StatsMerging offers three key advantages: (1) It uniquely leverages singular values from singular value decomposition (SVD) to capture task-specific weight distributions, serving as a proxy for task importance to guide task coefficient prediction; (2) It employs a lightweight learner StatsMergeLearner to model the weight distributions of task-specific pre-trained models, improving generalization and enhancing adaptation to unseen samples; (3) It introduces Task-Specific Teacher Distillation for merging vision models with heterogeneous architectures, a merging learning paradigm that avoids costly ground-truth labels by task-specific teacher distillation. Notably, we present two types of knowledge distillation, (a) distilling knowledge from task-specific models to StatsMergeLearner; and (b) distilling knowledge from models with heterogeneous architectures prior to merging. Extensive experiments across eight tasks demonstrate the effectiveness of StatsMerging. Our results show that StatsMerging outperforms state-of-the-art techniques in terms of overall accuracy, generalization to unseen tasks, and robustness to image quality variations.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts
Authors:
Sidharth Pulipaka,
Sparsh Jain,
Ashwin Sankar,
Raj Dabre
Abstract:
Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech, especially in the presence of disfluencies such as false starts and backtracking. These limitations hinder the performance of downstream tasks like translation, text to speech, summarization, etc. where sentence boundaries are critical for preservi…
▽ More
Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech, especially in the presence of disfluencies such as false starts and backtracking. These limitations hinder the performance of downstream tasks like translation, text to speech, summarization, etc. where sentence boundaries are critical for preserving quality. In this work, we introduce Cadence, a generalist punctuation restoration model adapted from a pretrained large language model. Cadence is designed to handle both clean written text and highly spontaneous spoken transcripts. It surpasses the previous state of the art in performance while expanding support from 14 to all 22 Indian languages and English. We conduct a comprehensive analysis of model behavior across punctuation types and language families, identifying persistent challenges under domain shift and with rare punctuation marks. Our findings demonstrate the efficacy of utilizing pretrained language models for multilingual punctuation restoration and highlight Cadence practical value for low resource NLP pipelines at scale.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer
Authors:
Orchid Chetia Phukan,
Mohd Mujtaba Akhtar,
Girish,
Swarup Ranjan Behera,
Abu Osama Siddiqui,
Sarthak Jain,
Priyabrata Mallick,
Jaya Sai Kiran Patibandla,
Pailla Balakrishna Reddy,
Arun Balaji Buduru,
Rajesh Sharma
Abstract:
As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features re…
▽ More
As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
DV365: Extremely Long User History Modeling at Instagram
Authors:
Wenhan Lyu,
Devashish Tyagi,
Yihang Yang,
Ziwei Li,
Ajay Somani,
Karthikeyan Shanmugasundaram,
Nikola Andrejevic,
Ferdi Adeputra,
Curtis Zeng,
Arun K. Singh,
Maxime Ransan,
Sagar Jain
Abstract:
Long user history is highly valuable signal for recommendation systems, but effectively incorporating it often comes with high cost in terms of data center power consumption and GPU. In this work, we chose offline embedding over end-to-end sequence length optimization methods to enable extremely long user sequence modeling as a cost-effective solution, and propose a new user embedding learning str…
▽ More
Long user history is highly valuable signal for recommendation systems, but effectively incorporating it often comes with high cost in terms of data center power consumption and GPU. In this work, we chose offline embedding over end-to-end sequence length optimization methods to enable extremely long user sequence modeling as a cost-effective solution, and propose a new user embedding learning strategy, multi-slicing and summarization, that generates highly generalizable user representation of user's long-term stable interest. History length we encoded in this embedding is up to 70,000 and on average 40,000. This embedding, named as DV365, is proven highly incremental on top of advanced attentive user sequence models deployed in Instagram. Produced by a single upstream foundational model, it is launched in 15 different models across Instagram and Threads with significant impact, and has been production battle-proven for >1 year since our first launch.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Authors:
Suhana Bedi,
Hejie Cui,
Miguel Fuentes,
Alyssa Unell,
Michael Wornow,
Juan M. Banda,
Nikesh Kotecha,
Timothy Keyes,
Yifan Mai,
Mert Oez,
Hao Qiu,
Shrey Jain,
Leonardo Schettini,
Mehr Kashyap,
Jason Alan Fries,
Akshay Swaminathan,
Philip Chung,
Fateme Nateghi,
Asad Aali,
Ashwin Nayak,
Shivam Vedak,
Sneha S. Jain,
Birju Patel,
Oluseyi Fayanju,
Shreya Shah
, et al. (56 additional authors not shown)
Abstract:
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcatego…
▽ More
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
△ Less
Submitted 2 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects
Authors:
Reva Schwartz,
Rumman Chowdhury,
Akash Kundu,
Heather Frase,
Marzieh Fadaee,
Tom David,
Gabriella Waters,
Afaf Taik,
Morgan Briggs,
Patrick Hall,
Shomik Jain,
Kyra Yee,
Spencer Thomas,
Sundeep Bhandari,
Paul Duncan,
Andrew Thompson,
Maya Carlyle,
Qinghua Lu,
Matthew Holmes,
Theodora Skeadas
Abstract:
Conventional AI evaluation approaches concentrated within the AI stack exhibit systemic limitations for exploring, navigating and resolving the human and societal factors that play out in real world deployment such as in education, finance, healthcare, and employment sectors. AI capability evaluations can capture detail about first-order effects, such as whether immediate system outputs are accura…
▽ More
Conventional AI evaluation approaches concentrated within the AI stack exhibit systemic limitations for exploring, navigating and resolving the human and societal factors that play out in real world deployment such as in education, finance, healthcare, and employment sectors. AI capability evaluations can capture detail about first-order effects, such as whether immediate system outputs are accurate, or contain toxic, biased or stereotypical content, but AI's second-order effects, i.e. any long-term outcomes and consequences that may result from AI use in the real world, have become a significant area of interest as the technology becomes embedded in our daily lives. These secondary effects can include shifts in user behavior, societal, cultural and economic ramifications, workforce transformations, and long-term downstream impacts that may result from a broad and growing set of risks. This position paper argues that measuring the indirect and secondary effects of AI will require expansion beyond static, single-turn approaches conducted in silico to include testing paradigms that can capture what actually materializes when people use AI technology in context. Specifically, we describe the need for data and methods that can facilitate contextual awareness and enable downstream interpretation and decision making about AI's secondary effects, and recommend requirements for a new ecosystem.
△ Less
Submitted 30 May, 2025; v1 submitted 24 May, 2025;
originally announced May 2025.
-
Personalizing Student-Agent Interactions Using Log-Contextualized Retrieval Augmented Generation (RAG)
Authors:
Clayton Cohn,
Surya Rayala,
Caitlin Snyder,
Joyce Fonteles,
Shruti Jain,
Naveeduddin Mohammed,
Umesh Timalsina,
Sarah K. Burriss,
Ashwin T S,
Namrata Srivastava,
Menton Deweese,
Angela Eeds,
Gautam Biswas
Abstract:
Collaborative dialogue offers rich insights into students' learning and critical thinking, which is essential for personalizing pedagogical agent interactions in STEM+C settings. While large language models (LLMs) facilitate dynamic pedagogical interactions, hallucinations undermine confidence, trust, and instructional value. Retrieval-augmented generation (RAG) grounds LLM outputs in curated know…
▽ More
Collaborative dialogue offers rich insights into students' learning and critical thinking, which is essential for personalizing pedagogical agent interactions in STEM+C settings. While large language models (LLMs) facilitate dynamic pedagogical interactions, hallucinations undermine confidence, trust, and instructional value. Retrieval-augmented generation (RAG) grounds LLM outputs in curated knowledge but requires a clear semantic link between user input and a knowledge base, which is often weak in student dialogue. We propose log-contextualized RAG (LC-RAG), which enhances RAG retrieval by using environment logs to contextualize collaborative discourse. Our findings show that LC-RAG improves retrieval over a discourse-only baseline and allows our collaborative peer agent, Copa, to deliver relevant, personalized guidance that supports students' critical thinking and epistemic decision-making in a collaborative computational modeling environment, C2STEM.
△ Less
Submitted 16 June, 2025; v1 submitted 22 May, 2025;
originally announced May 2025.
-
JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation
Authors:
Ghasem Pasandi,
Kishor Kunal,
Varun Tej,
Kunjal Shah,
Hanfei Sun,
Sumit Jain,
Chunhui Li,
Chenhui Deng,
Teodor-Dumitru Ene,
Haoxing Ren,
Sreedhar Pratty
Abstract:
This paper presents JARVIS, a novel multi-agent framework that leverages Large Language Models (LLMs) and domain expertise to generate high-quality scripts for specialized Electronic Design Automation (EDA) tasks. By combining a domain-specific LLM trained with synthetically generated data, a custom compiler for structural verification, rule enforcement, code fixing capabilities, and advanced retr…
▽ More
This paper presents JARVIS, a novel multi-agent framework that leverages Large Language Models (LLMs) and domain expertise to generate high-quality scripts for specialized Electronic Design Automation (EDA) tasks. By combining a domain-specific LLM trained with synthetically generated data, a custom compiler for structural verification, rule enforcement, code fixing capabilities, and advanced retrieval mechanisms, our approach achieves significant improvements over state-of-the-art domain-specific models. Our framework addresses the challenges of data scarcity and hallucination errors in LLMs, demonstrating the potential of LLMs in specialized engineering domains. We evaluate our framework on multiple benchmarks and show that it outperforms existing models in terms of accuracy and reliability. Our work sets a new precedent for the application of LLMs in EDA and paves the way for future innovations in this field.
△ Less
Submitted 15 August, 2025; v1 submitted 20 May, 2025;
originally announced May 2025.
-
JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes
Authors:
Shalin Anand Jain,
Jiazhen Liu,
Siva Kailas,
Harish Ravichandar
Abstract:
Multi-agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi-robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi-robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluat…
▽ More
Multi-agent reinforcement learning (MARL) has emerged as a promising solution for learning complex and scalable coordination behaviors in multi-robot systems. However, established MARL platforms (e.g., SMAC and MPE) lack robotics relevance and hardware deployment, leaving multi-robot learning researchers to develop bespoke environments and hardware testbeds dedicated to the development and evaluation of their individual contributions. The Multi-Agent RL Benchmark and Learning Environment for the Robotarium (MARBLER) is an exciting recent step in providing a standardized robotics-relevant platform for MARL, by bridging the Robotarium testbed with existing MARL software infrastructure. However, MARBLER lacks support for parallelization and GPU/TPU execution, making the platform prohibitively slow compared to modern MARL environments and hindering adoption. We contribute JaxRobotarium, a Jax-powered end-to-end simulation, learning, deployment, and benchmarking platform for the Robotarium. JaxRobotarium enables rapid training and deployment of multi-robot RL (MRRL) policies with realistic robot dynamics and safety constraints, supporting parallelization and hardware acceleration. Our generalizable learning interface integrates easily with SOTA MARL libraries (e.g., JaxMARL). In addition, JaxRobotarium includes eight standardized coordination scenarios, including four novel scenarios that bring established MARL benchmark tasks (e.g., RWARE and Level-Based Foraging) to a robotics setting. We demonstrate that JaxRobotarium retains high simulation fidelity while achieving dramatic speedups over baseline (20x in training and 150x in simulation), and provides an open-access sim-to-real evaluation pipeline through the Robotarium testbed, accelerating and democratizing access to multi-robot learning research and evaluation. Our code is available at https://github.com/GT-STAR-Lab/JaxRobotarium.
△ Less
Submitted 26 May, 2025; v1 submitted 10 May, 2025;
originally announced May 2025.
-
Llama-Nemotron: Efficient Reasoning Models
Authors:
Akhiad Bercovich,
Itay Levy,
Izik Golan,
Mohammad Dabbah,
Ran El-Yaniv,
Omri Puny,
Ido Galil,
Zach Moshe,
Tomer Ronen,
Najeeb Nabwani,
Ido Shahaf,
Oren Tropp,
Ehud Karpas,
Ran Zilberstein,
Jiaqi Zeng,
Soumye Singhal,
Alexander Bukharin,
Yian Zhang,
Tugrul Konuk,
Gerald Shen,
Ameya Sunil Mahabaleshwarkar,
Bilal Kartal,
Yoshi Suhara,
Olivier Delalleau,
Zijia Chen
, et al. (111 additional authors not shown)
Abstract:
We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior i…
▽ More
We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes -- Nano (8B), Super (49B), and Ultra (253B) -- and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models -- LN-Nano, LN-Super, and LN-Ultra -- under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.
△ Less
Submitted 9 September, 2025; v1 submitted 1 May, 2025;
originally announced May 2025.
-
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Authors:
Jang Hyun Cho,
Andrea Madotto,
Effrosyni Mavroudi,
Triantafyllos Afouras,
Tushar Nagarajan,
Muhammad Maaz,
Yale Song,
Tengyu Ma,
Shuming Hu,
Suyog Jain,
Miguel Martin,
Huiyu Wang,
Hanoona Rasheed,
Peize Sun,
Po-Yao Huang,
Daniel Bolya,
Nikhila Ravi,
Shashank Jain,
Tammy Stark,
Shane Moon,
Babak Damavandi,
Vivian Lee,
Andrew Westbury,
Salman Khan,
Philipp Krähenbühl
, et al. (4 additional authors not shown)
Abstract:
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the…
▽ More
Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models. https://github.com/facebookresearch/perception_models
△ Less
Submitted 23 July, 2025; v1 submitted 17 April, 2025;
originally announced April 2025.
-
CiMBA: Accelerating Genome Sequencing through On-Device Basecalling via Compute-in-Memory
Authors:
William Andrew Simon,
Irem Boybat,
Riselda Kodra,
Elena Ferro,
Gagandeep Singh,
Mohammed Alser,
Shubham Jain,
Hsinyu Tsai,
Geoffrey W. Burr,
Onur Mutlu,
Abu Sebastian
Abstract:
As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysi…
▽ More
As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysis time. A range of proposals have attempted to address these challenges, with limited success. We propose to address these challenges with a Compute-in-Memory Basecalling Accelerator (CiMBA), the first embedded ($\sim25$mm$^2$) accelerator capable of real-time, on-device basecalling, coupled with AnaLog (AL)-Dorado, a new family of analog focused basecalling DNNs. Our resulting hardware/software co-design greatly reduces data communication overhead, is capable of a throughput of 4.77 million bases per second, 24x that required for real-time operation, and achieves 17x/27x power/area efficiency over the best prior basecalling embedded accelerator while maintaining a high accuracy comparable to state-of-the-art software basecallers.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
Authors:
NVIDIA,
:,
Aaron Blakeman,
Aarti Basant,
Abhinav Khattar,
Adithya Renduchintala,
Akhiad Bercovich,
Aleksander Ficek,
Alexis Bjorlin,
Ali Taghibakhshi,
Amala Sanjay Deshmukh,
Ameya Sunil Mahabaleshwarkar,
Andrew Tao,
Anna Shors,
Ashwath Aithal,
Ashwin Poojary,
Ayush Dattagupta,
Balaram Buddharaju,
Bobby Chen,
Boris Ginsburg,
Boxin Wang,
Brandon Norick,
Brian Butterfield,
Bryan Catanzaro,
Carlo del Mundo
, et al. (176 additional authors not shown)
Abstract:
As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transf…
▽ More
As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3$\times$ faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. We are releasing Nemotron-H base model checkpoints with support in Hugging Face and NeMo.
△ Less
Submitted 5 September, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
Authors:
Wasi Uddin Ahmad,
Sean Narenthiran,
Somshubra Majumdar,
Aleksander Ficek,
Siddhartha Jain,
Jocelyn Huang,
Vahid Noroozi,
Boris Ginsburg
Abstract:
Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models. Such techniques have significantly bridged the gap between reasoning and standard LLMs on coding tasks. Despite this, much of the progress on distilling reasoning models remains locked behind proprietary datasets or lacks details on data curation, fil…
▽ More
Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models. Such techniques have significantly bridged the gap between reasoning and standard LLMs on coding tasks. Despite this, much of the progress on distilling reasoning models remains locked behind proprietary datasets or lacks details on data curation, filtering and subsequent training. To address this, we construct a superior supervised fine-tuning (SFT) dataset that we use to achieve state-of-the-art coding capability results in models of various sizes. Our distilled models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning. We then perform analysis on the data sources used to construct our dataset, the impact of code execution filtering, and the importance of instruction/solution diversity. We observe that execution filtering negatively affected benchmark accuracy, leading us to prioritize instruction diversity over solution correctness. Finally, we also analyze the token efficiency and reasoning patterns utilized by these models. We will open-source these datasets and distilled models to the community.
△ Less
Submitted 7 August, 2025; v1 submitted 2 April, 2025;
originally announced April 2025.
-
Adjoint Sensitivities for the Optimization of Nonlinear Structural Dynamics via Spectral Submanifolds
Authors:
Matteo Pozzi,
Jacopo Marconi,
Shobhit Jain,
Mingwu Li,
Francesco Braghin
Abstract:
This work presents an optimization framework for tailoring the nonlinear dynamic response of lightly damped mechanical systems using Spectral Submanifold (SSM) reduction. We derive the SSM-based backbone curve and its sensitivity with respect to parameters up to arbitrary polynomial orders, enabling efficient and accurate optimization of the nonlinear frequency-amplitude relation. We use the adjoi…
▽ More
This work presents an optimization framework for tailoring the nonlinear dynamic response of lightly damped mechanical systems using Spectral Submanifold (SSM) reduction. We derive the SSM-based backbone curve and its sensitivity with respect to parameters up to arbitrary polynomial orders, enabling efficient and accurate optimization of the nonlinear frequency-amplitude relation. We use the adjoint method to derive sensitivity expressions, which drastically reduces the computational cost compared to direct differentiation as the number of parameters increases. An important feature of this framework is the automatic adjustment of the expansion order of SSM-based ROMs using user-defined error tolerances during the optimization process. We demonstrate the effectiveness of the approach in optimizing the nonlinear response over several numerical examples of mechanical systems. Hence, the proposed framework extends the applicability of SSM-based optimization methods to practical engineering problems, offering a robust tool for the design and optimization of nonlinear mechanical structures.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
Allocation Multiplicity: Evaluating the Promises of the Rashomon Set
Authors:
Shomik Jain,
Margaret Wang,
Kathleen Creel,
Ashia Wilson
Abstract:
The Rashomon set of equally-good models promises less discriminatory algorithms, reduced outcome homogenization, and fairer decisions through model ensembles or reconciliation. However, we argue from the perspective of allocation multiplicity that these promises may remain unfulfilled. When there are more qualified candidates than resources available, many different allocations of scarce resources…
▽ More
The Rashomon set of equally-good models promises less discriminatory algorithms, reduced outcome homogenization, and fairer decisions through model ensembles or reconciliation. However, we argue from the perspective of allocation multiplicity that these promises may remain unfulfilled. When there are more qualified candidates than resources available, many different allocations of scarce resources can achieve the same utility. This space of equal-utility allocations may not be faithfully reflected by the Rashomon set, as we show in a case study of healthcare allocations. We attribute these unfulfilled promises to several factors: limitations in empirical methods for sampling from the Rashomon set, the standard practice of deterministically selecting individuals with the lowest risk, and structural biases that cause all equally-good models to view some qualified individuals as inherently risky.
△ Less
Submitted 1 September, 2025; v1 submitted 20 March, 2025;
originally announced March 2025.
-
Training Video Foundation Models with NVIDIA NeMo
Authors:
Zeeshan Patel,
Ethan He,
Parth Mannan,
Xiaowei Ren,
Ryan Wolf,
Niket Agarwal,
Jacob Huffman,
Zhuoyao Wang,
Carl Wang,
Jack Chang,
Yan Bai,
Tommy Huang,
Linnan Wang,
Sahil Jain,
Shanmugam Ramasamy,
Joseph Jennings,
Ekaterina Sirazitdinova,
Oleg Sudakov,
Mingyuan Ma,
Bobby Chen,
Forrest Lin,
Hao Wang,
Vasanth Rao Naik Sabavat,
Sriharsha Niverty,
Rong Ou
, et al. (4 additional authors not shown)
Abstract:
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, mul…
▽ More
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.