-
Self-Regulating Cars: Automating Traffic Control in Free Flow Road Networks
Authors:
Ankit Bhardwaj,
Rohail Asim,
Sachin Chauhan,
Yasir Zaki,
Lakshminarayanan Subramanian
Abstract:
Free-flow road networks, such as suburban highways, are increasingly experiencing traffic congestion due to growing commuter inflow and limited infrastructure. Traditional control mechanisms, such as traffic signals or local heuristics, are ineffective or infeasible in these high-speed, signal-free environments. We introduce self-regulating cars, a reinforcement learning-based traffic control prot…
▽ More
Free-flow road networks, such as suburban highways, are increasingly experiencing traffic congestion due to growing commuter inflow and limited infrastructure. Traditional control mechanisms, such as traffic signals or local heuristics, are ineffective or infeasible in these high-speed, signal-free environments. We introduce self-regulating cars, a reinforcement learning-based traffic control protocol that dynamically modulates vehicle speeds to optimize throughput and prevent congestion, without requiring new physical infrastructure. Our approach integrates classical traffic flow theory, gap acceptance models, and microscopic simulation into a physics-informed RL framework. By abstracting roads into super-segments, the agent captures emergent flow dynamics and learns robust speed modulation policies from instantaneous traffic observations. Evaluated in the high-fidelity PTV Vissim simulator on a real-world highway network, our method improves total throughput by 5%, reduces average delay by 13%, and decreases total stops by 3% compared to the no-control setting. It also achieves smoother, congestion-resistant flow while generalizing across varied traffic patterns, demonstrating its potential for scalable, ML-driven traffic management.
△ Less
Submitted 13 June, 2025;
originally announced June 2025.
-
Automated Validation of COBOL to Java Transformation
Authors:
Atul Kumar,
Diptikalyan Saha,
Toshikai Yasue,
Kohichi Ono,
Saravanan Krishnan,
Sandeep Hans,
Fumiko Satoh,
Gerald Mitchell,
Sachin Kumar
Abstract:
Recent advances in Large Language Model (LLM) based Generative AI techniques have made it feasible to translate enterpriselevel code from legacy languages such as COBOL to modern languages such as Java or Python. While the results of LLM-based automatic transformation are encouraging, the resulting code cannot be trusted to correctly translate the original code. We propose a framework and a tool t…
▽ More
Recent advances in Large Language Model (LLM) based Generative AI techniques have made it feasible to translate enterpriselevel code from legacy languages such as COBOL to modern languages such as Java or Python. While the results of LLM-based automatic transformation are encouraging, the resulting code cannot be trusted to correctly translate the original code. We propose a framework and a tool to help validate the equivalence of COBOL and translated Java. The results can also help repair the code if there are some issues and provide feedback to the AI model to improve. We have developed a symbolic-execution-based test generation to automatically generate unit tests for the source COBOL programs which also mocks the external resource calls. We generate equivalent JUnit test cases with equivalent mocking as COBOL and run them to check semantic equivalence between original and translated programs.
△ Less
Submitted 14 April, 2025;
originally announced June 2025.
-
Fabrica: Dual-Arm Assembly of General Multi-Part Objects via Integrated Planning and Learning
Authors:
Yunsheng Tian,
Joshua Jacob,
Yijiang Huang,
Jialiang Zhao,
Edward Gu,
Pingchuan Ma,
Annan Zhang,
Farhad Javid,
Branden Romero,
Sachin Chitta,
Shinjiro Sueda,
Hui Li,
Wojciech Matusik
Abstract:
Multi-part assembly poses significant challenges for robots to execute long-horizon, contact-rich manipulation with generalization across complex geometries. We present Fabrica, a dual-arm robotic system capable of end-to-end planning and control for autonomous assembly of general multi-part objects. For planning over long horizons, we develop hierarchies of precedence, sequence, grasp, and motion…
▽ More
Multi-part assembly poses significant challenges for robots to execute long-horizon, contact-rich manipulation with generalization across complex geometries. We present Fabrica, a dual-arm robotic system capable of end-to-end planning and control for autonomous assembly of general multi-part objects. For planning over long horizons, we develop hierarchies of precedence, sequence, grasp, and motion planning with automated fixture generation, enabling general multi-step assembly on any dual-arm robots. The planner is made efficient through a parallelizable design and is optimized for downstream control stability. For contact-rich assembly steps, we propose a lightweight reinforcement learning framework that trains generalist policies across object geometries, assembly directions, and grasp poses, guided by equivariance and residual actions obtained from the plan. These policies transfer zero-shot to the real world and achieve 80% successful steps. For systematic evaluation, we propose a benchmark suite of multi-part assemblies resembling industrial and daily objects across diverse categories and geometries. By integrating efficient global planning and robust local control, we showcase the first system to achieve complete and generalizable real-world multi-part assembly without domain knowledge or human demonstrations. Project website: http://fabrica.csail.mit.edu/
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
OpenThoughts: Data Recipes for Reasoning Models
Authors:
Etash Guha,
Ryan Marten,
Sedrick Keh,
Negin Raoof,
Georgios Smyrnis,
Hritik Bansal,
Marianna Nezhurina,
Jean Mercat,
Trung Vu,
Zayne Sprague,
Ashima Suvarna,
Benjamin Feuer,
Liangyu Chen,
Zaid Khan,
Eric Frankel,
Sachin Grover,
Caroline Choi,
Niklas Muennighoff,
Shiye Su,
Wanjia Zhao,
John Yang,
Shreyas Pimpalgaonkar,
Kartik Sharma,
Charlie Cheng-Jie Ji,
Yichuan Deng
, et al. (25 additional authors not shown)
Abstract:
Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training rea…
▽ More
Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on https://openthoughts.ai.
△ Less
Submitted 4 June, 2025; v1 submitted 4 June, 2025;
originally announced June 2025.
-
Evaluating Query Efficiency and Accuracy of Transfer Learning-based Model Extraction Attack in Federated Learning
Authors:
Sayyed Farid Ahamed,
Sandip Roy,
Soumya Banerjee,
Marc Vucovich,
Kevin Choi,
Abdul Rahman,
Alison Hu,
Edward Bowen,
Sachin Shetty
Abstract:
Federated Learning (FL) is a collaborative learning framework designed to protect client data, yet it remains highly vulnerable to Intellectual Property (IP) threats. Model extraction (ME) attacks pose a significant risk to Machine Learning as a Service (MLaaS) platforms, enabling attackers to replicate confidential models by querying black-box (without internal insight) APIs. Despite FL's privacy…
▽ More
Federated Learning (FL) is a collaborative learning framework designed to protect client data, yet it remains highly vulnerable to Intellectual Property (IP) threats. Model extraction (ME) attacks pose a significant risk to Machine Learning as a Service (MLaaS) platforms, enabling attackers to replicate confidential models by querying black-box (without internal insight) APIs. Despite FL's privacy-preserving goals, its distributed nature makes it particularly susceptible to such attacks. This paper examines the vulnerability of FL-based victim models to two types of model extraction attacks. For various federated clients built under the NVFlare platform, we implemented ME attacks across two deep learning architectures and three image datasets. We evaluate the proposed ME attack performance using various metrics, including accuracy, fidelity, and KL divergence. The experiments show that for different FL clients, the accuracy and fidelity of the extracted model are closely related to the size of the attack query set. Additionally, we explore a transfer learning based approach where pretrained models serve as the starting point for the extraction process. The results indicate that the accuracy and fidelity of the fine-tuned pretrained extraction models are notably higher, particularly with smaller query sets, highlighting potential advantages for attackers.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
RADEP: A Resilient Adaptive Defense Framework Against Model Extraction Attacks
Authors:
Amit Chakraborty,
Sayyed Farid Ahamed,
Sandip Roy,
Soumya Banerjee,
Kevin Choi,
Abdul Rahman,
Alison Hu,
Edward Bowen,
Sachin Shetty
Abstract:
Machine Learning as a Service (MLaaS) enables users to leverage powerful machine learning models through cloud-based APIs, offering scalability and ease of deployment. However, these services are vulnerable to model extraction attacks, where adversaries repeatedly query the application programming interface (API) to reconstruct a functionally similar model, compromising intellectual property and s…
▽ More
Machine Learning as a Service (MLaaS) enables users to leverage powerful machine learning models through cloud-based APIs, offering scalability and ease of deployment. However, these services are vulnerable to model extraction attacks, where adversaries repeatedly query the application programming interface (API) to reconstruct a functionally similar model, compromising intellectual property and security. Despite various defense strategies being proposed, many suffer from high computational costs, limited adaptability to evolving attack techniques, and a reduction in performance for legitimate users. In this paper, we introduce a Resilient Adaptive Defense Framework for Model Extraction Attack Protection (RADEP), a multifaceted defense framework designed to counteract model extraction attacks through a multi-layered security approach. RADEP employs progressive adversarial training to enhance model resilience against extraction attempts. Malicious query detection is achieved through a combination of uncertainty quantification and behavioral pattern analysis, effectively identifying adversarial queries. Furthermore, we develop an adaptive response mechanism that dynamically modifies query outputs based on their suspicion scores, reducing the utility of stolen models. Finally, ownership verification is enforced through embedded watermarking and backdoor triggers, enabling reliable identification of unauthorized model use. Experimental evaluations demonstrate that RADEP significantly reduces extraction success rates while maintaining high detection accuracy with minimal impact on legitimate queries. Extensive experiments show that RADEP effectively defends against model extraction attacks and remains resilient even against adaptive adversaries, making it a reliable security framework for MLaaS models.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
AI-Driven Robotics for Free-Space Optics
Authors:
Shiekh Zia Uddin,
Sachin Vaidya,
Shrish Choudhary,
Zhuo Chen,
Raafat K. Salib,
Luke Huang,
Dirk R. Englund,
Marin Soljačić
Abstract:
Tabletop optical experiments are foundational to research in many areas of science, including photonics, quantum optics, materials science, metrology, and biomedical imaging. However these experiments remain fundamentally reliant on manual design, assembly, and alignment, limiting throughput and reproducibility. Optics currently lacks generalizable robotic systems capable of operating across a div…
▽ More
Tabletop optical experiments are foundational to research in many areas of science, including photonics, quantum optics, materials science, metrology, and biomedical imaging. However these experiments remain fundamentally reliant on manual design, assembly, and alignment, limiting throughput and reproducibility. Optics currently lacks generalizable robotic systems capable of operating across a diverse range of setups in realistic laboratory environments. Here we present OptoMate, an autonomous platform that integrates generative AI, computer vision, and precision robotics to enable automation of free-space optics experiments. Our platform interprets user-defined goals to generate valid optical setups using a fine-tuned large language model (LLM), assembles the setup via robotic pick-and-place with sub-millimeter accuracy, and performs fine alignment using a robot-deployable tool. The system then executes a range of automated measurements, including laser beam characterization, polarization mapping, and spectroscopy tasks. This work demonstrates the first flexible, AI-driven automation platform for optics, offering a path toward remote operation, cloud labs, and high-throughput discovery in the optical sciences.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Turbocharging Gaussian Process Inference with Approximate Sketch-and-Project
Authors:
Pratik Rathore,
Zachary Frangella,
Sachin Garg,
Shaghayegh Fazliani,
Michał Dereziński,
Madeleine Udell
Abstract:
Gaussian processes (GPs) play an essential role in biostatistics, scientific machine learning, and Bayesian optimization for their ability to provide probabilistic predictions and model uncertainty. However, GP inference struggles to scale to large datasets (which are common in modern applications), since it requires the solution of a linear system whose size scales quadratically with the number o…
▽ More
Gaussian processes (GPs) play an essential role in biostatistics, scientific machine learning, and Bayesian optimization for their ability to provide probabilistic predictions and model uncertainty. However, GP inference struggles to scale to large datasets (which are common in modern applications), since it requires the solution of a linear system whose size scales quadratically with the number of samples in the dataset. We propose an approximate, distributed, accelerated sketch-and-project algorithm ($\texttt{ADASAP}$) for solving these linear systems, which improves scalability. We use the theory of determinantal point processes to show that the posterior mean induced by sketch-and-project rapidly converges to the true posterior mean. In particular, this yields the first efficient, condition number-free algorithm for estimating the posterior mean along the top spectral basis functions, showing that our approach is principled for GP inference. $\texttt{ADASAP}$ outperforms state-of-the-art solvers based on conjugate gradient and coordinate descent across several benchmark datasets and a large-scale Bayesian optimization task. Moreover, $\texttt{ADASAP}$ scales to a dataset with $> 3 \cdot 10^8$ samples, a feat which has not been accomplished in the literature.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
The Effects of Demographic Instructions on LLM Personas
Authors:
Angel Felipe Magnossão de Paula,
J. Shane Culpepper,
Alistair Moffat,
Sachin Pathiyan Cherumanal,
Falk Scholer,
Johanne Trippas
Abstract:
Social media platforms must filter sexist content in compliance with governmental regulations. Current machine learning approaches can reliably detect sexism based on standardized definitions, but often neglect the subjective nature of sexist language and fail to consider individual users' perspectives. To address this gap, we adopt a perspectivist approach, retaining diverse annotations rather th…
▽ More
Social media platforms must filter sexist content in compliance with governmental regulations. Current machine learning approaches can reliably detect sexism based on standardized definitions, but often neglect the subjective nature of sexist language and fail to consider individual users' perspectives. To address this gap, we adopt a perspectivist approach, retaining diverse annotations rather than enforcing gold-standard labels or their aggregations, allowing models to account for personal or group-specific views of sexism. Using demographic data from Twitter, we employ large language models (LLMs) to personalize the identification of sexism.
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Predicting Risk of Pulmonary Fibrosis Formation in PASC Patients
Authors:
Wanying Dou,
Gorkem Durak,
Koushik Biswas,
Ziliang Hong,
Andrea Mia Bejar,
Elif Keles,
Kaan Akin,
Sukru Mehmet Erturk,
Alpay Medetalibeyoglu,
Marc Sala,
Alexander Misharin,
Hatice Savas,
Mary Salvatore,
Sachin Jambawalikar,
Drew Torigian,
Jayaram K. Udupa,
Ulas Bagci
Abstract:
While the acute phase of the COVID-19 pandemic has subsided, its long-term effects persist through Post-Acute Sequelae of COVID-19 (PASC), commonly known as Long COVID. There remains substantial uncertainty regarding both its duration and optimal management strategies. PASC manifests as a diverse array of persistent or newly emerging symptoms--ranging from fatigue, dyspnea, and neurologic impairme…
▽ More
While the acute phase of the COVID-19 pandemic has subsided, its long-term effects persist through Post-Acute Sequelae of COVID-19 (PASC), commonly known as Long COVID. There remains substantial uncertainty regarding both its duration and optimal management strategies. PASC manifests as a diverse array of persistent or newly emerging symptoms--ranging from fatigue, dyspnea, and neurologic impairments (e.g., brain fog), to cardiovascular, pulmonary, and musculoskeletal abnormalities--that extend beyond the acute infection phase. This heterogeneous presentation poses substantial challenges for clinical assessment, diagnosis, and treatment planning. In this paper, we focus on imaging findings that may suggest fibrotic damage in the lungs, a critical manifestation characterized by scarring of lung tissue, which can potentially affect long-term respiratory function in patients with PASC. This study introduces a novel multi-center chest CT analysis framework that combines deep learning and radiomics for fibrosis prediction. Our approach leverages convolutional neural networks (CNNs) and interpretable feature extraction, achieving 82.2% accuracy and 85.5% AUC in classification tasks. We demonstrate the effectiveness of Grad-CAM visualization and radiomics-based feature analysis in providing clinically relevant insights for PASC-related lung fibrosis prediction. Our findings highlight the potential of deep learning-driven computational methods for early detection and risk assessment of PASC-related lung fibrosis--presented for the first time in the literature.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Matching Tasks with Industry Groups for Augmenting Commonsense Knowledge
Authors:
Rituraj Singh,
Sachin Pawar,
Girish Palshikar
Abstract:
Commonsense knowledge bases (KB) are a source of specialized knowledge that is widely used to improve machine learning applications. However, even for a large KB such as ConceptNet, capturing explicit knowledge from each industry domain is challenging. For example, only a few samples of general {\em tasks} performed by various industries are available in ConceptNet. Here, a task is a well-defined…
▽ More
Commonsense knowledge bases (KB) are a source of specialized knowledge that is widely used to improve machine learning applications. However, even for a large KB such as ConceptNet, capturing explicit knowledge from each industry domain is challenging. For example, only a few samples of general {\em tasks} performed by various industries are available in ConceptNet. Here, a task is a well-defined knowledge-based volitional action to achieve a particular goal. In this paper, we aim to fill this gap and present a weakly-supervised framework to augment commonsense KB with tasks carried out by various industry groups (IG). We attempt to {\em match} each task with one or more suitable IGs by training a neural model to learn task-IG affinity and apply clustering to select the top-k tasks per IG. We extract a total of 2339 triples of the form $\langle IG, is~capable~of, task \rangle$ from two publicly available news datasets for 24 IGs with the precision of 0.86. This validates the reliability of the extracted task-IG pairs that can be directly added to existing KBs.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
BLAB: Brutally Long Audio Bench
Authors:
Orevaoghene Ahia,
Martijn Bartelds,
Kabir Ahuja,
Hila Gonen,
Valentin Hofmann,
Siddhant Arora,
Shuyue Stella Li,
Vishal Puttagunta,
Mofetoluwa Adeyemi,
Charishma Buchireddy,
Ben Walls,
Noah Bennett,
Shinji Watanabe,
Noah A. Smith,
Yulia Tsvetkov,
Sachin Kumar
Abstract:
Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limite…
▽ More
Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities.
△ Less
Submitted 12 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
Regulating Algorithmic Management: A Multi-Stakeholder Study of Challenges in Aligning Software and the Law for Workplace Scheduling
Authors:
Jonathan Lynn,
Rachel Y. Kim,
Sicun Gao,
Daniel Schneider,
Sachin S. Pandya,
Min Kyung Lee
Abstract:
Algorithmic management (AM)'s impact on worker well-being has led to calls for regulation. However, little is known about the effectiveness and challenges in real-world AM regulation across the regulatory process -- rule operationalization, software use, and enforcement. Our multi-stakeholder study addresses this gap within workplace scheduling, one of the few AM domains with implemented regulatio…
▽ More
Algorithmic management (AM)'s impact on worker well-being has led to calls for regulation. However, little is known about the effectiveness and challenges in real-world AM regulation across the regulatory process -- rule operationalization, software use, and enforcement. Our multi-stakeholder study addresses this gap within workplace scheduling, one of the few AM domains with implemented regulations. We interviewed 38 stakeholders across the regulatory process: regulators, defense attorneys, worker advocates, managers, and workers. Our findings suggest that the efficacy of AM regulation is influenced by: (i) institutional constraints that challenge efforts to encode law into AM software, (ii) on-the-ground use of AM software that shapes its ability to facilitate compliance, (iii) mismatches between software and regulatory contexts that hinder enforcement, and (iv) unique concerns that software introduces when used to regulate AM. These findings underscore the importance of a sociotechnical approach to AM regulation, which considers organizational and collaborative contexts alongside the inherent attributes of software. We offer future research directions and implications for technology policy and design.
△ Less
Submitted 25 May, 2025; v1 submitted 4 May, 2025;
originally announced May 2025.
-
DBSCAN-based Vehicle Clustering and UAV Placement for NOMA-based Resource Management in Cellular V2X Communications
Authors:
Hossein Davoudi,
Behrouz Shahgholi Ghahfarokhi,
Neda Moghim,
Sachin Shetty
Abstract:
In the future wireless networks, terrestrial, aerial, space, and maritime wireless networks are integrated into a unified network to meet the needs of a fully connected global network. Nowadays, vehicular communication has become one of the challenging applications of wireless networks. In this article, we aim to address the radio resource management in Cellular V2X (C-V2X) networks using Unmanned…
▽ More
In the future wireless networks, terrestrial, aerial, space, and maritime wireless networks are integrated into a unified network to meet the needs of a fully connected global network. Nowadays, vehicular communication has become one of the challenging applications of wireless networks. In this article, we aim to address the radio resource management in Cellular V2X (C-V2X) networks using Unmanned Aerial Vehicles (UAV) and Non-orthogonal multiple access (NOMA). The goal of this problem is to maximize the spectral efficiency of vehicular users in Cellular Vehicle-to-Everything (C-V2X) networks under a fronthaul constraint. To solve this problem, a two-stage approach is utilized. In the first stage, vehicles in dense area are clustered based on their geographical locations, predicted location of vehicles, and speeds. Then UAVs are deployed to serve the clusters. In the second stage, NOMA groups are formed within each cluster and radio resources are allocated to vehicles based on NOMA groups. An optimization problem is formulated and a suboptimal method is used to solve it. The performance of the proposed method is evaluated through simulations where results demonstrate superiority of proposed method in spectral efficiency, min point, and distance.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Flexible Semantic-Aware Resource Allocation: Serving More Users Through Similarity Range Constraints
Authors:
Nasrin Gholami,
Neda Moghim,
Behrouz Shahgholi Ghahfarokhi,
Pouyan Salavati,
Christo Kurisummoottil Thomas,
Sachin Shetty,
Tahereh Rahmati
Abstract:
Semantic communication (SemCom) aims to enhance the resource efficiency of next-generation networks by transmitting the underlying meaning of messages, focusing on information relevant to the end user. Existing literature on SemCom primarily emphasizes learning the encoder and decoder through end-to-end deep learning frameworks, with the objective of minimizing a task-specific semantic loss functi…
▽ More
Semantic communication (SemCom) aims to enhance the resource efficiency of next-generation networks by transmitting the underlying meaning of messages, focusing on information relevant to the end user. Existing literature on SemCom primarily emphasizes learning the encoder and decoder through end-to-end deep learning frameworks, with the objective of minimizing a task-specific semantic loss function. Beyond its influence on the physical and application layer design, semantic variability across users in multi-user systems enables the design of resource allocation schemes that incorporate user-specific semantic requirements. To this end, \emph{a semantic-aware resource allocation} scheme is proposed with the objective of maximizing transmission and semantic reliability, ultimately increasing the number of users whose semantic requirements are met. The resulting resource allocation problem is a non-convex mixed-integer nonlinear program (MINLP), which is known to be NP-hard. To make the problem tractable, it is decomposed into a set of sub-problems, each of which is efficiently solved via geometric programming techniques. Finally, simulations demonstrate that the proposed method improves user satisfaction by up to $17.1\%$ compared to state of the art methods based on quality of experience-aware SemCom methods.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines
Authors:
Sachin R. Pendse,
Darren Gergle,
Rachel Kornfield,
Jonah Meyerhoff,
David Mohr,
Jina Suh,
Annie Wescott,
Casey Williams,
Jessica Schleider
Abstract:
Red-teaming is a core part of the infrastructure that ensures that AI models do not produce harmful content. Unlike past technologies, the black box nature of generative AI systems necessitates a uniquely interactional mode of testing, one in which individuals on red teams actively interact with the system, leveraging natural language to simulate malicious actors and solicit harmful outputs. This…
▽ More
Red-teaming is a core part of the infrastructure that ensures that AI models do not produce harmful content. Unlike past technologies, the black box nature of generative AI systems necessitates a uniquely interactional mode of testing, one in which individuals on red teams actively interact with the system, leveraging natural language to simulate malicious actors and solicit harmful outputs. This interactional labor done by red teams can result in mental health harms that are uniquely tied to the adversarial engagement strategies necessary to effectively red team. The importance of ensuring that generative AI models do not propagate societal or individual harm is widely recognized -- one less visible foundation of end-to-end AI safety is also the protection of the mental health and wellbeing of those who work to keep model outputs safe. In this paper, we argue that the unmet mental health needs of AI red-teamers is a critical workplace safety concern. Through analyzing the unique mental health impacts associated with the labor done by red teams, we propose potential individual and organizational strategies that could be used to meet these needs, and safeguard the mental health of red-teamers. We develop our proposed strategies through drawing parallels between common red-teaming practices and interactional labor common to other professions (including actors, mental health professionals, conflict photographers, and content moderators), describing how individuals and organizations within these professional spaces safeguard their mental health given similar psychological demands. Drawing on these protective practices, we describe how safeguards could be adapted for the distinct mental health challenges experienced by red teaming organizations as they mitigate emerging technological risks on the new digital frontlines.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
LawFlow : Collecting and Simulating Lawyers' Thought Processes
Authors:
Debarati Das,
Khanh Chi Le,
Ritik Sachin Parkar,
Karin De Langis,
Brendan Madson,
Chad M. Berryman,
Robin M. Willis,
Daniel H. Moses,
Brett McDonnell,
Daniel Schwarcz,
Dongyeop Kang
Abstract:
Legal practitioners, particularly those early in their careers, face complex, high-stakes tasks that require adaptive, context-sensitive reasoning. While AI holds promise in supporting legal work, current datasets and models are narrowly focused on isolated subtasks and fail to capture the end-to-end decision-making required in real-world practice. To address this gap, we introduce LawFlow, a data…
▽ More
Legal practitioners, particularly those early in their careers, face complex, high-stakes tasks that require adaptive, context-sensitive reasoning. While AI holds promise in supporting legal work, current datasets and models are narrowly focused on isolated subtasks and fail to capture the end-to-end decision-making required in real-world practice. To address this gap, we introduce LawFlow, a dataset of complete end-to-end legal workflows collected from trained law students, grounded in real-world business entity formation scenarios. Unlike prior datasets focused on input-output pairs or linear chains of thought, LawFlow captures dynamic, modular, and iterative reasoning processes that reflect the ambiguity, revision, and client-adaptive strategies of legal practice. Using LawFlow, we compare human and LLM-generated workflows, revealing systematic differences in structure, reasoning flexibility, and plan execution. Human workflows tend to be modular and adaptive, while LLM workflows are more sequential, exhaustive, and less sensitive to downstream implications. Our findings also suggest that legal professionals prefer AI to carry out supportive roles, such as brainstorming, identifying blind spots, and surfacing alternatives, rather than executing complex workflows end-to-end. Building on these findings, we propose a set of design suggestions, rooted in empirical observations, that align AI assistance with human goals of clarity, completeness, creativity, and efficiency, through hybrid planning, adaptive execution, and decision-point support. Our results highlight both the current limitations of LLMs in supporting complex legal workflows and opportunities for developing more collaborative, reasoning-aware legal AI systems. All data and code are available on our project page (https://minnesotanlp.github.io/LawFlow-website/).
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
Proof-of-TBI -- Fine-Tuned Vision Language Model Consortium and OpenAI-o3 Reasoning LLM-Based Medical Diagnosis Support System for Mild Traumatic Brain Injury (TBI) Prediction
Authors:
Ross Gore,
Eranga Bandara,
Sachin Shetty,
Alberto E. Musto,
Pratip Rana,
Ambrosio Valencia-Romero,
Christopher Rhea,
Lobat Tayebi,
Heather Richter,
Atmaram Yarlagadda,
Donna Edmonds,
Steven Wallace,
Donna Broshek
Abstract:
Mild Traumatic Brain Injury (TBI) detection presents significant challenges due to the subtle and often ambiguous presentation of symptoms in medical imaging, making accurate diagnosis a complex task. To address these challenges, we propose Proof-of-TBI, a medical diagnosis support system that integrates multiple fine-tuned vision-language models with the OpenAI-o3 reasoning large language model (…
▽ More
Mild Traumatic Brain Injury (TBI) detection presents significant challenges due to the subtle and often ambiguous presentation of symptoms in medical imaging, making accurate diagnosis a complex task. To address these challenges, we propose Proof-of-TBI, a medical diagnosis support system that integrates multiple fine-tuned vision-language models with the OpenAI-o3 reasoning large language model (LLM). Our approach fine-tunes multiple vision-language models using a labeled dataset of TBI MRI scans, training them to diagnose TBI symptoms effectively. The predictions from these models are aggregated through a consensus-based decision-making process. The system evaluates the predictions from all fine-tuned vision language models using the OpenAI-o3 reasoning LLM, a model that has demonstrated remarkable reasoning performance, to produce the most accurate final diagnosis. The LLM Agents orchestrates interactions between the vision-language models and the reasoning LLM, managing the final decision-making process with transparency, reliability, and automation. This end-to-end decision-making workflow combines the vision-language model consortium with the OpenAI-o3 reasoning LLM, enabled by custom prompt engineering by the LLM agents. The prototype for the proposed platform was developed in collaboration with the U.S. Army Medical Research team in Newport News, Virginia, incorporating five fine-tuned vision-language models. The results demonstrate the transformative potential of combining fine-tuned vision-language model inputs with the OpenAI-o3 reasoning LLM to create a robust, secure, and highly accurate diagnostic system for mild TBI prediction. To the best of our knowledge, this research represents the first application of fine-tuned vision-language models integrated with a reasoning LLM for TBI prediction tasks.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Safety Pretraining: Toward the Next Generation of Safe AI
Authors:
Pratyush Maini,
Sachin Goyal,
Dylan Sam,
Alex Robey,
Yash Savani,
Yiding Jiang,
Andy Zou,
Zacharcy C. Lipton,
J. Zico Kolter
Abstract:
As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions includ…
▽ More
As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date (100B tokens) generated via recontextualization of harmful web data; (iii) RefuseWeb and Moral Education datasets that convert harmful prompts into refusal dialogues and web-style educational material; (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content and steer away inference from harmful generations; and (v) safety evaluations measuring base model behavior before instruction tuning. Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% with no performance degradation on standard LLM safety benchmarks.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Markov Kernels, Distances and Optimal Control: A Parable of Linear Quadratic Non-Gaussian Distribution Steering
Authors:
Alexis M. H. Teter,
Wenqing Wang,
Sachin Shivakumar,
Abhishek Halder
Abstract:
For a controllable linear time-varying (LTV) pair $(\boldsymbol{A}_t,\boldsymbol{B}_t)$ and $\boldsymbol{Q}_{t}$ positive semidefinite, we derive the Markov kernel for the Itô diffusion ${\mathrm{d}}\boldsymbol{x}_{t}=\boldsymbol{A}_{t}\boldsymbol{x}_t {\mathrm{d}} t + \sqrt{2}\boldsymbol{B}_{t}{\mathrm{d}}\boldsymbol{w}_{t}$ with an accompanying killing of probability mass at rate…
▽ More
For a controllable linear time-varying (LTV) pair $(\boldsymbol{A}_t,\boldsymbol{B}_t)$ and $\boldsymbol{Q}_{t}$ positive semidefinite, we derive the Markov kernel for the Itô diffusion ${\mathrm{d}}\boldsymbol{x}_{t}=\boldsymbol{A}_{t}\boldsymbol{x}_t {\mathrm{d}} t + \sqrt{2}\boldsymbol{B}_{t}{\mathrm{d}}\boldsymbol{w}_{t}$ with an accompanying killing of probability mass at rate $\frac{1}{2}\boldsymbol{x}^{\top}\boldsymbol{Q}_{t}\boldsymbol{x}$. This Markov kernel is the Green's function for an associated linear reaction-advection-diffusion partial differential equation. Our result generalizes the recently derived kernel for the special case $\left(\boldsymbol{A}_t,\boldsymbol{B}_t\right)=\left(\boldsymbol{0},\boldsymbol{I}\right)$, and depends on the solution of an associated Riccati matrix ODE. A consequence of this result is that the linear quadratic non-Gaussian Schrödinger bridge is exactly solvable. This means that the problem of steering a controlled LTV diffusion from a given non-Gaussian distribution to another over a fixed deadline while minimizing an expected quadratic cost can be solved using dynamic Sinkhorn recursions performed with the derived kernel. Our derivation for the $\left(\boldsymbol{A}_t,\boldsymbol{B}_t,\boldsymbol{Q}_t\right)$-parametrized kernel pursues a new idea that relies on finding a state-time dependent distance-like functional given by the solution of a deterministic optimal control problem. This technique breaks away from existing methods, such as generalizing Hermite polynomials or Weyl calculus, which have seen limited success in the reaction-diffusion context. Our technique uncovers a new connection between Markov kernels, distances, and optimal control. This connection is of interest beyond its immediate application in solving the linear quadratic Schrödinger bridge problem.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results
Authors:
Zheng Chen,
Kai Liu,
Jue Gong,
Jingkai Wang,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yulun Zhang,
Xiangyu Kong,
Xiaoxuan Yu,
Hyunhee Park,
Suejin Han,
Hakjae Jeon,
Dafeng Zhang,
Hyung-Ju Chun,
Donghun Ryou,
Inju Ha,
Bohyung Han,
Lu Zhao,
Yuyi Zhang,
Pengyu Yan,
Jiawei Hu,
Pengwei Liu,
Fengjun Guo,
Hongyuan Yu
, et al. (86 additional authors not shown)
Abstract:
This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that ach…
▽ More
This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.
△ Less
Submitted 28 April, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
The Tenth NTIRE 2025 Image Denoising Challenge Report
Authors:
Lei Sun,
Hang Guo,
Bin Ren,
Luc Van Gool,
Radu Timofte,
Yawei Li,
Xiangyu Kong,
Hyunhee Park,
Xiaoxuan Yu,
Suejin Han,
Hakjae Jeon,
Jia Li,
Hyung-Ju Chun,
Donghun Ryou,
Inju Ha,
Bohyung Han,
Jingyu Ma,
Zhijuan Huang,
Huiyuan Fu,
Hongyuan Yu,
Boqi Zhang,
Jiawei Shi,
Heng Zhang,
Huadong Ma,
Deepak Kumar Tyagi
, et al. (69 additional authors not shown)
Abstract:
This paper presents an overview of the NTIRE 2025 Image Denoising Challenge (σ = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent ad…
▽ More
This paper presents an overview of the NTIRE 2025 Image Denoising Challenge (σ = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent additive white Gaussian noise (AWGN) with a fixed noise level of 50. A total of 290 participants registered for the challenge, with 20 teams successfully submitting valid results, providing insights into the current state-of-the-art in image denoising.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Hang Guo,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yawei Li,
Yao Zhang,
Xinning Chai,
Zhengxue Cheng,
Yingsheng Qin,
Yucai Yang,
Li Song,
Hongyuan Yu,
Pufan Xu,
Cheng Wan,
Zhijuan Huang,
Peng Guo,
Shuyuan Cui,
Chenjun Li,
Xuehai Hu,
Pan Pan,
Xin Zhang,
Heng Zhang,
Qing Luo,
Linyan Jiang
, et al. (122 additional authors not shown)
Abstract:
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the…
▽ More
This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Automated Testing of COBOL to Java Transformation
Authors:
Sandeep Hans,
Atul Kumar,
Toshikai Yasue,
Kouichi Ono,
Saravanan Krishnan,
Devika Sondhi,
Fumiko Satoh,
Gerald Mitchell,
Sachin Kumar,
Diptikalyan Saha
Abstract:
Recent advances in Large Language Model (LLM) based Generative AI techniques have made it feasible to translate enterprise-level code from legacy languages such as COBOL to modern languages such as Java or Python. While the results of LLM-based automatic transformation are encouraging, the resulting code cannot be trusted to correctly translate the original code, making manual validation of transl…
▽ More
Recent advances in Large Language Model (LLM) based Generative AI techniques have made it feasible to translate enterprise-level code from legacy languages such as COBOL to modern languages such as Java or Python. While the results of LLM-based automatic transformation are encouraging, the resulting code cannot be trusted to correctly translate the original code, making manual validation of translated Java code from COBOL a necessary but time-consuming and labor-intensive process. In this paper, we share our experience of developing a testing framework for IBM Watsonx Code Assistant for Z (WCA4Z) [5], an industrial tool designed for COBOL to Java translation. The framework automates the process of testing the functional equivalence of the translated Java code against the original COBOL programs in an industry context. Our framework uses symbolic execution to generate unit tests for COBOL, mocking external calls and transforming them into JUnit tests to validate semantic equivalence with translated Java. The results not only help identify and repair any detected discrepancies but also provide feedback to improve the AI model.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
Empirically Measuring Data Localization in the EU
Authors:
Alexander Gamero-Garrido,
Kicho Yu,
Sumukh Vasisht Shankar,
Sachin Kumar Singh,
Sindhya Balasubramanian,
Alexander Wilcox,
David Choffnes
Abstract:
EU data localization regulations limit data transfers to non-EU countries with the GDPR. However, BGP, DNS and other Internet protocols were not designed to enforce jurisdictional constraints, so implementing data localization is challenging. Despite initial research on the topic, little is known about if or how companies currently operate their server infrastructure to comply with the regulations…
▽ More
EU data localization regulations limit data transfers to non-EU countries with the GDPR. However, BGP, DNS and other Internet protocols were not designed to enforce jurisdictional constraints, so implementing data localization is challenging. Despite initial research on the topic, little is known about if or how companies currently operate their server infrastructure to comply with the regulations. We close this knowledge gap by empirically measuring the extent to which servers and routers that process EU requests are located outside of the EU (and a handful of "adequate" non-EU countries). The key challenge is that both browser measurements (to infer relevant endpoints) and data-plane measurements (to infer relevant IP addresses) are needed, but no large-scale public infrastructure allows both. We build a novel methodology that combines BrightData (browser) and RIPE Atlas (data-plane) probes, with joint measurements from over 1,000 networks in 20 EU countries. We find that, on average, 2.2% of servers serving users in each EU country are located in non-adequate destination countries (1.4% of known trackers). Our findings suggest that data localization policies are largely being followed by content providers, though there are exceptions.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
A Survey on Personalized and Pluralistic Preference Alignment in Large Language Models
Authors:
Zhouhang Xie,
Junda Wu,
Yiran Shen,
Yu Xia,
Xintong Li,
Aaron Chang,
Ryan Rossi,
Sachin Kumar,
Bodhisattwa Prasad Majumder,
Jingbo Shang,
Prithviraj Ammanabrolu,
Julian McAuley
Abstract:
Personalized preference alignment for large language models (LLMs), the process of tailoring LLMs to individual users' preferences, is an emerging research direction spanning the area of NLP and personalization. In this survey, we present an analysis of works on personalized alignment and modeling for LLMs. We introduce a taxonomy of preference alignment techniques, including training time, infere…
▽ More
Personalized preference alignment for large language models (LLMs), the process of tailoring LLMs to individual users' preferences, is an emerging research direction spanning the area of NLP and personalization. In this survey, we present an analysis of works on personalized alignment and modeling for LLMs. We introduce a taxonomy of preference alignment techniques, including training time, inference time, and additionally, user-modeling based methods. We provide analysis and discussion on the strengths and limitations of each group of techniques and then cover evaluation, benchmarks, as well as open problems in the field.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
PathSegDiff: Pathology Segmentation using Diffusion model representations
Authors:
Sachin Kumar Danisetty,
Alexandros Graikos,
Srikar Yellapragada,
Dimitris Samaras
Abstract:
Image segmentation is crucial in many computational pathology pipelines, including accurate disease diagnosis, subtyping, outcome, and survivability prediction. The common approach for training a segmentation model relies on a pre-trained feature extractor and a dataset of paired image and mask annotations. These are used to train a lightweight prediction model that translates features into per-pi…
▽ More
Image segmentation is crucial in many computational pathology pipelines, including accurate disease diagnosis, subtyping, outcome, and survivability prediction. The common approach for training a segmentation model relies on a pre-trained feature extractor and a dataset of paired image and mask annotations. These are used to train a lightweight prediction model that translates features into per-pixel classes. The choice of the feature extractor is central to the performance of the final segmentation model, and recent literature has focused on finding tasks to pre-train the feature extractor. In this paper, we propose PathSegDiff, a novel approach for histopathology image segmentation that leverages Latent Diffusion Models (LDMs) as pre-trained featured extractors. Our method utilizes a pathology-specific LDM, guided by a self-supervised encoder, to extract rich semantic information from H\&E stained histopathology images. We employ a simple, fully convolutional network to process the features extracted from the LDM and generate segmentation masks. Our experiments demonstrate significant improvements over traditional methods on the BCSS and GlaS datasets, highlighting the effectiveness of domain-specific diffusion pre-training in capturing intricate tissue structures and enhancing segmentation accuracy in histopathology images.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Steering off Course: Reliability Challenges in Steering Language Models
Authors:
Patrick Queiroz Da Silva,
Hari Sethuraman,
Dheeraj Rajagopal,
Hannaneh Hajishirzi,
Sachin Kumar
Abstract:
Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods -- DoLa, function vectors, a…
▽ More
Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods -- DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis demonstrate fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
Authors:
Jeffrey Li,
Mohammadreza Armandpour,
Iman Mirzadeh,
Sachin Mehta,
Vaishaal Shankar,
Raviteja Vemulapalli,
Samy Bengio,
Oncel Tuzel,
Mehrdad Farajtabar,
Hadi Pouransari,
Fartash Faghri
Abstract:
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design ti…
▽ More
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.
△ Less
Submitted 6 June, 2025; v1 submitted 2 April, 2025;
originally announced April 2025.
-
A Note on Function Correcting Codes for b-Symbol Read Channels
Authors:
Sachin Sampath,
B. Sundar Rajan
Abstract:
Function-Correcting Codes (FCCs) is a novel paradigm in Error Control Coding introduced by Lenz et. al. 2023 for the binary substitution channel \cite{FCC}. FCCs aim to protect the function evaluation of data against errors instead of the data itself, thereby relaxing the redundancy requirements of the code. Later R. Premlal et. al. \cite{LFCC} gave new bounds on the optimal redundancy of FCCs and…
▽ More
Function-Correcting Codes (FCCs) is a novel paradigm in Error Control Coding introduced by Lenz et. al. 2023 for the binary substitution channel \cite{FCC}. FCCs aim to protect the function evaluation of data against errors instead of the data itself, thereby relaxing the redundancy requirements of the code. Later R. Premlal et. al. \cite{LFCC} gave new bounds on the optimal redundancy of FCCs and also extensively studied FCCs for linear functions. The notion of FCCs has also been extended to different channels such as symbol-pair read channel over the binary field by Xia et. al. \cite{FCSPC} and b-symbol read channel over finite fields by A.Singh et. al. \cite{FCBSC} In this work, we study FCCs for linear functions for the b-symbol read channel. We provide the Plotkin-like bound on FCCs for b-symbol read channel which reduces to a Plotkin-like bound for FCCs for the symbol-pair read channel when $b$=2. FCCs reduce to classical Error Correcting Codes (ECCs) when the function is bijective. Analogous to this our bound reduces to the Plotkin-bound for classical ECCS for both the b-symbol and symbol-pair read channels \cite{Plotkin-b-symbol, Plotkin-symbol-pair} when we consider linear bijective functions.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
Parametric Shadow Control for Portrait Generation in Text-to-Image Diffusion Models
Authors:
Haoming Cai,
Tsung-Wei Huang,
Shiv Gehlot,
Brandon Y. Feng,
Sachin Shah,
Guan-Ming Su,
Christopher Metzler
Abstract:
Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, w…
▽ More
Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training-no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.
△ Less
Submitted 7 April, 2025; v1 submitted 27 March, 2025;
originally announced March 2025.
-
Overtrained Language Models Are Harder to Fine-Tune
Authors:
Jacob Mitchell Springer,
Sachin Goyal,
Kaiyue Wen,
Tanishq Kumar,
Xiang Yue,
Sadhika Malladi,
Graham Neubig,
Aditi Raghunathan
Abstract:
Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instructi…
▽ More
Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.
△ Less
Submitted 27 March, 2025; v1 submitted 24 March, 2025;
originally announced March 2025.
-
Bridging Emotions and Architecture: Sentiment Analysis in Modern Distributed Systems
Authors:
Mahak Shah,
Akaash Vishal Hazarika,
Meetu Malhotra,
Sachin C. Patil,
Joshit Mohanty
Abstract:
Sentiment analysis is a field within NLP that has gained importance because it is applied in various areas such as; social media surveillance, customer feedback evaluation and market research. At the same time, distributed systems allow for effective processing of large amounts of data. Therefore, this paper examines how sentiment analysis converges with distributed systems by concentrating on dif…
▽ More
Sentiment analysis is a field within NLP that has gained importance because it is applied in various areas such as; social media surveillance, customer feedback evaluation and market research. At the same time, distributed systems allow for effective processing of large amounts of data. Therefore, this paper examines how sentiment analysis converges with distributed systems by concentrating on different approaches, challenges and future investigations. Furthermore, we do an extensive experiment where we train sentiment analysis models using both single node configuration and distributed architecture to bring out the benefits and shortcomings of each method in terms of performance and accuracy.
△ Less
Submitted 23 March, 2025;
originally announced March 2025.
-
RESTRAIN: Reinforcement Learning-Based Secure Framework for Trigger-Action IoT Environment
Authors:
Md Morshed Alam,
Lokesh Chandra Das,
Sandip Roy,
Sachin Shetty,
Weichao Wang
Abstract:
Internet of Things (IoT) platforms with trigger-action capability allow event conditions to trigger actions in IoT devices autonomously by creating a chain of interactions. Adversaries exploit this chain of interactions to maliciously inject fake event conditions into IoT hubs, triggering unauthorized actions on target IoT devices to implement remote injection attacks. Existing defense mechanisms…
▽ More
Internet of Things (IoT) platforms with trigger-action capability allow event conditions to trigger actions in IoT devices autonomously by creating a chain of interactions. Adversaries exploit this chain of interactions to maliciously inject fake event conditions into IoT hubs, triggering unauthorized actions on target IoT devices to implement remote injection attacks. Existing defense mechanisms focus mainly on the verification of event transactions using physical event fingerprints to enforce the security policies to block unsafe event transactions. These approaches are designed to provide offline defense against injection attacks. The state-of-the-art online defense mechanisms offer real-time defense, but extensive reliability on the inference of attack impacts on the IoT network limits the generalization capability of these approaches. In this paper, we propose a platform-independent multi-agent online defense system, namely RESTRAIN, to counter remote injection attacks at runtime. RESTRAIN allows the defense agent to profile attack actions at runtime and leverages reinforcement learning to optimize a defense policy that complies with the security requirements of the IoT network. The experimental results show that the defense agent effectively takes real-time defense actions against complex and dynamic remote injection attacks and maximizes the security gain with minimal computational overhead.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
SegDesicNet: Lightweight Semantic Segmentation in Remote Sensing with Geo-Coordinate Embeddings for Domain Adaptation
Authors:
Sachin Verma,
Frank Lindseth,
Gabriel Kiss
Abstract:
Semantic segmentation is essential for analyzing highdefinition remote sensing images (HRSIs) because it allows the precise classification of objects and regions at the pixel level. However, remote sensing data present challenges owing to geographical location, weather, and environmental variations, making it difficult for semantic segmentation models to generalize across diverse scenarios. Existi…
▽ More
Semantic segmentation is essential for analyzing highdefinition remote sensing images (HRSIs) because it allows the precise classification of objects and regions at the pixel level. However, remote sensing data present challenges owing to geographical location, weather, and environmental variations, making it difficult for semantic segmentation models to generalize across diverse scenarios. Existing methods are often limited to specific data domains and require expert annotators and specialized equipment for semantic labeling. In this study, we propose a novel unsupervised domain adaptation technique for remote sensing semantic segmentation by utilizing geographical coordinates that are readily accessible in remote sensing setups as metadata in a dataset. To bridge the domain gap, we propose a novel approach that considers the combination of an imageś location encoding trait and the spherical nature of Earthś surface. Our proposed SegDesicNet module regresses the GRID positional encoding of the geo coordinates projected over the unit sphere to obtain the domain loss. Our experimental results demonstrate that the proposed SegDesicNet outperforms state of the art domain adaptation methods in remote sensing image segmentation, achieving an improvement of approximately ~6% in the mean intersection over union (MIoU) with a ~ 27\% drop in parameter count on benchmarked subsets of the publicly available FLAIR #1 dataset. We also benchmarked our method performance on the custom split of the ISPRS Potsdam dataset. Our algorithm seeks to reduce the modeling disparity between artificial neural networks and human comprehension of the physical world, making the technology more human centric and scalable.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Leveraging LLMs for Mental Health: Detection and Recommendations from Social Discussions
Authors:
Vaishali Aggarwal,
Sachin Thukral,
Krushil Patel,
Arnab Chatterjee
Abstract:
Textual data from social platforms captures various aspects of mental health through discussions around and across issues, while users reach out for help and others sympathize and offer support. We propose a comprehensive framework that leverages Natural Language Processing (NLP) and Generative AI techniques to identify and assess mental health disorders, detect their severity, and create recommen…
▽ More
Textual data from social platforms captures various aspects of mental health through discussions around and across issues, while users reach out for help and others sympathize and offer support. We propose a comprehensive framework that leverages Natural Language Processing (NLP) and Generative AI techniques to identify and assess mental health disorders, detect their severity, and create recommendations for behavior change and therapeutic interventions based on users' posts on Reddit.
To classify the disorders, we use rule-based labeling methods as well as advanced pre-trained NLP models to extract nuanced semantic features from the data. We fine-tune domain-adapted and generic pre-trained NLP models based on predictions from specialized Large Language Models (LLMs) to improve classification accuracy. Our hybrid approach combines the generalization capabilities of pre-trained models with the domain-specific insights captured by LLMs, providing an improved understanding of mental health discourse. Our findings highlight the strengths and limitations of each model, offering valuable insights into their practical applicability.
This research potentially facilitates early detection and personalized care to aid practitioners and aims to facilitate timely interventions and improve overall well-being, thereby contributing to the broader field of mental health surveillance and digital health analytics.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
DROID: Discrete-Time Simulation for Ring-Oscillator-Based Ising Design
Authors:
Abhimanyu Kumar,
Ramprasath S.,
Chris H. Kim,
Ulya R. Karpuzcu,
Sachin S. Sapatnekar
Abstract:
Many combinatorial problems can be mapped to Ising machines, i.e., networks of coupled oscillators that settle to a minimum-energy ground state, from which the problem solution is inferred. This work proposes DROID, a novel event-driven method for simulating the evolution of a CMOS Ising machine to its ground state. The approach is accurate under general delay-phase relations that include the effe…
▽ More
Many combinatorial problems can be mapped to Ising machines, i.e., networks of coupled oscillators that settle to a minimum-energy ground state, from which the problem solution is inferred. This work proposes DROID, a novel event-driven method for simulating the evolution of a CMOS Ising machine to its ground state. The approach is accurate under general delay-phase relations that include the effects of the transistor nonlinearities and is computationally efficient. On a realistic-size all-to-all coupled ring oscillator array, DROID is nearly four orders of magnitude faster than a traditional HSPICE simulation in predicting the evolution of a coupled oscillator system and is demonstrated to attain a similar distribution of solutions as the hardware.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
Dynamic LLM Routing and Selection based on User Preferences: Balancing Performance, Cost, and Ethics
Authors:
Deepak Babu Piskala,
Vijay Raajaa,
Sachin Mishra,
Bruno Bozza
Abstract:
With the widespread deployment of large language models (LLMs) such as GPT4, BART, and LLaMA, the need for a system that can intelligently select the most suitable model for specific tasks while balancing cost, latency, accuracy, and ethical considerations has become increasingly important. Recognizing that not all tasks necessitate models with over 100 billion parameters, we introduce OptiRoute,…
▽ More
With the widespread deployment of large language models (LLMs) such as GPT4, BART, and LLaMA, the need for a system that can intelligently select the most suitable model for specific tasks while balancing cost, latency, accuracy, and ethical considerations has become increasingly important. Recognizing that not all tasks necessitate models with over 100 billion parameters, we introduce OptiRoute, an advanced model routing engine designed to dynamically select and route tasks to the optimal LLM based on detailed user-defined requirements. OptiRoute captures both functional (e.g., accuracy, speed, cost) and non-functional (e.g., helpfulness, harmlessness, honesty) criteria, leveraging lightweight task analysis and complexity estimation to efficiently match tasks with the best-fit models from a diverse array of LLMs. By employing a hybrid approach combining k-nearest neighbors (kNN) search and hierarchical filtering, OptiRoute optimizes for user priorities while minimizing computational overhead. This makes it ideal for real-time applications in cloud-based ML platforms, personalized AI services, and regulated industries.
△ Less
Submitted 23 February, 2025;
originally announced February 2025.
-
TESS 2: A Large-Scale Generalist Diffusion Language Model
Authors:
Jaesung Tae,
Hamish Ivison,
Sachin Kumar,
Arman Cohan
Abstract:
We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with the usual cross-entropy as diffusion loss, and then performing further instruction tuning. We fin…
▽ More
We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with the usual cross-entropy as diffusion loss, and then performing further instruction tuning. We find that adaptation training as well as the choice of the base model is crucial for training good instruction-following diffusion models. We further propose reward guidance, a novel and modular inference-time guidance procedure to align model outputs without needing to train the underlying model. Finally, we show that TESS 2 further improves with increased inference-time compute, highlighting the utility of diffusion LMs in having fine-grained controllability over the amount of compute used at inference time. Code and models are available at https://github.com/hamishivi/tess-2.
△ Less
Submitted 31 May, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
Interleaved Gibbs Diffusion for Constrained Generation
Authors:
Gautham Govind Anil,
Sachin Yadav,
Dheeraj Nagaraj,
Karthikeyan Shanmugam,
Prateek Jain
Abstract:
We introduce Interleaved Gibbs Diffusion (IGD), a novel generative modeling framework for mixed continuous-discrete data, focusing on constrained generation problems. Prior works on discrete and continuous-discrete diffusion models assume factorized denoising distribution for fast generation, which can hinder the modeling of strong dependencies between random variables encountered in constrained g…
▽ More
We introduce Interleaved Gibbs Diffusion (IGD), a novel generative modeling framework for mixed continuous-discrete data, focusing on constrained generation problems. Prior works on discrete and continuous-discrete diffusion models assume factorized denoising distribution for fast generation, which can hinder the modeling of strong dependencies between random variables encountered in constrained generation. IGD moves beyond this by interleaving continuous and discrete denoising algorithms via a discrete time Gibbs sampling type Markov chain. IGD provides flexibility in the choice of denoisers, allows conditional generation via state-space doubling and inference time scaling via the ReDeNoise method. Empirical evaluations on three challenging tasks-solving 3-SAT, generating molecule structures, and generating layouts-demonstrate state-of-the-art performance. Notably, IGD achieves a 7% improvement on 3-SAT out of the box and achieves state-of-the-art results in molecule generation without relying on equivariant diffusion or domain-specific architectures. We explore a wide range of modeling, and interleaving strategies along with hyperparameters in each of these problems.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs
Authors:
Kumari Nishu,
Sachin Mehta,
Samira Abnar,
Mehrdad Farajtabar,
Maxwell Horton,
Mahyar Najibi,
Moin Nabi,
Minsik Cho,
Devang Naik
Abstract:
Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre…
▽ More
Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only $10B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only $\frac{1}{9}\text{th}$ of their fine-tuning cost.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
SMAB: MAB based word Sensitivity Estimation Framework and its Applications in Adversarial Text Generation
Authors:
Saurabh Kumar Pandey,
Sachin Vashistha,
Debrup Das,
Somak Aditya,
Monojit Choudhury
Abstract:
To understand the complexity of sequence classification tasks, Hahn et al. (2021) proposed sensitivity as the number of disjoint subsets of the input sequence that can each be individually changed to change the output. Though effective, calculating sensitivity at scale using this framework is costly because of exponential time complexity. Therefore, we introduce a Sensitivity-based Multi-Armed Ban…
▽ More
To understand the complexity of sequence classification tasks, Hahn et al. (2021) proposed sensitivity as the number of disjoint subsets of the input sequence that can each be individually changed to change the output. Though effective, calculating sensitivity at scale using this framework is costly because of exponential time complexity. Therefore, we introduce a Sensitivity-based Multi-Armed Bandit framework (SMAB), which provides a scalable approach for calculating word-level local (sentence-level) and global (aggregated) sensitivities concerning an underlying text classifier for any dataset. We establish the effectiveness of our approach through various applications. We perform a case study on CHECKLIST generated sentiment analysis dataset where we show that our algorithm indeed captures intuitively high and low-sensitive words. Through experiments on multiple tasks and languages, we show that sensitivity can serve as a proxy for accuracy in the absence of gold data. Lastly, we show that guiding perturbation prompts using sensitivity values in adversarial example generation improves attack success rate by 15.58%, whereas using sensitivity as an additional reward in adversarial paraphrase generation gives a 12.00% improvement over SOTA approaches. Warning: Contains potentially offensive content.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Accelerating OTA Circuit Design: Transistor Sizing Based on a Transformer Model and Precomputed Lookup Tables
Authors:
Subhadip Ghosh,
Endalk Y. Gebru,
Chandramouli V. Kashyap,
Ramesh Harjani,
Sachin S. Sapatnekar
Abstract:
Device sizing is crucial for meeting performance specifications in operational transconductance amplifiers (OTAs), and this work proposes an automated sizing framework based on a transformer model. The approach first leverages the driving-point signal flow graph (DP-SFG) to map an OTA circuit and its specifications into transformer-friendly sequential data. A specialized tokenization approach is a…
▽ More
Device sizing is crucial for meeting performance specifications in operational transconductance amplifiers (OTAs), and this work proposes an automated sizing framework based on a transformer model. The approach first leverages the driving-point signal flow graph (DP-SFG) to map an OTA circuit and its specifications into transformer-friendly sequential data. A specialized tokenization approach is applied to the sequential data to expedite the training of the transformer on a diverse range of OTA topologies, under multiple specifications. Under specific performance constraints, the trained transformer model is used to accurately predict DP-SFG parameters in the inference phase. The predicted DP-SFG parameters are then translated to transistor sizes using a precomputed look-up table-based approach inspired by the gm/Id methodology. In contrast to previous conventional or machine-learning-based methods, the proposed framework achieves significant improvements in both speed and computational efficiency by reducing the need for expensive SPICE simulations within the optimization loop; instead, almost all SPICE simulations are confined to the one-time training phase. The method is validated on a variety of unseen specifications, and the sizing solution demonstrates over 90% success in meeting specifications with just one SPICE simulation for validation, and 100% success with 3-5 additional SPICE simulations.
△ Less
Submitted 5 February, 2025;
originally announced February 2025.
-
Evaluating Fault Tolerance and Scalability in Distributed File Systems: A Case Study of GFS, HDFS, and MinIO
Authors:
Shubham Malhotra,
Fnu Yashu,
Muhammad Saqib,
Dipkumar Mehta,
Jagdish Jangid,
Sachin Dixit
Abstract:
Distributed File Systems (DFS) are essential for managing vast datasets across multiple servers, offering benefits in scalability, fault tolerance, and data accessibility. This paper presents a comprehensive evaluation of three prominent DFSs - Google File System (GFS), Hadoop Distributed File System (HDFS), and MinIO - focusing on their fault tolerance mechanisms and scalability under varying dat…
▽ More
Distributed File Systems (DFS) are essential for managing vast datasets across multiple servers, offering benefits in scalability, fault tolerance, and data accessibility. This paper presents a comprehensive evaluation of three prominent DFSs - Google File System (GFS), Hadoop Distributed File System (HDFS), and MinIO - focusing on their fault tolerance mechanisms and scalability under varying data loads and client demands. Through detailed analysis, how these systems handle data redundancy, server failures, and client access protocols, ensuring reliability in dynamic, large-scale environments is assessed. In addition, the impact of system design on performance, particularly in distributed cloud and computing architectures is assessed. By comparing the strengths and limitations of each DFS, the paper provides practical insights for selecting the most appropriate system for different enterprise needs, from high availability storage to big data analytics.
△ Less
Submitted 28 February, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
Optimizing Spot Instance Reliability and Security Using Cloud-Native Data and Tools
Authors:
Muhammad Saqib,
Shubham Malhotra,
Dipkumar Mehta,
Jagdish Jangid,
Fnu Yashu,
Sachin Dixit
Abstract:
This paper represents "Cloudlab", a comprehensive, cloud - native laboratory designed to support network security research and training. Built on Google Cloud and adhering to GitOps methodologies, Cloudlab facilitates the the creation, testing, and deployment of secure, containerized workloads using Kubernetes and serverless architectures. The lab integrates tools like Palo Alto Networks firewalls…
▽ More
This paper represents "Cloudlab", a comprehensive, cloud - native laboratory designed to support network security research and training. Built on Google Cloud and adhering to GitOps methodologies, Cloudlab facilitates the the creation, testing, and deployment of secure, containerized workloads using Kubernetes and serverless architectures. The lab integrates tools like Palo Alto Networks firewalls, Bridgecrew for "Security as Code," and automated GitHub workflows to establish a robust Continuous Integration/Continuous Machine Learning pipeline. By providing an adaptive and scalable environment, Cloudlab supports advanced security concepts such as role-based access control, Policy as Code, and container security. This initiative enables data scientists and engineers to explore cutting-edge practices in a dynamic cloud-native ecosystem, fostering innovation and improving operational resilience in modern IT infrastructures.
△ Less
Submitted 6 March, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
Flow-based Domain Randomization for Learning and Sequencing Robotic Skills
Authors:
Aidan Curtis,
Eric Li,
Michael Noseworthy,
Nishad Gothoskar,
Sachin Chitta,
Hui Li,
Leslie Pack Kaelbling,
Nicole Carey
Abstract:
Domain randomization in reinforcement learning is an established technique for increasing the robustness of control policies trained in simulation. By randomizing environment properties during training, the learned policy can become robust to uncertainties along the randomized dimensions. While the environment distribution is typically specified by hand, in this paper we investigate automatically…
▽ More
Domain randomization in reinforcement learning is an established technique for increasing the robustness of control policies trained in simulation. By randomizing environment properties during training, the learned policy can become robust to uncertainties along the randomized dimensions. While the environment distribution is typically specified by hand, in this paper we investigate automatically discovering a sampling distribution via entropy-regularized reward maximization of a normalizing-flow-based neural sampling distribution. We show that this architecture is more flexible and provides greater robustness than existing approaches that learn simpler, parameterized sampling distributions, as demonstrated in six simulated and one real-world robotics domain. Lastly, we explore how these learned sampling distributions, combined with a privileged value function, can be used for out-of-distribution detection in an uncertainty-aware multi-step manipulation planner.
△ Less
Submitted 5 May, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
Deep Reinforcement Learning for Dynamic Resource Allocation in Wireless Networks
Authors:
Shubham Malhotra,
Fnu Yashu,
Muhammad Saqib,
Dipkumar Mehta,
Jagdish Jangid,
Sachin Dixit
Abstract:
This report investigates the application of deep reinforcement learning (DRL) algorithms for dynamic resource allocation in wireless communication systems. An environment that includes a base station, multiple antennas, and user equipment is created. Using the RLlib library, various DRL algorithms such as Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) are then applied. These algorithm…
▽ More
This report investigates the application of deep reinforcement learning (DRL) algorithms for dynamic resource allocation in wireless communication systems. An environment that includes a base station, multiple antennas, and user equipment is created. Using the RLlib library, various DRL algorithms such as Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) are then applied. These algorithms are compared based on their ability to optimize resource allocation, focusing on the impact of different learning rates and scheduling policies. The findings demonstrate that the choice of algorithm and learning rate significantly influences system performance, with DRL providing more efficient resource allocation compared to traditional methods.
△ Less
Submitted 13 March, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
Humanity's Last Exam
Authors:
Long Phan,
Alice Gatti,
Ziwen Han,
Nathaniel Li,
Josephina Hu,
Hugh Zhang,
Chen Bo Calvin Zhang,
Mohamed Shaaban,
John Ling,
Sean Shi,
Michael Choi,
Anish Agrawal,
Arnav Chopra,
Adam Khoja,
Ryan Kim,
Richard Ren,
Jason Hausenloy,
Oliver Zhang,
Mantas Mazeika,
Dmitry Dodonov,
Tung Nguyen,
Jaeho Lee,
Daron Anderson,
Mikhail Doroshenko,
Alun Cennyth Stokes
, et al. (1084 additional authors not shown)
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of…
▽ More
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
△ Less
Submitted 19 April, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
YouLeQD: Decoding the Cognitive Complexity of Questions and Engagement in Online Educational Videos from Learners' Perspectives
Authors:
Nong Ming,
Sachin Sharma,
Jiho Noh
Abstract:
Questioning is a fundamental aspect of education, as it helps assess students' understanding, promotes critical thinking, and encourages active engagement. With the rise of artificial intelligence in education, there is a growing interest in developing intelligent systems that can automatically generate and answer questions and facilitate interactions in both virtual and in-person education settin…
▽ More
Questioning is a fundamental aspect of education, as it helps assess students' understanding, promotes critical thinking, and encourages active engagement. With the rise of artificial intelligence in education, there is a growing interest in developing intelligent systems that can automatically generate and answer questions and facilitate interactions in both virtual and in-person education settings. However, to develop effective AI models for education, it is essential to have a fundamental understanding of questioning. In this study, we created the YouTube Learners' Questions on Bloom's Taxonomy Dataset (YouLeQD), which contains learner-posed questions from YouTube lecture video comments. Along with the dataset, we developed two RoBERTa-based classification models leveraging Large Language Models to detect questions and analyze their cognitive complexity using Bloom's Taxonomy. This dataset and our findings provide valuable insights into the cognitive complexity of learner-posed questions in educational videos and their relationship with interaction metrics. This can aid in the development of more effective AI models for education and improve the overall learning experience for students.
△ Less
Submitted 20 January, 2025;
originally announced January 2025.
-
Efficient auto-labeling of large-scale poultry datasets (ALPD) using an ensemble model with self- and active-learning approaches
Authors:
Ramesh Bahadur Bist,
Lilong Chai,
Shawna Weimer,
Hannah Atungulua,
Chantel Pennicott,
Xiao Yang,
Sachin Subedi,
Chaitanya Pallerla,
Yang Tian,
Dongyi Wang
Abstract:
The rapid growth of artificial intelligence in poultry farming has highlighted the challenge of efficiently labeling large, diverse datasets. Manual annotation is time-consuming and costly, making it impractical for modern systems that continuously generate data. This study addresses this challenge by exploring semi-supervised auto-labeling methods, integrating self and active learning approaches…
▽ More
The rapid growth of artificial intelligence in poultry farming has highlighted the challenge of efficiently labeling large, diverse datasets. Manual annotation is time-consuming and costly, making it impractical for modern systems that continuously generate data. This study addresses this challenge by exploring semi-supervised auto-labeling methods, integrating self and active learning approaches to develop an efficient, label-scarce framework for auto-labeling large poultry datasets (ALPD). For this study, video data were collected from broilers and laying hens housed. Various machine learning models, including zero-shot models and supervised models, were utilized for broilers and hens detection. The results showed that YOLOv8s-World and YOLOv9s performed better when compared performance metrics for broiler and hen detection under supervised learning, while among the semi-supervised model, YOLOv8s-ALPD achieved the highest precision (96.1%) and recall (99%) with an RMSE of 1.87. The hybrid YOLO-World model, incorporating the optimal YOLOv8s backbone with zero-shot models, demonstrated the highest overall performance. It achieved a precision of 99.2%, recall of 99.4%, and an F1 score of 98.7% for detection. In addition, the semi-supervised models with minimal human intervention (active learning) reduced annotation time by over 80% compared to full manual labeling. Moreover, integrating zero-shot models with the best models enhanced broiler and hen detection, achieving comparable results to supervised models while significantly increasing speed. In conclusion, integrating semi-supervised auto-labeling and zero-shot models significantly improves detection accuracy. It reduces manual annotation efforts, offering a promising solution to optimize AI-driven systems in poultry farming, advancing precision livestock management, and promoting more sustainable practices.
△ Less
Submitted 21 February, 2025; v1 submitted 18 January, 2025;
originally announced January 2025.