Search | arXiv e-print repository

How Does LLM Reasoning Work for Code? A Survey and a Call to Action

Authors: Ira Ceka, Saurabh Pujar, Irene Manotas, Gail Kaiser, Baishakhi Ray, Shyam Ramji

Abstract: The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks. These advancements have extended into the domain of code, facilitating complex tasks such as code generation, translation, summarization, and repair. However, their utility for real-world deployment in-the-wild has only recently been studied, particularly on software engineering… ▽ More The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks. These advancements have extended into the domain of code, facilitating complex tasks such as code generation, translation, summarization, and repair. However, their utility for real-world deployment in-the-wild has only recently been studied, particularly on software engineering (SWE) tasks such as GitHub issue resolution. In this study, we examine the code reasoning techniques that underlie the ability to perform such tasks, and examine the paradigms used to drive their performance. Our contributions in this paper are: (1) the first dedicated survey on code reasoning for code tasks, highlighting overarching strategies, hybrid and agentic approaches; (2) a taxonomy of various techniques used to drive code reasoning; (3) a comprehensive overview of performance on common benchmarks and a showcase of new, under-explored benchmarks with high potential in SWE; (4) an exploration on how core properties of code can be used to explain different reasoning techniques; and (5) gaps and potentially under-explored areas for future research. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2506.13805 [pdf, ps, other]

Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases

Authors: Bonam Mingole, Aditya Majumdar, Firdaus Ahmed Choudhury, Jennifer L. Kraschnewski, Shyam S. Sundar, Amulya Yadav

Abstract: The proliferation of Large Language Models (LLMs) in high-stakes applications such as medical (self-)diagnosis and preliminary triage raises significant ethical and practical concerns about the effectiveness, appropriateness, and possible harmfulness of the use of these technologies for health-related concerns and queries. Some prior work has considered the effectiveness of LLMs in answering exper… ▽ More The proliferation of Large Language Models (LLMs) in high-stakes applications such as medical (self-)diagnosis and preliminary triage raises significant ethical and practical concerns about the effectiveness, appropriateness, and possible harmfulness of the use of these technologies for health-related concerns and queries. Some prior work has considered the effectiveness of LLMs in answering expert-written health queries/prompts, questions from medical examination banks, or queries based on pre-existing clinical cases. Unfortunately, these existing studies completely ignore an in-the-wild evaluation of the effectiveness of LLMs in answering everyday health concerns and queries typically asked by general users, which corresponds to the more prevalent use case for LLMs. To address this research gap, this paper presents the findings from a university-level competition that leveraged a novel, crowdsourced approach for evaluating the effectiveness of LLMs in answering everyday health queries. Over the course of a week, a total of 34 participants prompted four publicly accessible LLMs with 212 real (or imagined) health concerns, and the LLM generated responses were evaluated by a team of nine board-certified physicians. At a high level, our findings indicate that on average, 76% of the 212 LLM responses were deemed to be accurate by physicians. Further, with the help of medical professionals, we investigated whether RAG versions of these LLMs (powered with a comprehensive medical knowledge base) can improve the quality of responses generated by LLMs. Finally, we also derive qualitative insights to explain our quantitative findings by conducting interviews with seven medical professionals who were shown all the prompts in our competition. This paper aims to provide a more grounded understanding of how LLMs perform in real-world everyday health communication. △ Less

Submitted 13 June, 2025; originally announced June 2025.

arXiv:2506.08311 [pdf, ps, other]

Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study

Authors: Ira Ceka, Saurabh Pujar, Shyam Ramji, Luca Buratti, Gail Kaiser, Baishakhi Ray

Abstract: With the advent of large language models (LLMs), software engineering agents (SWE agents) have emerged as a powerful paradigm for automating a range of software tasks -- from code generation and repair to test case synthesis. These agents operate autonomously by interpreting user input and responding to environmental feedback. While various agent architectures have demonstrated strong empirical pe… ▽ More With the advent of large language models (LLMs), software engineering agents (SWE agents) have emerged as a powerful paradigm for automating a range of software tasks -- from code generation and repair to test case synthesis. These agents operate autonomously by interpreting user input and responding to environmental feedback. While various agent architectures have demonstrated strong empirical performance, the internal decision-making worfklows that drive their behavior remain poorly understood. Deeper insight into these workflows hold promise for improving both agent reliability and efficiency. In this work, we present the first systematic study of SWE agent behavior through the lens of execution traces. Our contributions are as follows: (1) we propose the first taxonomy of decision-making pathways across five representative agents; (2) using this taxonomy, we identify three core components essential to agent success -- bug localization, patch generation, and reproduction test generation -- and study each in depth; (3) we study the impact of test generation on successful patch production; and analyze strategies that can lead to successful test generation; (4) we further conduct the first large-scale code clone analysis comparing agent-generated and developer-written patches and provide a qualitative study revealing structural and stylistic differences in patch content. Together, these findings offer novel insights into agent design and open avenues for building agents that are both more effective and more aligned with human development practices. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.06756 [pdf, ps, other]

Can Quantized Audio Language Models Perform Zero-Shot Spoofing Detection?

Authors: Bikash Dutta, Rishabh Ranjan, Shyam Sathvik, Mayank Vatsa, Richa Singh

Abstract: Quantization is essential for deploying large audio language models (LALMs) efficiently in resource-constrained environments. However, its impact on complex tasks, such as zero-shot audio spoofing detection, remains underexplored. This study evaluates the zero-shot capabilities of five LALMs, GAMA, LTU-AS, MERaLiON, Qwen-Audio, and SALMONN, across three distinct datasets: ASVspoof2019, In-the-Wild… ▽ More Quantization is essential for deploying large audio language models (LALMs) efficiently in resource-constrained environments. However, its impact on complex tasks, such as zero-shot audio spoofing detection, remains underexplored. This study evaluates the zero-shot capabilities of five LALMs, GAMA, LTU-AS, MERaLiON, Qwen-Audio, and SALMONN, across three distinct datasets: ASVspoof2019, In-the-Wild, and WaveFake, and investigates their robustness to quantization (FP32, FP16, INT8). Despite high initial spoof detection accuracy, our analysis demonstrates severe predictive biases toward spoof classification across all models, rendering their practical performance equivalent to random classification. Interestingly, quantization to FP16 precision resulted in negligible performance degradation compared to FP32, effectively halving memory and computational requirements without materially impacting accuracy. However, INT8 quantization intensified model biases, significantly degrading balanced accuracy. These findings highlight critical architectural limitations and emphasize FP16 quantization as an optimal trade-off, providing guidelines for practical deployment and future model refinement. △ Less

Submitted 7 June, 2025; originally announced June 2025.

Comments: Accepted in Interspeech 2025

arXiv:2506.05321 [pdf, other]

LSM-2: Learning from Incomplete Wearable Sensor Data

Authors: Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A. Tailor, Ahmed Metwally, A. Ali Heydari, Yuwei Zhang, Jake Garrison, Samy Abdel-Ghaffar, Xuhai Xu, Ken Gu, Jacob Sunshine, Ming-Zher Poh, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Yuzhe Yang, James M. Rehg, Xin Liu, Daniel McDuff

Abstract: Foundation models, a cornerstone of recent advancements in machine learning, have predominantly thrived on complete and well-structured data. Wearable sensor data frequently suffers from significant missingness, posing a substantial challenge for self-supervised learning (SSL) models that typically assume complete data inputs. This paper introduces the second generation of Large Sensor Model (LSM-… ▽ More Foundation models, a cornerstone of recent advancements in machine learning, have predominantly thrived on complete and well-structured data. Wearable sensor data frequently suffers from significant missingness, posing a substantial challenge for self-supervised learning (SSL) models that typically assume complete data inputs. This paper introduces the second generation of Large Sensor Model (LSM-2) with Adaptive and Inherited Masking (AIM), a novel SSL approach that learns robust representations directly from incomplete data without requiring explicit imputation. AIM's core novelty lies in its use of learnable mask tokens to model both existing ("inherited") and artificially introduced missingness, enabling it to robustly handle fragmented real-world data during inference. Pre-trained on an extensive dataset of 40M hours of day-long multimodal sensor data, our LSM-2 with AIM achieves the best performance across a diverse range of tasks, including classification, regression and generative modeling. Furthermore, LSM-2 with AIM exhibits superior scaling performance, and critically, maintains high performance even under targeted missingness scenarios, reflecting clinically coherent patterns, such as the diagnostic value of nighttime biosignals for hypertension prediction. This makes AIM a more reliable choice for real-world wearable data applications. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: Xu and Narayanswamy are co-first authors. McDuff and Liu are co-last authors

arXiv:2506.03910 [pdf, ps, other]

Enhancing Experimental Efficiency in Materials Design: A Comparative Study of Taguchi and Machine Learning Methods

Authors: Shyam Prabhu, P Akshay Kumar, Antov Selwinston, Pavan Taduvai, Shreya Bairi, Rohit Batra

Abstract: Materials design problems often require optimizing multiple variables, rendering full factorial exploration impractical. Design of experiment (DOE) methods, such as Taguchi technique, are commonly used to efficiently sample the design space but they inherently lack the ability to capture non-linear dependency of process variables. In this work, we demonstrate how machine learning (ML) methods can… ▽ More Materials design problems often require optimizing multiple variables, rendering full factorial exploration impractical. Design of experiment (DOE) methods, such as Taguchi technique, are commonly used to efficiently sample the design space but they inherently lack the ability to capture non-linear dependency of process variables. In this work, we demonstrate how machine learning (ML) methods can be used to overcome these limitations. We compare the performance of Taguchi method against an active learning based Gaussian process regression (GPR) model in a wire arc additive manufacturing (WAAM) process to accurately predict aspects of bead geometry, including penetration depth, bead width, and height. While Taguchi method utilized a three-factor, five-level L25 orthogonal array to suggest weld parameters, the GPR model used an uncertainty-based exploration acquisition function coupled with latin hypercube sampling for initial training data. Accuracy and efficiency of both models was evaluated on 15 test cases, with GPR outperforming Taguchi in both metrics. This work applies to broader materials processing domain requiring efficient exploration of complex parameters. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: 7 pages, 3 figures

arXiv:2505.24108 [pdf, ps, other]

Federated Foundation Model for GI Endoscopy Images

Authors: Alina Devkota, Annahita Amireskandari, Joel Palko, Shyam Thakkar, Donald Adjeroh, Xiajun Jiang, Binod Bhattarai, Prashnna K. Gyawali

Abstract: Gastrointestinal (GI) endoscopy is essential in identifying GI tract abnormalities in order to detect diseases in their early stages and improve patient outcomes. Although deep learning has shown success in supporting GI diagnostics and decision-making, these models require curated datasets with labels that are expensive to acquire. Foundation models offer a promising solution by learning general-… ▽ More Gastrointestinal (GI) endoscopy is essential in identifying GI tract abnormalities in order to detect diseases in their early stages and improve patient outcomes. Although deep learning has shown success in supporting GI diagnostics and decision-making, these models require curated datasets with labels that are expensive to acquire. Foundation models offer a promising solution by learning general-purpose representations, which can be finetuned for specific tasks, overcoming data scarcity. Developing foundation models for medical imaging holds significant potential, but the sensitive and protected nature of medical data presents unique challenges. Foundation model training typically requires extensive datasets, and while hospitals generate large volumes of data, privacy restrictions prevent direct data sharing, making foundation model training infeasible in most scenarios. In this work, we propose a FL framework for training foundation models for gastroendoscopy imaging, enabling data to remain within local hospital environments while contributing to a shared model. We explore several established FL algorithms, assessing their suitability for training foundation models without relying on task-specific labels, conducting experiments in both homogeneous and heterogeneous settings. We evaluate the trained foundation model on three critical downstream tasks--classification, detection, and segmentation--and demonstrate that it achieves improved performance across all tasks, highlighting the effectiveness of our approach in a federated, privacy-preserving setting. △ Less

Submitted 5 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

Comments: 11 pages, 11 figures, submitted to BHI2025

ACM Class: I.2.10; I.4; I.5

arXiv:2505.09610 [pdf, ps, other]

Customizing a Large Language Model for VHDL Design of High-Performance Microprocessors

Authors: Nicolas Dupuis, Ravi Nair, Shyam Ramji, Sean McClintock, Nishant Chauhan, Priyanka Nagpal, Bart Blaner, Ken Valk, Leon Stok, Ruchir Puri

Abstract: The use of Large Language Models (LLMs) in hardware design has taken off in recent years, principally through its incorporation in tools that increase chip designer productivity. There has been considerable discussion about the use of LLMs in RTL specifications of chip designs, for which the two most popular languages are Verilog and VHDL. LLMs and their use in Verilog design has received signific… ▽ More The use of Large Language Models (LLMs) in hardware design has taken off in recent years, principally through its incorporation in tools that increase chip designer productivity. There has been considerable discussion about the use of LLMs in RTL specifications of chip designs, for which the two most popular languages are Verilog and VHDL. LLMs and their use in Verilog design has received significant attention due to the higher popularity of the language, but little attention so far has been given to VHDL despite its continued popularity in the industry. There has also been little discussion about the unique needs of organizations that engage in high-performance processor design, and techniques to deploy AI solutions in these settings. In this paper, we describe our journey in developing a Large Language Model (LLM) specifically for the purpose of explaining VHDL code, a task that has particular importance in an organization with decades of experience and assets in high-performance processor design. We show how we developed test sets specific to our needs and used them for evaluating models as we performed extended pretraining (EPT) of a base LLM. Expert evaluation of the code explanations produced by the EPT model increased to 69% compared to a base model rating of 43%. We further show how we developed an LLM-as-a-judge to gauge models similar to expert evaluators. This led us to deriving and evaluating a host of new models, including an instruction-tuned version of the EPT model with an expected expert evaluator rating of 71%. Our experiments also indicate that with the potential use of newer base models, this rating can be pushed to 85% and beyond. We conclude with a discussion on further improving the quality of hardware design LLMs using exciting new developments in the Generative AI world. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2504.20435 [pdf, other]

AI Assisted Cervical Cancer Screening for Cytology Samples in Developing Countries

Authors: Love Panta, Suraj Prasai, Karishma Malla Vaidya, Shyam Shrestha, Suresh Manandhar

Abstract: Cervical cancer remains a significant health challenge, with high incidence and mortality rates, particularly in transitioning countries. Conventional Liquid-Based Cytology(LBC) is a labor-intensive process, requires expert pathologists and is highly prone to errors, highlighting the need for more efficient screening methods. This paper introduces an innovative approach that integrates low-cost bi… ▽ More Cervical cancer remains a significant health challenge, with high incidence and mortality rates, particularly in transitioning countries. Conventional Liquid-Based Cytology(LBC) is a labor-intensive process, requires expert pathologists and is highly prone to errors, highlighting the need for more efficient screening methods. This paper introduces an innovative approach that integrates low-cost biological microscopes with our simple and efficient AI algorithms for automated whole-slide analysis. Our system uses a motorized microscope to capture cytology images, which are then processed through an AI pipeline involving image stitching, cell segmentation, and classification. We utilize the lightweight UNet-based model involving human-in-the-loop approach to train our segmentation model with minimal ROIs. CvT-based classification model, trained on the SIPaKMeD dataset, accurately categorizes five cell types. Our framework offers enhanced accuracy and efficiency in cervical cancer screening compared to various state-of-art methods, as demonstrated by different evaluation metrics. △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2504.18715 [pdf, other]

Spatial Speech Translation: Translating Across Space With Binaural Hearables

Authors: Tuochao Chen, Qirui Wang, Runlin He, Shyam Gollakota

Abstract: Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer's environment, while maintaining the direction and unique voice characteristics of e… ▽ More Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer's environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system's effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation. △ Less

Submitted 25 April, 2025; originally announced April 2025.

Comments: Accepted by CHI2025

arXiv:2504.04764 [pdf, other]

Enhancing Leaf Disease Classification Using GAT-GCN Hybrid Model

Authors: Shyam Sundhar, Riya Sharma, Priyansh Maheshwari, Suvidha Rupesh Kumar, T. Sunil Kumar

Abstract: Agriculture plays a critical role in the global economy, providing livelihoods and ensuring food security for billions. As innovative agricultural practices become more widespread, the risk of crop diseases has increased, highlighting the urgent need for efficient, low-intervention disease identification methods. This research presents a hybrid model combining Graph Attention Networks (GATs) and G… ▽ More Agriculture plays a critical role in the global economy, providing livelihoods and ensuring food security for billions. As innovative agricultural practices become more widespread, the risk of crop diseases has increased, highlighting the urgent need for efficient, low-intervention disease identification methods. This research presents a hybrid model combining Graph Attention Networks (GATs) and Graph Convolution Networks (GCNs) for leaf disease classification. GCNs have been widely used for learning from graph-structured data, and GATs enhance this by incorporating attention mechanisms to focus on the most important neighbors. The methodology integrates superpixel segmentation for efficient feature extraction, partitioning images into meaningful, homogeneous regions that better capture localized features. The authors have employed an edge augmentation technique to enhance the robustness of the model. The edge augmentation technique has introduced a significant degree of generalization in the detection capabilities of the model. To further optimize training, weight initialization techniques are applied. The hybrid model is evaluated against the individual performance of the GCN and GAT models and the hybrid model achieved a precision of 0.9822, recall of 0.9818, and F1-score of 0.9818 in apple leaf disease classification, a precision of 0.9746, recall of 0.9744, and F1-score of 0.9743 in potato leaf disease classification, and a precision of 0.8801, recall of 0.8801, and F1-score of 0.8799 in sugarcane leaf disease classification. These results demonstrate the robustness and performance of the model, suggesting its potential to support sustainable agricultural practices through precise and effective disease detection. This work is a small step towards reducing the loss of crops and hence supporting sustainable goals of zero hunger and life on land. △ Less

Submitted 7 April, 2025; originally announced April 2025.

arXiv:2503.21204 [pdf, other]

Dimensional optimization of single-DOF planar rigid link-flapping mechanisms for high lift and low power

Authors: Shyam Sunder Nishad, Anupam Saxena

Abstract: Rigid link flapping mechanisms remain the most practical choice for flapping wing micro-aerial vehicles (MAVs) to carry useful payloads and onboard batteries for free flight due to their long-term durability and reliability. However, to achieve high agility and maneuverability-like insects-MAVs with these mechanisms require significant weight reduction. One approach involves using single-DOF plana… ▽ More Rigid link flapping mechanisms remain the most practical choice for flapping wing micro-aerial vehicles (MAVs) to carry useful payloads and onboard batteries for free flight due to their long-term durability and reliability. However, to achieve high agility and maneuverability-like insects-MAVs with these mechanisms require significant weight reduction. One approach involves using single-DOF planar rigid linkages, which are rarely optimized dimensionally for high lift and low power so that smaller motors and batteries could be used. We integrated a mechanism simulator based on a quasistatic nonlinear finite element method with an unsteady vortex lattice method-based aerodynamic analysis tool within an optimization routine. We optimized three different mechanism topologies from the literature. As a result, significant power savings were observed up to 42% in some cases, due to increased amplitude and higher lift coefficients resulting from optimized asymmetric sweeping velocity profiles. We also conducted an uncertainty analysis that revealed the need for high manufacturing tolerances to ensure reliable mechanism performance. The presented unified computational tool also facilitates the optimal selection of MAV components based on the payload and flight time requirements. △ Less

Submitted 27 March, 2025; originally announced March 2025.

arXiv:2503.16448 [pdf, ps, other]

Towards Open Diversity-Aware Social Interactions

Authors: Loizos Michael, Ivano Bison, Matteo Busso, Luca Cernuzzi, Amalia De Götzen, Shyam Diwakar, Kobi Gal, Amarsanaa Ganbold, George Gaskell, Daniel Gatica-Perez, Jessica Heesen, Daniele Miorandi, Salvador Ruiz-Correa, Laura Schelenz, Avi Segal, Carles Sierra, Hao Xu, Fausto Giunchiglia

Abstract: Social Media and the Internet have catalyzed an unprecedented potential for exposure to human diversity in terms of demographics, talents, opinions, knowledge, and the like. However, this potential has not come with new, much needed, instruments and skills to harness it. This paper presents our work on promoting richer and deeper social relations through the design and development of the "Internet… ▽ More Social Media and the Internet have catalyzed an unprecedented potential for exposure to human diversity in terms of demographics, talents, opinions, knowledge, and the like. However, this potential has not come with new, much needed, instruments and skills to harness it. This paper presents our work on promoting richer and deeper social relations through the design and development of the "Internet of Us", an online platform that uses diversity-aware Artificial Intelligence to mediate and empower human social interactions. We discuss the multiple facets of diversity in social settings, the multidisciplinary work that is required to reap the benefits of diversity, and the vision for a diversity-aware hybrid human-AI society. △ Less

Submitted 17 February, 2025; originally announced March 2025.

arXiv:2503.08796 [pdf, other]

Robust Multi-Objective Controlled Decoding of Large Language Models

Authors: Seongho Son, William Bankes, Sangwoong Yoon, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

Abstract: Test-time alignment of Large Language Models (LLMs) to human preferences offers a flexible way to generate responses aligned to diverse objectives without extensive retraining of LLMs. Existing methods achieve alignment to multiple objectives simultaneously (e.g., instruction-following, helpfulness, conciseness) by optimizing their corresponding reward functions. However, they often rely on predef… ▽ More Test-time alignment of Large Language Models (LLMs) to human preferences offers a flexible way to generate responses aligned to diverse objectives without extensive retraining of LLMs. Existing methods achieve alignment to multiple objectives simultaneously (e.g., instruction-following, helpfulness, conciseness) by optimizing their corresponding reward functions. However, they often rely on predefined weights or optimize for averages, sacrificing one objective for another and leading to unbalanced outcomes. To address this, we introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that optimizes for improving worst-case rewards. RMOD formalizes the robust decoding problem as a maximin two-player game between reward weights and the sampling policy, solving for the Nash equilibrium. We show that the game reduces to a convex optimization problem to find the worst-case weights, while the best response policy can be computed analytically. We also introduce a practical RMOD variant designed for efficient decoding with contemporary LLMs, incurring minimal computational overhead compared to non-robust Multi-Objective Decoding (MOD) methods. Our experimental results showcase the effectiveness of RMOD in generating responses equitably aligned with diverse objectives, outperforming baselines up to 20%. △ Less

Submitted 11 March, 2025; originally announced March 2025.

Comments: 24 pages, 9 figures

arXiv:2503.08437 [pdf, other]

doi 10.1007/978-3-031-80139-6_3

ICPR 2024 Competition on Rider Intention Prediction

Authors: Shankar Gangisetty, Abdul Wasi, Shyam Nandan Rai, C. V. Jawahar, Sajay Raj, Manish Prajapati, Ayesha Choudhary, Aaryadev Chandra, Dev Chandan, Shireen Chand, Suvaditya Mukherjee

Abstract: The recent surge in the vehicle market has led to an alarming increase in road accidents. This underscores the critical importance of enhancing road safety measures, particularly for vulnerable road users like motorcyclists. Hence, we introduce the rider intention prediction (RIP) competition that aims to address challenges in rider safety by proactively predicting maneuvers before they occur, the… ▽ More The recent surge in the vehicle market has led to an alarming increase in road accidents. This underscores the critical importance of enhancing road safety measures, particularly for vulnerable road users like motorcyclists. Hence, we introduce the rider intention prediction (RIP) competition that aims to address challenges in rider safety by proactively predicting maneuvers before they occur, thereby strengthening rider safety. This capability enables the riders to react to the potential incorrect maneuvers flagged by advanced driver assistance systems (ADAS). We collect a new dataset, namely, rider action anticipation dataset (RAAD) for the competition consisting of two tasks: single-view RIP and multi-view RIP. The dataset incorporates a spectrum of traffic conditions and challenging navigational maneuvers on roads with varying lighting conditions. For the competition, we received seventy-five registrations and five team submissions for inference of which we compared the methods of the top three performing teams on both the RIP tasks: one state-space model (Mamba2) and two learning-based approaches (SVM and CNN-LSTM). The results indicate that the state-space model outperformed the other methods across the entire dataset, providing a balanced performance across maneuver classes. The SVM-based RIP method showed the second-best performance when using random sampling and SMOTE. However, the CNN-LSTM method underperformed, primarily due to class imbalance issues, particularly struggling with minority classes. This paper details the proposed RAAD dataset and provides a summary of the submissions for the RIP 2024 competition. △ Less

Submitted 11 March, 2025; originally announced March 2025.

arXiv:2503.04764 [pdf]

Artificial intelligence for objective assessment of acrobatic movements: How to apply machine learning for identifying tumbling elements in cheer sports

Authors: Sophia Wesely, Ella Hofer, Robin Curth, Shyam Paryani, Nicole Mills, Olaf Ueberschär, Julia Westermayr

Abstract: Over the past four decades, cheerleading has evolved from a sideline activity at major sporting events into a professional, competitive sport with growing global popularity. Evaluating tumbling elements in cheerleading relies on both objective measures and subjective judgments, such as difficulty and execution quality. However, the complexity of tumbling - encompassing team synchronicity, ground i… ▽ More Over the past four decades, cheerleading has evolved from a sideline activity at major sporting events into a professional, competitive sport with growing global popularity. Evaluating tumbling elements in cheerleading relies on both objective measures and subjective judgments, such as difficulty and execution quality. However, the complexity of tumbling - encompassing team synchronicity, ground interactions, choreography, and artistic expression - makes objective assessment challenging. Artificial intelligence (AI) has revolutionized various scientific fields and industries through precise data-driven analyses, yet their application in acrobatic sports remains limited despite significant potential for enhancing performance evaluation and coaching. This study investigates the feasibility of using an AI-based approach with data from a single inertial measurement unit to accurately identify and objectively assess tumbling elements in standard cheerleading routines. A sample of 16 participants (13 females, 3 males) from a Division I collegiate cheerleading team wore a single inertial measurement unit at the dorsal pelvis. Over a 4-week seasonal preparation period, 1102 tumbling elements were recorded during regular practice sessions. Using triaxial accelerations and rotational speeds, various ML algorithms were employed to classify and evaluate the execution of tumbling manoeuvres. Results indicate that certain machine learning models can effectively identify different tumbling elements despite inter-individual variability and data noise, achieving high accuracy. These findings demonstrate the significant potential for integrating AI-driven assessments into cheerleading and other acrobatic sports, providing objective metrics that complement traditional judging methods. △ Less

Submitted 11 February, 2025; originally announced March 2025.

Comments: 17 pages, 11 figures

arXiv:2502.10516 [pdf, ps, other]

A new lower bound for multi-color discrepancy with applications to fair division

Authors: Ioannis Caragiannis, Kasper Green Larsen, Sudarshan Shyam

Abstract: A classical problem in combinatorics seeks colorings of low discrepancy. More concretely, the goal is to color the elements of a set system so that the number of appearances of any color among the elements in each set is as balanced as possible. We present a new lower bound for multi-color discrepancy, showing that there is a set system with $n$ subsets over a set of elements in which any $k$-colo… ▽ More A classical problem in combinatorics seeks colorings of low discrepancy. More concretely, the goal is to color the elements of a set system so that the number of appearances of any color among the elements in each set is as balanced as possible. We present a new lower bound for multi-color discrepancy, showing that there is a set system with $n$ subsets over a set of elements in which any $k$-coloring of the elements has discrepancy at least $Ω\left(\sqrt{\frac{n}{\ln{k}}}\right)$. This result improves the previously best-known lower bound of $Ω\left(\sqrt{\frac{n}{k}}\right)$ of Doerr and Srivastav [2003] and may have several applications. Here, we explore its implications on the feasibility of fair division concepts for instances with $n$ agents having valuations for a set of indivisible items. The first such concept is known as consensus $1/k$-division up to $d$ items (\cd$d$) and aims to allocate the items into $k$ bundles so that no matter which bundle each agent is assigned to, the allocation is envy-free up to $d$ items. The above lower bound implies that \cd$d$ can be infeasible for $d\in Ω\left(\sqrt{\frac{n}{\ln{k}}}\right)$. We furthermore extend our proof technique to show that there exist instances of the problem of allocating indivisible items to $k$ groups of $n$ agents in total so that envy-freeness and proportionality up to $d$ items are infeasible for $d\in Ω\left(\sqrt{\frac{n}{k\ln{k}}}\right)$ and $d\in Ω\left(\sqrt{\frac{n}{k^3\ln{k}}}\right)$, respectively. The lower bounds for fair division improve the currently best-known ones by Manurangsi and Suksompong [2022]. △ Less

Submitted 18 February, 2025; v1 submitted 14 February, 2025; originally announced February 2025.

arXiv:2502.08073 [pdf]

Large language models perpetuate bias in palliative care: development and analysis of the Palliative Care Adversarial Dataset (PCAD)

Authors: Naomi Akhras, Fares Antaki, Fannie Mottet, Olivia Nguyen, Shyam Sawhney, Sabrina Bajwah, Joanna M Davies

Abstract: Bias and inequity in palliative care disproportionately affect marginalised groups. Large language models (LLMs), such as GPT-4o, hold potential to enhance care but risk perpetuating biases present in their training data. This study aimed to systematically evaluate whether GPT-4o propagates biases in palliative care responses using adversarially designed datasets. In July 2024, GPT-4o was probed u… ▽ More Bias and inequity in palliative care disproportionately affect marginalised groups. Large language models (LLMs), such as GPT-4o, hold potential to enhance care but risk perpetuating biases present in their training data. This study aimed to systematically evaluate whether GPT-4o propagates biases in palliative care responses using adversarially designed datasets. In July 2024, GPT-4o was probed using the Palliative Care Adversarial Dataset (PCAD), and responses were evaluated by three palliative care experts in Canada and the United Kingdom using validated bias rubrics. The PCAD comprised PCAD-Direct (100 adversarial questions) and PCAD-Counterfactual (84 paired scenarios). These datasets targeted four care dimensions (access to care, pain management, advance care planning, and place of death preferences) and three identity axes (ethnicity, age, and diagnosis). Bias was detected in a substantial proportion of responses. For adversarial questions, the pooled bias rate was 0.33 (95% confidence interval [CI]: 0.28, 0.38); "allows biased premise" was the most frequently identified source of bias (0.47; 95% CI: 0.39, 0.55), such as failing to challenge stereotypes. For counterfactual scenarios, the pooled bias rate was 0.26 (95% CI: 0.20, 0.31), with "potential for withholding" as the most frequently identified source of bias (0.25; 95% CI: 0.18, 0.34), such as withholding interventions based on identity. Bias rates were consistent across care dimensions and identity axes. GPT-4o perpetuates biases in palliative care, with implications for clinical decision-making and equity. The PCAD datasets provide novel tools to assess and address LLM bias in palliative care. △ Less

Submitted 11 February, 2025; originally announced February 2025.

Comments: The complete PCAD datasets are available on Figshare: dx.doi.org/10.6084/m9.figshare.28396016

arXiv:2502.03950 [pdf, other]

LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models

Authors: Priyank Pathak, Shyam Marjit, Shruti Vyas, Yogesh S Rawat

Abstract: Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on largescale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored. We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on t… ▽ More Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on largescale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored. We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher resolution models are less robust against LR. Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model's initial layers more than the deeper layers. We use these insights and introduce a simple strategy, LR-TK0, to enhance the robustness of models without compromising their pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness against low-resolution across several datasets and its generalization capability across backbones and other approaches. Code is available at https://github.com/shyammarjit/LR0.FM △ Less

Submitted 18 May, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

Comments: Accepted to ICLR 2025

arXiv:2502.03347 [pdf, other]

doi 10.1145/3712289

DiversityOne: A Multi-Country Smartphone Sensor Dataset for Everyday Life Behavior Modeling

Authors: Matteo Busso, Andrea Bontempelli, Leonardo Javier Malcotti, Lakmal Meegahapola, Peter Kun, Shyam Diwakar, Chaitanya Nutakki, Marcelo Dario Rodas Britez, Hao Xu, Donglei Song, Salvador Ruiz Correa, Andrea-Rebeca Mendoza-Lara, George Gaskell, Sally Stares, Miriam Bidoglia, Amarsanaa Ganbold, Altangerel Chagnaa, Luca Cernuzzi, Alethia Hume, Ronald Chenu-Abente, Roy Alia Asiku, Ivan Kayongo, Daniel Gatica-Perez, Amalia de Götzen, Ivano Bison , et al. (1 additional authors not shown)

Abstract: Understanding everyday life behavior of young adults through personal devices, e.g., smartphones and smartwatches, is key for various applications, from enhancing the user experience in mobile apps to enabling appropriate interventions in digital health apps. Towards this goal, previous studies have relied on datasets combining passive sensor data with human-provided annotations or self-reports. H… ▽ More Understanding everyday life behavior of young adults through personal devices, e.g., smartphones and smartwatches, is key for various applications, from enhancing the user experience in mobile apps to enabling appropriate interventions in digital health apps. Towards this goal, previous studies have relied on datasets combining passive sensor data with human-provided annotations or self-reports. However, many existing datasets are limited in scope, often focusing on specific countries primarily in the Global North, involving a small number of participants, or using a limited range of pre-processed sensors. These limitations restrict the ability to capture cross-country variations of human behavior, including the possibility of studying model generalization, and robustness. To address this gap, we introduce DiversityOne, a dataset which spans eight countries (China, Denmark, India, Italy, Mexico, Mongolia, Paraguay, and the United Kingdom) and includes data from 782 college students over four weeks. DiversityOne contains data from 26 smartphone sensor modalities and 350K+ self-reports. As of today, it is one of the largest and most diverse publicly available datasets, while featuring extensive demographic and psychosocial survey data. DiversityOne opens the possibility of studying important research problems in ubiquitous computing, particularly in domain adaptation and generalization across countries, all research areas so far largely underexplored because of the lack of adequate datasets. △ Less

Submitted 5 February, 2025; originally announced February 2025.

arXiv:2502.01208 [pdf, ps, other]

On Almost Surely Safe Alignment of Large Language Models at Inference-Time

Authors: Xiaotong Ji, Shyam Sundhar Ramesh, Matthieu Zimmer, Ilija Bogunovic, Jun Wang, Haitham Bou Ammar

Abstract: We introduce a novel inference-time alignment approach for LLMs that aims to generate safe responses almost surely, i.e., with probability approaching one. Our approach models the generation of safe responses as a constrained Markov Decision Process (MDP) within the LLM's latent space. We augment a safety state that tracks the evolution of safety constraints and dynamically penalize unsafe generat… ▽ More We introduce a novel inference-time alignment approach for LLMs that aims to generate safe responses almost surely, i.e., with probability approaching one. Our approach models the generation of safe responses as a constrained Markov Decision Process (MDP) within the LLM's latent space. We augment a safety state that tracks the evolution of safety constraints and dynamically penalize unsafe generations to ensure the generation of safe responses. Consequently, we demonstrate formal safety guarantees w.r.t. the given cost model upon solving the MDP in the latent space with sufficiently large penalties. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate that InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses. Our findings contribute to the advancement of safer LLM deployment through alignment at inference-time, thus presenting a promising alternative to resource-intensive, overfitting-prone alignment techniques like RLHF. △ Less

Submitted 20 June, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

arXiv:2501.05108 [pdf, other]

Optimizing Multitask Industrial Processes with Predictive Action Guidance

Authors: Naval Kishore Mehta, Arvind, Shyam Sunder Prasad, Sumeet Saurav, Sanjay Singh

Abstract: Monitoring complex assembly processes is critical for maintaining productivity and ensuring compliance with assembly standards. However, variability in human actions and subjective task preferences complicate accurate task anticipation and guidance. To address these challenges, we introduce the Multi-Modal Transformer Fusion and Recurrent Units (MMTFRU) Network for egocentric activity anticipation… ▽ More Monitoring complex assembly processes is critical for maintaining productivity and ensuring compliance with assembly standards. However, variability in human actions and subjective task preferences complicate accurate task anticipation and guidance. To address these challenges, we introduce the Multi-Modal Transformer Fusion and Recurrent Units (MMTFRU) Network for egocentric activity anticipation, utilizing multimodal fusion to improve prediction accuracy. Integrated with the Operator Action Monitoring Unit (OAMU), the system provides proactive operator guidance, preventing deviations in the assembly process. OAMU employs two strategies: (1) Top-5 MMTF-RU predictions, combined with a reference graph and an action dictionary, for next-step recommendations; and (2) Top-1 MMTF-RU predictions, integrated with a reference graph, for detecting sequence deviations and predicting anomaly scores via an entropy-informed confidence mechanism. We also introduce Time-Weighted Sequence Accuracy (TWSA) to evaluate operator efficiency and ensure timely task completion. Our approach is validated on the industrial Meccano dataset and the largescale EPIC-Kitchens-55 dataset, demonstrating its effectiveness in dynamic environments. △ Less

Submitted 9 January, 2025; originally announced January 2025.

arXiv:2501.00078 [pdf, other]

Human-like Bots for Tactical Shooters Using Compute-Efficient Sensors

Authors: Niels Justesen, Maria Kaselimi, Sam Snodgrass, Miruna Vozaru, Matthew Schlegel, Jonas Wingren, Gabriella A. B. Barros, Tobias Mahlmann, Shyam Sudhakaran, Wesley Kerr, Albert Wang, Christoffer Holmgård, Georgios N. Yannakakis, Sebastian Risi, Julian Togelius

Abstract: Artificial intelligence (AI) has enabled agents to master complex video games, from first-person shooters like Counter-Strike to real-time strategy games such as StarCraft II and racing games like Gran Turismo. While these achievements are notable, applying these AI methods in commercial video game production remains challenging due to computational constraints. In commercial scenarios, the majori… ▽ More Artificial intelligence (AI) has enabled agents to master complex video games, from first-person shooters like Counter-Strike to real-time strategy games such as StarCraft II and racing games like Gran Turismo. While these achievements are notable, applying these AI methods in commercial video game production remains challenging due to computational constraints. In commercial scenarios, the majority of computational resources are allocated to 3D rendering, leaving limited capacity for AI methods, which often demand high computational power, particularly those relying on pixel-based sensors. Moreover, the gaming industry prioritizes creating human-like behavior in AI agents to enhance player experience, unlike academic models that focus on maximizing game performance. This paper introduces a novel methodology for training neural networks via imitation learning to play a complex, commercial-standard, VALORANT-like 2v2 tactical shooter game, requiring only modest CPU hardware during inference. Our approach leverages an innovative, pixel-free perception architecture using a small set of ray-cast sensors, which capture essential spatial information efficiently. These sensors allow AI to perform competently without the computational overhead of traditional methods. Models are trained to mimic human behavior using supervised learning on human trajectory data, resulting in realistic and engaging AI agents. Human evaluation tests confirm that our AI agents provide human-like gameplay experiences while operating efficiently under computational constraints. This offers a significant advancement in AI model development for tactical shooter games and possibly other genres. △ Less

Submitted 30 December, 2024; originally announced January 2025.

arXiv:2412.18200 [pdf, other]

Adapting Large Language Models for Improving TCP Fairness over WiFi

Authors: Shyam Kumar Shrestha, Shiva Raj Pokhrel, Jonathan Kua

Abstract: The new transmission control protocol (TCP) relies on Deep Learning (DL) for prediction and optimization, but requires significant manual effort to design deep neural networks (DNNs) and struggles with generalization in dynamic environments. Inspired by the success of large language models (LLMs), this study proposes TCP-LLM, a novel framework leveraging LLMs for TCP applications. TCP-LLM utilizes… ▽ More The new transmission control protocol (TCP) relies on Deep Learning (DL) for prediction and optimization, but requires significant manual effort to design deep neural networks (DNNs) and struggles with generalization in dynamic environments. Inspired by the success of large language models (LLMs), this study proposes TCP-LLM, a novel framework leveraging LLMs for TCP applications. TCP-LLM utilizes pre-trained knowledge to reduce engineering effort, enhance generalization, and deliver superior performance across diverse TCP tasks. Applied to reducing flow unfairness, adapting congestion control, and preventing starvation, TCP-LLM demonstrates significant improvements over TCP with minimal fine-tuning. △ Less

Submitted 24 December, 2024; originally announced December 2024.

arXiv:2412.16482 [pdf, other]

Learn2Mix: Training Neural Networks Using Adaptive Data Integration

Authors: Shyam Venkatasubramanian, Vahid Tarokh

Abstract: Accelerating model convergence within resource-constrained environments is critical to ensure fast and efficient neural network training. This work presents learn2mix, a novel training strategy that adaptively adjusts class proportions within batches, focusing on classes with higher error rates. Unlike classical training methods that use static class proportions, learn2mix continually adapts class… ▽ More Accelerating model convergence within resource-constrained environments is critical to ensure fast and efficient neural network training. This work presents learn2mix, a novel training strategy that adaptively adjusts class proportions within batches, focusing on classes with higher error rates. Unlike classical training methods that use static class proportions, learn2mix continually adapts class proportions during training, leading to faster convergence. Empirical evaluations conducted on benchmark datasets show that neural networks trained with learn2mix converge faster than those trained with existing approaches, achieving improved results for classification, regression, and reconstruction tasks under limited training resources and with imbalanced classes. Our empirical findings are supported by theoretical analysis. △ Less

Submitted 13 February, 2025; v1 submitted 20 December, 2024; originally announced December 2024.

arXiv:2411.18765 [pdf, ps, other]

Near-Optimal Trace Reconstruction for Mildly Separated Strings

Authors: Anders Aamand, Allen Liu, Shyam Narayanan

Abstract: In the trace reconstruction problem our goal is to learn an unknown string $x\in \{0,1\}^n$ given independent traces of $x$. A trace is obtained by independently deleting each bit of $x$ with some probability $δ$ and concatenating the remaining bits. It is a major open question whether the trace reconstruction problem can be solved with a polynomial number of traces when the deletion probability… ▽ More In the trace reconstruction problem our goal is to learn an unknown string $x\in \{0,1\}^n$ given independent traces of $x$. A trace is obtained by independently deleting each bit of $x$ with some probability $δ$ and concatenating the remaining bits. It is a major open question whether the trace reconstruction problem can be solved with a polynomial number of traces when the deletion probability $δ$ is constant. The best known upper bound and lower bounds are respectively $\exp(\tilde O(n^{1/5}))$ and $\tilde Ω(n^{3/2})$ both by Chase [Cha21b,Cha21a]. Our main result is that if the string $x$ is mildly separated, meaning that the number of zeros between any two ones in $x$ is at least polylog$n$, and if $δ$ is a sufficiently small constant, then the trace reconstruction problem can be solved with $O(n \log n)$ traces and in polynomial time. △ Less

Submitted 27 November, 2024; originally announced November 2024.

arXiv:2411.15242 [pdf, other]

The Zamba2 Suite: Technical Report

Authors: Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, Beren Millidge

Abstract: In this technical report, we present the Zamba2 series -- a suite of 1.2B, 2.7B, and 7.4B parameter hybrid Mamba2-transformer models that achieve state of the art performance against the leading open-weights models of their class, while achieving substantial gains in inference latency, throughput, and memory efficiency. The Zamba2 series builds upon our initial work with Zamba1-7B, optimizing its… ▽ More In this technical report, we present the Zamba2 series -- a suite of 1.2B, 2.7B, and 7.4B parameter hybrid Mamba2-transformer models that achieve state of the art performance against the leading open-weights models of their class, while achieving substantial gains in inference latency, throughput, and memory efficiency. The Zamba2 series builds upon our initial work with Zamba1-7B, optimizing its architecture, training and annealing datasets, and training for up to three trillion tokens. We provide open-source weights for all models of the Zamba2 series as well as instruction-tuned variants that are strongly competitive against comparable instruct-tuned models of their class. We additionally open-source the pretraining dataset, which we call Zyda-2, used to train the Zamba2 series of models. The models and datasets used in this work are openly available at https://huggingface.co/Zyphra △ Less

Submitted 21 November, 2024; originally announced November 2024.

Comments: 21/11/24 initial upload

arXiv:2411.02298 [pdf, ps, other]

Sample-Efficient Private Learning of Mixtures of Gaussians

Authors: Hassan Ashtiani, Mahbod Majid, Shyam Narayanan

Abstract: We study the problem of learning mixtures of Gaussians with approximate differential privacy. We prove that roughly $kd^2 + k^{1.5} d^{1.75} + k^2 d$ samples suffice to learn a mixture of $k$ arbitrary $d$-dimensional Gaussians up to low total variation distance, with differential privacy. Our work improves over the previous best result [AAL24b] (which required roughly $k^2 d^4$ samples) and is pr… ▽ More We study the problem of learning mixtures of Gaussians with approximate differential privacy. We prove that roughly $kd^2 + k^{1.5} d^{1.75} + k^2 d$ samples suffice to learn a mixture of $k$ arbitrary $d$-dimensional Gaussians up to low total variation distance, with differential privacy. Our work improves over the previous best result [AAL24b] (which required roughly $k^2 d^4$ samples) and is provably optimal when $d$ is much larger than $k^2$. Moreover, we give the first optimal bound for privately learning mixtures of $k$ univariate (i.e., $1$-dimensional) Gaussians. Importantly, we show that the sample complexity for privately learning mixtures of univariate Gaussians is linear in the number of components $k$, whereas the previous best sample complexity [AAL21] was quadratic in $k$. Our algorithms utilize various techniques, including the inverse sensitivity mechanism [AD20b, AD20a, HKMN23], sample compression for distributions [ABDH+20], and methods for bounding volumes of sumsets. △ Less

Submitted 4 November, 2024; originally announced November 2024.

Comments: 52 pages. To appear in Neural Information Processing Systems (NeurIPS), 2024

arXiv:2410.23087 [pdf, other]

Statistical-Computational Trade-offs for Density Estimation

Authors: Anders Aamand, Alexandr Andoni, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal, Haike Xu

Abstract: We study the density estimation problem defined as follows: given $k$ distributions $p_1, \ldots, p_k$ over a discrete domain $[n]$, as well as a collection of samples chosen from a ``query'' distribution $q$ over $[n]$, output $p_i$ that is ``close'' to $q$. Recently~\cite{aamand2023data} gave the first and only known result that achieves sublinear bounds in {\em both} the sampling complexity and… ▽ More We study the density estimation problem defined as follows: given $k$ distributions $p_1, \ldots, p_k$ over a discrete domain $[n]$, as well as a collection of samples chosen from a ``query'' distribution $q$ over $[n]$, output $p_i$ that is ``close'' to $q$. Recently~\cite{aamand2023data} gave the first and only known result that achieves sublinear bounds in {\em both} the sampling complexity and the query time while preserving polynomial data structure space. However, their improvement over linear samples and time is only by subpolynomial factors. Our main result is a lower bound showing that, for a broad class of data structures, their bounds cannot be significantly improved. In particular, if an algorithm uses $O(n/\log^c k)$ samples for some constant $c>0$ and polynomial space, then the query time of the data structure must be at least $k^{1-O(1)/\log \log k}$, i.e., close to linear in the number of distributions $k$. This is a novel \emph{statistical-computational} trade-off for density estimation, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time. The lower bound holds even in the realizable case where $q=p_i$ for some $i$, and when the distributions are flat (specifically, all distributions are uniform over half of the domain $[n]$). We also give a simple data structure for our lower bound instance with asymptotically matching upper bounds. Experiments show that the data structure is quite efficient in practice. △ Less

Submitted 30 October, 2024; originally announced October 2024.

Comments: To appear at NeurIPS 2024

arXiv:2410.19250 [pdf, other]

Have LLMs Reopened the Pandora's Box of AI-Generated Fake News?

Authors: Xinyu Wang, Wenbo Zhang, Sai Koneru, Hangzhi Guo, Bonam Mingole, S. Shyam Sundar, Sarah Rajtmajer, Amulya Yadav

Abstract: With the rise of AI-generated content spewed at scale from large language models (LLMs), genuine concerns about the spread of fake news have intensified. The perceived ability of LLMs to produce convincing fake news at scale poses new challenges for both human and automated fake news detection systems. To address this gap, this paper presents the findings from a university-level competition that a… ▽ More With the rise of AI-generated content spewed at scale from large language models (LLMs), genuine concerns about the spread of fake news have intensified. The perceived ability of LLMs to produce convincing fake news at scale poses new challenges for both human and automated fake news detection systems. To address this gap, this paper presents the findings from a university-level competition that aimed to explore how LLMs can be used by humans to create fake news, and to assess the ability of human annotators and AI models to detect it. A total of 110 participants used LLMs to create 252 unique fake news stories, and 84 annotators participated in the detection tasks. Our findings indicate that LLMs are ~68% more effective at detecting real news than humans. However, for fake news detection, the performance of LLMs and humans remains comparable (~60% accuracy). Additionally, we examine the impact of visual elements (e.g., pictures) in news on the accuracy of detecting fake news stories. Finally, we also examine various strategies used by fake news creators to enhance the credibility of their AI-generated content. This work highlights the increasing complexity of detecting AI-generated fake news, particularly in collaborative human-AI settings. △ Less

Submitted 29 March, 2025; v1 submitted 24 October, 2024; originally announced October 2024.

arXiv:2410.17863 [pdf, other]

CASCRNet: An Atrous Spatial Pyramid Pooling and Shared Channel Residual based Network for Capsule Endoscopy

Authors: K V Srinanda, M Manvith Prabhu, Shyam Lal

Abstract: This manuscript summarizes work on the Capsule Vision Challenge 2024 by MISAHUB. To address the multi-class disease classification task, which is challenging due to the complexity and imbalance in the Capsule Vision challenge dataset, this paper proposes CASCRNet (Capsule endoscopy-Aspp-SCR-Network), a parameter-efficient and novel model that uses Shared Channel Residual (SCR) blocks and Atrous Sp… ▽ More This manuscript summarizes work on the Capsule Vision Challenge 2024 by MISAHUB. To address the multi-class disease classification task, which is challenging due to the complexity and imbalance in the Capsule Vision challenge dataset, this paper proposes CASCRNet (Capsule endoscopy-Aspp-SCR-Network), a parameter-efficient and novel model that uses Shared Channel Residual (SCR) blocks and Atrous Spatial Pyramid Pooling (ASPP) blocks. Further, the performance of the proposed model is compared with other well-known approaches. The experimental results yield that proposed model provides better disease classification results. The proposed model was successful in classifying diseases with an F1 Score of 78.5% and a Mean AUC of 98.3%, which is promising given its compact architecture. △ Less

Submitted 27 November, 2024; v1 submitted 23 October, 2024; originally announced October 2024.

Comments: 8 pages, 4 figures

arXiv:2410.15467 [pdf, other]

Hey GPT, Can You be More Racist? Analysis from Crowdsourced Attempts to Elicit Biased Content from Generative AI

Authors: Hangzhi Guo, Pranav Narayanan Venkit, Eunchae Jang, Mukund Srinath, Wenbo Zhang, Bonam Mingole, Vipul Gupta, Kush R. Varshney, S. Shyam Sundar, Amulya Yadav

Abstract: The widespread adoption of large language models (LLMs) and generative AI (GenAI) tools across diverse applications has amplified the importance of addressing societal biases inherent within these technologies. While the NLP community has extensively studied LLM bias, research investigating how non-expert users perceive and interact with biases from these systems remains limited. As these technolo… ▽ More The widespread adoption of large language models (LLMs) and generative AI (GenAI) tools across diverse applications has amplified the importance of addressing societal biases inherent within these technologies. While the NLP community has extensively studied LLM bias, research investigating how non-expert users perceive and interact with biases from these systems remains limited. As these technologies become increasingly prevalent, understanding this question is crucial to inform model developers in their efforts to mitigate bias. To address this gap, this work presents the findings from a university-level competition, which challenged participants to design prompts for eliciting biased outputs from GenAI tools. We quantitatively and qualitatively analyze the competition submissions and identify a diverse set of biases in GenAI and strategies employed by participants to induce bias in GenAI. Our finding provides unique insights into how non-expert users perceive and interact with biases from GenAI tools. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.14643 [pdf, other]

Instance-Optimality in I/O-Efficient Sampling and Sequential Estimation

Authors: Shyam Narayanan, Václav Rozhoň, Jakub Tětek, Mikkel Thorup

Abstract: Suppose we have a memory storing $0$s and $1$s and we want to estimate the frequency of $1$s by sampling. We want to do this I/O-efficiently, exploiting that each read gives a block of $B$ bits at unit cost; not just one bit. If the input consists of uniform blocks: either all 1s or all 0s, then sampling a whole block at a time does not reduce the number of samples needed for estimation. On the ot… ▽ More Suppose we have a memory storing $0$s and $1$s and we want to estimate the frequency of $1$s by sampling. We want to do this I/O-efficiently, exploiting that each read gives a block of $B$ bits at unit cost; not just one bit. If the input consists of uniform blocks: either all 1s or all 0s, then sampling a whole block at a time does not reduce the number of samples needed for estimation. On the other hand, if bits are randomly permuted, then getting a block of $B$ bits is as good as getting $B$ independent bit samples. However, we do not want to make any such assumptions on the input. Instead, our goal is to have an algorithm with instance-dependent performance guarantees which stops sampling blocks as soon as we know that we have a probabilistically reliable estimate. We prove our algorithms to be instance-optimal among algorithms oblivious to the order of the blocks, which we argue is the strongest form of instance optimality we can hope for. We also present similar results for I/O-efficiently estimating mean with both additive and multiplicative error, estimating histograms, quantiles, as well as the empirical cumulative distribution function. We obtain our above results on I/O-efficient sampling by reducing to corresponding problems in the so-called sequential estimation. In this setting, one samples from an unknown distribution until one can provide an estimate with some desired error probability. We then provide non-parametric instance-optimal results for several fundamental problems: mean and quantile estimation, as well as learning mixture distributions with respect to $\ell_\infty$ and the so-called Kolmogorov-Smirnov distance. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: To appear at FOCS 2024

arXiv:2410.13638 [pdf, other]

Scaling Wearable Foundation Models

Authors: Girish Narayanswamy, Xin Liu, Kumar Ayush, Yuzhe Yang, Xuhai Xu, Shun Liao, Jake Garrison, Shyam Tailor, Jake Sunshine, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Samy Abdel-Ghaffar, Daniel McDuff

Abstract: Wearable sensors have become ubiquitous thanks to a variety of health tracking features. The resulting continuous and longitudinal measurements from everyday life generate large volumes of data; however, making sense of these observations for scientific and actionable insights is non-trivial. Inspired by the empirical success of generative modeling, where large neural networks learn powerful repre… ▽ More Wearable sensors have become ubiquitous thanks to a variety of health tracking features. The resulting continuous and longitudinal measurements from everyday life generate large volumes of data; however, making sense of these observations for scientific and actionable insights is non-trivial. Inspired by the empirical success of generative modeling, where large neural networks learn powerful representations from vast amounts of text, image, video, or audio data, we investigate the scaling properties of sensor foundation models across compute, data, and model size. Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, electrodermal activity, accelerometer, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM, a multimodal foundation model built on the largest wearable-signals dataset with the most extensive range of sensor modalities to date. Our results establish the scaling laws of LSM for tasks such as imputation, interpolation and extrapolation, both across time and sensor modalities. Moreover, we highlight how LSM enables sample-efficient downstream learning for tasks like exercise and activity recognition. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2410.12219 [pdf, other]

OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

Authors: Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng Wang, Yandong Li, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang, Boqing Gong

Abstract: We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models, such as GPT-4o and Gemini. Evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Particularly, the user message might often consist of multiple modalities, such that OLMs have to establish holistic understanding and reasoning across modaliti… ▽ More We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models, such as GPT-4o and Gemini. Evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges. Particularly, the user message might often consist of multiple modalities, such that OLMs have to establish holistic understanding and reasoning across modalities to accomplish the task. Existing benchmarks are limited to single modality or dual-modality tasks, overlooking comprehensive multi-modal assessments of model reasoning. To address this, OmnixR offers two evaluation variants: (1)synthetic subset: a synthetic dataset generated automatically by translating text into multiple modalities--audio, images, video, and hybrids (Omnify). (2)realistic subset: a real-world dataset, manually curated and annotated by experts, for evaluating cross-modal reasoning in natural settings. OmnixR presents a unique evaluation towards assessing OLMs over a diverse mix of modalities, such as a question that involves video, audio, and text, providing a rigorous cross-modal reasoning testbed unlike any existing benchmarks. Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer. Further analysis highlights differences in reasoning behavior, underscoring the challenges of omni-modal AI alignment. △ Less

Submitted 16 October, 2024; originally announced October 2024.

Comments: 19 pages, 6 figures, 12 tables

arXiv:2410.05281 [pdf, other]

Micrometer: Micromechanics Transformer for Predicting Mechanical Responses of Heterogeneous Materials

Authors: Sifan Wang, Tong-Rui Liu, Shyam Sankaran, Paris Perdikaris

Abstract: Heterogeneous materials, crucial in various engineering applications, exhibit complex multiscale behavior, which challenges the effectiveness of traditional computational methods. In this work, we introduce the Micromechanics Transformer ({\em Micrometer}), an artificial intelligence (AI) framework for predicting the mechanical response of heterogeneous materials, bridging the gap between advanced… ▽ More Heterogeneous materials, crucial in various engineering applications, exhibit complex multiscale behavior, which challenges the effectiveness of traditional computational methods. In this work, we introduce the Micromechanics Transformer ({\em Micrometer}), an artificial intelligence (AI) framework for predicting the mechanical response of heterogeneous materials, bridging the gap between advanced data-driven methods and complex solid mechanics problems. Trained on a large-scale high-resolution dataset of 2D fiber-reinforced composites, Micrometer can achieve state-of-the-art performance in predicting microscale strain fields across a wide range of microstructures, material properties under any loading conditions and We demonstrate the accuracy and computational efficiency of Micrometer through applications in computational homogenization and multiscale modeling, where Micrometer achieves 1\% error in predicting macroscale stress fields while reducing computational time by up to two orders of magnitude compared to conventional numerical solvers. We further showcase the adaptability of the proposed model through transfer learning experiments on new materials with limited data, highlighting its potential to tackle diverse scenarios in mechanical analysis of solid materials. Our work represents a significant step towards AI-driven innovation in computational solid mechanics, addressing the limitations of traditional numerical methods and paving the way for more efficient simulations of heterogeneous materials across various industrial applications. △ Less

Submitted 23 September, 2024; originally announced October 2024.

Comments: 36 pages, 12 figures, 9 tables

arXiv:2410.05180 [pdf, other]

Mitigating the Risk of Health Inequity Exacerbated by Large Language Models

Authors: Yuelyu Ji, Wenhe Ma, Sonish Sivarajkumar, Hang Zhang, Eugene Mathew Sadhu, Zhuochun Li, Xizhi Wu, Shyam Visweswaran, Yanshan Wang

Abstract: Recent advancements in large language models have demonstrated their potential in numerous medical applications, particularly in automating clinical trial matching for translational research and enhancing medical question answering for clinical decision support. However, our study shows that incorporating non decisive sociodemographic factors such as race, sex, income level, LGBT+ status, homeless… ▽ More Recent advancements in large language models have demonstrated their potential in numerous medical applications, particularly in automating clinical trial matching for translational research and enhancing medical question answering for clinical decision support. However, our study shows that incorporating non decisive sociodemographic factors such as race, sex, income level, LGBT+ status, homelessness, illiteracy, disability, and unemployment into the input of LLMs can lead to incorrect and harmful outputs for these populations. These discrepancies risk exacerbating existing health disparities if LLMs are widely adopted in healthcare. To address this issue, we introduce EquityGuard, a novel framework designed to detect and mitigate the risk of health inequities in LLM based medical applications. Our evaluation demonstrates its efficacy in promoting equitable outcomes across diverse populations. △ Less

Submitted 14 October, 2024; v1 submitted 7 October, 2024; originally announced October 2024.

arXiv:2409.15255 [pdf, other]

ZeroSCD: Zero-Shot Street Scene Change Detection

Authors: Shyam Sundar Kannan, Byung-Cheol Min

Abstract: Scene Change Detection is a challenging task in computer vision and robotics that aims to identify differences between two images of the same scene captured at different times. Traditional change detection methods rely on training models that take these image pairs as input and estimate the changes, which requires large amounts of annotated data, a costly and time-consuming process. To overcome th… ▽ More Scene Change Detection is a challenging task in computer vision and robotics that aims to identify differences between two images of the same scene captured at different times. Traditional change detection methods rely on training models that take these image pairs as input and estimate the changes, which requires large amounts of annotated data, a costly and time-consuming process. To overcome this, we propose ZeroSCD, a zero-shot scene change detection framework that eliminates the need for training. ZeroSCD leverages pre-existing models for place recognition and semantic segmentation, utilizing their features and outputs to perform change detection. In this framework, features extracted from the place recognition model are used to estimate correspondences and detect changes between the two images. These are then combined with segmentation results from the semantic segmentation model to precisely delineate the boundaries of the detected changes. Extensive experiments on benchmark datasets demonstrate that ZeroSCD outperforms several state-of-the-art methods in change detection accuracy, despite not being trained on any of the benchmark datasets, proving its effectiveness and adaptability across different scenarios. △ Less

Submitted 23 September, 2024; originally announced September 2024.

arXiv:2409.12941 [pdf, other]

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

Authors: Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, Manaal Faruqui

Abstract: Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such… ▽ More Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems. △ Less

Submitted 24 January, 2025; v1 submitted 19 September, 2024; originally announced September 2024.

Comments: Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), 2025

arXiv:2409.10075 [pdf, other]

Steinmetz Neural Networks for Complex-Valued Data

Authors: Shyam Venkatasubramanian, Ali Pezeshki, Vahid Tarokh

Abstract: We introduce a new approach to processing complex-valued data using DNNs consisting of parallel real-valued subnetworks with coupled outputs. Our proposed class of architectures, referred to as Steinmetz Neural Networks, incorporates multi-view learning to construct more interpretable representations in the latent space. Moreover, we present the Analytic Neural Network, which incorporates a consis… ▽ More We introduce a new approach to processing complex-valued data using DNNs consisting of parallel real-valued subnetworks with coupled outputs. Our proposed class of architectures, referred to as Steinmetz Neural Networks, incorporates multi-view learning to construct more interpretable representations in the latent space. Moreover, we present the Analytic Neural Network, which incorporates a consistency penalty that encourages analytic signal representations in the latent space of the Steinmetz neural network. This penalty enforces a deterministic and orthogonal relationship between the real and imaginary components. Using an information-theoretic construction, we demonstrate that the generalization gap upper bound posited by the analytic neural network is lower than that of the general class of Steinmetz neural networks. Our numerical experiments depict the improved performance and robustness to additive noise, afforded by our proposed networks on benchmark datasets and synthetic examples. △ Less

Submitted 13 February, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

arXiv:2408.15878 [pdf, other]

A Non-Traditional Approach to Assisting Data Address Translation

Authors: Shyam Murthy, Gurindar S Sohi

Abstract: This paper proposes a novel way to assist conventional data address translation. The approach, PC-Indexed Data Address Translation (PCAX), uses the PC of a load instruction, and not a data virtual address, to obtain the page table entry (PTE) for the data accessed by a load instruction. PCAX is intended to be used for a small subset of the static loads in a program. We observe that: (i) a small su… ▽ More This paper proposes a novel way to assist conventional data address translation. The approach, PC-Indexed Data Address Translation (PCAX), uses the PC of a load instruction, and not a data virtual address, to obtain the page table entry (PTE) for the data accessed by a load instruction. PCAX is intended to be used for a small subset of the static loads in a program. We observe that: (i) a small subset of static loads is responsible for most of the misses in a data translation lookaside buffer (DTLB), and (ii) often a dynamic instance of a static load instruction accesses the same PTE as the last dynamic instance, and consider PCAX for this subset. With PCAX the effective miss rate of a conventional DTLB can be cut down by a factor of 2-3X in many cases, and even more in some cases. PCAX is also beneficial in reducing the number of secondary TLB (STLB) misses. Since the tables used for PCAX can be accessed alongside instruction fetch, they can be slow, yet frequently provide a valid PTE even before the data address calculation. This results in a performance improvement, and reduced data address translation energy, in most cases. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.10330 [pdf, other]

doi 10.1007/978-981-96-0695-5_12

Meta-Learning in Audio and Speech Processing: An End to End Comprehensive Review

Authors: Athul Raimon, Shubha Masti, Shyam K Sateesh, Siyani Vengatagiri, Bhaskarjyoti Das

Abstract: This survey overviews various meta-learning approaches used in audio and speech processing scenarios. Meta-learning is used where model performance needs to be maximized with minimum annotated samples, making it suitable for low-sample audio processing. Although the field has made some significant contributions, audio meta-learning still lacks the presence of comprehensive survey papers. We presen… ▽ More This survey overviews various meta-learning approaches used in audio and speech processing scenarios. Meta-learning is used where model performance needs to be maximized with minimum annotated samples, making it suitable for low-sample audio processing. Although the field has made some significant contributions, audio meta-learning still lacks the presence of comprehensive survey papers. We present a systematic review of meta-learning methodologies in audio processing. This includes audio-specific discussions on data augmentation, feature extraction, preprocessing techniques, meta-learners, task selection strategies and also presents important datasets in audio, together with crucial real-world use cases. Through this extensive review, we aim to provide valuable insights and identify future research directions in the intersection of meta-learning and audio processing. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Comments: Survey Paper (15 pages, 1 figure)

arXiv:2408.10328 [pdf, other]

Decoding Human Emotions: Analyzing Multi-Channel EEG Data using LSTM Networks

Authors: Shyam K Sateesh, Sparsh BK, Uma D

Abstract: Emotion recognition from electroencephalogram (EEG) signals is a thriving field, particularly in neuroscience and Human-Computer Interaction (HCI). This study aims to understand and improve the predictive accuracy of emotional state classification through metrics such as valence, arousal, dominance, and likeness by applying a Long Short-Term Memory (LSTM) network to analyze EEG signals. Using a po… ▽ More Emotion recognition from electroencephalogram (EEG) signals is a thriving field, particularly in neuroscience and Human-Computer Interaction (HCI). This study aims to understand and improve the predictive accuracy of emotional state classification through metrics such as valence, arousal, dominance, and likeness by applying a Long Short-Term Memory (LSTM) network to analyze EEG signals. Using a popular dataset of multi-channel EEG recordings known as DEAP, we look towards leveraging LSTM networks' properties to handle temporal dependencies within EEG signal data. This allows for a more comprehensive understanding and classification of emotional parameter states. We obtain accuracies of 89.89%, 90.33%, 90.70%, and 90.54% for arousal, valence, dominance, and likeness, respectively, demonstrating significant improvements in emotion recognition model capabilities. This paper elucidates the methodology and architectural specifics of our LSTM model and provides a benchmark analysis with existing papers. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Comments: 13 pages, 3 figures; accepted at ICDSA '24 Conference, Jaipur, India

arXiv:2408.07832 [pdf, ps, other]

LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers

Authors: Shantanu Ghosh, Rayan Syed, Chenyu Wang, Vaibhav Choudhary, Binxu Li, Clare B. Poynton, Shyam Visweswaran, Kayhan Batmanghelich

Abstract: Error slice discovery is crucial to diagnose and mitigate model errors. Current clustering or discrete attribute-based slice discovery methods face key limitations: 1) clustering results in incoherent slices, while assigning discrete attributes to slices leads to incomplete coverage of error patterns due to missing or insufficient attributes; 2) these methods lack complex reasoning, preventing the… ▽ More Error slice discovery is crucial to diagnose and mitigate model errors. Current clustering or discrete attribute-based slice discovery methods face key limitations: 1) clustering results in incoherent slices, while assigning discrete attributes to slices leads to incomplete coverage of error patterns due to missing or insufficient attributes; 2) these methods lack complex reasoning, preventing them from fully explaining model biases; 3) they fail to integrate \textit{domain knowledge}, limiting their usage in specialized fields \eg radiology. We propose\ladder (\underline{La}nguage-\underline{D}riven \underline{D}iscovery and \underline{E}rror \underline{R}ectification), to address the limitations by: (1) leveraging the flexibility of natural language to address incompleteness, (2) employing LLM's latent \textit{domain knowledge} and advanced reasoning to analyze sentences and derive testable hypotheses directly, identifying biased attributes, and form coherent error slices without clustering. Existing mitigation methods typically address only the worst-performing group, often amplifying errors in other subgroups. In contrast,\ladder generates pseudo attributes from the discovered hypotheses to mitigate errors across all biases without explicit attribute annotations or prior knowledge of bias. Rigorous evaluations on 6 datasets spanning natural and medical images -- comparing 200+ classifiers with diverse architectures, pretraining strategies, and LLMs -- show that\ladder consistently outperforms existing baselines in discovering and mitigating biases. △ Less

Submitted 29 May, 2025; v1 submitted 31 July, 2024; originally announced August 2024.

Comments: ACL 2025 (Findings). Code: https://github.com/batmanlab/Ladder

arXiv:2408.05190 [pdf, other]

Parameterized Verification of Timed Networks with Clock Invariants

Authors: Étienne André, Swen Jacobs, Shyam Lal Karra, Ocan Sankur

Abstract: We consider parameterized verification problems for networks of timed automata (TAs) that communicate via disjunctive guards or lossy broadcast. To this end, we first consider disjunctive timed networks (DTNs), i.e., networks of TAs that communicate via location guards that enable a transition only if there is another process in a certain location. We solve for the first time the general case with… ▽ More We consider parameterized verification problems for networks of timed automata (TAs) that communicate via disjunctive guards or lossy broadcast. To this end, we first consider disjunctive timed networks (DTNs), i.e., networks of TAs that communicate via location guards that enable a transition only if there is another process in a certain location. We solve for the first time the general case with clock invariants, and establish the decidability of the parameterized verification problem for local trace properties and for reachability of global configurations; Moreover, we prove that, surprisingly and unlike in other settings, this model is equivalent to lossy broadcast networks. △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: 31 pages, 7 figures

ACM Class: D.2.4; F.1.2

arXiv:2408.04093 [pdf, other]

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Authors: Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony, Beren Millidge

Abstract: Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, called Tree Attention, for parallelizing exact attention computation across multiple GPUs enables cross-device decoding to be performed asymptotically faster (up to 8x faster in our experiments) than state-of-the-art approaches such as Ring Attention,… ▽ More Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, called Tree Attention, for parallelizing exact attention computation across multiple GPUs enables cross-device decoding to be performed asymptotically faster (up to 8x faster in our experiments) than state-of-the-art approaches such as Ring Attention, while also requiring significantly less communication volume and incurring 2x less peak memory. We demonstrate that Tree Attention speeds up decoding up to 4x on Llama 3.1-8B and can be applied to a variety of hardware and networking setups such as H100 DGX nodes, AMD MI300x nodes, and PCIe connected NVIDIA RTX 4090s. Our code is publicly available here: https://github.com/Zyphra/tree_attention △ Less

Submitted 9 February, 2025; v1 submitted 7 August, 2024; originally announced August 2024.

arXiv:2408.00695 [pdf, other]

Accelerating Full Waveform Inversion By Transfer Learning

Authors: Divya Shyam Singh, Leon Herrmann, Qing Sun, Tim Bürchner, Felix Dietrich, Stefan Kollmannsberger

Abstract: Full waveform inversion (FWI) is a powerful tool for reconstructing material fields based on sparsely measured data obtained by wave propagation. For specific problems, discretizing the material field with a neural network (NN) improves the robustness and reconstruction quality of the corresponding optimization problem. We call this method NN-based FWI. Starting from an initial guess, the weights… ▽ More Full waveform inversion (FWI) is a powerful tool for reconstructing material fields based on sparsely measured data obtained by wave propagation. For specific problems, discretizing the material field with a neural network (NN) improves the robustness and reconstruction quality of the corresponding optimization problem. We call this method NN-based FWI. Starting from an initial guess, the weights of the NN are iteratively updated to fit the simulated wave signals to the sparsely measured data set. For gradient-based optimization, a suitable choice of the initial guess, i.e., a suitable NN weight initialization, is crucial for fast and robust convergence. In this paper, we introduce a novel transfer learning approach to further improve NN-based FWI. This approach leverages supervised pretraining to provide a better NN weight initialization, leading to faster convergence of the subsequent optimization problem. Moreover, the inversions yield physically more meaningful local minima. The network is pretrained to predict the unknown material field using the gradient information from the first iteration of conventional FWI. In our computational experiments on two-dimensional domains, the training data set consists of reference simulations with arbitrarily positioned elliptical voids of different shapes and orientations. We compare the performance of the proposed transfer learning NN-based FWI with three other methods: conventional FWI, NN-based FWI without pretraining and conventional FWI with an initial guess predicted from the pretrained NN. Our results show that transfer learning NN-based FWI outperforms the other methods in terms of convergence speed and reconstruction quality. △ Less

Submitted 1 August, 2024; originally announced August 2024.

arXiv:2407.20284 [pdf]

MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI

Authors: Shyam Dongre, Ritesh Chandra, Sonali Agarwal

Abstract: In modern healthcare, addressing the complexities of accurate disease prediction and personalized recommendations is both crucial and challenging. This research introduces MLtoGAI, which integrates Semantic Web technology with Machine Learning (ML) to enhance disease prediction and offer user-friendly explanations through ChatGPT. The system comprises three key components: a reusable disease ontol… ▽ More In modern healthcare, addressing the complexities of accurate disease prediction and personalized recommendations is both crucial and challenging. This research introduces MLtoGAI, which integrates Semantic Web technology with Machine Learning (ML) to enhance disease prediction and offer user-friendly explanations through ChatGPT. The system comprises three key components: a reusable disease ontology that incorporates detailed knowledge about various diseases, a diagnostic classification model that uses patient symptoms to detect specific diseases accurately, and the integration of Semantic Web Rule Language (SWRL) with ontology and ChatGPT to generate clear, personalized health advice. This approach significantly improves prediction accuracy and ensures results that are easy to understand, addressing the complexity of diseases and diverse symptoms. The MLtoGAI system demonstrates substantial advancements in accuracy and user satisfaction, contributing to developing more intelligent and accessible healthcare solutions. This innovative approach combines the strengths of ML algorithms with the ability to provide transparent, human-understandable explanations through ChatGPT, achieving significant improvements in prediction accuracy and user comprehension. By leveraging semantic technology and explainable AI, the system enhances the accuracy of disease prediction and ensures that the recommendations are relevant and easily understood by individual patients. Our research highlights the potential of integrating advanced technologies to overcome existing challenges in medical diagnostics, paving the way for future developments in intelligent healthcare systems. Additionally, the system is validated using 200 synthetic patient data records, ensuring robust performance and reliability. △ Less

Submitted 26 July, 2024; originally announced July 2024.

arXiv:2407.17129 [pdf, other]

doi 10.1145/3630106.3658939

Mapping the individual, social, and biospheric impacts of Foundation Models

Authors: Andrés Domínguez Hernández, Shyam Krishna, Antonella Maia Perini, Michael Katell, SJ Bennett, Ann Borda, Youmna Hashem, Semeli Hadjiloizou, Sabeehah Mahomed, Smera Jayadeva, Mhairi Aitken, David Leslie

Abstract: Responding to the rapid roll-out and large-scale commercialization of foundation models, large language models, and generative AI, an emerging body of work is shedding light on the myriad impacts these technologies are having across society. Such research is expansive, ranging from the production of discriminatory, fake and toxic outputs, and privacy and copyright violations, to the unjust extract… ▽ More Responding to the rapid roll-out and large-scale commercialization of foundation models, large language models, and generative AI, an emerging body of work is shedding light on the myriad impacts these technologies are having across society. Such research is expansive, ranging from the production of discriminatory, fake and toxic outputs, and privacy and copyright violations, to the unjust extraction of labor and natural resources. The same has not been the case in some of the most prominent AI governance initiatives in the global north like the UK's AI Safety Summit and the G7's Hiroshima process, which have influenced much of the international dialogue around AI governance. Despite the wealth of cautionary tales and evidence of algorithmic harm, there has been an ongoing over-emphasis within the AI governance discourse on technical matters of safety and global catastrophic or existential risks. This narrowed focus has tended to draw attention away from very pressing social and ethical challenges posed by the current brute-force industrialization of AI applications. To address such a visibility gap between real-world consequences and speculative risks, this paper offers a critical framework to account for the social, political, and environmental dimensions of foundation models and generative AI. We identify 14 categories of risks and harms and map them according to their individual, social, and biospheric impacts. We argue that this novel typology offers an integrative perspective to address the most urgent negative impacts of foundation models and their downstream applications. We conclude with recommendations on how this typology could be used to inform technical and normative interventions to advance responsible AI. △ Less

Submitted 24 July, 2024; originally announced July 2024.

Comments: ACM Conference on Fairness, Accountability, and Transparency (FAccT '24). Association for Computing Machinery, New York, NY, USA, 776-796

Journal ref: In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24). Association for Computing Machinery, New York, NY, USA, 776-796

arXiv:2407.11240 [pdf, other]

Making New Connections: LLMs as Puzzle Generators for The New York Times' Connections Word Game

Authors: Tim Merino, Sam Earle, Ryan Sudhakaran, Shyam Sudhakaran, Julian Togelius

Abstract: The Connections puzzle is a word association game published daily by The New York Times (NYT). In this game, players are asked to find groups of four words that are connected by a common theme. While solving a given Connections puzzle requires both semantic knowledge and abstract reasoning, generating novel puzzles additionally requires a form of metacognition: generators must be able to accuratel… ▽ More The Connections puzzle is a word association game published daily by The New York Times (NYT). In this game, players are asked to find groups of four words that are connected by a common theme. While solving a given Connections puzzle requires both semantic knowledge and abstract reasoning, generating novel puzzles additionally requires a form of metacognition: generators must be able to accurately model the downstream reasoning of potential solvers. In this paper, we investigate the ability of the GPT family of Large Language Models (LLMs) to generate challenging and creative word games for human players. We start with an analysis of the word game Connections and the unique challenges it poses as a Procedural Content Generation (PCG) domain. We then propose a method for generating Connections puzzles using LLMs by adapting a Tree of Thoughts (ToT) prompting approach. We evaluate this method by conducting a user study, asking human players to compare AI-generated puzzles against published Connections puzzles. Our findings show that LLMs are capable puzzle creators, and can generate diverse sets of enjoyable, challenging, and creative Connections puzzles as judged by human users. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Showing 1–50 of 267 results for author: Shyam