-
Security Degradation in Iterative AI Code Generation -- A Systematic Analysis of the Paradox
Authors:
Shivani Shukla,
Himanshu Joshi,
Romilla Syed
Abstract:
The rapid adoption of Large Language Models(LLMs) for code generation has transformed software development, yet little attention has been given to how security vulnerabilities evolve through iterative LLM feedback. This paper analyzes security degradation in AI-generated code through a controlled experiment with 400 code samples across 40 rounds of "improvements" using four distinct prompting stra…
▽ More
The rapid adoption of Large Language Models(LLMs) for code generation has transformed software development, yet little attention has been given to how security vulnerabilities evolve through iterative LLM feedback. This paper analyzes security degradation in AI-generated code through a controlled experiment with 400 code samples across 40 rounds of "improvements" using four distinct prompting strategies. Our findings show a 37.6% increase in critical vulnerabilities after just five iterations, with distinct vulnerability patterns emerging across different prompting approaches. This evidence challenges the assumption that iterative LLM refinement improves code security and highlights the essential role of human expertise in the loop. We propose practical guidelines for developers to mitigate these risks, emphasizing the need for robust human validation between LLM iterations to prevent the paradoxical introduction of new security issues during supposedly beneficial code "improvements".
△ Less
Submitted 19 May, 2025;
originally announced June 2025.
-
A Cytology Dataset for Early Detection of Oral Squamous Cell Carcinoma
Authors:
Garima Jain,
Sanghamitra Pati,
Mona Duggal,
Amit Sethi,
Abhijeet Patil,
Gururaj Malekar,
Nilesh Kowe,
Jitender Kumar,
Jatin Kashyap,
Divyajeet Rout,
Deepali,
Hitesh,
Nishi Halduniya,
Sharat Kumar,
Heena Tabassum,
Rupinder Singh Dhaliwal,
Sucheta Devi Khuraijam,
Sushma Khuraijam,
Sharmila Laishram,
Simmi Kharb,
Sunita Singh,
K. Swaminadtan,
Ranjana Solanki,
Deepika Hemranjani,
Shashank Nath Singh
, et al. (12 additional authors not shown)
Abstract:
Oral squamous cell carcinoma OSCC is a major global health burden, particularly in several regions across Asia, Africa, and South America, where it accounts for a significant proportion of cancer cases. Early detection dramatically improves outcomes, with stage I cancers achieving up to 90 percent survival. However, traditional diagnosis based on histopathology has limited accessibility in low-res…
▽ More
Oral squamous cell carcinoma OSCC is a major global health burden, particularly in several regions across Asia, Africa, and South America, where it accounts for a significant proportion of cancer cases. Early detection dramatically improves outcomes, with stage I cancers achieving up to 90 percent survival. However, traditional diagnosis based on histopathology has limited accessibility in low-resource settings because it is invasive, resource-intensive, and reliant on expert pathologists. On the other hand, oral cytology of brush biopsy offers a minimally invasive and lower cost alternative, provided that the remaining challenges, inter observer variability and unavailability of expert pathologists can be addressed using artificial intelligence. Development and validation of robust AI solutions requires access to large, labeled, and multi-source datasets to train high capacity models that generalize across domain shifts. We introduce the first large and multicenter oral cytology dataset, comprising annotated slides stained with Papanicolaou(PAP) and May-Grunwald-Giemsa(MGG) protocols, collected from ten tertiary medical centers in India. The dataset is labeled and annotated by expert pathologists for cellular anomaly classification and detection, is designed to advance AI driven diagnostic methods. By filling the gap in publicly available oral cytology datasets, this resource aims to enhance automated detection, reduce diagnostic errors, and improve early OSCC diagnosis in resource-constrained settings, ultimately contributing to reduced mortality and better patient outcomes worldwide.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation
Authors:
Jaechul Roh,
Varun Gandhi,
Shivani Anilkumar,
Arin Garg
Abstract:
Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis -- especially when aided by reasoning tokens and Chain-of-Thought prompting. Yet, a core question remains: do these models truly reason, or do they merely exploit shallow statistical patterns? In this paper, we introduc…
▽ More
Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis -- especially when aided by reasoning tokens and Chain-of-Thought prompting. Yet, a core question remains: do these models truly reason, or do they merely exploit shallow statistical patterns? In this paper, we introduce Chain-of-Code Collapse, where we systematically investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations. Our evaluation -- spanning 700 perturbed code generations derived from LeetCode-style problems -- applies transformations such as storytelling reframing, irrelevant constraint injection, example reordering, and numeric perturbation. We observe that while certain modifications severely degrade performance (with accuracy drops up to -42.1%), others surprisingly improve model accuracy by up to 35.3%, suggesting sensitivity not only to semantics but also to surface-level prompt dynamics. These findings expose the fragility and unpredictability of current reasoning systems, underscoring the need for more principles approaches to reasoning alignments and prompting robustness. We release our perturbation datasets and evaluation framework to promote further research in trustworthy and resilient LLM reasoning.
△ Less
Submitted 12 June, 2025; v1 submitted 7 June, 2025;
originally announced June 2025.
-
On the Comprehensibility of Multi-structured Financial Documents using LLMs and Pre-processing Tools
Authors:
Shivani Upadhyay,
Messiah Ataey,
Shariyar Murtuza,
Yifan Nie,
Jimmy Lin
Abstract:
The proliferation of complex structured data in hybrid sources, such as PDF documents and web pages, presents unique challenges for current Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) in providing accurate answers. Despite the recent advancements of MLLMs, they still often falter when interpreting intricately structured information, such as nested tables and multi-di…
▽ More
The proliferation of complex structured data in hybrid sources, such as PDF documents and web pages, presents unique challenges for current Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) in providing accurate answers. Despite the recent advancements of MLLMs, they still often falter when interpreting intricately structured information, such as nested tables and multi-dimensional plots, leading to hallucinations and erroneous outputs. This paper explores the capabilities of LLMs and MLLMs in understanding and answering questions from complex data structures found in PDF documents by leveraging industrial and open-source tools as part of a pre-processing pipeline. Our findings indicate that GPT-4o, a popular MLLM, achieves an accuracy of 56% on multi-structured documents when fed documents directly, and that integrating pre-processing tools raises the accuracy of LLMs to 61.3% for GPT-4o and 76% for GPT-4, and with lower overall cost. The code is publicly available at https://github.com/OGCDS/FinancialQA.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
TerraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models
Authors:
Shivani Chiranjeevi,
Hossein Zaremehrjerdi,
Zi K. Deng,
Talukder Z. Jubery,
Ari Grele,
Arti Singh,
Asheesh K Singh,
Soumik Sarkar,
Nirav Merchant,
Harold F. Greeney,
Baskar Ganapathysubramanian,
Chinmay Hegde
Abstract:
The rapid global loss of biodiversity, particularly among insects, represents an urgent ecological crisis. Current methods for insect species discovery are manual, slow, and severely constrained by taxonomic expertise, hindering timely conservation actions. We introduce TerraIncognita, a dynamic benchmark designed to evaluate state-of-the-art multimodal models for the challenging problem of identi…
▽ More
The rapid global loss of biodiversity, particularly among insects, represents an urgent ecological crisis. Current methods for insect species discovery are manual, slow, and severely constrained by taxonomic expertise, hindering timely conservation actions. We introduce TerraIncognita, a dynamic benchmark designed to evaluate state-of-the-art multimodal models for the challenging problem of identifying unknown, potentially undescribed insect species from image data. Our benchmark dataset combines a mix of expertly annotated images of insect species likely known to frontier AI models, and images of rare and poorly known species, for which few/no publicly available images exist. These images were collected from underexplored biodiversity hotspots, realistically mimicking open-world discovery scenarios faced by ecologists. The benchmark assesses models' proficiency in hierarchical taxonomic classification, their capability to detect and abstain from out-of-distribution (OOD) samples representing novel species, and their ability to generate explanations aligned with expert taxonomic knowledge. Notably, top-performing models achieve over 90\% F1 at the Order level on known species, but drop below 2\% at the Species level, highlighting the sharp difficulty gradient from coarse to fine taxonomic prediction (Order $\rightarrow$ Family $\rightarrow$ Genus $\rightarrow$ Species). TerraIncognita will be updated regularly, and by committing to quarterly dataset expansions (of both known and novel species), will provide an evolving platform for longitudinal benchmarking of frontier AI methods. All TerraIncognita data, results, and future updates are available \href{https://baskargroup.github.io/TerraIncognita/}{here}.
△ Less
Submitted 29 May, 2025;
originally announced June 2025.
-
Searching Clinical Data Using Generative AI
Authors:
Karan Hanswadkar,
Anika Kanchi,
Shivani Tripathi,
Shi Qiao,
Rony Chatterjee,
Alekh Jindal
Abstract:
Artificial Intelligence (AI) is making a major impact on healthcare, particularly through its application in natural language processing (NLP) and predictive analytics. The healthcare sector has increasingly adopted AI for tasks such as clinical data analysis and medical code assignment. However, searching for clinical information in large and often unorganized datasets remains a manual and error-…
▽ More
Artificial Intelligence (AI) is making a major impact on healthcare, particularly through its application in natural language processing (NLP) and predictive analytics. The healthcare sector has increasingly adopted AI for tasks such as clinical data analysis and medical code assignment. However, searching for clinical information in large and often unorganized datasets remains a manual and error-prone process. Assisting this process with automations can help physicians improve their operational productivity significantly.
In this paper, we present a generative AI approach, coined SearchAI, to enhance the accuracy and efficiency of searching clinical data. Unlike traditional code assignment, which is a one-to-one problem, clinical data search is a one-to-many problem, i.e., a given search query can map to a family of codes. Healthcare professionals typically search for groups of related diseases, drugs, or conditions that map to many codes, and therefore, they need search tools that can handle keyword synonyms, semantic variants, and broad open-ended queries. SearchAI employs a hierarchical model that respects the coding hierarchy and improves the traversal of relationships from parent to child nodes. SearchAI navigates these hierarchies predictively and ensures that all paths are reachable without losing any relevant nodes.
To evaluate the effectiveness of SearchAI, we conducted a series of experiments using both public and production datasets. Our results show that SearchAI outperforms default hierarchical traversals across several metrics, including accuracy, robustness, performance, and scalability. SearchAI can help make clinical data more accessible, leading to streamlined workflows, reduced administrative burden, and enhanced coding and diagnostic accuracy.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
J-PARSE: Jacobian-based Projection Algorithm for Resolving Singularities Effectively in Inverse Kinematic Control of Serial Manipulators
Authors:
Shivani Guptasarma,
Matthew Strong,
Honghao Zhen,
Monroe Kennedy III
Abstract:
J-PARSE is a method for smooth first-order inverse kinematic control of a serial manipulator near kinematic singularities. The commanded end-effector velocity is interpreted component-wise, according to the available mobility in each dimension of the task space. First, a substitute "Safety" Jacobian matrix is created, keeping the aspect ratio of the manipulability ellipsoid above a threshold value…
▽ More
J-PARSE is a method for smooth first-order inverse kinematic control of a serial manipulator near kinematic singularities. The commanded end-effector velocity is interpreted component-wise, according to the available mobility in each dimension of the task space. First, a substitute "Safety" Jacobian matrix is created, keeping the aspect ratio of the manipulability ellipsoid above a threshold value. The desired motion is then projected onto non-singular and singular directions, and the latter projection scaled down by a factor informed by the threshold value. A right-inverse of the non-singular Safety Jacobian is applied to the modified command. In the absence of joint limits and collisions, this ensures smooth transition into and out of low-rank poses, guaranteeing asymptotic stability for target poses within the workspace, and stability for those outside. Velocity control with J-PARSE is benchmarked against the Least-Squares and Damped Least-Squares inversions of the Jacobian, and shows high accuracy in reaching and leaving singular target poses. By expanding the available workspace of manipulators, the method finds applications in servoing, teleoperation, and learning.
△ Less
Submitted 6 May, 2025; v1 submitted 1 May, 2025;
originally announced May 2025.
-
Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses
Authors:
Sahel Sharifymoghaddam,
Shivani Upadhyay,
Nandan Thakur,
Ronak Pradeep,
Jimmy Lin
Abstract:
Battles, or side-by-side comparisons in so-called arenas that elicit human preferences, have emerged as a popular approach for assessing the output quality of LLMs. Recently, this idea has been extended to retrieval-augmented generation (RAG) systems. While undoubtedly representing an advance in evaluation, battles have at least two drawbacks, particularly in the context of complex information-see…
▽ More
Battles, or side-by-side comparisons in so-called arenas that elicit human preferences, have emerged as a popular approach for assessing the output quality of LLMs. Recently, this idea has been extended to retrieval-augmented generation (RAG) systems. While undoubtedly representing an advance in evaluation, battles have at least two drawbacks, particularly in the context of complex information-seeking queries: they are neither explanatory nor diagnostic. Recently, the nugget evaluation methodology has emerged as a promising approach to evaluate the quality of RAG answers. Nuggets decompose long-form LLM-generated answers into atomic facts, highlighting important pieces of information necessary in a "good" response. In this work, we apply our AutoNuggetizer framework to analyze data from roughly 7K Search Arena battles provided by LMArena in a fully automatic manner. Our results show a significant correlation between nugget scores and human preferences, showcasing promise in our approach to explainable and diagnostic system evaluations. All the code necessary to reproduce results in our work is available in https://github.com/castorini/lmsys_nuggetize.
△ Less
Submitted 25 May, 2025; v1 submitted 28 April, 2025;
originally announced April 2025.
-
Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges
Authors:
Nandan Thakur,
Ronak Pradeep,
Shivani Upadhyay,
Daniel Campos,
Nick Craswell,
Jimmy Lin
Abstract:
Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissio…
▽ More
Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models
Authors:
Ronak Pradeep,
Nandan Thakur,
Shivani Upadhyay,
Daniel Campos,
Nick Craswell,
Jimmy Lin
Abstract:
Large Language Models (LLMs) have significantly enhanced the capabilities of information access systems, especially with retrieval-augmented generation (RAG). Nevertheless, the evaluation of RAG systems remains a barrier to continued progress, a challenge we tackle in this work by proposing an automatic evaluation framework that is validated against human annotations. We believe that the nugget ev…
▽ More
Large Language Models (LLMs) have significantly enhanced the capabilities of information access systems, especially with retrieval-augmented generation (RAG). Nevertheless, the evaluation of RAG systems remains a barrier to continued progress, a challenge we tackle in this work by proposing an automatic evaluation framework that is validated against human annotations. We believe that the nugget evaluation methodology provides a solid foundation for evaluating RAG systems. This approach, originally developed for the TREC Question Answering (QA) Track in 2003, evaluates systems based on atomic facts that should be present in good answers. Our efforts focus on "refactoring" this methodology, where we describe the AutoNuggetizer framework that specifically applies LLMs to both automatically create nuggets and automatically assign nuggets to system answers. In the context of the TREC 2024 RAG Track, we calibrate a fully automatic approach against strategies where nuggets are created manually or semi-manually by human assessors and then assigned manually to system answers. Based on results from a community-wide evaluation, we observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants. The agreement is stronger when individual framework components such as nugget assignment are automated independently. This suggests that our evaluation framework provides tradeoffs between effort and quality that can be used to guide the development of future RAG systems. However, further research is necessary to refine our approach, particularly in establishing robust per-topic agreement to diagnose system failures effectively.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Learning Strategies in Particle Swarm Optimizer: A Critical Review and Performance Analysis
Authors:
Dikshit Chauhan,
Shivani,
P. N. Suganthan
Abstract:
Nature has long inspired the development of swarm intelligence (SI), a key branch of artificial intelligence that models collective behaviors observed in biological systems for solving complex optimization problems. Particle swarm optimization (PSO) is widely adopted among SI algorithms due to its simplicity and efficiency. Despite numerous learning strategies proposed to enhance PSO's performance…
▽ More
Nature has long inspired the development of swarm intelligence (SI), a key branch of artificial intelligence that models collective behaviors observed in biological systems for solving complex optimization problems. Particle swarm optimization (PSO) is widely adopted among SI algorithms due to its simplicity and efficiency. Despite numerous learning strategies proposed to enhance PSO's performance in terms of convergence speed, robustness, and adaptability, no comprehensive and systematic analysis of these strategies exists. We review and classify various learning strategies to address this gap, assessing their impact on optimization performance. Additionally, a comparative experimental evaluation is conducted to examine how these strategies influence PSO's search dynamics. Finally, we discuss open challenges and future directions, emphasizing the need for self-adaptive, intelligent PSO variants capable of addressing increasingly complex real-world problems.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
Advancements in Multimodal Differential Evolution: A Comprehensive Review and Future Perspectives
Authors:
Dikshit Chauhan,
Shivani,
Donghwi Jung,
Anupam Yadav
Abstract:
Multi-modal optimization involves identifying multiple global and local optima of a function, offering valuable insights into diverse optimal solutions within the search space. Evolutionary algorithms (EAs) excel at finding multiple solutions in a single run, providing a distinct advantage over classical optimization techniques that often require multiple restarts without guarantee of obtaining di…
▽ More
Multi-modal optimization involves identifying multiple global and local optima of a function, offering valuable insights into diverse optimal solutions within the search space. Evolutionary algorithms (EAs) excel at finding multiple solutions in a single run, providing a distinct advantage over classical optimization techniques that often require multiple restarts without guarantee of obtaining diverse solutions. Among these EAs, differential evolution (DE) stands out as a powerful and versatile optimizer for continuous parameter spaces. DE has shown significant success in multi-modal optimization by utilizing its population-based search to promote the formation of multiple stable subpopulations, each targeting different optima. Recent advancements in DE for multi-modal optimization have focused on niching methods, parameter adaptation, hybridization with other algorithms including machine learning, and applications across various domains. Given these developments, it is an opportune moment to present a critical review of the latest literature and identify key future research directions. This paper offers a comprehensive overview of recent DE advancements in multimodal optimization, including methods for handling multiple optima, hybridization with EAs, and machine learning, and highlights a range of real-world applications. Additionally, the paper outlines a set of compelling open problems and future research issues from multiple perspectives
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with UniMoral
Authors:
Shivani Kumar,
David Jurgens
Abstract:
Moral reasoning is a complex cognitive process shaped by individual experiences and cultural contexts and presents unique challenges for computational analysis. While natural language processing (NLP) offers promising tools for studying this phenomenon, current research lacks cohesion, employing discordant datasets and tasks that examine isolated aspects of moral reasoning. We bridge this gap with…
▽ More
Moral reasoning is a complex cognitive process shaped by individual experiences and cultural contexts and presents unique challenges for computational analysis. While natural language processing (NLP) offers promising tools for studying this phenomenon, current research lacks cohesion, employing discordant datasets and tasks that examine isolated aspects of moral reasoning. We bridge this gap with UniMoral, a unified dataset integrating psychologically grounded and social-media-derived moral dilemmas annotated with labels for action choices, ethical principles, contributing factors, and consequences, alongside annotators' moral and cultural profiles. Recognizing the cultural relativity of moral reasoning, UniMoral spans six languages, Arabic, Chinese, English, Hindi, Russian, and Spanish, capturing diverse socio-cultural contexts. We demonstrate UniMoral's utility through a benchmark evaluations of three large language models (LLMs) across four tasks: action prediction, moral typology classification, factor attribution analysis, and consequence generation. Key findings reveal that while implicitly embedded moral contexts enhance the moral reasoning capability of LLMs, there remains a critical need for increasingly specialized approaches to further advance moral reasoning in these models.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Localization of Vibrotactile Stimuli on the Face
Authors:
Shivani Guptasarma,
Allison M. Okamura,
Monroe Kennedy III
Abstract:
The face remains relatively unexplored as a target region for haptic feedback, despite providing a considerable surface area consisting of highly sensitive skin. There are promising applications for facial haptic feedback, especially in cases of severe upper limb loss or spinal cord injury, where the face is typically less impacted than other body parts. Moreover, the neural representation of the…
▽ More
The face remains relatively unexplored as a target region for haptic feedback, despite providing a considerable surface area consisting of highly sensitive skin. There are promising applications for facial haptic feedback, especially in cases of severe upper limb loss or spinal cord injury, where the face is typically less impacted than other body parts. Moreover, the neural representation of the face is adjacent to that of the hand, and phantom maps have been discovered between the fingertips and the cheeks. However, there is a dearth of compact devices for facial haptic feedback, and vibrotactile stimulation, a common modality of haptic feedback, has not been characterized for localization acuity on the face. We performed a localization experiment on the cheek, with an arrangement of off-the-shelf coin vibration motors. The study follows the methods of prior work studying other skin regions, in which participants attempt to identify the sites of discrete vibrotactile stimuli. We intend for our results to inform the future development of systems using vibrotactile feedback to convey information via the face.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Precision Harvesting in Cluttered Environments: Integrating End Effector Design with Dual Camera Perception
Authors:
Kendall Koe,
Poojan Kalpeshbhai Shah,
Benjamin Walt,
Jordan Westphal,
Samhita Marri,
Shivani Kamtikar,
James Seungbum Nam,
Naveen Kumar Uppalapati,
Girish Krishnan,
Girish Chowdhary
Abstract:
Due to labor shortages in specialty crop industries, a need for robotic automation to increase agricultural efficiency and productivity has arisen. Previous manipulation systems perform well in harvesting in uncluttered and structured environments. High tunnel environments are more compact and cluttered in nature, requiring a rethinking of the large form factor systems and grippers. We propose a n…
▽ More
Due to labor shortages in specialty crop industries, a need for robotic automation to increase agricultural efficiency and productivity has arisen. Previous manipulation systems perform well in harvesting in uncluttered and structured environments. High tunnel environments are more compact and cluttered in nature, requiring a rethinking of the large form factor systems and grippers. We propose a novel codesigned framework incorporating a global detection camera and a local eye-in-hand camera that demonstrates precise localization of small fruits via closed-loop visual feedback and reliable error handling. Field experiments in high tunnels show our system can reach an average of 85.0\% of cherry tomato fruit in 10.98s on average.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
Examining the Expanding Role of Synthetic Data Throughout the AI Development Pipeline
Authors:
Shivani Kapania,
Stephanie Ballard,
Alex Kessler,
Jennifer Wortman Vaughan
Abstract:
Alongside the growth of generative AI, we are witnessing a surge in the use of synthetic data across all stages of the AI development pipeline. It is now common practice for researchers and practitioners to use one large generative model (which we refer to as an auxiliary model) to generate synthetic data that is used to train or evaluate another, reconfiguring AI workflows and reshaping the very…
▽ More
Alongside the growth of generative AI, we are witnessing a surge in the use of synthetic data across all stages of the AI development pipeline. It is now common practice for researchers and practitioners to use one large generative model (which we refer to as an auxiliary model) to generate synthetic data that is used to train or evaluate another, reconfiguring AI workflows and reshaping the very nature of data. While scholars have raised concerns over the risks of synthetic data, policy guidance and best practices for its responsible use have not kept up with these rapidly evolving industry trends, in part because we lack a clear picture of current practices and challenges. Our work aims to address this gap. Through 29 interviews with AI practitioners and responsible AI experts, we examine the expanding role of synthetic data in AI development. Our findings reveal how auxiliary models are now widely used across the AI development pipeline. Practitioners describe synthetic data as crucial for addressing data scarcity and providing a competitive edge, noting that evaluation of generative AI systems at scale would be infeasible without auxiliary models. However, they face challenges controlling the outputs of auxiliary models, generating data that accurately depict underrepresented groups, and scaling data validation practices that are based primarily on manual inspection. We detail general limitations of and ethical considerations for synthetic data and conclude with a proposal of concrete steps towards the development of best practices for its responsible use.
△ Less
Submitted 12 May, 2025; v1 submitted 30 January, 2025;
originally announced January 2025.
-
Statistical Undersampling with Mutual Information and Support Points
Authors:
Alex Mak,
Shubham Sahoo,
Shivani Pandey,
Yidan Yue,
Linglong Kong
Abstract:
Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize re…
▽ More
Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Analysis of Generalized Hebbian Learning Algorithm for Neuromorphic Hardware Using Spinnaker
Authors:
Shivani Sharma,
Darshika G. Perera
Abstract:
Neuromorphic computing, inspired by biological neural networks, has emerged as a promising approach for solving complex machine learning tasks with greater efficiency and lower power consumption. The integration of biologically plausible learning algorithms, such as the Generalized Hebbian Algorithm (GHA), is key to enhancing the performance of neuromorphic systems. In this paper, we explore the a…
▽ More
Neuromorphic computing, inspired by biological neural networks, has emerged as a promising approach for solving complex machine learning tasks with greater efficiency and lower power consumption. The integration of biologically plausible learning algorithms, such as the Generalized Hebbian Algorithm (GHA), is key to enhancing the performance of neuromorphic systems. In this paper, we explore the application of GHA in large-scale neuromorphic platforms, specifically SpiNNaker, a hardware designed to simulate large neural networks. Our results demonstrate significant improvements in classification accuracy, showcasing the potential of biologically inspired learning algorithms in advancing the field of neuromorphic computing.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework
Authors:
Ronak Pradeep,
Nandan Thakur,
Shivani Upadhyay,
Daniel Campos,
Nick Craswell,
Jimmy Lin
Abstract:
This report provides an initial look at partial results from the TREC 2024 Retrieval-Augmented Generation (RAG) Track. We have identified RAG evaluation as a barrier to continued progress in information access (and more broadly, natural language processing and artificial intelligence), and it is our hope that we can contribute to tackling the many challenges in this space. The central hypothesis w…
▽ More
This report provides an initial look at partial results from the TREC 2024 Retrieval-Augmented Generation (RAG) Track. We have identified RAG evaluation as a barrier to continued progress in information access (and more broadly, natural language processing and artificial intelligence), and it is our hope that we can contribute to tackling the many challenges in this space. The central hypothesis we explore in this work is that the nugget evaluation methodology, originally developed for the TREC Question Answering Track in 2003, provides a solid foundation for evaluating RAG systems. As such, our efforts have focused on "refactoring" this methodology, specifically applying large language models to both automatically create nuggets and to automatically assign nuggets to system answers. We call this the AutoNuggetizer framework. Within the TREC setup, we are able to calibrate our fully automatic process against a manual process whereby nuggets are created by human assessors semi-manually and then assigned manually to system answers. Based on initial results across 21 topics from 45 runs, we observe a strong correlation between scores derived from a fully automatic nugget evaluation and a (mostly) manual nugget evaluation by human assessors. This suggests that our fully automatic evaluation process can be used to guide future iterations of RAG systems.
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look
Authors:
Shivani Upadhyay,
Ronak Pradeep,
Nandan Thakur,
Daniel Campos,
Nick Craswell,
Ian Soboroff,
Hoa Trang Dang,
Jimmy Lin
Abstract:
The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully…
▽ More
The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits. Overall, human assessors appear to be stricter than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.
△ Less
Submitted 12 November, 2024;
originally announced November 2024.
-
'Simulacrum of Stories': Examining Large Language Models as Qualitative Research Participants
Authors:
Shivani Kapania,
William Agnew,
Motahhare Eslami,
Hoda Heidari,
Sarah Fox
Abstract:
The recent excitement around generative models has sparked a wave of proposals suggesting the replacement of human participation and labor in research and development--e.g., through surveys, experiments, and interviews--with synthetic research data generated by large language models (LLMs). We conducted interviews with 19 qualitative researchers to understand their perspectives on this paradigm sh…
▽ More
The recent excitement around generative models has sparked a wave of proposals suggesting the replacement of human participation and labor in research and development--e.g., through surveys, experiments, and interviews--with synthetic research data generated by large language models (LLMs). We conducted interviews with 19 qualitative researchers to understand their perspectives on this paradigm shift. Initially skeptical, researchers were surprised to see similar narratives emerge in the LLM-generated data when using the interview probe. However, over several conversational turns, they went on to identify fundamental limitations, such as how LLMs foreclose participants' consent and agency, produce responses lacking in palpability and contextual depth, and risk delegitimizing qualitative research methods. We argue that the use of LLMs as proxies for participants enacts the surrogate effect, raising ethical and epistemological concerns that extend beyond the technical limitations of current models to the core of whether LLMs fit within qualitative ways of knowing.
△ Less
Submitted 28 September, 2024;
originally announced September 2024.
-
A Multi-operator Ensemble LSHADE with Restart and Local Search Mechanisms for Single-objective Optimization
Authors:
Dikshit Chauhan,
Anupam Trivedi,
Shivani
Abstract:
In recent years, multi-operator and multi-method algorithms have succeeded, encouraging their combination within single frameworks. Despite promising results, there remains room for improvement as only some evolutionary algorithms (EAs) consistently excel across all optimization problems. This paper proposes mLSHADE-RL, an enhanced version of LSHADE-cnEpSin, which is one of the winners of the CEC…
▽ More
In recent years, multi-operator and multi-method algorithms have succeeded, encouraging their combination within single frameworks. Despite promising results, there remains room for improvement as only some evolutionary algorithms (EAs) consistently excel across all optimization problems. This paper proposes mLSHADE-RL, an enhanced version of LSHADE-cnEpSin, which is one of the winners of the CEC 2017 competition in real-parameter single-objective optimization. mLSHADE-RL integrates multiple EAs and search operators to improve performance further. Three mutation strategies such as DE/current-to-pbest-weight/1 with archive, DE/current-to-pbest/1 without archive, and DE/current-to-ordpbest-weight/1 are integrated in the original LSHADE-cnEpSin. A restart mechanism is also proposed to overcome the local optima tendency. Additionally, a local search method is applied in the later phase of the evolutionary procedure to enhance the exploitation capability of mLSHADE-RL. mLSHADE-RL is tested on 30 dimensions in the CEC 2024 competition on single objective bound constrained optimization, demonstrating superior performance over other state-of-the-art algorithms in producing high-quality solutions across various optimization scenarios.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue
Authors:
Jonathan Ivey,
Shivani Kumar,
Jiayu Liu,
Hua Shen,
Sushrita Rakshit,
Rohan Raju,
Haotian Zhang,
Aparna Ananthasubramaniam,
Junghwan Kim,
Bowen Yi,
Dustin Wright,
Abraham Israeli,
Anders Giovanni Møller,
Lechen Zhang,
David Jurgens
Abstract:
Studying and building datasets for dialogue tasks is both expensive and time-consuming due to the need to recruit, train, and collect data from study participants. In response, much recent work has sought to use large language models (LLMs) to simulate both human-human and human-LLM interactions, as they have been shown to generate convincingly human-like text in many settings. However, to what ex…
▽ More
Studying and building datasets for dialogue tasks is both expensive and time-consuming due to the need to recruit, train, and collect data from study participants. In response, much recent work has sought to use large language models (LLMs) to simulate both human-human and human-LLM interactions, as they have been shown to generate convincingly human-like text in many settings. However, to what extent do LLM-based simulations \textit{actually} reflect human dialogues? In this work, we answer this question by generating a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset and quantifying how well the LLM simulations align with their human counterparts. Overall, we find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties, including style and content. Further, in comparisons of English, Chinese, and Russian dialogues, we find that models perform similarly. Our results suggest that LLMs generally perform better when the human themself writes in a way that is more similar to the LLM's own style.
△ Less
Submitted 16 September, 2024; v1 submitted 12 September, 2024;
originally announced September 2024.
-
Target Detection for OTFS-Aided Cell-Free MIMO ISAC System
Authors:
Shivani Singh,
Amudheesan Nakkeeran,
Prem Singh,
Ekant Sharma,
Jyotsna Bapat
Abstract:
This letter focuses on enhancing target detection performance for a multi-user integrated sensing and communication (ISAC) system using orthogonal time frequency space (OTFS)-aided cell-free multiple-input multiple-output (MIMO) technology in high-speed vehicular environments. We propose a sensing-centric (SC) approach for target detection using communication signals with or without sensing signal…
▽ More
This letter focuses on enhancing target detection performance for a multi-user integrated sensing and communication (ISAC) system using orthogonal time frequency space (OTFS)-aided cell-free multiple-input multiple-output (MIMO) technology in high-speed vehicular environments. We propose a sensing-centric (SC) approach for target detection using communication signals with or without sensing signals. Power allocation is optimized to maximize the sensing signal-to-noise ratio (SNR) of the proposed SC scheme while ensuring a required quality-of-service (QoS) for the communication user equipment (UEs), and adhering to each access points (APs) power budget. Numerical results show that the proposed SC scheme vastly outperforms a communication-centric method that minimizes the total power consumed at the APs subject to the same constraints.
△ Less
Submitted 23 August, 2024;
originally announced August 2024.
-
Deep Learning-based Classification of Dementia using Image Representation of Subcortical Signals
Authors:
Shivani Ranjan,
Ayush Tripathi,
Harshal Shende,
Robin Badal,
Amit Kumar,
Pramod Yadav,
Deepak Joshi,
Lalan Kumar
Abstract:
Dementia is a neurological syndrome marked by cognitive decline. Alzheimer's disease (AD) and Frontotemporal dementia (FTD) are the common forms of dementia, each with distinct progression patterns. EEG, a non-invasive tool for recording brain activity, has shown potential in distinguishing AD from FTD and mild cognitive impairment (MCI). Previous studies have utilized various EEG features, such a…
▽ More
Dementia is a neurological syndrome marked by cognitive decline. Alzheimer's disease (AD) and Frontotemporal dementia (FTD) are the common forms of dementia, each with distinct progression patterns. EEG, a non-invasive tool for recording brain activity, has shown potential in distinguishing AD from FTD and mild cognitive impairment (MCI). Previous studies have utilized various EEG features, such as subband power and connectivity patterns to differentiate these conditions. However, artifacts in EEG signals can obscure crucial information, necessitating advanced signal processing techniques. This study aims to develop a deep learning-based classification system for dementia by analyzing scout time-series signals from deep brain regions, specifically the hippocampus, amygdala, and thalamus. The study utilizes scout time series extracted via the standardized low-resolution brain electromagnetic tomography (sLORETA) technique. The time series is converted to image representations using continuous wavelet transform (CWT) and fed as input to deep learning models. Two high-density EEG datasets are utilized to check for the efficacy of the proposed method: the online BrainLat dataset (comprising AD, FTD, and healthy controls (HC)) and the in-house IITD-AIIA dataset (including subjects with AD, MCI, and HC). Different classification strategies and classifier combinations have been utilized for the accurate mapping of classes on both datasets. The best results were achieved by using a product of probabilities from classifiers for left and right subcortical regions in conjunction with the DenseNet model architecture. It yields accuracies of 94.17$\%$ and 77.72$\%$ on the BrainLat and IITD-AIIA datasets, respectively. This highlights the potential of this approach for early and accurate differentiation of neurodegenerative disorders.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Automated Sperm Morphology Analysis Based on Instance-Aware Part Segmentation
Authors:
Wenyuan Chen,
Haocong Song,
Changsheng Dai,
Aojun Jiang,
Guanqiao Shan,
Hang Liu,
Yanlong Zhou,
Khaled Abdalla,
Shivani N Dhanani,
Katy Fatemeh Moosavi,
Shruti Pathak,
Clifford Librach,
Zhuoran Zhang,
Yu Sun
Abstract:
Traditional sperm morphology analysis is based on tedious manual annotation. Automated morphology analysis of a high number of sperm requires accurate segmentation of each sperm part and quantitative morphology evaluation. State-of-the-art instance-aware part segmentation networks follow a "detect-then-segment" paradigm. However, due to sperm's slim shape, their segmentation suffers from large con…
▽ More
Traditional sperm morphology analysis is based on tedious manual annotation. Automated morphology analysis of a high number of sperm requires accurate segmentation of each sperm part and quantitative morphology evaluation. State-of-the-art instance-aware part segmentation networks follow a "detect-then-segment" paradigm. However, due to sperm's slim shape, their segmentation suffers from large context loss and feature distortion due to bounding box cropping and resizing during ROI Align. Moreover, morphology measurement of sperm tail is demanding because of the long and curved shape and its uneven width. This paper presents automated techniques to measure sperm morphology parameters automatically and quantitatively. A novel attention-based instance-aware part segmentation network is designed to reconstruct lost contexts outside bounding boxes and to fix distorted features, by refining preliminary segmented masks through merging features extracted by feature pyramid network. An automated centerline-based tail morphology measurement method is also proposed, in which an outlier filtering method and endpoint detection algorithm are designed to accurately reconstruct tail endpoints. Experimental results demonstrate that the proposed network outperformed the state-of-the-art top-down RP-R-CNN by 9.2% [AP]_vol^p, and the proposed automated tail morphology measurement method achieved high measurement accuracies of 95.34%,96.39%,91.2% for length, width and curvature, respectively.
△ Less
Submitted 31 July, 2024;
originally announced August 2024.
-
Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction
Authors:
Chase Fensore,
Rodrigo M. Carrillo-Larco,
Shivani A. Patel,
Alanna A. Morris,
Joyce C. Ho
Abstract:
Social determinants of health (SDOH) $-$ the myriad of circumstances in which people live, grow, and age $-$ play an important role in health outcomes. However, existing outcome prediction models often only use proxies of SDOH as features. Recent open data initiatives present an opportunity to construct a more comprehensive view of SDOH, but manually integrating the most relevant data for individu…
▽ More
Social determinants of health (SDOH) $-$ the myriad of circumstances in which people live, grow, and age $-$ play an important role in health outcomes. However, existing outcome prediction models often only use proxies of SDOH as features. Recent open data initiatives present an opportunity to construct a more comprehensive view of SDOH, but manually integrating the most relevant data for individual patients becomes increasingly challenging as the volume and diversity of public SDOH data grows. Large language models (LLMs) have shown promise at automatically annotating structured data. Here, we conduct an end-to-end case study evaluating the feasibility of using LLMs to integrate SDOH data, and the utility of these SDOH features for clinical prediction. We first manually label 700+ variables from two publicly-accessible SDOH data sources to one of five semantic SDOH categories. Then, we benchmark performance of 9 open-source LLMs on this classification task. Finally, we train ML models to predict 30-day hospital readmission among 39k heart failure (HF) patients, and we compare the prediction performance of the categorized SDOH variables with standard clinical variables. Additionally, we investigate the impact of few-shot LLM prompting on LLM annotation performance, and perform a metadata ablation study on prompts to evaluate which information helps LLMs accurately annotate these variables. We find that some open-source LLMs can effectively, accurately annotate SDOH variables with zero-shot prompting without the need for fine-tuning. Crucially, when combined with standard clinical features, the LLM-annotated Neighborhood and Built Environment subset of the SDOH variables shows the best performance predicting 30-day readmission of HF patients.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
ProACT: An Augmented Reality Testbed for Intelligent Prosthetic Arms
Authors:
Shivani Guptasarma,
Monroe D. Kennedy III
Abstract:
Upper-limb amputees face tremendous difficulty in operating dexterous powered prostheses. Previous work has shown that aspects of prosthetic hand, wrist, or elbow control can be improved through "intelligent" control, by combining movement-based or gaze-based intent estimation with low-level robotic autonomy. However, no such solutions exist for whole-arm control. Moreover, hardware platforms for…
▽ More
Upper-limb amputees face tremendous difficulty in operating dexterous powered prostheses. Previous work has shown that aspects of prosthetic hand, wrist, or elbow control can be improved through "intelligent" control, by combining movement-based or gaze-based intent estimation with low-level robotic autonomy. However, no such solutions exist for whole-arm control. Moreover, hardware platforms for advanced prosthetic control are expensive, and existing simulation platforms are not well-designed for integration with robotics software frameworks. We present the Prosthetic Arm Control Testbed (ProACT), a platform for evaluating intelligent control methods for prosthetic arms in an immersive (Augmented Reality) simulation setting. We demonstrate the use of ProACT through preliminary studies, with non-amputee participants performing an adapted Box-and-Blocks task with and without intent estimation. We further discuss how our observations may inform the design of prosthesis control methods, as well as the design of future studies using the platform. To the best of our knowledge, this constitutes the first study of semi-autonomous control for complex whole-arm prostheses, the first study including sequential task modeling in the context of wearable prosthetic arms, and the first testbed of its kind. Towards the goal of supporting future research in intelligent prosthetics, the system is built upon on existing open-source frameworks for robotics, and is available at https://arm.stanford.edu/proact.
△ Less
Submitted 2 December, 2024; v1 submitted 6 July, 2024;
originally announced July 2024.
-
BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity
Authors:
Chih-Hsuan Yang,
Benjamin Feuer,
Zaki Jubery,
Zi K. Deng,
Andre Nakkab,
Md Zahid Hasan,
Shivani Chiranjeevi,
Kelly Marshall,
Nirmal Baishnab,
Asheesh K Singh,
Arti Singh,
Soumik Sarkar,
Nirav Merchant,
Chinmay Hegde,
Baskar Ganapathysubramanian
Abstract:
We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted to include only research-grade data, BioTrove contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia ("animals"), Fungi ("fungi"), and Plantae ("plants"), spanning approximately…
▽ More
We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted to include only research-grade data, BioTrove contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia ("animals"), Fungi ("fungi"), and Plantae ("plants"), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems.
We demonstrate the value of BioTrove by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BioTrove-Train. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for their critical role in biodiversity and agriculture: Aves ("birds"), Arachnida ("spiders/ticks/mites"), Insecta ("insects"), Plantae ("plants"), Fungi ("fungi"), Mollusca ("snails"), and Reptilia ("snakes/lizards"). To support rigorous assessment, we introduce several new benchmarks and report model accuracy for zero-shot learning across life stages, rare species, confounding species, and multiple taxonomic levels.
We anticipate that BioTrove will spur the development of AI models capable of supporting digital tools for pest control, crop monitoring, biodiversity assessment, and environmental conservation. These advancements are crucial for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. BioTrove is publicly available, easily accessible, and ready for immediate use.
△ Less
Submitted 27 January, 2025; v1 submitted 25 June, 2024;
originally announced June 2024.
-
UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor
Authors:
Shivani Upadhyay,
Ronak Pradeep,
Nandan Thakur,
Nick Craswell,
Jimmy Lin
Abstract:
Copious amounts of relevance judgments are necessary for the effective training and accurate evaluation of retrieval systems. Conventionally, these judgments are made by human assessors, rendering this process expensive and laborious. A recent study by Thomas et al. from Microsoft Bing suggested that large language models (LLMs) can accurately perform the relevance assessment task and provide huma…
▽ More
Copious amounts of relevance judgments are necessary for the effective training and accurate evaluation of retrieval systems. Conventionally, these judgments are made by human assessors, rendering this process expensive and laborious. A recent study by Thomas et al. from Microsoft Bing suggested that large language models (LLMs) can accurately perform the relevance assessment task and provide human-quality judgments, but unfortunately their study did not yield any reusable software artifacts. Our work presents UMBRELA (a recursive acronym that stands for UMbrela is the Bing RELevance Assessor), an open-source toolkit that reproduces the results of Thomas et al. using OpenAI's GPT-4o model and adds more nuance to the original paper. Across Deep Learning Tracks from TREC 2019 to 2023, we find that LLM-derived relevance judgments correlate highly with rankings generated by effective multi-stage retrieval systems. Our toolkit is designed to be easily extensible and can be integrated into existing multi-stage retrieval and evaluation pipelines, offering researchers a valuable resource for studying retrieval evaluation methodologies. UMBRELA will be used in the TREC 2024 RAG Track to aid in relevance assessments, and we envision our toolkit becoming a foundation for further innovation in the field. UMBRELA is available at https://github.com/castorini/umbrela.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
UniRAG: Universal Retrieval Augmentation for Large Vision Language Models
Authors:
Sahel Sharifymoghaddam,
Shivani Upadhyay,
Wenhu Chen,
Jimmy Lin
Abstract:
Recently, Large Vision Language Models (LVLMs) have unlocked many complex use cases that require Multi-Modal (MM) understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelityof LVLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to…
▽ More
Recently, Large Vision Language Models (LVLMs) have unlocked many complex use cases that require Multi-Modal (MM) understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelityof LVLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by Vision-Language (VL) retrievers like UniIR models. All the necessary code to reproduce our results is available at https://github.com/castorini/UniRAG
△ Less
Submitted 9 March, 2025; v1 submitted 16 May, 2024;
originally announced May 2024.
-
LLMs Can Patch Up Missing Relevance Judgments in Evaluation
Authors:
Shivani Upadhyay,
Ehsan Kamalloo,
Jimmy Lin
Abstract:
Unjudged documents or holes in information retrieval benchmarks are considered non-relevant in evaluation, yielding no gains in measuring effectiveness. However, these missing judgments may inadvertently introduce biases into the evaluation as their prevalence for a retrieval model is heavily contingent on the pooling process. Thus, filling holes becomes crucial in ensuring reliable and accurate e…
▽ More
Unjudged documents or holes in information retrieval benchmarks are considered non-relevant in evaluation, yielding no gains in measuring effectiveness. However, these missing judgments may inadvertently introduce biases into the evaluation as their prevalence for a retrieval model is heavily contingent on the pooling process. Thus, filling holes becomes crucial in ensuring reliable and accurate evaluation. Collecting human judgment for all documents is cumbersome and impractical. In this paper, we aim at leveraging large language models (LLMs) to automatically label unjudged documents. Our goal is to instruct an LLM using detailed instructions to assign fine-grained relevance judgments to holes. To this end, we systematically simulate scenarios with varying degrees of holes by randomly dropping relevant documents from the relevance judgment in TREC DL tracks. Our experiments reveal a strong correlation between our LLM-based method and ground-truth relevance judgments. Based on our simulation experiments conducted on three TREC DL datasets, in the extreme scenario of retaining only 10% of judgments, our method achieves a Kendall tau correlation of 0.87 and 0.92 on an average for Vicuña-7B and GPT-3.5 Turbo respectively.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Generative AI in Cybersecurity
Authors:
Shivani Metta,
Isaac Chang,
Jack Parker,
Michael P. Roman,
Arturo F. Ehuan
Abstract:
The dawn of Generative Artificial Intelligence (GAI), characterized by advanced models such as Generative Pre-trained Transformers (GPT) and other Large Language Models (LLMs), has been pivotal in reshaping the field of data analysis, pattern recognition, and decision-making processes. This surge in GAI technology has ushered in not only innovative opportunities for data processing and automation…
▽ More
The dawn of Generative Artificial Intelligence (GAI), characterized by advanced models such as Generative Pre-trained Transformers (GPT) and other Large Language Models (LLMs), has been pivotal in reshaping the field of data analysis, pattern recognition, and decision-making processes. This surge in GAI technology has ushered in not only innovative opportunities for data processing and automation but has also introduced significant cybersecurity challenges.
As GAI rapidly progresses, it outstrips the current pace of cybersecurity protocols and regulatory frameworks, leading to a paradox wherein the same innovations meant to safeguard digital infrastructures also enhance the arsenal available to cyber criminals. These adversaries, adept at swiftly integrating and exploiting emerging technologies, may utilize GAI to develop malware that is both more covert and adaptable, thus complicating traditional cybersecurity efforts.
The acceleration of GAI presents an ambiguous frontier for cybersecurity experts, offering potent tools for threat detection and response, while concurrently providing cyber attackers with the means to engineer more intricate and potent malware. Through the joint efforts of Duke Pratt School of Engineering, Coalfire, and Safebreach, this research undertakes a meticulous analysis of how malicious agents are exploiting GAI to augment their attack strategies, emphasizing a critical issue for the integrity of future cybersecurity initiatives. The study highlights the critical need for organizations to proactively identify and develop more complex defensive strategies to counter the sophisticated employment of GAI in malware creation.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Predicting the Temporal Dynamics of Prosthetic Vision
Authors:
Yuchen Hou,
Laya Pullela,
Jiaxin Su,
Sriya Aluru,
Shivani Sista,
Xiankun Lu,
Michael Beyeler
Abstract:
Retinal implants are a promising treatment option for degenerative retinal disease. While numerous models have been developed to simulate the appearance of elicited visual percepts ("phosphenes"), these models often either focus solely on spatial characteristics or inadequately capture the complex temporal dynamics observed in clinical trials, which vary heavily across implant technologies, subjec…
▽ More
Retinal implants are a promising treatment option for degenerative retinal disease. While numerous models have been developed to simulate the appearance of elicited visual percepts ("phosphenes"), these models often either focus solely on spatial characteristics or inadequately capture the complex temporal dynamics observed in clinical trials, which vary heavily across implant technologies, subjects, and stimulus conditions. Here we introduce two computational models designed to accurately predict phosphene fading and persistence under varying stimulus conditions, cross-validated on behavioral data reported by nine users of the Argus II Retinal Prosthesis System. Both models segment the time course of phosphene perception into discrete intervals, decomposing phosphene fading and persistence into either sinusoidal or exponential components. Our spectral model demonstrates state-of-the-art predictions of phosphene intensity over time (r = 0.7 across all participants). Overall, this study lays the groundwork for enhancing prosthetic vision by improving our understanding of phosphene temporal dynamics.
△ Less
Submitted 1 May, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
"I'm categorizing LLM as a productivity tool": Examining ethics of LLM use in HCI research practices
Authors:
Shivani Kapania,
Ruiyi Wang,
Toby Jia-Jun Li,
Tianshi Li,
Hong Shen
Abstract:
Large language models are increasingly applied in real-world scenarios, including research and education. These models, however, come with well-known ethical issues, which may manifest in unexpected ways in human-computer interaction research due to the extensive engagement with human subjects. This paper reports on research practices related to LLM use, drawing on 16 semi-structured interviews an…
▽ More
Large language models are increasingly applied in real-world scenarios, including research and education. These models, however, come with well-known ethical issues, which may manifest in unexpected ways in human-computer interaction research due to the extensive engagement with human subjects. This paper reports on research practices related to LLM use, drawing on 16 semi-structured interviews and a survey conducted with 50 HCI researchers. We discuss the ways in which LLMs are already being utilized throughout the entire HCI research pipeline, from ideation to system development and paper writing. While researchers described nuanced understandings of ethical issues, they were rarely or only partially able to identify and address those ethical concerns in their own projects. This lack of action and reliance on workarounds was explained through the perceived lack of control and distributed responsibility in the LLM supply chain, the conditional nature of engaging with ethics, and competing priorities. Finally, we reflect on the implications of our findings and present opportunities to shape emerging norms of engaging with large language models in HCI research.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
SEGAA: A Unified Approach to Predicting Age, Gender, and Emotion in Speech
Authors:
Aron R,
Indra Sigicharla,
Chirag Periwal,
Mohanaprasad K,
Nithya Darisini P S,
Sourabh Tiwari,
Shivani Arora
Abstract:
The interpretation of human voices holds importance across various applications. This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications. Voice analysis tech advancements span domains, from improving customer interactions to enhancing healthcare and retail experiences. Discerning emotions aids mental health, while age and gender detection are vi…
▽ More
The interpretation of human voices holds importance across various applications. This study ventures into predicting age, gender, and emotion from vocal cues, a field with vast applications. Voice analysis tech advancements span domains, from improving customer interactions to enhancing healthcare and retail experiences. Discerning emotions aids mental health, while age and gender detection are vital in various contexts. Exploring deep learning models for these predictions involves comparing single, multi-output, and sequential models highlighted in this paper. Sourcing suitable data posed challenges, resulting in the amalgamation of the CREMA-D and EMO-DB datasets. Prior work showed promise in individual predictions, but limited research considered all three variables simultaneously. This paper identifies flaws in an individual model approach and advocates for our novel multi-output learning architecture Speech-based Emotion Gender and Age Analysis (SEGAA) model. The experiments suggest that Multi-output models perform comparably to individual models, efficiently capturing the intricate relationships between variables and speech inputs, all while achieving improved runtime.
△ Less
Submitted 1 March, 2024;
originally announced March 2024.
-
Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: A Benchmark Study
Authors:
Prottay Kumar Adhikary,
Aseem Srivastava,
Shivani Kumar,
Salam Michael Singh,
Puneet Manuja,
Jini K Gopinath,
Vijay Krishnan,
Swati Kedia,
Koushik Sinha Deb,
Tanmoy Chakraborty
Abstract:
Comprehensive summaries of sessions enable an effective continuity in mental health counseling, facilitating informed therapy planning. Yet, manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. This study evaluates the effectiveness of state-of-the-art Large Language Models (LLMs) in selectively summarizing various components of ther…
▽ More
Comprehensive summaries of sessions enable an effective continuity in mental health counseling, facilitating informed therapy planning. Yet, manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. This study evaluates the effectiveness of state-of-the-art Large Language Models (LLMs) in selectively summarizing various components of therapy sessions through aspect-based summarization, aiming to benchmark their performance. We introduce MentalCLOUDS, a counseling-component guided summarization dataset consisting of 191 counseling sessions with summaries focused on three distinct counseling components (aka counseling aspects). Additionally, we assess the capabilities of 11 state-of-the-art LLMs in addressing the task of component-guided summarization in counseling. The generated summaries are evaluated quantitatively using standard summarization metrics and verified qualitatively by mental health professionals. Our findings demonstrate the superior performance of task-specific LLMs such as MentalLlama, Mistral, and MentalBART in terms of standard quantitative metrics such as Rouge-1, Rouge-2, Rouge-L, and BERTScore across all aspects of counseling components. Further, expert evaluation reveals that Mistral supersedes both MentalLlama and MentalBART based on six parameters -- affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. However, these models share the same weakness by demonstrating a potential for improvement in the opportunity costs and perceived effectiveness metrics.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
SemEval 2024 -- Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF)
Authors:
Shivani Kumar,
Md Shad Akhtar,
Erik Cambria,
Tanmoy Chakraborty
Abstract:
We present SemEval-2024 Task 10, a shared task centred on identifying emotions and finding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks - emotion recognition in conversation for code-mixed dialogues, emotion flip reasoning for code-mixed dialogues, and emotion flip reasoning for English dialogues. Pa…
▽ More
We present SemEval-2024 Task 10, a shared task centred on identifying emotions and finding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks - emotion recognition in conversation for code-mixed dialogues, emotion flip reasoning for code-mixed dialogues, and emotion flip reasoning for English dialogues. Participating systems were tasked to automatically execute one or more of these subtasks. The datasets for these tasks comprise manually annotated conversations focusing on emotions and triggers for emotion shifts (The task data is available at https://github.com/LCS2-IIITD/EDiReF-SemEval2024.git). A total of 84 participants engaged in this task, with the most adept systems attaining F1-scores of 0.70, 0.79, and 0.76 for the respective subtasks. This paper summarises the results and findings from 24 teams alongside their system descriptions.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
SuperdropNet: a Stable and Accurate Machine Learning Proxy for Droplet-based Cloud Microphysics
Authors:
Shivani Sharma,
David Greenberg
Abstract:
Cloud microphysics has important consequences for climate and weather phenomena, and inaccurate representations can limit forecast accuracy. While atmospheric models increasingly resolve storms and clouds, the accuracy of the underlying microphysics remains limited by computationally expedient bulk moment schemes based on simplifying assumptions. Droplet-based Lagrangian schemes are more accurate…
▽ More
Cloud microphysics has important consequences for climate and weather phenomena, and inaccurate representations can limit forecast accuracy. While atmospheric models increasingly resolve storms and clouds, the accuracy of the underlying microphysics remains limited by computationally expedient bulk moment schemes based on simplifying assumptions. Droplet-based Lagrangian schemes are more accurate but are underutilized due to their large computational overhead. Machine learning (ML) based schemes can bridge this gap by learning from vast droplet-based simulation datasets, but have so far struggled to match the accuracy and stability of bulk moment schemes. To address this challenge, we developed SuperdropNet, an ML-based emulator of the Lagrangian superdroplet simulations. To improve accuracy and stability, we employ multi-step autoregressive prediction during training, impose physical constraints, and carefully control stochasticity in the training data. Superdropnet predicted hydrometeor states and cloud-to-rain transition times more accurately than previous ML emulators, and matched or outperformed bulk moment schemes in many cases. We further carried out detailed analyses to reveal how multistep autoregressive training improves performance, and how the performance of SuperdropNet and other microphysical schemes hydrometeors' mass, number and size distribution. Together our results suggest that ML models can effectively emulate cloud microphysics, in a manner consistent with droplet-based simulations.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
On the Effect of Image Resolution on Semantic Segmentation
Authors:
Ritambhara Singh,
Abhishek Jain,
Pietro Perona,
Shivani Agarwal,
Junfeng Yang
Abstract:
High-resolution semantic segmentation requires substantial computational resources. Traditional approaches in the field typically downscale the input images before processing and then upscale the low-resolution outputs back to their original dimensions. While this strategy effectively identifies broad regions, it often misses finer details. In this study, we demonstrate that a streamlined model ca…
▽ More
High-resolution semantic segmentation requires substantial computational resources. Traditional approaches in the field typically downscale the input images before processing and then upscale the low-resolution outputs back to their original dimensions. While this strategy effectively identifies broad regions, it often misses finer details. In this study, we demonstrate that a streamlined model capable of directly producing high-resolution segmentations can match the performance of more complex systems that generate lower-resolution results. By simplifying the network architecture, we enable the processing of images at their native resolution. Our approach leverages a bottom-up information propagation technique across various scales, which we have empirically shown to enhance segmentation accuracy. We have rigorously tested our method using leading-edge semantic segmentation datasets. Specifically, for the Cityscapes dataset, we further boost accuracy by applying the Noisy Student Training technique.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers
Authors:
Abhimanyu Rajeshkumar Bambhaniya,
Amir Yazdanbakhsh,
Suvinay Subramanian,
Sheng-Chun Kao,
Shivani Agrawal,
Utku Evci,
Tushar Krishna
Abstract:
N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions (…
▽ More
N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions ($\sim$50\%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions ($>$80\%). In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2$\%$ and 5$\%$ in vision and language models at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method yields better performance compared to conventional sparse training recipes, exhibiting an accuracy improvement of up to 2$\%$. The source code is available at https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Harm Amplification in Text-to-Image Models
Authors:
Susan Hao,
Renee Shelby,
Yuchi Liu,
Hansa Srinivasan,
Mukul Bhutani,
Burcu Karagol Ayan,
Ryan Poplin,
Shivani Poddar,
Sarah Laszlo
Abstract:
Text-to-image (T2I) models have emerged as a significant advancement in generative AI; however, there exist safety concerns regarding their potential to produce harmful image outputs even when users input seemingly safe prompts. This phenomenon, where T2I models generate harmful representations that were not explicit in the input prompt, poses a potentially greater risk than adversarial prompts, l…
▽ More
Text-to-image (T2I) models have emerged as a significant advancement in generative AI; however, there exist safety concerns regarding their potential to produce harmful image outputs even when users input seemingly safe prompts. This phenomenon, where T2I models generate harmful representations that were not explicit in the input prompt, poses a potentially greater risk than adversarial prompts, leaving users unintentionally exposed to harms. Our paper addresses this issue by formalizing a definition for this phenomenon which we term harm amplification. We further contribute to the field by developing a framework of methodologies to quantify harm amplification in which we consider the harm of the model output in the context of user input. We then empirically examine how to apply these different methodologies to simulate real-world deployment scenarios including a quantification of disparate impacts across genders resulting from harm amplification. Together, our work aims to offer researchers tools to comprehensively address safety challenges in T2I systems and contribute to the responsible deployment of generative AI models.
△ Less
Submitted 15 August, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Multiclass Learning from Noisy Labels for Non-decomposable Performance Measures
Authors:
Mingyuan Zhang,
Shivani Agarwal
Abstract:
There has been much interest in recent years in learning good classifiers from data with noisy labels. Most work on learning from noisy labels has focused on standard loss-based performance measures. However, many machine learning problems require using non-decomposable performance measures which cannot be expressed as the expectation or sum of a loss on individual examples; these include for exam…
▽ More
There has been much interest in recent years in learning good classifiers from data with noisy labels. Most work on learning from noisy labels has focused on standard loss-based performance measures. However, many machine learning problems require using non-decomposable performance measures which cannot be expressed as the expectation or sum of a loss on individual examples; these include for example the H-mean, Q-mean and G-mean in class imbalance settings, and the Micro $F_1$ in information retrieval. In this paper, we design algorithms to learn from noisy labels for two broad classes of multiclass non-decomposable performance measures, namely, monotonic convex and ratio-of-linear, which encompass all the above examples. Our work builds on the Frank-Wolfe and Bisection based methods of Narasimhan et al. (2015). In both cases, we develop noise-corrected versions of the algorithms under the widely studied family of class-conditional noise models. We provide regret (excess risk) bounds for our algorithms, establishing that even though they are trained on noisy data, they are Bayes consistent in the sense that their performance converges to the optimal performance w.r.t. the clean (non-noisy) distribution. Our experiments demonstrate the effectiveness of our algorithms in handling label noise.
△ Less
Submitted 23 April, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Harmonizing Code-mixed Conversations: Personality-assisted Code-mixed Response Generation in Dialogues
Authors:
Shivani Kumar,
Tanmoy Chakraborty
Abstract:
Code-mixing, the blending of multiple languages within a single conversation, introduces a distinctive challenge, particularly in the context of response generation. Capturing the intricacies of code-mixing proves to be a formidable task, given the wide-ranging variations influenced by individual speaking styles and cultural backgrounds. In this study, we explore response generation within code-mi…
▽ More
Code-mixing, the blending of multiple languages within a single conversation, introduces a distinctive challenge, particularly in the context of response generation. Capturing the intricacies of code-mixing proves to be a formidable task, given the wide-ranging variations influenced by individual speaking styles and cultural backgrounds. In this study, we explore response generation within code-mixed conversations. We introduce a novel approach centered on harnessing the Big Five personality traits acquired in an unsupervised manner from the conversations to bolster the performance of response generation. These inferred personality attributes are seamlessly woven into the fabric of the dialogue context, using a novel fusion mechanism, PA3. It uses an effective two-step attention formulation to fuse the dialogue and personality information. This fusion not only enhances the contextual relevance of generated responses but also elevates the overall performance of the model. Our experimental results, grounded in a dataset comprising of multi-party Hindi-English code-mix conversations, highlight the substantial advantages offered by personality-infused models over their conventional counterparts. This is evident in the increase observed in ROUGE and BLUE scores for the response generation task when the identified personality is seamlessly integrated into the dialogue context. Qualitative assessment for personality identification and response generation aligns well with our quantitative results.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English Clinical Queries
Authors:
Akash Ghosh,
Arkadeep Acharya,
Prince Jha,
Aniket Gaudgaul,
Rajdeep Majumdar,
Sriparna Saha,
Aman Chadha,
Raghav Jain,
Setu Sinha,
Shivani Agarwal
Abstract:
In the healthcare domain, summarizing medical questions posed by patients is critical for improving doctor-patient interactions and medical decision-making. Although medical data has grown in complexity and quantity, the current body of research in this domain has primarily concentrated on text-based methods, overlooking the integration of visual cues. Also prior works in the area of medical quest…
▽ More
In the healthcare domain, summarizing medical questions posed by patients is critical for improving doctor-patient interactions and medical decision-making. Although medical data has grown in complexity and quantity, the current body of research in this domain has primarily concentrated on text-based methods, overlooking the integration of visual cues. Also prior works in the area of medical question summarisation have been limited to the English language. This work introduces the task of multimodal medical question summarization for codemixed input in a low-resource setting. To address this gap, we introduce the Multimodal Medical Codemixed Question Summarization MMCQS dataset, which combines Hindi-English codemixed medical queries with visual aids. This integration enriches the representation of a patient's medical condition, providing a more comprehensive perspective. We also propose a framework named MedSumm that leverages the power of LLMs and VLMs for this task. By utilizing our MMCQS dataset, we demonstrate the value of integrating visual information from images to improve the creation of medically detailed summaries. This multimodal strategy not only improves healthcare decision-making but also promotes a deeper comprehension of patient queries, paving the way for future exploration in personalized and responsive medical care. Our dataset, code, and pre-trained models will be made publicly available.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1326 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 9 May, 2025; v1 submitted 18 December, 2023;
originally announced December 2023.
-
USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models
Authors:
Shaojin Ding,
David Qiu,
David Rim,
Yanzhang He,
Oleg Rybakov,
Bo Li,
Rohit Prabhavalkar,
Weiran Wang,
Tara N. Sainath,
Zhonglin Han,
Jian Li,
Amir Yazdanbakhsh,
Shivani Agrawal
Abstract:
End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios…
▽ More
End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N:M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a large-scale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies.
△ Less
Submitted 16 January, 2024; v1 submitted 13 December, 2023;
originally announced December 2023.
-
Automated Detection and Counting of Windows using UAV Imagery based Remote Sensing
Authors:
Dhruv Patel,
Shivani Chepuri,
Sarvesh Thakur,
K. Harikumar,
Ravi Kiran S.,
K. Madhava Krishna
Abstract:
Despite the technological advancements in the construction and surveying sector, the inspection of salient features like windows in an under-construction or existing building is predominantly a manual process. Moreover, the number of windows present in a building is directly related to the magnitude of deformation it suffers under earthquakes. In this research, a method to accurately detect and co…
▽ More
Despite the technological advancements in the construction and surveying sector, the inspection of salient features like windows in an under-construction or existing building is predominantly a manual process. Moreover, the number of windows present in a building is directly related to the magnitude of deformation it suffers under earthquakes. In this research, a method to accurately detect and count the number of windows of a building by deploying an Unmanned Aerial Vehicle (UAV) based remote sensing system is proposed. The proposed two-stage method automates the identification and counting of windows by developing computer vision pipelines that utilize data from UAV's onboard camera and other sensors. Quantitative and Qualitative results show the effectiveness of our proposed approach in accurately detecting and counting the windows compared to the existing method.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
The Uli Dataset: An Exercise in Experience Led Annotation of oGBV
Authors:
Arnav Arora,
Maha Jinadoss,
Cheshta Arora,
Denny George,
Brindaalakshmi,
Haseena Dawood Khan,
Kirti Rawat,
Div,
Ritash,
Seema Mathur,
Shivani Yadav,
Shehla Rashid Shora,
Rie Raut,
Sumit Pawar,
Apurva Paithane,
Sonia,
Vivek,
Dharini Priscilla,
Khairunnisha,
Grace Banu,
Ambika Tandon,
Rishav Thakker,
Rahul Dev Korra,
Aatman Vaidya,
Tarunima Prabhakar
Abstract:
Online gender based violence has grown concomitantly with adoption of the internet and social media. Its effects are worse in the Global majority where many users use social media in languages other than English. The scale and volume of conversations on the internet has necessitated the need for automated detection of hate speech, and more specifically gendered abuse. There is, however, a lack of…
▽ More
Online gender based violence has grown concomitantly with adoption of the internet and social media. Its effects are worse in the Global majority where many users use social media in languages other than English. The scale and volume of conversations on the internet has necessitated the need for automated detection of hate speech, and more specifically gendered abuse. There is, however, a lack of language specific and contextual data to build such automated tools. In this paper we present a dataset on gendered abuse in three languages- Hindi, Tamil and Indian English. The dataset comprises of tweets annotated along three questions pertaining to the experience of gender abuse, by experts who identify as women or a member of the LGBTQIA community in South Asia. Through this dataset we demonstrate a participatory approach to creating datasets that drive AI systems.
△ Less
Submitted 24 June, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Considerations for the Control Design of Augmentative Robots
Authors:
Shivani Guptasarma,
Monroe Kennedy III
Abstract:
Robotic systems that are intended to augment human capabilities commonly require the use of semi-autonomous control and artificial sensing, while at the same time aiming to empower the user to make decisions and take actions. This work identifies principles and techniques from the literature that can help to resolve this apparent contradiction. It is postulated that augmentative robots must functi…
▽ More
Robotic systems that are intended to augment human capabilities commonly require the use of semi-autonomous control and artificial sensing, while at the same time aiming to empower the user to make decisions and take actions. This work identifies principles and techniques from the literature that can help to resolve this apparent contradiction. It is postulated that augmentative robots must function as tools that have partial agency, as collaborative agents that provide conditional transparency, and ideally, serve as extensions of the human body.
△ Less
Submitted 31 October, 2023; v1 submitted 30 October, 2023;
originally announced October 2023.