-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Authors:
Gheorghe Comanici,
Eric Bieber,
Mike Schaekermann,
Ice Pasupat,
Noveen Sachdeva,
Inderjit Dhillon,
Marcel Blistein,
Ori Ram,
Dan Zhang,
Evan Rosen,
Luke Marris,
Sam Petulla,
Colin Gaffney,
Asaf Aharoni,
Nathan Lintz,
Tiago Cardal Pais,
Henrik Jacobsson,
Idan Szpektor,
Nan-Jiang Jiang,
Krishna Haridasan,
Ahmed Omran,
Nikunj Saunshi,
Dara Bahri,
Gaurav Mishra,
Eric Chu
, et al. (3278 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde…
▽ More
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
Beyond Retrieval: Ensembling Cross-Encoders and GPT Rerankers with LLMs for Biomedical QA
Authors:
Shashank Verma,
Fengyi Jiang,
Xiangning Xue
Abstract:
Biomedical semantic question answering rooted in information retrieval can play a crucial role in keeping up to date with vast, rapidly evolving and ever-growing biomedical literature. A robust system can help researchers, healthcare professionals and even layman users access relevant knowledge grounded in evidence. The BioASQ 2025 Task13b Challenge serves as an important benchmark, offering a com…
▽ More
Biomedical semantic question answering rooted in information retrieval can play a crucial role in keeping up to date with vast, rapidly evolving and ever-growing biomedical literature. A robust system can help researchers, healthcare professionals and even layman users access relevant knowledge grounded in evidence. The BioASQ 2025 Task13b Challenge serves as an important benchmark, offering a competitive platform for advancement of this space. This paper presents the methodologies and results from our participation in this challenge where we built a Retrieval-Augmented Generation (RAG) system that can answer biomedical questions by retrieving relevant PubMed documents and snippets to generate answers. For the retrieval task, we generated dense embeddings from biomedical articles for initial retrieval, and applied an ensemble of finetuned cross-encoders and large language models (LLMs) for re-ranking to identify top relevant documents. Our solution achieved an MAP@10 of 0.1581, placing 10th on the leaderboard for the retrieval task. For answer generation, we employed few-shot prompting of instruction-tuned LLMs. Our system achieved macro-F1 score of 0.95 for yes/no questions (rank 12), Mean Reciprocal Rank (MRR) of 0.64 for factoid questions (rank 1), mean-F1 score of 0.63 for list questions (rank 5), and ROUGE-SU4 F1 score of 0.29 for ideal answers (rank 11).
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
DMD-Net: Deep Mesh Denoising Network
Authors:
Aalok Gangopadhyay,
Shashikant Verma,
Shanmuganathan Raman
Abstract:
We present Deep Mesh Denoising Network (DMD-Net), an end-to-end deep learning framework, for solving the mesh denoising problem. DMD-Net consists of a Graph Convolutional Neural Network in which aggregation is performed in both the primal as well as the dual graph. This is realized in the form of an asymmetric two-stream network, which contains a primal-dual fusion block that enables communication…
▽ More
We present Deep Mesh Denoising Network (DMD-Net), an end-to-end deep learning framework, for solving the mesh denoising problem. DMD-Net consists of a Graph Convolutional Neural Network in which aggregation is performed in both the primal as well as the dual graph. This is realized in the form of an asymmetric two-stream network, which contains a primal-dual fusion block that enables communication between the primal-stream and the dual-stream. We develop a Feature Guided Transformer (FGT) paradigm, which consists of a feature extractor, a transformer, and a denoiser. The feature extractor estimates the local features, that guide the transformer to compute a transformation, which is applied to the noisy input mesh to obtain a useful intermediate representation. This is further processed by the denoiser to obtain the denoised mesh. Our network is trained on a large scale dataset of 3D objects. We perform exhaustive ablation studies to demonstrate that each component in our network is essential for obtaining the best performance. We show that our method obtains competitive or better results when compared with the state-of-the-art mesh denoising algorithms. We demonstrate that our method is robust to various kinds of noise. We observe that even in the presence of extremely high noise, our method achieves excellent performance.
△ Less
Submitted 28 June, 2025;
originally announced June 2025.
-
SemFaceEdit: Semantic Face Editing on Generative Radiance Manifolds
Authors:
Shashikant Verma,
Shanmuganathan Raman
Abstract:
Despite multiple view consistency offered by 3D-aware GAN techniques, the resulting images often lack the capacity for localized editing. In response, generative radiance manifolds emerge as an efficient approach for constrained point sampling within volumes, effectively reducing computational demands and enabling the learning of fine details. This work introduces SemFaceEdit, a novel method that…
▽ More
Despite multiple view consistency offered by 3D-aware GAN techniques, the resulting images often lack the capacity for localized editing. In response, generative radiance manifolds emerge as an efficient approach for constrained point sampling within volumes, effectively reducing computational demands and enabling the learning of fine details. This work introduces SemFaceEdit, a novel method that streamlines the appearance and geometric editing process by generating semantic fields on generative radiance manifolds. Utilizing latent codes, our method effectively disentangles the geometry and appearance associated with different facial semantics within the generated image. In contrast to existing methods that can change the appearance of the entire radiance field, our method enables the precise editing of particular facial semantics while preserving the integrity of other regions. Our network comprises two key modules: the Geometry module, which generates semantic radiance and occupancy fields, and the Appearance module, which is responsible for predicting RGB radiance. We jointly train both modules in adversarial settings to learn semantic-aware geometry and appearance descriptors. The appearance descriptors are then conditioned on their respective semantic latent codes by the Appearance Module, facilitating disentanglement and enhanced control. Our experiments highlight SemFaceEdit's superior performance in semantic field-based editing, particularly in achieving improved radiance field disentanglement.
△ Less
Submitted 28 June, 2025;
originally announced June 2025.
-
An Audio-centric Multi-task Learning Framework for Streaming Ads Targeting on Spotify
Authors:
Shivam Verma,
Vivian Chen,
Darren Mei
Abstract:
Spotify, a large-scale multimedia platform, attracts over 675 million monthly active users who collectively consume millions of hours of music, podcasts, audiobooks, and video content. This diverse content consumption pattern introduces unique challenges for computational advertising, which must effectively integrate a variety of ad modalities, including audio, video, and display, within a single…
▽ More
Spotify, a large-scale multimedia platform, attracts over 675 million monthly active users who collectively consume millions of hours of music, podcasts, audiobooks, and video content. This diverse content consumption pattern introduces unique challenges for computational advertising, which must effectively integrate a variety of ad modalities, including audio, video, and display, within a single user experience. Traditional ad recommendation models, primarily designed for foregrounded experiences, often struggle to reconcile the platform's inherent audio-centrality with the demands of optimizing ad performance across multiple formats and modalities. To overcome these challenges, we introduce Cross-modal Adaptive Mixture-of-Experts (CAMoE), a novel framework for optimizing click-through rate (CTR) prediction in both audio-centric and multi-modal settings. CAMoE enhances traditional mixture-of-experts models by incorporating modality-aware task grouping, adaptive loss masking, and deep-cross networks (DCN) to capture complex feature interactions within a multi-modal ad ecosystem. Through extensive ablation studies, we demonstrate that this approach achieves near Pareto-optimal performance across audio, video, and display ad formats, significantly improving AUC-PR compared to conventional single-task and content-based multi-task learning baselines. When deployed at scale on Spotify's ad serving platform, CAMoE delivered substantial gains, yielding a 14.5% increase in CTR for audio ads, a 1.3% increase for video ads, and a 4.8% reduction in expected cost-per-click (eCPC) for audio slots.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA
Authors:
Yuelyu Ji,
Hang Zhang,
Shiven Verma,
Hui Ji,
Chun Li,
Yushui Han,
Yanshan Wang
Abstract:
We propose DeepRAG, a novel framework that integrates DeepSeek hierarchical question decomposition capabilities with RAG Gym unified retrieval-augmented generation optimization using process level supervision. Targeting the challenging MedHopQA biomedical question answering task, DeepRAG systematically decomposes complex queries into precise sub-queries and employs concept level reward signals inf…
▽ More
We propose DeepRAG, a novel framework that integrates DeepSeek hierarchical question decomposition capabilities with RAG Gym unified retrieval-augmented generation optimization using process level supervision. Targeting the challenging MedHopQA biomedical question answering task, DeepRAG systematically decomposes complex queries into precise sub-queries and employs concept level reward signals informed by the UMLS ontology to enhance biomedical accuracy. Preliminary evaluations on the MedHopQA dataset indicate that DeepRAG significantly outperforms baseline models, including standalone DeepSeek and RAG Gym, achieving notable improvements in both Exact Match and concept level accuracy.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities
Authors:
Sahil Verma,
Keegan Hines,
Jeff Bilmes,
Charlotte Siska,
Luke Zettlemoyer,
Hila Gonen,
Chandan Singh
Abstract:
The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-res…
▽ More
The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($\approx 120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Deterministic Suffix-reading Automata
Authors:
R Keerthan,
B Srivathsan,
R Venkatesh,
Sagar Verma
Abstract:
We introduce deterministic suffix-reading automata (DSA), a new automaton model over finite words. Transitions in a DSA are labeled with words. From a state, a DSA triggers an outgoing transition on seeing a word ending with the transition's label. Therefore, rather than moving along an input word letter by letter, a DSA can jump along blocks of letters, with each block ending in a suitable suffix…
▽ More
We introduce deterministic suffix-reading automata (DSA), a new automaton model over finite words. Transitions in a DSA are labeled with words. From a state, a DSA triggers an outgoing transition on seeing a word ending with the transition's label. Therefore, rather than moving along an input word letter by letter, a DSA can jump along blocks of letters, with each block ending in a suitable suffix. This feature allows DSAs to recognize regular languages more concisely, compared to DFAs. In this work, we focus on questions around finding a minimal DSA for a regular language. The number of states is not a faithful measure of the size of a DSA, since the transition-labels contain strings of arbitrary length. Hence, we consider total-size (number of states + number of edges + total length of transition-labels) as the size measure of DSAs.
We start by formally defining the model and providing a DSA-to-DFA conversion that allows to compare the expressiveness and succinctness of DSA with related automata models. Our main technical contribution is a method to derive DSAs from a given DFA: a DFA-to-DSA conversion. We make a surprising observation that the smallest DSA derived from the canonical DFA of a regular language L need not be a minimal DSA for L. This observation leads to a fundamental bottleneck in deriving a minimal DSA for a regular language. In fact, we prove that given a DFA and a number k, the problem of deciding if there exists an equivalent DSA of total-size atmost k is NP-complete.
△ Less
Submitted 19 May, 2025; v1 submitted 14 May, 2025;
originally announced May 2025.
-
GRAIL: Graph Edit Distance and Node Alignment Using LLM-Generated Code
Authors:
Samidha Verma,
Arushi Goyal,
Ananya Mathur,
Ankit Anand,
Sayan Ranu
Abstract:
Graph Edit Distance (GED) is a widely used metric for measuring similarity between two graphs. Computing the optimal GED is NP-hard, leading to the development of various neural and non-neural heuristics. While neural methods have achieved improved approximation quality compared to non-neural approaches, they face significant challenges: (1) They require large amounts of ground truth data, which i…
▽ More
Graph Edit Distance (GED) is a widely used metric for measuring similarity between two graphs. Computing the optimal GED is NP-hard, leading to the development of various neural and non-neural heuristics. While neural methods have achieved improved approximation quality compared to non-neural approaches, they face significant challenges: (1) They require large amounts of ground truth data, which is itself NP-hard to compute. (2) They operate as black boxes, offering limited interpretability. (3) They lack cross-domain generalization, necessitating expensive retraining for each new dataset. We address these limitations with GRAIL, introducing a paradigm shift in this domain. Instead of training a neural model to predict GED, GRAIL employs a novel combination of large language models (LLMs) and automated prompt tuning to generate a program that is used to compute GED. This shift from predicting GED to generating programs imparts various advantages, including end-to-end interpretability and an autonomous self-evolutionary learning mechanism without ground-truth supervision. Extensive experiments on seven datasets confirm that GRAIL not only surpasses state-of-the-art GED approximation methods in prediction quality but also achieves robust cross-domain generalization across diverse graph distributions.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
CodeSSM: Towards State Space Models for Code Understanding
Authors:
Shweta Verma,
Abhinav Anand,
Mira Mezini
Abstract:
Although transformers are widely used for various code-specific tasks, they have some significant limitations. In this paper, we investigate State Space Models (SSMs) as a potential alternative to transformers for code understanding tasks, such as code retrieval, classification, and clone detection. Previous research has already demonstrated that SSMs are more compute-efficient than transformers.…
▽ More
Although transformers are widely used for various code-specific tasks, they have some significant limitations. In this paper, we investigate State Space Models (SSMs) as a potential alternative to transformers for code understanding tasks, such as code retrieval, classification, and clone detection. Previous research has already demonstrated that SSMs are more compute-efficient than transformers. In our work, we show that SSMs are also more sample-efficient and can effectively extrapolate to longer contexts (beyond the pretraining context) during fine-tuning. Through comprehensive experiments, we demonstrate that SSMs could serve as a viable alternative to transformers for code understanding tasks, while addressing some of the major limitations associated with transformers.
△ Less
Submitted 21 May, 2025; v1 submitted 2 May, 2025;
originally announced May 2025.
-
RAIL in the Wild: Operationalizing Responsible AI Evaluation Using Anthropic's Value Dataset
Authors:
Sumit Verma,
Pritam Prasun,
Arpit Jaiswal,
Pritish Kumar
Abstract:
As AI systems become embedded in real-world applications, ensuring they meet ethical standards is crucial. While existing AI ethics frameworks emphasize fairness, transparency, and accountability, they often lack actionable evaluation methods. This paper introduces a systematic approach using the Responsible AI Labs (RAIL) framework, which includes eight measurable dimensions to assess the normati…
▽ More
As AI systems become embedded in real-world applications, ensuring they meet ethical standards is crucial. While existing AI ethics frameworks emphasize fairness, transparency, and accountability, they often lack actionable evaluation methods. This paper introduces a systematic approach using the Responsible AI Labs (RAIL) framework, which includes eight measurable dimensions to assess the normative behavior of large language models (LLMs). We apply this framework to Anthropic's "Values in the Wild" dataset, containing over 308,000 anonymized conversations with Claude and more than 3,000 annotated value expressions. Our study maps these values to RAIL dimensions, computes synthetic scores, and provides insights into the ethical behavior of LLMs in real-world use.
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
Enhanced Question-Answering for Skill-based learning using Knowledge-based AI and Generative AI
Authors:
Rahul K. Dass,
Rochan H. Madhusudhana,
Erin C. Deye,
Shashank Verma,
Timothy A. Bydlon,
Grace Brazil,
Ashok K. Goel
Abstract:
Supporting learners' understanding of taught skills in online settings is a longstanding challenge. While exercises and chat-based agents can evaluate understanding in limited contexts, this challenge is magnified when learners seek explanations that delve into procedural knowledge (how things are done) and reasoning (why things happen). We hypothesize that an intelligent agent's ability to unders…
▽ More
Supporting learners' understanding of taught skills in online settings is a longstanding challenge. While exercises and chat-based agents can evaluate understanding in limited contexts, this challenge is magnified when learners seek explanations that delve into procedural knowledge (how things are done) and reasoning (why things happen). We hypothesize that an intelligent agent's ability to understand and explain learners' questions about skills can be significantly enhanced using the TMK (Task-Method-Knowledge) model, a Knowledge-based AI framework. We introduce Ivy, an intelligent agent that leverages an LLM and iterative refinement techniques to generate explanations that embody teleological, causal, and compositional principles. Our initial evaluation demonstrates that this approach goes beyond the typical shallow responses produced by an agent with access to unstructured text, thereby substantially improving the depth and relevance of feedback. This can potentially ensure learners develop a comprehensive understanding of skills crucial for effective problem-solving in online environments.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
Right Prediction, Wrong Reasoning: Uncovering LLM Misalignment in RA Disease Diagnosis
Authors:
Umakanta Maharana,
Sarthak Verma,
Avarna Agarwal,
Prakashini Mruthyunjaya,
Dwarikanath Mahapatra,
Sakir Ahmed,
Murari Mandal
Abstract:
Large language models (LLMs) offer a promising pre-screening tool, improving early disease detection and providing enhanced healthcare access for underprivileged communities. The early diagnosis of various diseases continues to be a significant challenge in healthcare, primarily due to the nonspecific nature of early symptoms, the shortage of expert medical practitioners, and the need for prolonge…
▽ More
Large language models (LLMs) offer a promising pre-screening tool, improving early disease detection and providing enhanced healthcare access for underprivileged communities. The early diagnosis of various diseases continues to be a significant challenge in healthcare, primarily due to the nonspecific nature of early symptoms, the shortage of expert medical practitioners, and the need for prolonged clinical evaluations, all of which can delay treatment and adversely affect patient outcomes. With impressive accuracy in prediction across a range of diseases, LLMs have the potential to revolutionize clinical pre-screening and decision-making for various medical conditions. In this work, we study the diagnostic capability of LLMs for Rheumatoid Arthritis (RA) with real world patients data. Patient data was collected alongside diagnoses from medical experts, and the performance of LLMs was evaluated in comparison to expert diagnoses for RA disease prediction. We notice an interesting pattern in disease diagnosis and find an unexpected \textit{misalignment between prediction and explanation}. We conduct a series of multi-round analyses using different LLM agents. The best-performing model accurately predicts rheumatoid arthritis (RA) diseases approximately 95\% of the time. However, when medical experts evaluated the reasoning generated by the model, they found that nearly 68\% of the reasoning was incorrect. This study highlights a clear misalignment between LLMs high prediction accuracy and its flawed reasoning, raising important questions about relying on LLM explanations in clinical settings. \textbf{LLMs provide incorrect reasoning to arrive at the correct answer for RA disease diagnosis.}
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics
Authors:
Aditya Pathak,
Rachit Gandhi,
Vaibhav Uttam,
Devansh,
Yashwanth Nakka,
Aaryan Raj Jindal,
Pratyush Ghosh,
Arnav Ramamoorthy,
Shreyash Verma,
Aditya Mittal,
Aashna Ased,
Chirag Khatri,
Jagat Sesh Challa,
Dhruv Kumar
Abstract:
Since the emergence of Large Language Models (LLMs) popularized by the release of GPT-3 and ChatGPT, LLMs have shown remarkable promise in programming-related tasks. While code generation using LLMs has become a popular field of research, code evaluation using LLMs remains under-explored. In this paper, we focus on LLM-based code evaluation and attempt to fill in the existing gaps. We propose mult…
▽ More
Since the emergence of Large Language Models (LLMs) popularized by the release of GPT-3 and ChatGPT, LLMs have shown remarkable promise in programming-related tasks. While code generation using LLMs has become a popular field of research, code evaluation using LLMs remains under-explored. In this paper, we focus on LLM-based code evaluation and attempt to fill in the existing gaps. We propose multi-agentic novel approaches using \emph{question-specific rubrics} tailored to the problem statement, arguing that these perform better for logical assessment than the existing approaches that use \emph{question-agnostic rubrics}. To address the lack of suitable evaluation datasets, we introduce two datasets: a Data Structures and Algorithms dataset containing 150 student submissions from a popular Data Structures and Algorithms practice website, and an Object Oriented Programming dataset comprising 80 student submissions from undergraduate computer science courses. In addition to using standard metrics (Spearman Correlation, Cohen's Kappa), we additionally propose a new metric called as Leniency, which quantifies evaluation strictness relative to expert assessment. Our comprehensive analysis demonstrates that \emph{question-specific rubrics} significantly enhance logical assessment of code in educational settings, providing better feedback aligned with instructional goals beyond mere syntactic correctness.
△ Less
Submitted 22 June, 2025; v1 submitted 31 March, 2025;
originally announced March 2025.
-
Intelligent IoT Attack Detection Design via ODLLM with Feature Ranking-based Knowledge Base
Authors:
Satvik Verma,
Qun Wang,
E. Wes Bethel
Abstract:
The widespread adoption of Internet of Things (IoT) devices has introduced significant cybersecurity challenges, particularly with the increasing frequency and sophistication of Distributed Denial of Service (DDoS) attacks. Traditional machine learning (ML) techniques often fall short in detecting such attacks due to the complexity of blended and evolving patterns. To address this, we propose a no…
▽ More
The widespread adoption of Internet of Things (IoT) devices has introduced significant cybersecurity challenges, particularly with the increasing frequency and sophistication of Distributed Denial of Service (DDoS) attacks. Traditional machine learning (ML) techniques often fall short in detecting such attacks due to the complexity of blended and evolving patterns. To address this, we propose a novel framework leveraging On-Device Large Language Models (ODLLMs) augmented with fine-tuning and knowledge base (KB) integration for intelligent IoT network attack detection. By implementing feature ranking techniques and constructing both long and short KBs tailored to model capacities, the proposed framework ensures efficient and accurate detection of DDoS attacks while overcoming computational and privacy limitations. Simulation results demonstrate that the optimized framework achieves superior accuracy across diverse attack types, especially when using compact models in edge computing environments. This work provides a scalable and secure solution for real-time IoT security, advancing the applicability of edge intelligence in cybersecurity.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
STEP: Simultaneous Tracking and Estimation of Pose for Animals and Humans
Authors:
Shashikant Verma,
Harish Katti,
Soumyaratna Debnath,
Yamuna Swamy,
Shanmuganathan Raman
Abstract:
We introduce STEP, a novel framework utilizing Transformer-based discriminative model prediction for simultaneous tracking and estimation of pose across diverse animal species and humans. We are inspired by the fact that the human brain exploits spatiotemporal continuity and performs concurrent localization and pose estimation despite the specialization of brain areas for form and motion processin…
▽ More
We introduce STEP, a novel framework utilizing Transformer-based discriminative model prediction for simultaneous tracking and estimation of pose across diverse animal species and humans. We are inspired by the fact that the human brain exploits spatiotemporal continuity and performs concurrent localization and pose estimation despite the specialization of brain areas for form and motion processing. Traditional discriminative models typically require predefined target states for determining model weights, a challenge we address through Gaussian Map Soft Prediction (GMSP) and Offset Map Regression Adapter (OMRA) Modules. These modules remove the necessity of keypoint target states as input, streamlining the process. Our method starts with a known target state in the initial frame of a given video sequence. It then seamlessly tracks the target and estimates keypoints of anatomical importance as output for subsequent frames. Unlike prevalent top-down pose estimation methods, our approach doesn't rely on per-frame target detections due to its tracking capability. This facilitates a significant advancement in inference efficiency and potential applications. We train and validate our approach on datasets encompassing diverse species. Our experiments demonstrate superior results compared to existing methods, opening doors to various applications, including but not limited to action recognition and behavioral analysis.
△ Less
Submitted 20 March, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
Connected Partitions via Connected Dominating Sets
Authors:
Aikaterini Niklanovits,
Kirill Simonov,
Shaily Verma,
Ziena Zeif
Abstract:
The classical theorem due to Győri and Lovász states that any $k$-connected graph $G$ admits a partition into $k$ connected subgraphs, where each subgraph has a prescribed size and contains a prescribed vertex, as long as the total size of target subgraphs is equal to the size of $G$. However, this result is notoriously evasive in terms of efficient constructions, and it is still unknown whether s…
▽ More
The classical theorem due to Győri and Lovász states that any $k$-connected graph $G$ admits a partition into $k$ connected subgraphs, where each subgraph has a prescribed size and contains a prescribed vertex, as long as the total size of target subgraphs is equal to the size of $G$. However, this result is notoriously evasive in terms of efficient constructions, and it is still unknown whether such a partition can be computed in polynomial time, even for $k = 5$.
We make progress towards an efficient constructive version of the Győri--Lovász theorem by considering a natural strengthening of the $k$-connectivity requirement. Specifically, we show that the desired connected partition can be found in polynomial time, if $G$ contains $k$ disjoint connected dominating sets. As a consequence of this result, we give several efficient approximate and exact constructive versions of the original Győri--Lovász theorem:
1. On general graphs, a Győri--Lovász partition with $k$ parts can be computed in polynomial time when the input graph has connectivity $Ω(k \cdot \log^2 n)$;
2. On convex bipartite graphs, connectivity of $4k$ is sufficient;
3. On biconvex graphs and interval graphs, connectivity of $k$ is sufficient, meaning that our algorithm gives a ``true'' constructive version of the theorem on these graph classes.
△ Less
Submitted 15 May, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
SegDesicNet: Lightweight Semantic Segmentation in Remote Sensing with Geo-Coordinate Embeddings for Domain Adaptation
Authors:
Sachin Verma,
Frank Lindseth,
Gabriel Kiss
Abstract:
Semantic segmentation is essential for analyzing highdefinition remote sensing images (HRSIs) because it allows the precise classification of objects and regions at the pixel level. However, remote sensing data present challenges owing to geographical location, weather, and environmental variations, making it difficult for semantic segmentation models to generalize across diverse scenarios. Existi…
▽ More
Semantic segmentation is essential for analyzing highdefinition remote sensing images (HRSIs) because it allows the precise classification of objects and regions at the pixel level. However, remote sensing data present challenges owing to geographical location, weather, and environmental variations, making it difficult for semantic segmentation models to generalize across diverse scenarios. Existing methods are often limited to specific data domains and require expert annotators and specialized equipment for semantic labeling. In this study, we propose a novel unsupervised domain adaptation technique for remote sensing semantic segmentation by utilizing geographical coordinates that are readily accessible in remote sensing setups as metadata in a dataset. To bridge the domain gap, we propose a novel approach that considers the combination of an imageś location encoding trait and the spherical nature of Earthś surface. Our proposed SegDesicNet module regresses the GRID positional encoding of the geo coordinates projected over the unit sphere to obtain the domain loss. Our experimental results demonstrate that the proposed SegDesicNet outperforms state of the art domain adaptation methods in remote sensing image segmentation, achieving an improvement of approximately ~6% in the mean intersection over union (MIoU) with a ~ 27\% drop in parameter count on benchmarked subsets of the publicly available FLAIR #1 dataset. We also benchmarked our method performance on the custom split of the ISPRS Potsdam dataset. Our algorithm seeks to reduce the modeling disparity between artificial neural networks and human comprehension of the physical world, making the technology more human centric and scalable.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Qubit-Based Framework for Quantum Machine Learning: Bridging Classical Data and Quantum Algorithms
Authors:
Bhavna Bose,
Saurav Verma
Abstract:
This paper dives into the exciting and rapidly growing field of quantum computing, explaining its core ideas, current progress, and how it could revolutionize the way we solve complex problems. It starts by breaking down the basics, like qubits, quantum circuits, and how principles like superposition and entanglement make quantum computers fundamentally different-and far more powerful for certain…
▽ More
This paper dives into the exciting and rapidly growing field of quantum computing, explaining its core ideas, current progress, and how it could revolutionize the way we solve complex problems. It starts by breaking down the basics, like qubits, quantum circuits, and how principles like superposition and entanglement make quantum computers fundamentally different-and far more powerful for certain tasks-than the classical computers we use today. We also explore how quantum computing deals with complex problems and why it is uniquely suited for challenges classical systems struggle to handle. A big part of this paper focuses on Quantum Machine Learning (QML), where the strengths of quantum computing meet the world of artificial intelligence. By processing massive datasets and optimizing intricate algorithms, quantum systems offer new possibilities for machine learning. We highlight different approaches to combining quantum and classical computing, showing how they can work together to produce faster and more accurate results. Additionally, we explore the tools and platforms available-like TensorFlow Quantum, Qiskit and PennyLane-that are helping researchers and developers bring these theories to life. Of course, quantum computing has its hurdles. Challenges like scaling up hardware, correcting errors, and keeping qubits stable are significant roadblocks. Yet, with rapid advancements in cloud-based platforms and innovative technologies, the potential of quantum computing feels closer than ever. This paper aims to offer readers a clear and comprehensive introduction to quantum computing, its role in machine learning, and the immense possibilities it holds for the future of technology.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
Navigating the Social Welfare Frontier: Portfolios for Multi-objective Reinforcement Learning
Authors:
Cheol Woo Kim,
Jai Moondra,
Shresth Verma,
Madeleine Pollack,
Lingkai Kong,
Milind Tambe,
Swati Gupta
Abstract:
In many real-world applications of reinforcement learning (RL), deployed policies have varied impacts on different stakeholders, creating challenges in reaching consensus on how to effectively aggregate their preferences. Generalized $p$-means form a widely used class of social welfare functions for this purpose, with broad applications in fair resource allocation, AI alignment, and decision-makin…
▽ More
In many real-world applications of reinforcement learning (RL), deployed policies have varied impacts on different stakeholders, creating challenges in reaching consensus on how to effectively aggregate their preferences. Generalized $p$-means form a widely used class of social welfare functions for this purpose, with broad applications in fair resource allocation, AI alignment, and decision-making. This class includes well-known welfare functions such as Egalitarian, Nash, and Utilitarian welfare. However, selecting the appropriate social welfare function is challenging for decision-makers, as the structure and outcomes of optimal policies can be highly sensitive to the choice of $p$. To address this challenge, we study the concept of an $α$-approximate portfolio in RL, a set of policies that are approximately optimal across the family of generalized $p$-means for all $p \in [-\infty, 1]$. We propose algorithms to compute such portfolios and provide theoretical guarantees on the trade-offs among approximation factor, portfolio size, and computational efficiency. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of our approach in summarizing the policy space induced by varying $p$ values, empowering decision-makers to navigate this landscape more effectively.
△ Less
Submitted 13 February, 2025;
originally announced February 2025.
-
A Parameterized Study of Secluded Structures in Directed Graphs
Authors:
Nadym Mallek,
Jonas Schmidt,
Shaily Verma
Abstract:
Given an undirected graph $G$ and an integer $k$, the Secluded $Π$-Subgraph problem asks you to find a maximum size induced subgraph that satisfies a property $Π$ and has at most $k$ neighbors in the rest of the graph. This problem has been extensively studied; however, there is no prior study of the problem in directed graphs. This question has been mentioned by Jansen et al. [ISAAC'23].
In thi…
▽ More
Given an undirected graph $G$ and an integer $k$, the Secluded $Π$-Subgraph problem asks you to find a maximum size induced subgraph that satisfies a property $Π$ and has at most $k$ neighbors in the rest of the graph. This problem has been extensively studied; however, there is no prior study of the problem in directed graphs. This question has been mentioned by Jansen et al. [ISAAC'23].
In this paper, we initiate the study of Secluded Subgraph problem in directed graphs by incorporating different notions of neighborhoods: in-neighborhood, out-neighborhood, and their union. Formally, we call these problems {\{In, Out, Total\}-Secluded $Π$-Subgraph, where given a directed graph $G$ and integers $k$, we want to find an induced subgraph satisfying $Π$ of maximum size that has at most $k$ in/out/total-neighbors in the rest of the graph, respectively. We investigate the parameterized complexity of these problems for different properties $Π$. In particular, we prove the following parameterized results: - We design an FPT algorithm for the Total-Secluded Strongly Connected Subgraph problem when parameterized by $k$. - We show that the In/Out-Secluded $\mathcal{F}$-Free Subgraph problem with parameter $k+w$ is W[1]-hard, where $\mathcal{F}$ is a family of directed graphs except any subgraph of a star graph whose edges are directed towards the center. This result also implies that In/Out-Secluded DAG is W[1]-hard, unlike the undirected variants of the two problems, which are FPT. - We design an FPT-algorithm for In/Out/Total-Secluded $α$-Bounded Subgraph when parameterized by $k$, where $α$-bounded graphs are a superclass of tournaments. - For undirected graphs, we improve the best-known FPT algorithm for Secluded Clique by providing a faster FPT algorithm that runs in time $1.6181^kn^{\mathcal{O}(1)}$.
△ Less
Submitted 9 February, 2025;
originally announced February 2025.
-
Humanity's Last Exam
Authors:
Long Phan,
Alice Gatti,
Ziwen Han,
Nathaniel Li,
Josephina Hu,
Hugh Zhang,
Chen Bo Calvin Zhang,
Mohamed Shaaban,
John Ling,
Sean Shi,
Michael Choi,
Anish Agrawal,
Arnav Chopra,
Adam Khoja,
Ryan Kim,
Richard Ren,
Jason Hausenloy,
Oliver Zhang,
Mantas Mazeika,
Dmitry Dodonov,
Tung Nguyen,
Jaeho Lee,
Daron Anderson,
Mikhail Doroshenko,
Alun Cennyth Stokes
, et al. (1084 additional authors not shown)
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of…
▽ More
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
△ Less
Submitted 19 April, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
Measuring Fairness in Financial Transaction Machine Learning Models
Authors:
Deniz Sezin Ayvaz,
Lorenzo Belenguer,
Hankun He,
Deborah Dormah Kanubala,
Mingxu Li,
Soung Low,
Carlos Mougan,
Faithful Chiagoziem Onwuegbuche,
Yulu Pi,
Natalia Sikora,
Dan Tran,
Shresth Verma,
Hanzhi Wang,
Skyler Xie,
Adeline Pelletier
Abstract:
Mastercard, a global leader in financial services, develops and deploys machine learning models aimed at optimizing card usage and preventing attrition through advanced predictive models. These models use aggregated and anonymized card usage patterns, including cross-border transactions and industry-specific spending, to tailor bank offerings and maximize revenue opportunities. Mastercard has esta…
▽ More
Mastercard, a global leader in financial services, develops and deploys machine learning models aimed at optimizing card usage and preventing attrition through advanced predictive models. These models use aggregated and anonymized card usage patterns, including cross-border transactions and industry-specific spending, to tailor bank offerings and maximize revenue opportunities. Mastercard has established an AI Governance program, based on its Data and Tech Responsibility Principles, to evaluate any built and bought AI for efficacy, fairness, and transparency. As part of this effort, Mastercard has sought expertise from the Turing Institute through a Data Study Group to better assess fairness in more complex AI/ML models. The Data Study Group challenge lies in defining, measuring, and mitigating fairness in these predictions, which can be complex due to the various interpretations of fairness, gaps in the research literature, and ML-operations challenges.
△ Less
Submitted 22 January, 2025; v1 submitted 18 January, 2025;
originally announced January 2025.
-
L3D-Pose: Lifting Pose for 3D Avatars from a Single Camera in the Wild
Authors:
Soumyaratna Debnath,
Harish Katti,
Shashikant Verma,
Shanmuganathan Raman
Abstract:
While 2D pose estimation has advanced our ability to interpret body movements in animals and primates, it is limited by the lack of depth information, constraining its application range. 3D pose estimation provides a more comprehensive solution by incorporating spatial depth, yet creating extensive 3D pose datasets for animals is challenging due to their dynamic and unpredictable behaviours in nat…
▽ More
While 2D pose estimation has advanced our ability to interpret body movements in animals and primates, it is limited by the lack of depth information, constraining its application range. 3D pose estimation provides a more comprehensive solution by incorporating spatial depth, yet creating extensive 3D pose datasets for animals is challenging due to their dynamic and unpredictable behaviours in natural settings. To address this, we propose a hybrid approach that utilizes rigged avatars and the pipeline to generate synthetic datasets to acquire the necessary 3D annotations for training. Our method introduces a simple attention-based MLP network for converting 2D poses to 3D, designed to be independent of the input image to ensure scalability for poses in natural environments. Additionally, we identify that existing anatomical keypoint detectors are insufficient for accurate pose retargeting onto arbitrary avatars. To overcome this, we present a lookup table based on a deep pose estimation method using a synthetic collection of diverse actions rigged avatars perform. Our experiments demonstrate the effectiveness and efficiency of this lookup table-based retargeting approach. Overall, we propose a comprehensive framework with systematically synthesized datasets for lifting poses from 2D to 3D and then utilize this to re-target motion from wild settings onto arbitrary avatars.
△ Less
Submitted 2 January, 2025;
originally announced January 2025.
-
COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Adaptation
Authors:
Arnav M. Das,
Gantavya Bhatt,
Lilly Kumari,
Sahil Verma,
Jeff Bilmes
Abstract:
Retrieval augmentation, the practice of retrieving additional data from large auxiliary pools, has emerged as an effective technique for enhancing model performance in the low-data regime. Prior approaches have employed only nearest-neighbor based strategies for data selection, which retrieve auxiliary samples with high similarity to instances in the target task. However, these approaches are pron…
▽ More
Retrieval augmentation, the practice of retrieving additional data from large auxiliary pools, has emerged as an effective technique for enhancing model performance in the low-data regime. Prior approaches have employed only nearest-neighbor based strategies for data selection, which retrieve auxiliary samples with high similarity to instances in the target task. However, these approaches are prone to selecting highly redundant samples, since they fail to incorporate any notion of diversity. In our work, we first demonstrate that data selection strategies used in prior retrieval-augmented few-shot adaptation settings can be generalized using a class of functions known as Combinatorial Mutual Information (CMI) measures. We then propose COBRA (COmBinatorial Retrieval Augmentation), which employs an alternative CMI measure that considers both diversity and similarity to a target dataset. COBRA consistently outperforms previous retrieval approaches across image classification tasks and few-shot learning techniques when used to retrieve samples from LAION-2B. COBRA introduces negligible computational overhead to the cost of retrieval while providing significant gains in downstream model performance.
△ Less
Submitted 28 March, 2025; v1 submitted 23 December, 2024;
originally announced December 2024.
-
RouteNet-Fermi: Network Modeling With GNN (Analysis And Re-implementation)
Authors:
Shourya Verma,
Simran Kadadi,
Swathi Jayaprakash,
Arpan Kumar Mahapatra,
Ishaan Jain
Abstract:
Network performance modeling presents important challenges in modern computer networks due to increasing complexity, scale, and diverse traffic patterns. While traditional approaches like queuing theory and packet-level simulation have served as foundational tools, they face limitations in modeling complex traffic behaviors and scaling to large networks. This project presents an extended implement…
▽ More
Network performance modeling presents important challenges in modern computer networks due to increasing complexity, scale, and diverse traffic patterns. While traditional approaches like queuing theory and packet-level simulation have served as foundational tools, they face limitations in modeling complex traffic behaviors and scaling to large networks. This project presents an extended implementation of RouteNet-Fermi, a Graph Neural Network (GNN) architecture designed for network performance prediction, with additional recurrent neural network variants. We improve the the original architecture by implementing Long Short-Term Memory (LSTM) cells and Recurrent Neural Network (RNN) cells alongside the existing Gated Recurrent Unit (GRU) cells implementation. This work contributes to the understanding of recurrent neural architectures in GNN-based network modeling and provides a flexible framework for future experimentation with different cell types.
△ Less
Submitted 7 December, 2024;
originally announced December 2024.
-
Cross Domain Adaptation using Adversarial networks with Cyclic loss
Authors:
Manpreet Kaur,
Ankur Tomar,
Srijan Mishra,
Shashwat Verma
Abstract:
Deep Learning methods are highly local and sensitive to the domain of data they are trained with. Even a slight deviation from the domain distribution affects prediction accuracy of deep networks significantly. In this work, we have investigated a set of techniques aimed at increasing accuracy of generator networks which perform translation from one domain to the other in an adversarial setting. I…
▽ More
Deep Learning methods are highly local and sensitive to the domain of data they are trained with. Even a slight deviation from the domain distribution affects prediction accuracy of deep networks significantly. In this work, we have investigated a set of techniques aimed at increasing accuracy of generator networks which perform translation from one domain to the other in an adversarial setting. In particular, we experimented with activations, the encoder-decoder network architectures, and introduced a Loss called cyclic loss to constrain the Generator network so that it learns effective source-target translation. This machine learning problem is motivated by myriad applications that can be derived from domain adaptation networks like generating labeled data from synthetic inputs in an unsupervised fashion, and using these translation network in conjunction with the original domain network to generalize deep learning networks across domains.
△ Less
Submitted 2 December, 2024;
originally announced December 2024.
-
MILU: A Multi-task Indic Language Understanding Benchmark
Authors:
Sshubam Verma,
Mohammed Safi Ur Rahman Khan,
Vishwajeet Kumar,
Rudra Murthy,
Jaydeep Sen
Abstract:
Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding…
▽ More
Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding Benchmark, a comprehensive evaluation benchmark designed to address this gap. MILU spans 8 domains and 41 subjects across 11 Indic languages, reflecting both general and culturally specific knowledge. With an India-centric design, incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics. We evaluate over 42 LLMs, and find that current LLMs struggle with MILU, with GPT-4o achieving the highest average accuracy at 74 percent. Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines. Models also perform better in high resource languages as compared to low resource ones. Domain-wise analysis indicates that models perform poorly in culturally relevant areas like Arts and Humanities, Law and Governance compared to general fields like STEM. To the best of our knowledge, MILU is the first of its kind benchmark focused on Indic languages, serving as a crucial step towards comprehensive cultural evaluation. All code, benchmarks, and artifacts are publicly available to foster open research.
△ Less
Submitted 4 February, 2025; v1 submitted 4 November, 2024;
originally announced November 2024.
-
Proactive Detection and Calibration of Seasonal Advertisements with Multimodal Large Language Models
Authors:
Hamid Eghbalzadeh,
Shuai Shao,
Saurabh Verma,
Venugopal Mani,
Hongnan Wang,
Jigar Madia,
Vitali Karpinchyk,
Andrey Malevich
Abstract:
A myriad of factors affect large scale ads delivery systems and influence both user experience and revenue. One such factor is proactive detection and calibration of seasonal advertisements to help with increasing conversion and user satisfaction. In this paper, we present Proactive Detection and Calibration of Seasonal Advertisements (PDCaSA), a research problem that is of interest for the ads ra…
▽ More
A myriad of factors affect large scale ads delivery systems and influence both user experience and revenue. One such factor is proactive detection and calibration of seasonal advertisements to help with increasing conversion and user satisfaction. In this paper, we present Proactive Detection and Calibration of Seasonal Advertisements (PDCaSA), a research problem that is of interest for the ads ranking and recommendation community, both in the industrial setting as well as in research. Our paper provides detailed guidelines from various angles of this problem tested in, and motivated by a large-scale industrial ads ranking system. We share our findings including the clear statement of the problem and its motivation rooted in real-world systems, evaluation metrics, and sheds lights to the existing challenges, lessons learned, and best practices of data annotation and machine learning modeling to tackle this problem. Lastly, we present a conclusive solution we took during this research exploration: to detect seasonality, we leveraged Multimodal LLMs (MLMs) which on our in-house benchmark achieved 0.97 top F1 score. Based on our findings, we envision MLMs as a teacher for knowledge distillation, a machine labeler, and a part of the ensembled and tiered seasonality detection system, which can empower ads ranking systems with enriched seasonal information.
△ Less
Submitted 16 October, 2024;
originally announced November 2024.
-
Deterministic Suffix-reading Automata
Authors:
R Keerthan,
B Srivathsan,
R Venkatesh,
Sagar Verma
Abstract:
We introduce deterministic suffix-reading automata (DSA), a new automaton model over finite words. Transitions in a DSA are labeled with words. From a state, a DSA triggers an outgoing transition on seeing a word ending with the transition's label. Therefore, rather than moving along an input word letter by letter, a DSA can jump along blocks of letters, with each block ending in a suitable suffix…
▽ More
We introduce deterministic suffix-reading automata (DSA), a new automaton model over finite words. Transitions in a DSA are labeled with words. From a state, a DSA triggers an outgoing transition on seeing a word ending with the transition's label. Therefore, rather than moving along an input word letter by letter, a DSA can jump along blocks of letters, with each block ending in a suitable suffix. This feature allows DSAs to recognize regular languages more concisely, compared to DFAs. In this work, we focus on questions around finding a "minimal" DSA for a regular language. The number of states is not a faithful measure of the size of a DSA, since the transition-labels contain strings of arbitrary length. Hence, we consider total-size (number of states + number of edges + total length of transition-labels) as the size measure of DSAs.
We start by formally defining the model and providing a DSA-to-DFA conversion that allows to compare the expressiveness and succinctness of DSA with related automata models. Our main technical contribution is a method to derive DSAs from a given DFA: a DFA-to-DSA conversion. We make a surprising observation that the smallest DSA derived from the canonical DFA of a regular language L need not be a minimal DSA for L. This observation leads to a fundamental bottleneck in deriving a minimal DSA for a regular language. In fact, we prove that given a DFA and a number k, the problem of deciding if there exists an equivalent DSA of total-size at most k is NP-complete.
△ Less
Submitted 30 October, 2024;
originally announced October 2024.
-
Parameterized Saga of First-Fit and Last-Fit Coloring
Authors:
Akanksha Agrawal,
Daniel Lokshtanov,
Fahad Panolan,
Saket Saurabh,
Shaily Verma
Abstract:
The classic greedy coloring (first-fit) algorithm considers the vertices of an input graph $G$ in a given order and assigns the first available color to each vertex $v$ in $G$. In the {\sc Grundy Coloring} problem, the task is to find an ordering of the vertices that will force the greedy algorithm to use as many colors as possible. In the {\sc Partial Grundy Coloring}, the task is also to color t…
▽ More
The classic greedy coloring (first-fit) algorithm considers the vertices of an input graph $G$ in a given order and assigns the first available color to each vertex $v$ in $G$. In the {\sc Grundy Coloring} problem, the task is to find an ordering of the vertices that will force the greedy algorithm to use as many colors as possible. In the {\sc Partial Grundy Coloring}, the task is also to color the graph using as many colors as possible. This time, however, we may select both the ordering in which the vertices are considered and which color to assign the vertex. The only constraint is that the color assigned to a vertex $v$ is a color previously used for another vertex if such a color is available.
Whether {\sc Grundy Coloring} and {\sc Partial Grundy Coloring} admit fixed-parameter tractable (FPT) algorithms, algorithms with running time $f(k)n^{\OO(1)}$, where $k$ is the number of colors, was posed as an open problem by Zaker and by Effantin et al., respectively. Recently, Aboulker et al. (STACS 2020 and Algorithmica 2022) resolved the question for \Grundycol\ in the negative by showing that the problem is W[1]-hard. For {\sc Partial Grundy Coloring}, they obtain an FPT algorithm on graphs that do not contain $K_{i,j}$ as a subgraph (a.k.a. $K_{i,j}$-free graphs). Aboulker et al.~re-iterate the question of whether there exists an FPT algorithm for {\sc Partial Grundy Coloring} on general graphs and also asks whether {\sc Grundy Coloring} admits an FPT algorithm on $K_{i,j}$-free graphs. We give FPT algorithms for {\sc Partial Grundy Coloring} on general graphs and for {\sc Grundy Coloring} on $K_{i,j}$-free graphs, resolving both the questions in the affirmative. We believe that our new structural theorems for partial Grundy coloring and ``representative-family'' like sets for $K_{i,j}$-free graphs that we use in obtaining our results may have wider algorithmic applications.
△ Less
Submitted 27 October, 2024;
originally announced October 2024.
-
How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold
Authors:
Sahil Verma,
Royi Rassin,
Arnav Das,
Gantavya Bhatt,
Preethi Seshadri,
Chirag Shah,
Jeff Bilmes,
Hannaneh Hajishirzi,
Yanai Elazar
Abstract:
Text-to-image models are trained using large datasets collected by scraping image-text pairs from the internet. These datasets often include private, copyrighted, and licensed material. Training models on such datasets enables them to generate images with such content, which might violate copyright laws and individual privacy. This phenomenon is termed imitation -- generation of images with conten…
▽ More
Text-to-image models are trained using large datasets collected by scraping image-text pairs from the internet. These datasets often include private, copyrighted, and licensed material. Training models on such datasets enables them to generate images with such content, which might violate copyright laws and individual privacy. This phenomenon is termed imitation -- generation of images with content that has recognizable similarity to its training images. In this work we study the relationship between a concept's frequency in the training dataset and the ability of a model to imitate it. We seek to determine the point at which a model was trained on enough instances to imitate a concept -- the imitation threshold. We posit this question as a new problem: Finding the Imitation Threshold (FIT) and propose an efficient approach that estimates the imitation threshold without incurring the colossal cost of training multiple models from scratch. We experiment with two domains -- human faces and art styles -- for which we create four datasets, and evaluate three text-to-image models which were trained on two pretraining datasets. Our results reveal that the imitation threshold of these models is in the range of 200-600 images, depending on the domain and the model. The imitation threshold can provide an empirical basis for copyright violation claims and acts as a guiding principle for text-to-image model developers that aim to comply with copyright and privacy laws. We release the code and data at \url{https://github.com/vsahil/MIMETIC-2.git} and the project's website is hosted at \url{https://how-many-van-goghs-does-it-take.github.io}.
△ Less
Submitted 19 October, 2024;
originally announced October 2024.
-
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Authors:
Himanshu Gupta,
Shreyas Verma,
Ujjwala Anantheswaran,
Kevin Scaria,
Mihir Parmar,
Swaroop Mishra,
Chitta Baral
Abstract:
Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities in various domains, but their visual comprehension and abstract reasoning skills remain under-evaluated. To this end, we present PolyMATH, a challenging benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. PolyMATH comprises 5,000 manually collected high-quality images of cognitive t…
▽ More
Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities in various domains, but their visual comprehension and abstract reasoning skills remain under-evaluated. To this end, we present PolyMATH, a challenging benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. PolyMATH comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning. We conducted a comprehensive, and quantitative evaluation of 15 MLLMs using four diverse prompting strategies, including Chain-of-Thought and Step-Back. The best scores achieved on PolyMATH are ~41%, ~36%, and ~27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively - highlighting the logical and visual complexity of these questions. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning. This is further strengthened by our ablation study estimating MLLM performance when given textual descriptions in place of diagrams. As evidenced by ~4% improvement over textual descriptions as opposed to actual images, we discover that models do not truly comprehend visual diagrams and the spatial information therein, and are thus prone to logical errors. Finally, we evaluate the OpenAI o1 models and find that their performance only matches the human baseline, highlighting the difficulty of the benchmark. The results on PolyMATH highlight the room for improvement in multi-modal reasoning and provide unique insights to guide the development of future MLLMs.
△ Less
Submitted 6 October, 2024;
originally announced October 2024.
-
AXCEL: Automated eXplainable Consistency Evaluation using LLMs
Authors:
P Aditya Sreekar,
Sahil Verma,
Suransh Chopra,
Sarik Ghazarian,
Abhishek Persad,
Narayanan Sadagopan
Abstract:
Large Language Models (LLMs) are widely used in both industry and academia for various tasks, yet evaluating the consistency of generated text responses continues to be a challenge. Traditional metrics like ROUGE and BLEU show a weak correlation with human judgment. More sophisticated metrics using Natural Language Inference (NLI) have shown improved correlations but are complex to implement, requ…
▽ More
Large Language Models (LLMs) are widely used in both industry and academia for various tasks, yet evaluating the consistency of generated text responses continues to be a challenge. Traditional metrics like ROUGE and BLEU show a weak correlation with human judgment. More sophisticated metrics using Natural Language Inference (NLI) have shown improved correlations but are complex to implement, require domain-specific training due to poor cross-domain generalization, and lack explainability. More recently, prompt-based metrics using LLMs as evaluators have emerged; while they are easier to implement, they still lack explainability and depend on task-specific prompts, which limits their generalizability. This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL), a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning and pinpointing inconsistent text spans. AXCEL is also a generalizable metric which can be adopted to multiple tasks without changing the prompt. AXCEL outperforms both non-prompt and prompt-based state-of-the-art (SOTA) metrics in detecting inconsistencies across summarization by 8.7%, free text generation by 6.2%, and data-to-text conversion tasks by 29.4%. We also evaluate the influence of underlying LLMs on prompt based metric performance and recalibrate the SOTA prompt-based metrics with the latest LLMs for fair comparison. Further, we show that AXCEL demonstrates strong performance using open source LLMs.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Credit Card Fraud Detection: A Deep Learning Approach
Authors:
Sourav Verma,
Joydip Dhar
Abstract:
Credit card is one of the most extensive methods of instalment for both online and offline mode of payment for electronic transactions in recent times. credit cards invention has provided significant ease in electronic transactions. However, it has also provided new fraud opportunities for criminals, which results in increased fraud rates. Substantial amount of money has been lost by many institut…
▽ More
Credit card is one of the most extensive methods of instalment for both online and offline mode of payment for electronic transactions in recent times. credit cards invention has provided significant ease in electronic transactions. However, it has also provided new fraud opportunities for criminals, which results in increased fraud rates. Substantial amount of money has been lost by many institutions and individuals due to fraudulent credit card transactions. Adapting improved and dynamic fraud recognition frameworks thus became essential for all credit card distributing banks to mitigate their losses. In fact, the problem of fraudulent credit card transactions implicates a number of relevant real-time challenges, namely: Concept drift, Class imbalance, and Verification latency. However, the vast majority of current systems are based on artificial intelligence (AI), Fuzzy logic, Machine Learning, Data mining, Genetic Algorithms, and so on, rely on assumptions that hardly address all the relevant challenges of fraud-detection system (FDS). This paper aims to understand & implement Deep Learning algorithms in order to obtain a high fraud coverage with very low false positive rate. Also, it aims to implement an auto-encoder as an unsupervised (semi-supervised) method of learning common patterns. Keywords: Credit card fraud, Fraud-detection system (FDS), Electronic transactions, Concept drift, Class imbalance, Verification latency, Machine Learning, Deep Learning
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey
Authors:
Sourav Verma
Abstract:
Large Language Models (LLMs) showcase remarkable abilities, yet they struggle with limitations such as hallucinations, outdated knowledge, opacity, and inexplicable reasoning. To address these challenges, Retrieval-Augmented Generation (RAG) has proven to be a viable solution, leveraging external databases to improve the consistency and coherence of generated content, especially valuable for compl…
▽ More
Large Language Models (LLMs) showcase remarkable abilities, yet they struggle with limitations such as hallucinations, outdated knowledge, opacity, and inexplicable reasoning. To address these challenges, Retrieval-Augmented Generation (RAG) has proven to be a viable solution, leveraging external databases to improve the consistency and coherence of generated content, especially valuable for complex, knowledge-rich tasks, and facilitates continuous improvement by leveraging domain-specific insights. By combining the intrinsic knowledge of LLMs with the vast, dynamic repositories of external databases, RAG achieves a synergistic effect. However, RAG is not without its limitations, including a limited context window, irrelevant information, and the high processing overhead for extensive contextual data. In this comprehensive work, we explore the evolution of Contextual Compression paradigms, providing an in-depth examination of the field. Finally, we outline the current challenges and suggest potential research and development directions, paving the way for future advancements in this area.
△ Less
Submitted 2 October, 2024; v1 submitted 20 September, 2024;
originally announced September 2024.
-
Li-MSD: A lightweight mitigation solution for DAO insider attack in RPL-based IoT
Authors:
Abhishek Verma,
Sachin Kumar Verma,
Avinash Chandra Pandey,
Jyoti Grover,
Girish Sharma
Abstract:
Many IoT applications run on a wireless infrastructure supported by resource-constrained nodes which is popularly known as Low-Power and Lossy Networks (LLNs). Currently, LLNs play a vital role in digital transformation of industries. The resource limitations of LLNs restrict the usage of traditional routing protocols and therefore require an energy-efficient routing solution. IETF's Routing Proto…
▽ More
Many IoT applications run on a wireless infrastructure supported by resource-constrained nodes which is popularly known as Low-Power and Lossy Networks (LLNs). Currently, LLNs play a vital role in digital transformation of industries. The resource limitations of LLNs restrict the usage of traditional routing protocols and therefore require an energy-efficient routing solution. IETF's Routing Protocol for Low-power Lossy Networks (RPL, pronounced 'ripple') is one of the most popular energy-efficient protocols for LLNs, specified in RFC 6550. In RPL, Destination Advertisement Object (DAO) control message is transmitted by a child node to pass on its reachability information to its immediate parent or root node. An attacker may exploit the insecure DAO sending mechanism of RPL to perform 'DAO insider attack' by transmitting DAO multiple times. This paper shows that an aggressive DAO insider attacker can drastically degrade network performance. We propose a Lightweight Mitigation Solution for DAO insider attack, which is termed as 'Li-MSD'. Li-MSD uses a blacklisting strategy to mitigate the attack and restore RPL performance, significantly. By using simulations, it is shown that Li-MSD outperforms the existing solution in the literature.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
Case Study: Leveraging GenAI to Build AI-based Surrogates and Regressors for Modeling Radio Frequency Heating in Fusion Energy Science
Authors:
E. Wes Bethel,
Vianna Cramer,
Alexander del Rio,
Lothar Narins,
Chris Pestano,
Satvik Verma,
Erick Arias,
Nicola Bertelli,
Talita Perciano,
Syun'ichi Shiraiwa,
Álvaro Sánchez Villar,
Greg Wallace,
John C. Wright
Abstract:
This work presents a detailed case study on using Generative AI (GenAI) to develop AI surrogates for simulation models in fusion energy research. The scope includes the methodology, implementation, and results of using GenAI to assist in model development and optimization, comparing these results with previous manually developed models.
This work presents a detailed case study on using Generative AI (GenAI) to develop AI surrogates for simulation models in fusion energy research. The scope includes the methodology, implementation, and results of using GenAI to assist in model development and optimization, comparing these results with previous manually developed models.
△ Less
Submitted 9 September, 2024;
originally announced September 2024.
-
Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards
Authors:
Shresth Verma,
Niclas Boehmer,
Lingkai Kong,
Milind Tambe
Abstract:
LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the prese…
▽ More
LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the presence of multiple agents, altering the reward function based on human preferences can impact subpopulations very differently, leading to complex tradeoffs and a multi-objective resource allocation problem. We are the first to present a principled method termed Social Choice Language Model for dealing with these tradeoffs for LLM-designed rewards for multiagent planners in general and restless bandits in particular. The novel part of our model is a transparent and configurable selection component, called an adjudicator, external to the LLM that controls complex tradeoffs via a user-selected social welfare function. Our experiments demonstrate that our model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.
△ Less
Submitted 7 July, 2025; v1 submitted 21 August, 2024;
originally announced August 2024.
-
The Llama 3 Herd of Models
Authors:
Aaron Grattafiori,
Abhimanyu Dubey,
Abhinav Jauhri,
Abhinav Pandey,
Abhishek Kadian,
Ahmad Al-Dahle,
Aiesha Letman,
Akhil Mathur,
Alan Schelten,
Alex Vaughan,
Amy Yang,
Angela Fan,
Anirudh Goyal,
Anthony Hartshorn,
Aobo Yang,
Archi Mitra,
Archie Sravankumar,
Artem Korenev,
Arthur Hinsvark,
Arun Rao,
Aston Zhang,
Aurelien Rodriguez,
Austen Gregerson,
Ava Spataru,
Baptiste Roziere
, et al. (536 additional authors not shown)
Abstract:
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical…
▽ More
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
△ Less
Submitted 23 November, 2024; v1 submitted 31 July, 2024;
originally announced July 2024.
-
Improving Health Information Access in the World's Largest Maternal Mobile Health Program via Bandit Algorithms
Authors:
Arshika Lalan,
Shresth Verma,
Paula Rodriguez Diaz,
Panayiotis Danassis,
Amrita Mahale,
Kumar Madhu Sudan,
Aparna Hegde,
Milind Tambe,
Aparna Taneja
Abstract:
Harnessing the wide-spread availability of cell phones, many nonprofits have launched mobile health (mHealth) programs to deliver information via voice or text to beneficiaries in underserved communities, with maternal and infant health being a key area of such mHealth programs. Unfortunately, dwindling listenership is a major challenge, requiring targeted interventions using limited resources. Th…
▽ More
Harnessing the wide-spread availability of cell phones, many nonprofits have launched mobile health (mHealth) programs to deliver information via voice or text to beneficiaries in underserved communities, with maternal and infant health being a key area of such mHealth programs. Unfortunately, dwindling listenership is a major challenge, requiring targeted interventions using limited resources. This paper focuses on Kilkari, the world's largest mHealth program for maternal and child care - with over 3 million active subscribers at a time - launched by India's Ministry of Health and Family Welfare (MoHFW) and run by the non-profit ARRMAN. We present a system called CHAHAK that aims to reduce automated dropouts as well as boost engagement with the program through the strategic allocation of interventions to beneficiaries. Past work in a similar domain has focused on a much smaller scale mHealth program and used markovian restless multiarmed bandits to optimize a single limited intervention resource. However this paper demonstrates the challenges in adopting a markovian approach in Kilkari; therefore CHAHAK instead relies on non-markovian time-series restless bandits, and optimizes multiple interventions to improve listenership. We use real Kilkari data from the Odisha state in India to show CHAHAK's effectiveness in harnessing multiple interventions to boost listenership, benefiting marginalized communities. When deployed CHAHAK will assist the largest maternal mHealth program to date.
△ Less
Submitted 14 May, 2024;
originally announced July 2024.
-
Machine Learning for ALSFRS-R Score Prediction: Making Sense of the Sensor Data
Authors:
Ritesh Mehta,
Aleksandar Pramov,
Shashank Verma
Abstract:
Amyotrophic Lateral Sclerosis (ALS) is characterized as a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options in the realm of medical interventions and therapies. The disease showcases a diverse range of onset patterns and progression trajectories, emphasizing the critical importance of early detection of functional decline to enable tailored care…
▽ More
Amyotrophic Lateral Sclerosis (ALS) is characterized as a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options in the realm of medical interventions and therapies. The disease showcases a diverse range of onset patterns and progression trajectories, emphasizing the critical importance of early detection of functional decline to enable tailored care strategies and timely therapeutic interventions. The present investigation, spearheaded by the iDPP@CLEF 2024 challenge, focuses on utilizing sensor-derived data obtained through an app. This data is used to construct various machine learning models specifically designed to forecast the advancement of the ALS Functional Rating Scale-Revised (ALSFRS-R) score, leveraging the dataset provided by the organizers. In our analysis, multiple predictive models were evaluated to determine their efficacy in handling ALS sensor data. The temporal aspect of the sensor data was compressed and amalgamated using statistical methods, thereby augmenting the interpretability and applicability of the gathered information for predictive modeling objectives. The models that demonstrated optimal performance were a naive baseline and ElasticNet regression. The naive model achieved a Mean Absolute Error (MAE) of 0.20 and a Root Mean Square Error (RMSE) of 0.49, slightly outperforming the ElasticNet model, which recorded an MAE of 0.22 and an RMSE of 0.50. Our comparative analysis suggests that while the naive approach yielded marginally better predictive accuracy, the ElasticNet model provides a robust framework for understanding feature contributions.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Estimation of the Area and Precipitation Associated with a Tropical Cyclone Biparjoy by using Image Processing
Authors:
Shikha Verma,
Kuldeep Srivastava,
Akhilesh Tiwari,
Shekhar Verma
Abstract:
The rainfall associated with Topical Cyclone(TC) contributes a major amount to the annual rainfall in India. Due to the limited research on the quantitative precipitation associated with Tropical Cyclones (TC), the prediction of the amount of precipitation and area that it may cover remains a challenge. This paper proposes an approach to estimate the accumulated precipitation and impact on affecte…
▽ More
The rainfall associated with Topical Cyclone(TC) contributes a major amount to the annual rainfall in India. Due to the limited research on the quantitative precipitation associated with Tropical Cyclones (TC), the prediction of the amount of precipitation and area that it may cover remains a challenge. This paper proposes an approach to estimate the accumulated precipitation and impact on affected area using Remote Sensing data. For this study, an instance of Extremely Severe Cyclonic Storm, Biparjoy that formed over the Arabian Sea and hit India in 2023 is considered in which we have used the satellite images of IMERG-Late Run of Global Precipitation Measurement (GPM). Image processing techniques were employed to identify and extract precipitation clusters linked to the cyclone. The results indicate that Biparjoy contributed a daily average rainfall of 53.14 mm/day across India and the Arabian Sea, with the Indian boundary receiving 11.59 mm/day, covering an extensive 411.76 thousand square kilometers. The localized intensity and variability observed in states like Gujarat, Rajasthan, Madhya Pradesh, and Uttar Pradesh highlight the need for tailored response measures, emphasizing the importance of further research to enhance predictive models and disaster readiness, crucial for building resilience against the diverse impacts of tropical cyclones.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Cutting Through the Noise: Boosting LLM Performance on Math Word Problems
Authors:
Ujjwala Anantheswaran,
Himanshu Gupta,
Kevin Scaria,
Shreyas Verma,
Chitta Baral,
Swaroop Mishra
Abstract:
Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experim…
▽ More
Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.
△ Less
Submitted 24 October, 2024; v1 submitted 30 May, 2024;
originally announced June 2024.
-
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Authors:
Sumanth Doddapaneni,
Mohammed Safi Ur Rahman Khan,
Sshubam Verma,
Mitesh M. Khapra
Abstract:
Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework d…
▽ More
Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators in the literature. Our findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50\% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. Code and data are available at https://github.com/AI4Bharat/FBI.
△ Less
Submitted 26 November, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
A Critical Study of What Code-LLMs (Do Not) Learn
Authors:
Abhinav Anand,
Shweta Verma,
Krishna Narasimhan,
Mira Mezini
Abstract:
Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidd…
▽ More
Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidden representations to encode relations among input tokens. However, previous works have not studied what code properties are not encoded by code-LLMs. In this paper, we conduct a fine-grained analysis of attention maps and hidden representations of code-LLMs. Our study indicates that code-LLMs only encode relations among specific subsets of input tokens. Specifically, by categorizing input tokens into syntactic tokens and identifiers, we found that models encode relations among syntactic tokens and among identifiers, but they fail to encode relations between syntactic tokens and identifiers. We also found that fine-tuned models encode these relations poorly compared to their pre-trained counterparts. Additionally, larger models with billions of parameters encode significantly less information about code than models with only a few hundred million parameters.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization
Authors:
Amir Saeidi,
Shivanshu Verma,
Aswin RRV,
Kashif Rasul,
Chitta Baral
Abstract:
Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome these shortcomings. While studies have shown that DPO improves instruction-following capabilities, it negatively impacts the reasoning ability of LLMs. Additio…
▽ More
Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome these shortcomings. While studies have shown that DPO improves instruction-following capabilities, it negatively impacts the reasoning ability of LLMs. Additionally, DPO is highly sensitive to judgment noise in preference datasets and the size of the training set. Although several modifications to DPO have been proposed, they still fail to fully resolve these issues. To address these limitations, we propose Triple Preference Optimization (TPO), a new preference learning method designed to enhance both reasoning and instruction-following abilities through one-step optimization. We compare TPO against DPO and its recent variants using state-of-the-art training setups, including both base and instruction-tuned models such as Mistral and Llama 3. Our evaluation covers a comprehensive range of chat-based and reasoning benchmarks. The results demonstrate that TPO achieves significant improvements over existing methods without substantially increasing response length across different dataset sizes. Specifically, TPO outperforms DPO and SimPO by up to 7.0% and 7.3% points on Arena-Hard, 12.2% and 13.3% points on MixEval-Hard, 10.4% and 10.1% points on MMLU-Pro, and 19.0% and 19.2% points on GSM8K, respectively. Furthermore, TPO achieves these improvements while requiring less data than DPO.
△ Less
Submitted 17 February, 2025; v1 submitted 26 May, 2024;
originally announced May 2024.
-
Perturbing the Gradient for Alleviating Meta Overfitting
Authors:
Manas Gogoi,
Sambhavi Tiwari,
Shekhar Verma
Abstract:
The reason for Meta Overfitting can be attributed to two factors: Mutual Non-exclusivity and the Lack of diversity, consequent to which a single global function can fit the support set data of all the meta-training tasks and fail to generalize to new unseen tasks. This issue is evidenced by low error rates on the meta-training tasks, but high error rates on new tasks. However, there can be a numbe…
▽ More
The reason for Meta Overfitting can be attributed to two factors: Mutual Non-exclusivity and the Lack of diversity, consequent to which a single global function can fit the support set data of all the meta-training tasks and fail to generalize to new unseen tasks. This issue is evidenced by low error rates on the meta-training tasks, but high error rates on new tasks. However, there can be a number of novel solutions to this problem keeping in mind any of the two objectives to be attained, i.e. to increase diversity in the tasks and to reduce the confidence of the model for some of the tasks. In light of the above, this paper proposes a number of solutions to tackle meta-overfitting on few-shot learning settings, such as few-shot sinusoid regression and few shot classification. Our proposed approaches demonstrate improved generalization performance compared to state-of-the-art baselines for learning in a non-mutually exclusive task setting. Overall, this paper aims to provide insights into tackling overfitting in meta-learning and to advance the field towards more robust and generalizable models.
△ Less
Submitted 10 November, 2024; v1 submitted 20 May, 2024;
originally announced May 2024.
-
Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks
Authors:
Amir Saeidi,
Shivanshu Verma,
Md Nayem Uddin,
Chitta Baral
Abstract:
This study evaluates Direct Preference Optimization (DPO) and its variants for aligning Large Language Models (LLMs) with human preferences, testing three configurations: (1) with Supervised Fine Tuning (SFT), (2) without SFT, and (3) without SFT but using an instruction tuned model. We further investigate how training set size influences model performance. Our evaluation spans 13 benchmarks cover…
▽ More
This study evaluates Direct Preference Optimization (DPO) and its variants for aligning Large Language Models (LLMs) with human preferences, testing three configurations: (1) with Supervised Fine Tuning (SFT), (2) without SFT, and (3) without SFT but using an instruction tuned model. We further investigate how training set size influences model performance. Our evaluation spans 13 benchmarks covering dialogue, reasoning, mathematical problem-solving, question answering, truthfulness, MT-Bench, Big Bench, and the Open LLM Leaderboard. We find that: (1) alignment methods often achieve near optimal performance even with smaller subsets of training data; (2) although they offer limited improvements on complex reasoning tasks, they enhance mathematical problem-solving; and (3) using an instruction tuned model improves truthfulness. These insights highlight the conditions under which alignment methods excel, as well as their limitations.
△ Less
Submitted 8 February, 2025; v1 submitted 22 April, 2024;
originally announced April 2024.
-
A Systematic Literature Review on Task Allocation and Performance Management Techniques in Cloud Data Center
Authors:
Nidhika Chauhan,
Navneet Kaur,
Kamaljit Singh Saini,
Sahil Verma,
Abdulatif Alabdulatif,
Ruba Abu Khurma,
Maribel Garcia-Arenas,
Pedro A. Castillo
Abstract:
As cloud computing usage grows, cloud data centers play an increasingly important role. To maximize resource utilization, ensure service quality, and enhance system performance, it is crucial to allocate tasks and manage performance effectively. The purpose of this study is to provide an extensive analysis of task allocation and performance management techniques employed in cloud data centers. The…
▽ More
As cloud computing usage grows, cloud data centers play an increasingly important role. To maximize resource utilization, ensure service quality, and enhance system performance, it is crucial to allocate tasks and manage performance effectively. The purpose of this study is to provide an extensive analysis of task allocation and performance management techniques employed in cloud data centers. The aim is to systematically categorize and organize previous research by identifying the cloud computing methodologies, categories, and gaps. A literature review was conducted, which included the analysis of 463 task allocations and 480 performance management papers. The review revealed three task allocation research topics and seven performance management methods. Task allocation research areas are resource allocation, load-Balancing, and scheduling. Performance management includes monitoring and control, power and energy management, resource utilization optimization, quality of service management, fault management, virtual machine management, and network management. The study proposes new techniques to enhance cloud computing work allocation and performance management. Short-comings in each approach can guide future research. The research's findings on cloud data center task allocation and performance management can assist academics, practitioners, and cloud service providers in optimizing their systems for dependability, cost-effectiveness, and scalability. Innovative methodologies can steer future research to fill gaps in the literature.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.