-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Authors:
Gheorghe Comanici,
Eric Bieber,
Mike Schaekermann,
Ice Pasupat,
Noveen Sachdeva,
Inderjit Dhillon,
Marcel Blistein,
Ori Ram,
Dan Zhang,
Evan Rosen,
Luke Marris,
Sam Petulla,
Colin Gaffney,
Asaf Aharoni,
Nathan Lintz,
Tiago Cardal Pais,
Henrik Jacobsson,
Idan Szpektor,
Nan-Jiang Jiang,
Krishna Haridasan,
Ahmed Omran,
Nikunj Saunshi,
Dara Bahri,
Gaurav Mishra,
Eric Chu
, et al. (3264 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde…
▽ More
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
△ Less
Submitted 11 July, 2025; v1 submitted 7 July, 2025;
originally announced July 2025.
-
Agentic-R1: Distilled Dual-Strategy Reasoning
Authors:
Weihua Du,
Pranjal Aggarwal,
Sean Welleck,
Yiming Yang
Abstract:
Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using…
▽ More
Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train Agentic-R1, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems, and using text-based reasoning for abstract ones. Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks, demonstrating the effectiveness of multi-strategy distillation in achieving robust and efficient reasoning. Our project is available at https://github.com/StigLidu/DualDistill
△ Less
Submitted 8 July, 2025;
originally announced July 2025.
-
Optimizing Federated Learning using Remote Embeddings for Graph Neural Networks
Authors:
Pranjal Naman,
Yogesh Simmhan
Abstract:
Graph Neural Networks (GNNs) have experienced rapid advancements in recent years due to their ability to learn meaningful representations from graph data structures. Federated Learning (FL) has emerged as a viable machine learning approach for training a shared model on decentralized data, addressing privacy concerns while leveraging parallelism. Existing methods that address the unique requiremen…
▽ More
Graph Neural Networks (GNNs) have experienced rapid advancements in recent years due to their ability to learn meaningful representations from graph data structures. Federated Learning (FL) has emerged as a viable machine learning approach for training a shared model on decentralized data, addressing privacy concerns while leveraging parallelism. Existing methods that address the unique requirements of federated GNN training using remote embeddings to enhance convergence accuracy are limited by their diminished performance due to large communication costs with a shared embedding server. In this paper, we present OpES, an optimized federated GNN training framework that uses remote neighbourhood pruning, and overlaps pushing of embeddings to the server with local training to reduce the network costs and training time. The modest drop in per-round accuracy due to pre-emptive push of embeddings is out-stripped by the reduction in per-round training time for large and dense graphs like Reddit and Products, converging up to $\approx2\times$ faster than the state-of-the-art technique using an embedding server and giving up to $20\%$ better accuracy than vanilla federated GNN learning.
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
The Limits of Preference Data for Post-Training
Authors:
Eric Zhao,
Jessica Dai,
Pranjal Awasthi
Abstract:
Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluatio…
▽ More
Recent progress in strengthening the capabilities of large language models has stemmed from applying reinforcement learning to domains with automatically verifiable outcomes. A key question is whether we can similarly use RL to optimize for outcomes in domains where evaluating outcomes inherently requires human feedback; for example, in tasks like deep research and trip planning, outcome evaluation is qualitative and there are many possible degrees of success. One attractive and scalable modality for collecting human feedback is preference data: ordinal rankings (pairwise or $k$-wise) that indicate, for $k$ given outcomes, which one is preferred. In this work, we study a critical roadblock: preference data fundamentally and significantly limits outcome-based optimization. Even with idealized preference data (infinite, noiseless, and online), the use of ordinal feedback can prevent obtaining even approximately optimal solutions. We formalize this impossibility using voting theory, drawing an analogy between how a model chooses to answer a query with how voters choose a candidate to elect. This indicates that grounded human scoring and algorithmic innovations are necessary for extending the success of RL post-training to domains demanding human feedback. We also explore why these limitations have disproportionately impacted RLHF when it comes to eliciting reasoning behaviors (e.g., backtracking) versus situations where RLHF has been historically successful (e.g., instruction-tuning and safety training), finding that the limitations of preference data primarily suppress RLHF's ability to elicit robust strategies -- a class that encompasses most reasoning behaviors.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Ripple: Scalable Incremental GNN Inferencing on Large Streaming Graphs
Authors:
Pranjal Naman,
Yogesh Simmhan
Abstract:
Most real-world graphs are dynamic in nature, with continuous and rapid updates to the graph topology, and vertex and edge properties. Such frequent updates pose significant challenges for inferencing over Graph Neural Networks (GNNs). Current approaches that perform vertex-wise and layer-wise inferencing are impractical for dynamic graphs as they cause redundant computations, expand to large neig…
▽ More
Most real-world graphs are dynamic in nature, with continuous and rapid updates to the graph topology, and vertex and edge properties. Such frequent updates pose significant challenges for inferencing over Graph Neural Networks (GNNs). Current approaches that perform vertex-wise and layer-wise inferencing are impractical for dynamic graphs as they cause redundant computations, expand to large neighborhoods, and incur high communication costs for distributed setups, resulting in slow update propagation that often exceeds real-time latency requirements. This motivates the need for streaming GNN inference frameworks that are efficient and accurate over large, dynamic graphs. We propose Ripple, a framework that performs fast incremental updates of embeddings arising due to updates to the graph topology or vertex features. Ripple provides a generalized incremental programming model, leveraging the properties of the underlying aggregation functions employed by GNNs to efficiently propagate updates to the affected neighborhood and compute the exact new embeddings. Besides a single-machine design, we also extend this execution model to distributed inferencing, to support large graphs that do not fit in a single machine's memory. Ripple on a single machine achieves up to $\approx28000$ updates/sec for sparse graphs like Arxiv and $\approx1200$ updates/sec for larger and denser graphs like Products, with latencies of $0.1$ms--$1$s that are required for near-realtime applications. The distributed version of Ripple offers up to $\approx30\times$ better throughput over the baselines, due to $70\times$ lower communication costs during updates.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Who is More Bayesian: Humans or ChatGPT?
Authors:
Tianshi Mu,
Pranjal Rawat,
John Rust,
Chengjun Zhang,
Qixuan Zhong
Abstract:
We compare the performance of human and artificially intelligent (AI) decision makers in simple binary classification tasks where the optimal decision rule is given by Bayes Rule. We reanalyze choices of human subjects gathered from laboratory experiments conducted by El-Gamal and Grether and Holt and Smith. We confirm that while overall, Bayes Rule represents the single best model for predicting…
▽ More
We compare the performance of human and artificially intelligent (AI) decision makers in simple binary classification tasks where the optimal decision rule is given by Bayes Rule. We reanalyze choices of human subjects gathered from laboratory experiments conducted by El-Gamal and Grether and Holt and Smith. We confirm that while overall, Bayes Rule represents the single best model for predicting human choices, subjects are heterogeneous and a significant share of them make suboptimal choices that reflect judgement biases described by Kahneman and Tversky that include the ``representativeness heuristic'' (excessive weight on the evidence from the sample relative to the prior) and ``conservatism'' (excessive weight on the prior relative to the sample). We compare the performance of AI subjects gathered from recent versions of large language models (LLMs) including several versions of ChatGPT. These general-purpose generative AI chatbots are not specifically trained to do well in narrow decision making tasks, but are trained instead as ``language predictors'' using a large corpus of textual data from the web. We show that ChatGPT is also subject to biases that result in suboptimal decisions. However we document a rapid evolution in the performance of ChatGPT from sub-human performance for early versions (ChatGPT 3.5) to superhuman and nearly perfect Bayesian classifications in the latest versions (ChatGPT 4o).
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
SparseLoc: Sparse Open-Set Landmark-based Global Localization for Autonomous Navigation
Authors:
Pranjal Paul,
Vineeth Bhat,
Tejas Salian,
Mohammad Omama,
Krishna Murthy Jatavallabhula,
Naveen Arulselvan,
K. Madhava Krishna
Abstract:
Global localization is a critical problem in autonomous navigation, enabling precise positioning without reliance on GPS. Modern global localization techniques often depend on dense LiDAR maps, which, while precise, require extensive storage and computational resources. Recent approaches have explored alternative methods, such as sparse maps and learned features, but they suffer from poor robustne…
▽ More
Global localization is a critical problem in autonomous navigation, enabling precise positioning without reliance on GPS. Modern global localization techniques often depend on dense LiDAR maps, which, while precise, require extensive storage and computational resources. Recent approaches have explored alternative methods, such as sparse maps and learned features, but they suffer from poor robustness and generalization. We propose SparseLoc, a global localization framework that leverages vision-language foundation models to generate sparse, semantic-topometric maps in a zero-shot manner. It combines this map representation with a Monte Carlo localization scheme enhanced by a novel late optimization strategy, ensuring improved pose estimation. By constructing compact yet highly discriminative maps and refining localization through a carefully designed optimization schedule, SparseLoc overcomes the limitations of existing techniques, offering a more efficient and robust solution for global localization. Our system achieves over a 5X improvement in localization accuracy compared to existing sparse mapping techniques. Despite utilizing only 1/500th of the points of dense mapping methods, it achieves comparable performance, maintaining an average global localization error below 5m and 2 degrees on KITTI sequences.
△ Less
Submitted 30 March, 2025;
originally announced March 2025.
-
Demystifying the Accuracy-Interpretability Trade-Off: A Case Study of Inferring Ratings from Reviews
Authors:
Pranjal Atrey,
Michael P. Brundage,
Min Wu,
Sanghamitra Dutta
Abstract:
Interpretable machine learning models offer understandable reasoning behind their decision-making process, though they may not always match the performance of their black-box counterparts. This trade-off between interpretability and model performance has sparked discussions around the deployment of AI, particularly in critical applications where knowing the rationale of decision-making is essentia…
▽ More
Interpretable machine learning models offer understandable reasoning behind their decision-making process, though they may not always match the performance of their black-box counterparts. This trade-off between interpretability and model performance has sparked discussions around the deployment of AI, particularly in critical applications where knowing the rationale of decision-making is essential for trust and accountability. In this study, we conduct a comparative analysis of several black-box and interpretable models, focusing on a specific NLP use case that has received limited attention: inferring ratings from reviews. Through this use case, we explore the intricate relationship between the performance and interpretability of different models. We introduce a quantitative score called Composite Interpretability (CI) to help visualize the trade-off between interpretability and performance, particularly in the case of composite models. Our results indicate that, in general, the learning performance improves as interpretability decreases, but this relationship is not strictly monotonic, and there are instances where interpretable models are more advantageous.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Combinatorial Optimization via LLM-driven Iterated Fine-tuning
Authors:
Pranjal Awasthi,
Sreenivas Gollapudi,
Ravi Kumar,
Kamesh Munagala
Abstract:
We present a novel way to integrate flexible, context-dependent constraints into combinatorial optimization by leveraging Large Language Models (LLMs) alongside traditional algorithms. Although LLMs excel at interpreting nuanced, locally specified requirements, they struggle with enforcing global combinatorial feasibility. To bridge this gap, we propose an iterated fine-tuning framework where algo…
▽ More
We present a novel way to integrate flexible, context-dependent constraints into combinatorial optimization by leveraging Large Language Models (LLMs) alongside traditional algorithms. Although LLMs excel at interpreting nuanced, locally specified requirements, they struggle with enforcing global combinatorial feasibility. To bridge this gap, we propose an iterated fine-tuning framework where algorithmic feedback progressively refines the LLM's output distribution. Interpreting this as simulated annealing, we introduce a formal model based on a "coarse learnability" assumption, providing sample complexity bounds for convergence. Empirical evaluations on scheduling, graph connectivity, and clustering tasks demonstrate that our framework balances the flexibility of locally expressed constraints with rigorous global optimization more effectively compared to baseline sampling methods. Our results highlight a promising direction for hybrid AI-driven combinatorial reasoning.
△ Less
Submitted 13 March, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning
Authors:
Eric Zhao,
Pranjal Awasthi,
Nika Haghtalab
Abstract:
Finetuning provides a scalable and cost-effective means of customizing language models for specific tasks or response styles, with greater reliability than prompting or in-context learning. In contrast, the conventional wisdom is that injecting knowledge via finetuning results in brittle performance and poor generalization. We argue that the dichotomy of "task customization" (e.g., instruction tun…
▽ More
Finetuning provides a scalable and cost-effective means of customizing language models for specific tasks or response styles, with greater reliability than prompting or in-context learning. In contrast, the conventional wisdom is that injecting knowledge via finetuning results in brittle performance and poor generalization. We argue that the dichotomy of "task customization" (e.g., instruction tuning) and "knowledge injection" (e.g., teaching new facts) is a distinction without a difference. We instead identify concrete factors that explain the heterogeneous effectiveness observed with finetuning. To this end, we conduct a large-scale experimental study of finetuning the frontier Gemini v1.5 model family on a spectrum of datasets that are artificially engineered to interpolate between the strengths and failure modes of finetuning. Our findings indicate that question-answer training data formats provide much stronger knowledge generalization than document/article-style training data, numerical information can be harder for finetuning to retain than categorical information, and models struggle to apply finetuned knowledge during multi-step reasoning even when trained on similar examples -- all factors that render "knowledge injection" to be especially difficult, even after controlling for considerations like data augmentation and information volume. On the other hand, our findings also indicate that it is not fundamentally more difficult to finetune information about a real-world event than information about what a model's writing style should be.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Authors:
Pranjal Aggarwal,
Sean Welleck
Abstract:
Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Control…
▽ More
Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1's length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. For instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at https://www.cmu-l3.github.io/l1
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Survey Perspective: The Role of Explainable AI in Threat Intelligence
Authors:
Nidhi Rastogi,
Devang Dhanuka,
Amulya Saxena,
Pranjal Mairal,
Le Nguyen
Abstract:
The increasing reliance on AI-based security tools in Security Operations Centers (SOCs) has transformed threat detection and response, yet analysts frequently struggle with alert overload, false positives, and lack of contextual relevance. The inability to effectively analyze AI-generated security alerts lead to inefficiencies in incident response and reduces trust in automated decision-making. I…
▽ More
The increasing reliance on AI-based security tools in Security Operations Centers (SOCs) has transformed threat detection and response, yet analysts frequently struggle with alert overload, false positives, and lack of contextual relevance. The inability to effectively analyze AI-generated security alerts lead to inefficiencies in incident response and reduces trust in automated decision-making. In this paper, we show results and analysis of our investigation of how SOC analysts navigate AI-based alerts, their challenges with current security tools, and how explainability (XAI) integrated into their security workflows has the potential to become an effective decision support. In this vein, we conducted an industry survey. Using the survey responses, we analyze how security analysts' process, retrieve, and prioritize alerts. Our findings indicate that most analysts have not yet adopted XAI-integrated tools, but they express high interest in attack attribution, confidence scores, and feature contribution explanations to improve interpretability, and triage efficiency. Based on our findings, we also propose practical design recommendations for XAI-enhanced security alert systems, enabling AI-based cybersecurity solutions to be more transparent, interpretable, and actionable.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Programming with Pixels: Computer-Use Meets Software Engineering
Authors:
Pranjal Aggarwal,
Sean Welleck
Abstract:
Recent advancements in software engineering (SWE) agents have largely followed a $\textit{tool-based paradigm}$, where agents interact with hand-engineered tool APIs to perform specific tasks. While effective for specialized tasks, these methods fundamentally lack generalization, as they require predefined tools for each task and do not scale across programming languages and domains. We introduce…
▽ More
Recent advancements in software engineering (SWE) agents have largely followed a $\textit{tool-based paradigm}$, where agents interact with hand-engineered tool APIs to perform specific tasks. While effective for specialized tasks, these methods fundamentally lack generalization, as they require predefined tools for each task and do not scale across programming languages and domains. We introduce $\texttt{Programming with Pixels}$ (PwP), an agent environment that unifies software development tasks by enabling $\textit{computer-use agents}$-agents that operate directly within an IDE through visual perception, typing, and clicking, rather than relying on predefined tool APIs. To systematically evaluate these agents, we propose $\texttt{PwP-Bench}$, a benchmark that unifies existing SWE benchmarks spanning tasks across multiple programming languages, modalities, and domains under a task-agnostic state and action space. Our experiments demonstrate that general-purpose computer-use agents can approach or even surpass specialized tool-based agents on a variety of SWE tasks without the need for hand-engineered tools. However, our analysis shows that current models suffer from limited visual grounding and fail to exploit many IDE tools that could simplify their tasks. When agents can directly access IDE tools, without visual interaction, they show significant performance improvements, highlighting the untapped potential of leveraging built-in IDE capabilities. Our results establish PwP as a scalable testbed for building and evaluating the next wave of software engineering agents. We release code and data at https://programmingwithpixels.com
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
MMTEB: Massive Multilingual Text Embedding Benchmark
Authors:
Kenneth Enevoldsen,
Isaac Chung,
Imene Kerboua,
Márton Kardos,
Ashwin Mathur,
David Stap,
Jay Gala,
Wissam Siblini,
Dominik Krzemiński,
Genta Indra Winata,
Saba Sturua,
Saiteja Utpala,
Mathieu Ciancone,
Marion Schaeffer,
Gabriel Sequeira,
Diganta Misra,
Shreeya Dhakal,
Jonathan Rystrøm,
Roman Solomatin,
Ömer Çağatan,
Akash Kundu,
Martin Bernstorff,
Shitao Xiao,
Akshita Sukhlecha,
Bhavish Pahwa
, et al. (61 additional authors not shown)
Abstract:
Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ langua…
▽ More
Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.
△ Less
Submitted 8 June, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks
Authors:
Saurabh Jha,
Rohan Arora,
Yuji Watanabe,
Takumi Yanagawa,
Yinfang Chen,
Jackson Clark,
Bhavya Bhavya,
Mudit Verma,
Harshit Kumar,
Hirokuni Kitahara,
Noah Zheutlin,
Saki Takano,
Divya Pathak,
Felix George,
Xinbo Wu,
Bekir O. Turkkan,
Gerard Vanloo,
Michael Nidd,
Ting Dai,
Oishik Chatterjee,
Pranjal Gupta,
Suranjana Samanta,
Pooja Aggarwal,
Rong Lee,
Pavankumar Murali
, et al. (18 additional authors not shown)
Abstract:
Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Securit…
▽ More
Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. ITBench includes an initial set of 94 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 13.8% of SRE scenarios, 25.2% of CISO scenarios, and 0% of FinOps scenarios. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification
Authors:
Eric Zhao,
Pranjal Awasthi,
Sreenivas Gollapudi
Abstract:
Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by having models self-verify each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation of sampling-based search, us…
▽ More
Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by having models self-verify each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation of sampling-based search, using only random sampling and direct self-verification, provides a practical inference method that, for example, elevates the reasoning capabilities of Gemini v1.5 Pro above that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves self-verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.
△ Less
Submitted 20 February, 2025; v1 submitted 3 February, 2025;
originally announced February 2025.
-
GPD: Guided Polynomial Diffusion for Motion Planning
Authors:
Ajit Srikanth,
Parth Mahanjan,
Kallol Saha,
Vishal Mandadi,
Pranjal Paul,
Pawan Wadhwani,
Brojeshwar Bhowmick,
Arun Singh,
Madhava Krishna
Abstract:
Diffusion-based motion planners are becoming popular due to their well-established performance improvements, stemming from sample diversity and the ease of incorporating new constraints directly during inference. However, a primary limitation of the diffusion process is the requirement for a substantial number of denoising steps, especially when the denoising process is coupled with gradient-based…
▽ More
Diffusion-based motion planners are becoming popular due to their well-established performance improvements, stemming from sample diversity and the ease of incorporating new constraints directly during inference. However, a primary limitation of the diffusion process is the requirement for a substantial number of denoising steps, especially when the denoising process is coupled with gradient-based guidance. In this paper, we introduce, diffusion in the parametric space of trajectories, where the parameters are represented as Bernstein coefficients. We show that this representation greatly improves the effectiveness of the cost function guidance and the inference speed. We also introduce a novel stitching algorithm that leverages the diversity in diffusion-generated trajectories to produce collision-free trajectories with just a single cost function-guided model. We demonstrate that our approaches outperform current SOTA diffusion-based motion planners for manipulators and provide an ablation study on key components.
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
Complexity of Minimal Faithful Permutation Degree for Fitting-free Groups
Authors:
Michael Levet,
Pranjal Srivastava,
Dhara Thakkar
Abstract:
In this paper, we investigate the complexity of computing the minimal faithful permutation degree for groups without abelian normal subgroups. When our groups are given as quotients of permutation groups, we establish that this problem is in $\textsf{P}$. Furthermore, in the setting of permutation groups, we obtain an upper bound of $\textsf{NC}$ for this problem. This improves upon the work of Da…
▽ More
In this paper, we investigate the complexity of computing the minimal faithful permutation degree for groups without abelian normal subgroups. When our groups are given as quotients of permutation groups, we establish that this problem is in $\textsf{P}$. Furthermore, in the setting of permutation groups, we obtain an upper bound of $\textsf{NC}$ for this problem. This improves upon the work of Das and Thakkar (STOC 2024), who established a Las Vegas polynomial-time algorithm for this class in the setting of permutation groups.
△ Less
Submitted 28 April, 2025; v1 submitted 27 January, 2025;
originally announced January 2025.
-
AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement
Authors:
Pranjal Aggarwal,
Bryan Parno,
Sean Welleck
Abstract:
Automated code generation with large language models has gained significant traction, but there remains no guarantee on the correctness of generated code. We aim to use formal verification to provide mathematical guarantees that the generated code is correct. However, generating formally verified code with LLMs is hindered by the scarcity of training data and the complexity of formal proofs. To ta…
▽ More
Automated code generation with large language models has gained significant traction, but there remains no guarantee on the correctness of generated code. We aim to use formal verification to provide mathematical guarantees that the generated code is correct. However, generating formally verified code with LLMs is hindered by the scarcity of training data and the complexity of formal proofs. To tackle this challenge, we introduce AlphaVerus, a self-improving framework that bootstraps formally verified code generation by iteratively translating programs from a higher-resource language and leveraging feedback from a verifier. AlphaVerus operates in three phases: exploration of candidate translations, Treefinement -- a novel tree search algorithm for program refinement using verifier feedback, and filtering misaligned specifications and programs to prevent reward hacking. Through this iterative process, AlphaVerus enables a LLaMA-3.1-70B model to generate verified code without human intervention or model finetuning. AlphaVerus shows an ability to generate formally verified solutions for HumanEval and MBPP, laying the groundwork for truly trustworthy code-generation agents.
△ Less
Submitted 8 December, 2024;
originally announced December 2024.
-
The Complexity of Order-Finding for ROABPs
Authors:
Vishwas Bhargava,
Pranjal Dutta,
Sumanta Ghosh,
Anamay Tengse
Abstract:
We study the \emph{order-finding problem} for Read-once Oblivious Algebraic Branching Programs (ROABPs). Given a polynomial $f$ and a parameter $w$, the goal is to find an order $σ$ in which $f$ has an ROABP of \emph{width} $w$. We show that this problem is NP-hard in the worst case, even when the input is a constant degree polynomial that is given in its dense representation. We provide a reducti…
▽ More
We study the \emph{order-finding problem} for Read-once Oblivious Algebraic Branching Programs (ROABPs). Given a polynomial $f$ and a parameter $w$, the goal is to find an order $σ$ in which $f$ has an ROABP of \emph{width} $w$. We show that this problem is NP-hard in the worst case, even when the input is a constant degree polynomial that is given in its dense representation. We provide a reduction from CutWidth to prove these results. Owing to the exactness of our reduction, all the known results for the hardness of approximation of Cutwidth also transfer directly to the order-finding problem. Additionally, we also show that any constant-approximation algorithm for the order-finding problem would imply a polynomial time approximation scheme (PTAS) for it.
On the algorithmic front, we design algorithms that solve the order-finding problem for generic ROABPs in polynomial time, when the width $w$ is polynomial in the individual degree $d$ of the polynomial $f$. That is, our algorithm is efficient for most/random ROABPs, and requires more time only on a lower-dimensional subspace (or subvariety) of ROABPs. Even when the individual degree is constant, our algorithm runs in time $n^{O(\log w)}$ for most/random ROABPs. This stands in strong contrast to the case of (Boolean) ROBPs, where only heuristic order-finding algorithms are known.
△ Less
Submitted 28 November, 2024;
originally announced November 2024.
-
Derandomizing Multivariate Polynomial Factoring for Low Degree Factors
Authors:
Pranjal Dutta,
Amit Sinhababu,
Thomas Thierauf
Abstract:
For a polynomial $f$ from a class $\mathcal{C}$ of polynomials, we show that the problem to compute all the constant degree irreducible factors of $f$ reduces in polynomial time to polynomial identity tests (PIT) for class $\mathcal{C}$ and divisibility tests of $f$ by constant degree polynomials. We apply the result to several classes $\mathcal{C}$ and obtain the constant degree factors in \begin…
▽ More
For a polynomial $f$ from a class $\mathcal{C}$ of polynomials, we show that the problem to compute all the constant degree irreducible factors of $f$ reduces in polynomial time to polynomial identity tests (PIT) for class $\mathcal{C}$ and divisibility tests of $f$ by constant degree polynomials. We apply the result to several classes $\mathcal{C}$ and obtain the constant degree factors in \begin{enumerate}
\item
polynomial time, for $\mathcal{C}$ being polynomials that have only constant degree factors,
\item
quasipolynomial time, for $\mathcal{C}$ being sparse polynomials,
\item
subexponential time, for $\mathcal{C}$ being polynomials that have constant-depth circuits. \end{enumerate} Result 2 and 3 were already shown by Kumar, Ramanathan, and Saptharishi with a different proof and their time complexities necessarily depend on black-box PITs for a related bigger class $\mathcal{C}'$. Our complexities vary on whether the input is given as a blackbox or whitebox.
We also show that the problem to compute the sparse factors of polynomial from a class $\mathcal{C}$ reduces in polynomial time to PIT for class $\mathcal{C}$, divisibility tests of $f$ by sparse polynomials, and irreducibility preserving bivariate projections for sparse polynomials. For $\mathcal{C}$ being sparse polynomials, it follows that it suffices to derandomize irreducibility preserving bivariate projections for sparse polynomials in order to compute all the sparse irreducible factors efficiently. When we consider factors of sparse polynomials that are sums of univariate polynomials, a subclass of sparse polynomials, we obtain a polynomial time algorithm. This was already shown by Volkovich with a different proof.
△ Less
Submitted 26 November, 2024;
originally announced November 2024.
-
Approximate counting of permutation patterns
Authors:
Omri Ben-Eliezer,
Slobodan Mitrović,
Pranjal Srivastava
Abstract:
We consider the problem of counting the copies of a length-$k$ pattern $σ$ in a sequence $f \colon [n] \to \mathbb{R}$, where a copy is a subset of indices $i_1 < \ldots < i_k \in [n]$ such that $f(i_j) < f(i_\ell)$ if and only if $σ(j) < σ(\ell)$. This problem is motivated by a range of connections and applications in ranking, nonparametric statistics, combinatorics, and fine-grained complexity,…
▽ More
We consider the problem of counting the copies of a length-$k$ pattern $σ$ in a sequence $f \colon [n] \to \mathbb{R}$, where a copy is a subset of indices $i_1 < \ldots < i_k \in [n]$ such that $f(i_j) < f(i_\ell)$ if and only if $σ(j) < σ(\ell)$. This problem is motivated by a range of connections and applications in ranking, nonparametric statistics, combinatorics, and fine-grained complexity, especially when $k$ is a small fixed constant.
Recent advances have significantly improved our understanding of counting and detecting patterns. Guillemot and Marx [2014] demonstrated that the detection variant is solvable in $O(n)$ time for any fixed $k$. Their proof has laid the foundations for the discovery of the twin-width, a concept that has notably advanced parameterized complexity in recent years. Counting, in contrast, is harder: it has a conditional lower bound of $n^{Ω(k / \log k)}$ [Berendsohn, Kozma, and Marx 2019] and is expected to be polynomially harder than detection as early as $k = 4$, given its equivalence to counting $4$-cycles in graphs [Dudek and Gawrychowski, 2020].
In this work, we design a deterministic near-linear time $(1+\varepsilon)$-approximation algorithm for counting $σ$-copies in $f$ for all $k \leq 5$. Combined with the conditional lower bound for $k=4$, this establishes the first known separation between approximate and exact algorithms for pattern counting. Interestingly, our algorithm leverages the Birgé decomposition -- a sublinear tool for monotone distributions widely used in distribution testing -- which, to our knowledge, has not been applied in a pattern counting context before.
△ Less
Submitted 7 November, 2024;
originally announced November 2024.
-
Algebraic metacomplexity and representation theory
Authors:
Maxim van den Berg,
Pranjal Dutta,
Fulvio Gesmundo,
Christian Ikenmeyer,
Vladimir Lysikov
Abstract:
In the algebraic metacomplexity framework we prove that the decomposition of metapolynomials into their isotypic components can be implemented efficiently, namely with only a quasipolynomial blowup in the circuit size. We use this to resolve an open question posed by Grochow, Kumar, Saks & Saraf (2017). Our result means that many existing algebraic complexity lower bound proofs can be efficiently…
▽ More
In the algebraic metacomplexity framework we prove that the decomposition of metapolynomials into their isotypic components can be implemented efficiently, namely with only a quasipolynomial blowup in the circuit size. We use this to resolve an open question posed by Grochow, Kumar, Saks & Saraf (2017). Our result means that many existing algebraic complexity lower bound proofs can be efficiently converted into isotypic lower bound proofs via highest weight metapolynomials, a notion studied in geometric complexity theory. In the context of algebraic natural proofs, it means that without loss of generality algebraic natural proofs can be assumed to be isotypic. Our proof is built on the Poincaré-Birkhoff-Witt theorem for Lie algebras and on Gelfand-Tsetlin theory, for which we give the necessary comprehensive background.
△ Less
Submitted 7 February, 2025; v1 submitted 5 November, 2024;
originally announced November 2024.
-
Approximating Auction Equilibria with Reinforcement Learning
Authors:
Pranjal Rawat
Abstract:
Traditional methods for computing equilibria in auctions become computationally intractable as auction complexity increases, particularly in multi-item and dynamic auctions. This paper introduces a self-play based reinforcement learning approach that employs advanced algorithms such as Proximal Policy Optimization and Neural Fictitious Self-Play to approximate Bayes-Nash equilibria. This framework…
▽ More
Traditional methods for computing equilibria in auctions become computationally intractable as auction complexity increases, particularly in multi-item and dynamic auctions. This paper introduces a self-play based reinforcement learning approach that employs advanced algorithms such as Proximal Policy Optimization and Neural Fictitious Self-Play to approximate Bayes-Nash equilibria. This framework allows for continuous action spaces, high-dimensional information states, and delayed payoffs. Through self-play, these algorithms can learn robust and near-optimal bidding strategies in auctions with known equilibria, including those with symmetric and asymmetric valuations, private and interdependent values, and multi-round auctions.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs
Authors:
Ishan Jindal,
Chandana Badrinath,
Pranjal Bharti,
Lakkidi Vinay,
Sachin Dev Sharma
Abstract:
Large Language Models (LLMs) for public use require continuous pre-training to remain up-to-date with the latest data. The models also need to be fine-tuned with specific instructions to maintain their ability to follow instructions accurately. Typically, LLMs are released in two versions: the Base LLM, pre-trained on diverse data, and the instruction-refined LLM, additionally trained with specifi…
▽ More
Large Language Models (LLMs) for public use require continuous pre-training to remain up-to-date with the latest data. The models also need to be fine-tuned with specific instructions to maintain their ability to follow instructions accurately. Typically, LLMs are released in two versions: the Base LLM, pre-trained on diverse data, and the instruction-refined LLM, additionally trained with specific instructions for better instruction following. The question arises as to which model should undergo continuous pre-training to maintain its instruction-following abilities while also staying current with the latest data. In this study, we delve into the intricate relationship between continuous pre-training and instruction fine-tuning of the LLMs and investigate the impact of continuous pre-training on the instruction following abilities of both the base and its instruction finetuned model. Further, the instruction fine-tuning process is computationally intense and requires a substantial number of hand-annotated examples for the model to learn effectively. This study aims to find the most compute-efficient strategy to gain up-to-date knowledge and instruction-following capabilities without requiring any instruction data and fine-tuning. We empirically prove our findings on the LLaMa 3, 3.1 and Qwen 2, 2.5 family of base and instruction models, providing a comprehensive exploration of our hypotheses across varying sizes of pre-training data corpus and different LLMs settings.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Taming the Ransomware Threats: Leveraging Prospect Theory for Rational Payment Decisions
Authors:
Pranjal Sharma
Abstract:
Day by day, the frequency of ransomware attacks on organizations is experiencing a significant surge. High-profile incidents involving major entities like Las Vegas giants MGM Resorts, Caesar Entertainment, and Boeing underscore the profound impact, posing substantial business barriers. When a sudden cyberattack occurs, organizations often find themselves at a loss, with a looming countdown to pay…
▽ More
Day by day, the frequency of ransomware attacks on organizations is experiencing a significant surge. High-profile incidents involving major entities like Las Vegas giants MGM Resorts, Caesar Entertainment, and Boeing underscore the profound impact, posing substantial business barriers. When a sudden cyberattack occurs, organizations often find themselves at a loss, with a looming countdown to pay the ransom, leading to a cascade of impromptu and unfavourable decisions. This paper adopts a novel approach, leveraging Prospect Theory, to elucidate the tactics employed by cyber attackers to entice organizations into paying the ransom. Furthermore, it introduces an algorithm based on Prospect Theory and an Attack Recovery Plan, enabling organizations to make informed decisions on whether to consent to the ransom demands or resist. This algorithm Ransomware Risk Analysis and Decision Support (RADS) uses Prospect Theory to re-instantiate the shifted reference manipulated as perceived gains by attackers and adjusts for the framing effect created due to time urgency. Additionally, leveraging application criticality and incorporating Prospect Theory's insights into under/over weighing probabilities, RADS facilitates informed decision-making that transcends the simplistic framework of "consent" or "resistance," enabling organizations to achieve optimal decisions.
△ Less
Submitted 15 September, 2024;
originally announced September 2024.
-
Towards Inducing Long-Context Abilities in Multilingual Neural Machine Translation Models
Authors:
Varun Gumma,
Pranjal A. Chitale,
Kalika Bali
Abstract:
Neural Machine Translation (NMT) models have traditionally used Sinusoidal Positional Embeddings (PEs), which often struggle to capture long-range dependencies and are inefficient for handling extended context or document-level translation tasks. This work addresses the challenge of transitioning pre-trained NMT models from absolute Sinusoidal PEs to Relative PEs, such as RoPE and ALiBi, without c…
▽ More
Neural Machine Translation (NMT) models have traditionally used Sinusoidal Positional Embeddings (PEs), which often struggle to capture long-range dependencies and are inefficient for handling extended context or document-level translation tasks. This work addresses the challenge of transitioning pre-trained NMT models from absolute Sinusoidal PEs to Relative PEs, such as RoPE and ALiBi, without compromising performance. We demonstrate that parameter-efficient fine-tuning, using only a small amount of high-quality data, can successfully facilitate this transition. Experimental results indicate that switching from Sinusoidal to Relative PEs results in competitive translation quality on sentence-level evaluation benchmarks. Additionally, models trained with RoPE consistently outperform those using ALiBi and Sinusoidal PEs on document-level benchmarks across both string-based metrics and qualitative evaluations. Moreover, we find that a small amount of long-context data in a few languages is sufficient for cross-lingual length generalization, thereby inducing long-context capabilities.
△ Less
Submitted 9 February, 2025; v1 submitted 21 August, 2024;
originally announced August 2024.
-
On Fourier analysis of sparse Boolean functions over certain Abelian groups
Authors:
Sourav Chakraborty,
Swarnalipa Datta,
Pranjal Dutta,
Arijit Ghosh,
Swagato Sanyal
Abstract:
Given an Abelian group G, a Boolean-valued function f: G -> {-1,+1}, is said to be s-sparse, if it has at most s-many non-zero Fourier coefficients over the domain G. In a seminal paper, Gopalan et al. proved "Granularity" for Fourier coefficients of Boolean valued functions over Z_2^n, that have found many diverse applications in theoretical computer science and combinatorics. They also studied s…
▽ More
Given an Abelian group G, a Boolean-valued function f: G -> {-1,+1}, is said to be s-sparse, if it has at most s-many non-zero Fourier coefficients over the domain G. In a seminal paper, Gopalan et al. proved "Granularity" for Fourier coefficients of Boolean valued functions over Z_2^n, that have found many diverse applications in theoretical computer science and combinatorics. They also studied structural results for Boolean functions over Z_2^n which are approximately Fourier-sparse. In this work, we obtain structural results for approximately Fourier-sparse Boolean valued functions over Abelian groups G of the form,G:= Z_{p_1}^{n_1} \times ... \times Z_{p_t}^{n_t}, for distinct primes p_i. We also obtain a lower bound of the form 1/(m^{2}s)^ceiling(phi(m)/2), on the absolute value of the smallest non-zero Fourier coefficient of an s-sparse function, where m=p_1 ... p_t, and phi(m)=(p_1-1) ... (p_t-1). We carefully apply probabilistic techniques from Gopalan et al., to obtain our structural results, and use some non-trivial results from algebraic number theory to get the lower bound.
We construct a family of at most s-sparse Boolean functions over Z_p^n, where p > 2, for arbitrarily large enough s, where the minimum non-zero Fourier coefficient is 1/omega(n). The "Granularity" result of Gopalan et al. implies that the absolute values of non-zero Fourier coefficients of any s-sparse Boolean valued function over Z_2^n are 1/O(s). So, our result shows that one cannot expect such a lower bound for general Abelian groups.
Using our new structural results on the Fourier coefficients of sparse functions, we design an efficient testing algorithm for Fourier-sparse Boolean functions, thata requires poly((ms)^phi(m),1/epsilon)-many queries. Further, we prove an Omega(sqrt{s}) lower bound on the query complexity of any adaptive sparsity testing algorithm.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Learning Neural Networks with Sparse Activations
Authors:
Pranjal Awasthi,
Nishanth Dikkala,
Pritish Kamath,
Raghu Meka
Abstract:
A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms…
▽ More
A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Evaluating Serverless Machine Learning Performance on Google Cloud Run
Authors:
Prerana Khatiwada,
Pranjal Dhakal
Abstract:
End-users can get functions-as-a-service from serverless platforms, which promise lower hosting costs, high availability, fault tolerance, and dynamic flexibility for hosting individual functions known as microservices. Machine learning tools are seen to be reliably useful, and the services created using these tools are in increasing demand on a large scale. The serverless platforms are uniquely s…
▽ More
End-users can get functions-as-a-service from serverless platforms, which promise lower hosting costs, high availability, fault tolerance, and dynamic flexibility for hosting individual functions known as microservices. Machine learning tools are seen to be reliably useful, and the services created using these tools are in increasing demand on a large scale. The serverless platforms are uniquely suited for hosting these machine learning services to be used for large-scale applications. These platforms are well known for their cost efficiency, fault tolerance, resource scaling, robust APIs for communication, and global reach. However, machine learning services are different from the web-services in that these serverless platforms were originally designed to host web services. We aimed to understand how these serverless platforms handle machine learning workloads with our study. We examine machine learning performance on one of the serverless platforms - Google Cloud Run, which is a GPU-less infrastructure that is not designed for machine learning application deployment.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
ReMI: A Dataset for Reasoning with Multiple Images
Authors:
Mehran Kazemi,
Nishanth Dikkala,
Ankit Anand,
Petar Devic,
Ishita Dasgupta,
Fangyu Liu,
Bahare Fatemi,
Pranjal Awasthi,
Dee Guo,
Sreenivas Gollapudi,
Ahmed Qureshi
Abstract:
With the continuous advancement of large language models (LLMs), it is essential to create new benchmarks to effectively evaluate their expanding capabilities and identify areas for improvement. This work focuses on multi-image reasoning, an emerging capability in state-of-the-art LLMs. We introduce ReMI, a dataset designed to assess LLMs' ability to Reason with Multiple Images. This dataset encom…
▽ More
With the continuous advancement of large language models (LLMs), it is essential to create new benchmarks to effectively evaluate their expanding capabilities and identify areas for improvement. This work focuses on multi-image reasoning, an emerging capability in state-of-the-art LLMs. We introduce ReMI, a dataset designed to assess LLMs' ability to Reason with Multiple Images. This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning. It also covers a broad spectrum of characteristics found in multi-image reasoning scenarios. We have benchmarked several cutting-edge LLMs using ReMI and found a substantial gap between their performance and human-level proficiency. This highlights the challenges in multi-image reasoning and the need for further research. Our analysis also reveals the strengths and weaknesses of different models, shedding light on the types of reasoning that are currently attainable and areas where future models require improvement. To foster further research in this area, we are releasing ReMI publicly: https://huggingface.co/datasets/mehrankazemi/ReMI.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Synchronous Programming with Refinement Types
Authors:
Jiawei Chen,
José Luiz Vargas de Mendonça,
Bereket Shimels Ayele,
Bereket Ngussie Bekele,
Shayan Jalili,
Pranjal Sharma,
Nicholas Wohlfeil,
Yicheng Zhang,
Jean-Baptiste Jeannin
Abstract:
Cyber-Physical Systems (CPS) consist of software interacting with the physical world, such as robots, vehicles, and industrial processes. CPS are frequently responsible for the safety of lives, property, or the environment, and so software correctness must be determined with a high degree of certainty. To that end, simply testing a CPS is insufficient, as its interactions with the physical world m…
▽ More
Cyber-Physical Systems (CPS) consist of software interacting with the physical world, such as robots, vehicles, and industrial processes. CPS are frequently responsible for the safety of lives, property, or the environment, and so software correctness must be determined with a high degree of certainty. To that end, simply testing a CPS is insufficient, as its interactions with the physical world may be difficult to predict, and unsafe conditions may not be immediately obvious. Formal verification can provide stronger safety guarantees but relies on the accuracy of the verified system in representing the real system. Bringing together verification and implementation can be challenging, as languages that are typically used to implement CPS are not easy to formally verify, and languages that lend themselves well to verification often abstract away low-level implementation details. Translation between verification and implementation languages is possible, but requires additional assurances in the translation process and increases software complexity; having both in a single language is desirable. This paper presents a formalization of MARVeLus, a CPS language which combines verification and implementation. We develop a metatheory for its synchronous refinement type system and demonstrate verified synchronous programs executing on real systems.
△ Less
Submitted 4 September, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
Authors:
David Romero,
Chenyang Lyu,
Haryo Akbarianto Wibowo,
Teresa Lynn,
Injy Hamed,
Aditya Nanda Kishore,
Aishik Mandal,
Alina Dragonetti,
Artem Abzaliev,
Atnafu Lambebo Tonja,
Bontu Fufa Balcha,
Chenxi Whitehouse,
Christian Salamea,
Dan John Velasco,
David Ifeoluwa Adelani,
David Le Meur,
Emilio Villa-Cueva,
Fajri Koto,
Fauzan Farooqui,
Frederico Belcavello,
Ganzorig Batnasan,
Gisela Vallejo,
Grainne Caulfield,
Guido Ivetta,
Haiyue Song
, et al. (51 additional authors not shown)
Abstract:
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recen…
▽ More
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.
△ Less
Submitted 4 November, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation
Authors:
Bernd Bohnet,
Kevin Swersky,
Rosanne Liu,
Pranjal Awasthi,
Azade Nova,
Javier Snaider,
Hanie Sedghi,
Aaron T Parisi,
Michael Collins,
Angeliki Lazaridou,
Orhan Firat,
Noah Fiedel
Abstract:
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, unde…
▽ More
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a more consistent and differentiating scoring mechanism than an absolute scorer that rates answers individually. We also show that LLMs from different model families produce moderate agreement in their ratings. We ground our approach using the manually curated NarrativeQA dataset, where our evaluator shows excellent agreement with human judgement and even finds errors in the dataset. Using our automatic evaluation approach, we show that using an entire book as context produces superior reading comprehension performance compared to baseline no-context (parametric knowledge only) and retrieval-based approaches.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure
Authors:
Hanseul Cho,
Jaeyoung Cha,
Pranjal Awasthi,
Srinadh Bhojanapalli,
Anupam Gupta,
Chulhee Yun
Abstract:
Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absol…
▽ More
Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, our models trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task.
△ Less
Submitted 30 October, 2024; v1 submitted 31 May, 2024;
originally announced May 2024.
-
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
Authors:
Shreyas Chaudhari,
Pranjal Aggarwal,
Vishvak Murahari,
Tanmay Rajpurohit,
Ashwin Kalyan,
Karthik Narasimhan,
Ameet Deshpande,
Bruno Castro da Silva
Abstract:
State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hal…
▽ More
State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial design choices that popularized the method and current research focuses on augmenting those choices rather than fundamentally improving the framework. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating substantial focus to the core component of RLHF -- the reward model. Our study investigates modeling choices, caveats of function approximation, and their implications on RLHF training algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our analysis improves the understanding of the role of reward models and methods for their training, concurrently revealing limitations of the current methodology. We characterize these limitations, including incorrect generalization, model misspecification, and the sparsity of feedback, along with their impact on the performance of a language model. The discussion and analysis are substantiated by a categorical review of current literature, serving as a reference for researchers and practitioners to understand the challenges of RLHF and build upon existing efforts.
△ Less
Submitted 15 April, 2024; v1 submitted 12 April, 2024;
originally announced April 2024.
-
LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End Autonomous Driving
Authors:
Pranjal Paul,
Anant Garg,
Tushar Choudhary,
Arun Kumar Singh,
K. Madhava Krishna
Abstract:
Existing Vision-Language models (VLMs) estimate either long-term trajectory waypoints or a set of control actions as a reactive solution for closed-loop planning based on their rich scene comprehension. However, these estimations are coarse and are subjective to their "world understanding" which may generate sub-optimal decisions due to perception errors. In this paper, we introduce LeGo-Drive, wh…
▽ More
Existing Vision-Language models (VLMs) estimate either long-term trajectory waypoints or a set of control actions as a reactive solution for closed-loop planning based on their rich scene comprehension. However, these estimations are coarse and are subjective to their "world understanding" which may generate sub-optimal decisions due to perception errors. In this paper, we introduce LeGo-Drive, which aims to address this issue by estimating a goal location based on the given language command as an intermediate representation in an end-to-end setting. The estimated goal might fall in a non-desirable region, like on top of a car for a parking-like command, leading to inadequate planning. Hence, we propose to train the architecture in an end-to-end manner, resulting in iterative refinement of both the goal and the trajectory collectively. We validate the effectiveness of our method through comprehensive experiments conducted in diverse simulated environments. We report significant improvements in standard autonomous driving metrics, with a goal reaching Success Rate of 81%. We further showcase the versatility of LeGo-Drive across different driving scenarios and linguistic inputs, underscoring its potential for practical deployment in autonomous vehicles and intelligent transportation systems.
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
A Machine learning and Empirical Bayesian Approach for Predictive Buying in B2B E-commerce
Authors:
Tuhin Subhra De,
Pranjal Singh,
Alok Patel
Abstract:
In the context of developing nations like India, traditional business to business (B2B) commerce heavily relies on the establishment of robust relationships, trust, and credit arrangements between buyers and sellers. Consequently, ecommerce enterprises frequently. Established in 2016 with a vision to revolutionize trade in India through technology, Udaan is the countrys largest business to busines…
▽ More
In the context of developing nations like India, traditional business to business (B2B) commerce heavily relies on the establishment of robust relationships, trust, and credit arrangements between buyers and sellers. Consequently, ecommerce enterprises frequently. Established in 2016 with a vision to revolutionize trade in India through technology, Udaan is the countrys largest business to business ecommerce platform. Udaan operates across diverse product categories, including lifestyle, electronics, home and employ telecallers to cultivate buyer relationships, streamline order placement procedures, and promote special promotions. The accurate anticipation of buyer order placement behavior emerges as a pivotal factor for attaining sustainable growth, heightening competitiveness, and optimizing the efficiency of these telecallers. To address this challenge, we have employed an ensemble approach comprising XGBoost and a modified version of Poisson Gamma model to predict customer order patterns with precision. This paper provides an in-depth exploration of the strategic fusion of machine learning and an empirical Bayesian approach, bolstered by the judicious selection of pertinent features. This innovative approach has yielded a remarkable 3 times increase in customer order rates, show casing its potential for transformative impact in the ecommerce industry.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Stacking as Accelerated Gradient Descent
Authors:
Naman Agarwal,
Pranjal Awasthi,
Satyen Kale,
Eric Zhao
Abstract:
Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nester…
▽ More
Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov's accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov's accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.
△ Less
Submitted 19 February, 2025; v1 submitted 7 March, 2024;
originally announced March 2024.
-
On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions
Authors:
Maximilian Böther,
Abraham Sebastian,
Pranjal Awasthi,
Ana Klimovic,
Srikumar Ramalingam
Abstract:
Modern datasets span billions of samples, making training on all available data infeasible. Selecting a high quality subset helps in reducing training costs and enhancing model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving such subset selection problems. However, existing algorithms for optimizing submodular functions are sequential, and the prior distribut…
▽ More
Modern datasets span billions of samples, making training on all available data infeasible. Selecting a high quality subset helps in reducing training costs and enhancing model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving such subset selection problems. However, existing algorithms for optimizing submodular functions are sequential, and the prior distributed methods require at least one central machine to fit the target subset in DRAM. At billion datapoint scale, even the subset may not fit a single machine, and the sequential algorithms are prohibitively slow. In this paper, we relax the requirement of having a central machine for the target subset by proposing a novel distributed bounding algorithm with provable approximation guarantees. The algorithm iteratively bounds the minimum and maximum utility values to select high quality points and discard the unimportant ones. When bounding does not find the complete subset, we use a multi-round, partition-based distributed greedy algorithm to identify the remaining subset. We discuss how to implement these algorithms in a distributed data processing framework and empirically analyze different configurations. We find high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in quality compared to centralized methods, and scale to a dataset with 13 billion points.
△ Less
Submitted 3 April, 2025; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Post-Quantum Cryptography
Authors:
Pranjal,
Atul Chaturvedi
Abstract:
In this survey we propose to cover the prose of post-quantum cryptography over classical cryptography. We talk about the various cryptographic methods that are being practiced to safeguard our information. The future of secure communication is expected to be the implementation of quantum-safe cryptographic systems, and that in the post-quantum era, the development of post-quantum cryptography is e…
▽ More
In this survey we propose to cover the prose of post-quantum cryptography over classical cryptography. We talk about the various cryptographic methods that are being practiced to safeguard our information. The future of secure communication is expected to be the implementation of quantum-safe cryptographic systems, and that in the post-quantum era, the development of post-quantum cryptography is essential for ensuring the security of sensitive data.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training
Authors:
Hanna Mazzawi,
Pranjal Awasthi,
Xavi Gonzalvo,
Srikumar Ramalingam
Abstract:
Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce sma…
▽ More
Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model shrinking method to meet hardware constraints; Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones. Formally, this can be seen as the problem of identifying an optimal model of size $n$ from a larger model of size $k \cdot n$, where $k > 1$ is the overparameterization factor. This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment.
Our contribution is an effective architectural change, namely, {\it Majority Kernels} that is compatible with the main standard architectures such as multi-layer perceptrons (MLPs), Residual networks (ResNets), and Transformers. We demonstrate that applying our technique can modify the training dynamics resulting in performance gains across architectures and tasks while maintaining the inference performance consistent. Furthermore, our approach adds minimal overhead to the cost incurred (wall clock time) at training time. The proposed approach shows strong performance on a wide variety of datasets and models, even outperforming strong baselines such as distilled ensembles as well as combinatorial optimization methods based on submodular optimization.
△ Less
Submitted 20 November, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
An Empirical Study of In-context Learning in LLMs for Machine Translation
Authors:
Pranjal A. Chitale,
Jay Gala,
Raj Dabre
Abstract:
Recent interest has surged in employing Large Language Models (LLMs) for machine translation (MT) via in-context learning (ICL) (Vilar et al., 2023). Most prior studies primarily focus on optimizing translation quality, with limited attention to understanding the specific aspects of ICL that influence the said quality. To this end, we perform the first of its kind, an exhaustive study of in-contex…
▽ More
Recent interest has surged in employing Large Language Models (LLMs) for machine translation (MT) via in-context learning (ICL) (Vilar et al., 2023). Most prior studies primarily focus on optimizing translation quality, with limited attention to understanding the specific aspects of ICL that influence the said quality. To this end, we perform the first of its kind, an exhaustive study of in-context learning for machine translation. We first establish that ICL is primarily example-driven and not instruction-driven. Following this, we conduct an extensive exploration of various aspects of the examples to understand their influence on downstream performance. Our analysis includes factors such as quality and quantity of demonstrations, spatial proximity, and source versus target originality. Further, we also investigate challenging scenarios involving indirectness and misalignment of examples to understand the limits of ICL. While we establish the significance of the quality of the target distribution over the source distribution of demonstrations, we further observe that perturbations sometimes act as regularizers, resulting in performance improvements. Surprisingly, ICL does not necessitate examples from the same task, and a related task with the same target distribution proves sufficient. We hope that our study acts as a guiding resource for considerations in utilizing ICL for MT. Our code is available on https://github.com/PranjalChitale/in-context-mt-analysis.
△ Less
Submitted 4 June, 2024; v1 submitted 22 January, 2024;
originally announced January 2024.
-
Fixed-parameter debordering of Waring rank
Authors:
Pranjal Dutta,
Fulvio Gesmundo,
Christian Ikenmeyer,
Gorav Jindal,
Vladimir Lysikov
Abstract:
Border complexity measures are defined via limits (or topological closures), so that any function which can approximated arbitrarily closely by low complexity functions itself has low border complexity. Debordering is the task of proving an upper bound on some non-border complexity measure in terms of a border complexity measure, thus getting rid of limits.
Debordering is at the heart of underst…
▽ More
Border complexity measures are defined via limits (or topological closures), so that any function which can approximated arbitrarily closely by low complexity functions itself has low border complexity. Debordering is the task of proving an upper bound on some non-border complexity measure in terms of a border complexity measure, thus getting rid of limits.
Debordering is at the heart of understanding the difference between Valiant's determinant vs permanent conjecture, and Mulmuley and Sohoni's variation which uses border determinantal complexity. The debordering of matrix multiplication tensors by Bini played a pivotal role in the development of efficient matrix multiplication algorithms. Consequently, debordering finds applications in both establishing computational complexity lower bounds and facilitating algorithm design. Currently, very few debordering results are known.
In this work, we study the question of debordering the border Waring rank of polynomials. Waring and border Waring rank are very well studied measures in the context of invariant theory, algebraic geometry, and matrix multiplication algorithms. For the first time, we obtain a Waring rank upper bound that is exponential in the border Waring rank and only linear in the degree. All previous known results were exponential in the degree. For polynomials with constant border Waring rank, our results imply an upper bound on the Waring rank linear in degree, which previously was only known for polynomials with border Waring rank at most 5.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
A Weighted K-Center Algorithm for Data Subset Selection
Authors:
Srikumar Ramalingam,
Pranjal Awasthi,
Sanjiv Kumar
Abstract:
The success of deep learning hinges on enormous data and large models, which require labor-intensive annotations and heavy computation costs. Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data, which can then be used to produce similar models as the ones trained with full data. Two prior methods are shown to achieve impressive re…
▽ More
The success of deep learning hinges on enormous data and large models, which require labor-intensive annotations and heavy computation costs. Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data, which can then be used to produce similar models as the ones trained with full data. Two prior methods are shown to achieve impressive results: (1) margin sampling that focuses on selecting points with high uncertainty, and (2) core-sets or clustering methods such as k-center for informative and diverse subsets. We are not aware of any work that combines these methods in a principled manner. To this end, we develop a novel and efficient factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions. To handle large datasets, we show a parallel algorithm to run on multiple machines with approximation guarantees. The proposed algorithm achieves similar or better performance compared to other strong baselines on vision datasets such as CIFAR-10, CIFAR-100, and ImageNet.
△ Less
Submitted 16 December, 2023;
originally announced December 2023.
-
Homogeneous Algebraic Complexity Theory and Algebraic Formulas
Authors:
Pranjal Dutta,
Fulvio Gesmundo,
Christian Ikenmeyer,
Gorav Jindal,
Vladimir Lysikov
Abstract:
We study algebraic complexity classes and their complete polynomials under \emph{homogeneous linear} projections, not just under the usual affine linear projections that were originally introduced by Valiant in 1979. These reductions are weaker yet more natural from a geometric complexity theory (GCT) standpoint, because the corresponding orbit closure formulations do not require the padding of po…
▽ More
We study algebraic complexity classes and their complete polynomials under \emph{homogeneous linear} projections, not just under the usual affine linear projections that were originally introduced by Valiant in 1979. These reductions are weaker yet more natural from a geometric complexity theory (GCT) standpoint, because the corresponding orbit closure formulations do not require the padding of polynomials. We give the \emph{first} complete polynomials for VF, the class of sequences of polynomials that admit small algebraic formulas, under homogeneous linear projections: The sum of the entries of the non-commutative elementary symmetric polynomial in 3 by 3 matrices of homogeneous linear forms.
Even simpler variants of the elementary symmetric polynomial are hard for the topological closure of a large subclass of VF: the sum of the entries of the non-commutative elementary symmetric polynomial in 2 by 2 matrices of homogeneous linear forms, and homogeneous variants of the continuant polynomial (Bringmann, Ikenmeyer, Zuiddam, JACM '18). This requires a careful study of circuits with arity-3 product gates.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
GEO: Generative Engine Optimization
Authors:
Pranjal Aggarwal,
Vishvak Murahari,
Tanmay Rajpurohit,
Ashwin Kalyan,
Karthik Narasimhan,
Ameet Deshpande
Abstract:
The advent of large language models (LLMs) has ushered in a new paradigm of search engines that use generative models to gather and summarize information to answer user queries. This emerging technology, which we formalize under the unified framework of generative engines (GEs), can generate accurate and personalized responses, rapidly replacing traditional search engines like Google and Bing. Gen…
▽ More
The advent of large language models (LLMs) has ushered in a new paradigm of search engines that use generative models to gather and summarize information to answer user queries. This emerging technology, which we formalize under the unified framework of generative engines (GEs), can generate accurate and personalized responses, rapidly replacing traditional search engines like Google and Bing. Generative Engines typically satisfy queries by synthesizing information from multiple sources and summarizing them using LLMs. While this shift significantly improves $\textit{user}$ utility and $\textit{generative search engine}$ traffic, it poses a huge challenge for the third stakeholder -- website and content creators. Given the black-box and fast-moving nature of generative engines, content creators have little to no control over $\textit{when}$ and $\textit{how}$ their content is displayed. With generative engines here to stay, we must ensure the creator economy is not disadvantaged. To address this, we introduce Generative Engine Optimization (GEO), the first novel paradigm to aid content creators in improving their content visibility in generative engine responses through a flexible black-box optimization framework for optimizing and defining visibility metrics. We facilitate systematic evaluation by introducing GEO-bench, a large-scale benchmark of diverse user queries across multiple domains, along with relevant web sources to answer these queries. Through rigorous evaluation, we demonstrate that GEO can boost visibility by up to $40\%$ in generative engine responses. Moreover, we show the efficacy of these strategies varies across domains, underscoring the need for domain-specific optimization methods. Our work opens a new frontier in information discovery systems, with profound implications for both developers of generative engines and content creators.
△ Less
Submitted 28 June, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
AutoMix: Automatically Mixing Language Models
Authors:
Pranjal Aggarwal,
Aman Madaan,
Ankit Anand,
Srividya Pranavi Potharaju,
Swaroop Mishra,
Pei Zhou,
Aditya Gupta,
Dheeraj Rajagopal,
Karthik Kappaganthu,
Yiming Yang,
Shyam Upadhyay,
Manaal Faruqui,
Mausam
Abstract:
Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness…
▽ More
Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to Automix are two key technical contributions. First, it has a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring extensive training. Second, given that self-verification can be noisy, it employs a POMDP based router that can effectively select an appropriately sized model, based on answer confidence. Experiments across five language models and five challenging datasets show that Automix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance.
△ Less
Submitted 19 January, 2025; v1 submitted 19 October, 2023;
originally announced October 2023.
-
ALT-Pilot: Autonomous navigation with Language augmented Topometric maps
Authors:
Mohammad Omama,
Pranav Inani,
Pranjal Paul,
Sarat Chandra Yellapragada,
Krishna Murthy Jatavallabhula,
Sandeep Chinchali,
Madhava Krishna
Abstract:
We present an autonomous navigation system that operates without assuming HD LiDAR maps of the environment. Our system, ALT-Pilot, relies only on publicly available road network information and a sparse (and noisy) set of crowdsourced language landmarks. With the help of onboard sensors and a language-augmented topometric map, ALT-Pilot autonomously pilots the vehicle to any destination on the roa…
▽ More
We present an autonomous navigation system that operates without assuming HD LiDAR maps of the environment. Our system, ALT-Pilot, relies only on publicly available road network information and a sparse (and noisy) set of crowdsourced language landmarks. With the help of onboard sensors and a language-augmented topometric map, ALT-Pilot autonomously pilots the vehicle to any destination on the road network. We achieve this by leveraging vision-language models pre-trained on web-scale data to identify potential landmarks in a scene, incorporating vision-language features into the recursive Bayesian state estimation stack to generate global (route) plans, and a reactive trajectory planner and controller operating in the vehicle frame. We implement and evaluate ALT-Pilot in simulation and on a real, full-scale autonomous vehicle and report improvements over state-of-the-art topometric navigation systems by a factor of 3x on localization accuracy and 5x on goal reachability
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Improving Length-Generalization in Transformers via Task Hinting
Authors:
Pranjal Awasthi,
Anupam Gupta
Abstract:
It has been observed in recent years that transformers have problems with length generalization for certain types of reasoning and arithmetic tasks. In particular, the performance of a transformer model trained on tasks (say addition) up to a certain length (e.g., 5 digit numbers) drops sharply when applied to longer instances of the same problem. This work proposes an approach based on task hinti…
▽ More
It has been observed in recent years that transformers have problems with length generalization for certain types of reasoning and arithmetic tasks. In particular, the performance of a transformer model trained on tasks (say addition) up to a certain length (e.g., 5 digit numbers) drops sharply when applied to longer instances of the same problem. This work proposes an approach based on task hinting towards addressing length generalization. Our key idea is that while training the model on task-specific data, it is helpful to simultaneously train the model to solve a simpler but related auxiliary task as well.
We study the classical sorting problem as a canonical example to evaluate our approach. We design a multitask training framework and show that task hinting significantly improve length generalization. For sorting we show that it is possible to train models on data consisting of sequences having length at most $20$, and improve the test accuracy on sequences of length $100$ from less than 1% (for standard training) to more than 92% (via task hinting).
Our study uncovers several interesting aspects of length generalization. We observe that while several auxiliary tasks may seem natural a priori, their effectiveness in improving length generalization differs dramatically. We further use probing and visualization-based techniques to understand the internal mechanisms via which the model performs the task, and propose a theoretical construction consistent with the observed learning behaviors of the model. Based on our construction, we show that introducing a small number of length dependent parameters into the training procedure can further boost the performance on unseen lengths. Finally, we also show the efficacy of our task hinting based approach beyond sorting, giving hope that these techniques will be applicable in broader contexts.
△ Less
Submitted 1 October, 2023;
originally announced October 2023.