-
Rediscovery
Authors:
Martino Banchio,
Suraj Malladi
Abstract:
We model search in settings where decision makers know what can be found but not where to find it. A searcher faces a set of choices arranged by an observable attribute. Each period, she either selects a choice and pays a cost to learn about its quality, or she concludes search to take her best discovery to date. She knows that similar choices have similar qualities and uses this to guide her sear…
▽ More
We model search in settings where decision makers know what can be found but not where to find it. A searcher faces a set of choices arranged by an observable attribute. Each period, she either selects a choice and pays a cost to learn about its quality, or she concludes search to take her best discovery to date. She knows that similar choices have similar qualities and uses this to guide her search. We identify robustly optimal search policies with a simple structure. Search is directional, recall is never invoked, there is a threshold stopping rule, and the policy at each history depends only on a simple index.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Overtrained Language Models Are Harder to Fine-Tune
Authors:
Jacob Mitchell Springer,
Sachin Goyal,
Kaiyue Wen,
Tanishq Kumar,
Xiang Yue,
Sadhika Malladi,
Graham Neubig,
Aditi Raghunathan
Abstract:
Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instructi…
▽ More
Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.
△ Less
Submitted 27 March, 2025; v1 submitted 24 March, 2025;
originally announced March 2025.
-
Usable Privacy in Virtual Worlds: Design Implications for Data Collection Awareness and Control Interfaces in Virtual Reality
Authors:
Viktorija Paneva,
Verena Winterhalter,
Naga Sai Surya Vamsy Malladi,
Marvin Strauss,
Stefan Schneegass,
Florian Alt
Abstract:
Extended reality (XR) devices have become ubiquitous. They are equipped with arrays of sensors, collecting extensive user and environmental data, allowing inferences about sensitive user information users may not realize they are sharing. Current VR privacy notices largely replicate mechanisms from 2D interfaces, failing to leverage the unique affordances of virtual 3D environments. To address thi…
▽ More
Extended reality (XR) devices have become ubiquitous. They are equipped with arrays of sensors, collecting extensive user and environmental data, allowing inferences about sensitive user information users may not realize they are sharing. Current VR privacy notices largely replicate mechanisms from 2D interfaces, failing to leverage the unique affordances of virtual 3D environments. To address this, we conducted brainstorming and sketching sessions with novice game developers and designers, followed by privacy expert evaluations, to explore and refine privacy interfaces tailored for VR. Key challenges include balancing user engagement with privacy awareness, managing complex privacy information with user comprehension, and maintaining compliance and trust. We identify design implications such as thoughtful gamification, explicit and purpose-tied consent mechanisms, and granular, modifiable privacy control options. Our findings provide actionable guidance to researchers and practitioners for developing privacy-aware and user-friendly VR experiences.
△ Less
Submitted 13 March, 2025;
originally announced March 2025.
-
BioAgents: Democratizing Bioinformatics Analysis with Multi-Agent Systems
Authors:
Nikita Mehandru,
Amanda K. Hall,
Olesya Melnichenko,
Yulia Dubinina,
Daniel Tsirulnikov,
David Bamman,
Ahmed Alaa,
Scott Saponas,
Venkat S. Malladi
Abstract:
Creating end-to-end bioinformatics workflows requires diverse domain expertise, which poses challenges for both junior and senior researchers as it demands a deep understanding of both genomics concepts and computational techniques. While large language models (LLMs) provide some assistance, they often fall short in providing the nuanced guidance needed to execute complex bioinformatics tasks, and…
▽ More
Creating end-to-end bioinformatics workflows requires diverse domain expertise, which poses challenges for both junior and senior researchers as it demands a deep understanding of both genomics concepts and computational techniques. While large language models (LLMs) provide some assistance, they often fall short in providing the nuanced guidance needed to execute complex bioinformatics tasks, and require expensive computing resources to achieve high performance. We thus propose a multi-agent system built on small language models, fine-tuned on bioinformatics data, and enhanced with retrieval augmented generation (RAG). Our system, BioAgents, enables local operation and personalization using proprietary data. We observe performance comparable to human experts on conceptual genomics tasks, and suggest next steps to enhance code generation capabilities.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
Metadata Conditioning Accelerates Language Model Pre-training
Authors:
Tianyu Gao,
Alexander Wettig,
Luxi He,
Yihe Dong,
Sadhika Malladi,
Danqi Chen
Abstract:
The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate…
▽ More
The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate additional learning cues during pre-training. MeCo first provides metadata (e.g., URLs like www$.$wikipedia$.$org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally even without metadata. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the downstream task performance of standard pre-training while using 33% less data. Additionally, MeCo enables us to steer language models by conditioning the inference prompt on either real or fabricated metadata that encodes the desired properties of the output: for example, prepending wikipedia$.$org to reduce harmful generations or factquizmaster$.$com (fabricated) to improve common knowledge task performance. We also demonstrate that MeCo is compatible with different types of metadata, such as model-generated topics. MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.
△ Less
Submitted 27 June, 2025; v1 submitted 3 January, 2025;
originally announced January 2025.
-
Provable unlearning in topic modeling and downstream tasks
Authors:
Stanley Wei,
Sadhika Malladi,
Sanjeev Arora,
Amartya Sanyal
Abstract:
Machine unlearning algorithms are increasingly important as legal concerns arise around the provenance of training data, but verifying the success of unlearning is often difficult. Provable guarantees for unlearning are often limited to supervised learning settings. In this paper, we provide the first theoretical guarantees for unlearning in the pre-training and fine-tuning paradigm by studying to…
▽ More
Machine unlearning algorithms are increasingly important as legal concerns arise around the provenance of training data, but verifying the success of unlearning is often difficult. Provable guarantees for unlearning are often limited to supervised learning settings. In this paper, we provide the first theoretical guarantees for unlearning in the pre-training and fine-tuning paradigm by studying topic models, simple bag-of-words language models that can be adapted to solve downstream tasks like retrieval and classification. First, we design a provably effective unlearning algorithm for topic models that incurs a computational overhead independent of the size of the original dataset. Our analysis additionally quantifies the deletion capacity of the model -- i.e., the number of examples that can be unlearned without incurring a significant cost in model performance. Finally, we formally extend our analyses to account for adaptation to a given downstream task. In particular, we design an efficient algorithm to perform unlearning after fine-tuning the topic model via a linear head. Notably, we show that it is easier to unlearn pre-training data from models that have been fine-tuned to a particular task, and one can unlearn this data without modifying the base model.
△ Less
Submitted 20 April, 2025; v1 submitted 19 November, 2024;
originally announced November 2024.
-
Continuous Analysis: Evolution of Software Engineering and Reproducibility for Science
Authors:
Venkat S. Malladi,
Maria Yazykova,
Olesya Melnichenko,
Yulia Dubinina
Abstract:
Reproducibility in research remains hindered by complex systems involving data, models, tools, and algorithms. Studies highlight a reproducibility crisis due to a lack of standardized reporting, code and data sharing, and rigorous evaluation. This paper introduces the concept of Continuous Analysis to address the reproducibility challenges in scientific research, extending the DevOps lifecycle. Co…
▽ More
Reproducibility in research remains hindered by complex systems involving data, models, tools, and algorithms. Studies highlight a reproducibility crisis due to a lack of standardized reporting, code and data sharing, and rigorous evaluation. This paper introduces the concept of Continuous Analysis to address the reproducibility challenges in scientific research, extending the DevOps lifecycle. Continuous Analysis proposes solutions through version control, analysis orchestration, and feedback mechanisms, enhancing the reliability of scientific results. By adopting CA, the scientific community can ensure the validity and generalizability of research outcomes, fostering transparency and collaboration and ultimately advancing the field.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws
Authors:
Yiding Jiang,
Allan Zhou,
Zhili Feng,
Sadhika Malladi,
J. Zico Kolter
Abstract:
The composition of pretraining data is a key determinant of foundation models' performance, but there is no standard guideline for allocating a limited computational budget across different data sources. Most current approaches either rely on extensive experiments with smaller models or dynamic data adjustments that also require proxy models, both of which significantly increase the workflow compl…
▽ More
The composition of pretraining data is a key determinant of foundation models' performance, but there is no standard guideline for allocating a limited computational budget across different data sources. Most current approaches either rely on extensive experiments with smaller models or dynamic data adjustments that also require proxy models, both of which significantly increase the workflow complexity and computational overhead. In this paper, we introduce Adaptive Data Optimization (ADO), an algorithm that optimizes data distributions in an online fashion, concurrent with model training. Unlike existing techniques, ADO does not require external knowledge, proxy models, or modifications to the model update. Instead, ADO uses per-domain scaling laws to estimate the learning potential of each domain during training and adjusts the data mixture accordingly, making it more scalable and easier to integrate. Experiments demonstrate that ADO can achieve comparable or better performance than prior methods while maintaining computational efficiency across different computation scales, offering a practical solution for dynamically adjusting data distribution without sacrificing flexibility or increasing costs. Beyond its practical benefits, ADO also provides a new perspective on data collection strategies via scaling laws.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Authors:
Noam Razin,
Sadhika Malladi,
Adithya Bhaskar,
Danqi Chen,
Sanjeev Arora,
Boris Hanin
Abstract:
Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on th…
▽ More
Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.
△ Less
Submitted 27 April, 2025; v1 submitted 11 October, 2024;
originally announced October 2024.
-
Progressive distillation induces an implicit curriculum
Authors:
Abhishek Panigrahi,
Bingbin Liu,
Sadhika Malladi,
Andrej Risteski,
Surbhi Goel
Abstract:
Knowledge distillation leverages a teacher model to improve the training of a student model. A persistent challenge is that a better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several ``intermediate'' teachers. One empirically validated variant of this principle is progressive distillation, where the student learns from succes…
▽ More
Knowledge distillation leverages a teacher model to improve the training of a student model. A persistent challenge is that a better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several ``intermediate'' teachers. One empirically validated variant of this principle is progressive distillation, where the student learns from successive intermediate checkpoints of the teacher. Using sparse parity as a sandbox, we identify an implicit curriculum as one mechanism through which progressive distillation accelerates the student's learning. This curriculum is available only through the intermediate checkpoints but not the final converged one, and imparts both empirical acceleration and a provable sample complexity benefit to the student. We then extend our investigation to Transformers trained on probabilistic context-free grammars (PCFGs) and real-world pre-training datasets (Wikipedia and Books). Through probing the teacher model, we identify an analogous implicit curriculum where the model progressively learns features that capture longer context. Our theoretical and empirical findings on sparse parity, complemented by empirical observations on more complex tasks, highlight the benefit of progressive distillation via implicit curriculum across setups.
△ Less
Submitted 7 October, 2024;
originally announced October 2024.
-
MUSE: Machine Unlearning Six-Way Evaluation for Language Models
Authors:
Weijia Shi,
Jaechan Lee,
Yangsibo Huang,
Sadhika Malladi,
Jieyu Zhao,
Ari Holtzman,
Daogao Liu,
Luke Zettlemoyer,
Noah A. Smith,
Chiyuan Zhang
Abstract:
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approxim…
▽ More
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approximate unlearning algorithms. The evaluation of the efficacy of these algorithms has traditionally been narrow in scope, failing to precisely quantify the success and practicality of the algorithm from the perspectives of both the model deployers and the data owners. We address this issue by proposing MUSE, a comprehensive machine unlearning evaluation benchmark that enumerates six diverse desirable properties for unlearned models: (1) no verbatim memorization, (2) no knowledge memorization, (3) no privacy leakage, (4) utility preservation on data not intended for removal, (5) scalability with respect to the size of removal requests, and (6) sustainability over sequential unlearning requests. Using these criteria, we benchmark how effectively eight popular unlearning algorithms on 7B-parameter LMs can unlearn Harry Potter books and news articles. Our results demonstrate that most algorithms can prevent verbatim memorization and knowledge memorization to varying degrees, but only one algorithm does not lead to severe privacy leakage. Furthermore, existing algorithms fail to meet deployer's expectations because they often degrade general model utility and also cannot sustainably accommodate successive unlearning requests or large-scale content removal. Our findings identify key issues with the practicality of existing unlearning algorithms on language models, and we release our benchmark to facilitate further evaluations: muse-bench.github.io
△ Less
Submitted 14 July, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Authors:
Zirui Wang,
Mengzhou Xia,
Luxi He,
Howard Chen,
Yitao Liu,
Richard Zhu,
Kaiqu Liang,
Xindi Wu,
Haotian Liu,
Sadhika Malladi,
Alexis Chevalier,
Sanjeev Arora,
Danqi Chen
Abstract:
Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to ou…
▽ More
Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Preference Learning Algorithms Do Not Learn Preference Rankings
Authors:
Angelica Chen,
Sadhika Malladi,
Lily H. Zhang,
Xinyi Chen,
Qiuyi Zhang,
Rajesh Ranganath,
Kyunghyun Cho
Abstract:
Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via ranking a…
▽ More
Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via ranking accuracy. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the idealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant alignment gap -- i.e., a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.
△ Less
Submitted 31 October, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
The GA4GH Task Execution API: Enabling Easy Multi Cloud Task Execution
Authors:
Alexander Kanitz,
Matthew H. McLoughlin,
Liam Beckman,
Venkat S. Malladi,
Kyle P. Ellrott
Abstract:
The Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API is a standardized schema and API for describing and executing batch execution tasks. It provides a common way to submit and manage tasks to a variety of compute environments, including on premise High Performance Compute and High Throughput Computing (HPC/HTC) systems, Cloud computing platforms, and hybrid environ…
▽ More
The Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API is a standardized schema and API for describing and executing batch execution tasks. It provides a common way to submit and manage tasks to a variety of compute environments, including on premise High Performance Compute and High Throughput Computing (HPC/HTC) systems, Cloud computing platforms, and hybrid environments. The TES API is designed to be flexible and extensible, allowing it to be adapted to a wide range of use cases, such as "bringing compute to the data" solutions for federated and distributed data analysis or load balancing across multi cloud infrastructures. This API has been adopted by a number of different service providers and utilized by several workflow engines. Using its capabilities, genomes research institutes are building hybrid compute systems to study life science.
△ Less
Submitted 8 February, 2024;
originally announced May 2024.
-
LESS: Selecting Influential Data for Targeted Instruction Tuning
Authors:
Mengzhou Xia,
Sadhika Malladi,
Suchin Gururangan,
Sanjeev Arora,
Danqi Chen
Abstract:
Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we…
▽ More
Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.
△ Less
Submitted 12 June, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Feel the Breeze: Promoting Relaxation in Virtual Reality using Mid-Air Haptics
Authors:
Naga Sai Surya Vamsy Malladi,
Viktorija Paneva,
Jörg Müller
Abstract:
Mid-air haptic interfaces employ focused ultrasound waves to generate touchless haptic sensations on the skin. Prior studies have demonstrated the potential positive impact of mid-air haptic feedback on virtual experiences, enhancing aspects such as enjoyment, immersion, and sense of agency. As a highly immersive environment, Virtual Reality (VR) is being explored as a tool for stress management a…
▽ More
Mid-air haptic interfaces employ focused ultrasound waves to generate touchless haptic sensations on the skin. Prior studies have demonstrated the potential positive impact of mid-air haptic feedback on virtual experiences, enhancing aspects such as enjoyment, immersion, and sense of agency. As a highly immersive environment, Virtual Reality (VR) is being explored as a tool for stress management and relaxation in current research. However, the impact of incorporating mid-air haptic stimuli into relaxing experiences in VR has not been studied thus far. In this paper, for the first time, we design a mid-air haptic stimulation that is congruent with a relaxing scene in VR, and conduct a user study investigating the effectiveness of this experience. Our user study encompasses three different conditions: a control group with no relaxation intervention, a VR-only relaxation experience, and a VR+Haptics relaxation experience that includes the mid-air haptic feedback. While we did not find any significant differences between the conditions, a trend suggesting that the VR+Haptics condition might be associated with greater pleasure emerged, requiring further validation with a larger sample size. These initial findings set the foundation for future investigations into leveraging multimodal interventions in VR, utilising mid-air haptics to potentially enhance relaxation experiences.
△ Less
Submitted 18 August, 2023;
originally announced August 2023.
-
The Marginal Value of Momentum for Small Learning Rate SGD
Authors:
Runzhe Wang,
Sadhika Malladi,
Tianhao Wang,
Kaifeng Lyu,
Zhiyuan Li
Abstract:
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable ac…
▽ More
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.
△ Less
Submitted 15 April, 2024; v1 submitted 27 July, 2023;
originally announced July 2023.
-
Trainable Transformer in Transformer
Authors:
Abhishek Panigrahi,
Sadhika Malladi,
Mengzhou Xia,
Sanjeev Arora
Abstract:
Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose an efficient construction, Tran…
▽ More
Recent works attribute the capability of in-context learning (ICL) in large pre-trained language models to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e.g., pre-trained language models). In particular, we introduce innovative approximation techniques that allow a TinT model with less than 2 billion parameters to simulate and fine-tune a 125 million parameter transformer model within a single forward pass. TinT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers. We conduct end-to-end experiments to validate the internal fine-tuning procedure of TinT on various language modeling and downstream tasks. For example, even with a limited one-step budget, we observe TinT for a OPT-125M model improves performance by 4-16% absolute on average compared to OPT-125M. These findings suggest that large pre-trained language models are capable of performing intricate subroutines. To facilitate further work, a modular and extensible codebase for TinT is included.
△ Less
Submitted 8 February, 2024; v1 submitted 3 July, 2023;
originally announced July 2023.
-
Fine-Tuning Language Models with Just Forward Passes
Authors:
Sadhika Malladi,
Tianyu Gao,
Eshaan Nichani,
Alex Damian,
Jason D. Lee,
Danqi Chen,
Sanjeev Arora
Abstract:
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder opti…
▽ More
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.
△ Less
Submitted 11 January, 2024; v1 submitted 26 May, 2023;
originally announced May 2023.
-
A Kernel-Based View of Language Model Fine-Tuning
Authors:
Sadhika Malladi,
Alexander Wettig,
Dingli Yu,
Danqi Chen,
Sanjeev Arora
Abstract:
It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK) - which originated as a mode…
▽ More
It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK) - which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization - describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.
△ Less
Submitted 6 June, 2023; v1 submitted 11 October, 2022;
originally announced October 2022.
-
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
Authors:
Sadhika Malladi,
Kaifeng Lyu,
Abhishek Panigrahi,
Sanjeev Arora
Abstract:
Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for thes…
▽ More
Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.
△ Less
Submitted 31 October, 2024; v1 submitted 20 May, 2022;
originally announced May 2022.
-
Deriving Explanation of Deep Visual Saliency Models
Authors:
Sai Phani Kumar Malladi,
Jayanta Mukhopadhyay,
Chaker Larabi,
Santanu Chaudhury
Abstract:
Deep neural networks have shown their profound impact on achieving human level performance in visual saliency prediction. However, it is still unclear how they learn the task and what it means in terms of understanding human visual system. In this work, we develop a technique to derive explainable saliency models from their corresponding deep neural architecture based saliency models by applying h…
▽ More
Deep neural networks have shown their profound impact on achieving human level performance in visual saliency prediction. However, it is still unclear how they learn the task and what it means in terms of understanding human visual system. In this work, we develop a technique to derive explainable saliency models from their corresponding deep neural architecture based saliency models by applying human perception theories and the conventional concepts of saliency. This technique helps us understand the learning pattern of the deep network at its intermediate layers through their activation maps. Initially, we consider two state-of-the-art deep saliency models, namely UNISAL and MSI-Net for our interpretation. We use a set of biologically plausible log-gabor filters for identifying and reconstructing the activation maps of them using our explainable saliency model. The final saliency map is generated using these reconstructed activation maps. We also build our own deep saliency model named cross-concatenated multi-scale residual block based network (CMRNet) for saliency prediction. Then, we evaluate and compare the performance of the explainable models derived from UNISAL, MSI-Net and CMRNet on three benchmark datasets with other state-of-the-art methods. Hence, we propose that this approach of explainability can be applied to any deep visual saliency model for interpretation which makes it a generic one.
△ Less
Submitted 8 September, 2021;
originally announced September 2021.
-
On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)
Authors:
Zhiyuan Li,
Sadhika Malladi,
Sanjeev Arora
Abstract:
It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019)) only applies to SGD with tiny LR. Experimental verificatio…
▽ More
It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019)) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Ito SDE approximation. (b) A theoretically motivated testable necessary condition for the SDE approximation and its most famous implication, the linear scaling rule (Goyal et al., 2017), to hold. (c) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.
△ Less
Submitted 16 June, 2021; v1 submitted 24 February, 2021;
originally announced February 2021.
-
A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks
Authors:
Nikunj Saunshi,
Sadhika Malladi,
Sanjeev Arora
Abstract:
Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions…
▽ More
Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions: (1) What is the intuitive connection between the pretraining task of next word prediction and text classification? (2) How can we mathematically formalize this connection and quantify the benefit of language modeling? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as sentence completion tasks, thus making language modeling a meaningful pretraining task. With a mathematical formalization of this hypothesis, we make progress towards (2) and show that language models that are $ε$-optimal in cross-entropy (log-perplexity) learn features that can linearly solve such classification tasks with $\mathcal{O}(\sqrtε)$ error, thus demonstrating that doing well on language modeling can be beneficial for downstream tasks. We experimentally verify various assumptions and theoretical findings, and also use insights from the analysis to design a new objective function that performs well on some classification tasks.
△ Less
Submitted 14 April, 2021; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Estimating Dispersion Curves from Frequency Response Functions via Vector-Fitting
Authors:
Mohammad I. Albakri,
Vijaya V. N. Sriram Malladi,
Serkan Gugercin,
Pablo A. Tarazaga
Abstract:
Driven by the need for describing and understanding wave propagation in structural materials and components, several analytical, numerical, and experimental techniques have been developed to obtain dispersion curves. Accurate characterization of the structure (waveguide) under test is needed for analytical and numerical approaches. Experimental approaches, on the other hand, rely on analyzing wave…
▽ More
Driven by the need for describing and understanding wave propagation in structural materials and components, several analytical, numerical, and experimental techniques have been developed to obtain dispersion curves. Accurate characterization of the structure (waveguide) under test is needed for analytical and numerical approaches. Experimental approaches, on the other hand, rely on analyzing waveforms as they propagate along the structure. Material inhomogeneity, reflections from boundaries, and the physical dimensions of the structure under test limit the frequency range over which dispersion curves can be measured. In this work, a new data-driven modeling approach for estimating dispersion curves is developed. This approach utilizes the relatively easy-to-measure, steady-state Frequency Response Functions (FRFs) to develop a state-space dynamical model of the structure under test. The developed model is then used to study the transient response of the structure and estimate its dispersion curves. This paper lays down the foundation of this approach and demonstrates its capabilities on a one-dimensional homogeneous beam using numerically calculated FRFs. Both in-plane and out-of-plane FRFs corresponding, respectively, to longitudinal (the first symmetric) and flexural (the first anti-symmetric) wave modes are analyzed. The effects of boundary conditions on the performance of this approach are also addressed.
△ Less
Submitted 27 December, 2019;
originally announced December 2019.
-
Learning through the Grapevine: The Impact of Noise and the Breadth and Depth of Social Networks
Authors:
Matthew O. Jackson,
Suraj Malladi,
David McAdams
Abstract:
We examine how well people learn when information is noisily relayed from person to person; and we study how communication platforms can improve learning without censoring or fact-checking messages. We analyze learning as a function of social network depth (how many times information is relayed) and breadth (the number of relay chains accessed). Noise builds up as depth increases, so learning requ…
▽ More
We examine how well people learn when information is noisily relayed from person to person; and we study how communication platforms can improve learning without censoring or fact-checking messages. We analyze learning as a function of social network depth (how many times information is relayed) and breadth (the number of relay chains accessed). Noise builds up as depth increases, so learning requires greater breadth. In the presence of mutations (deliberate or random) and transmission failures of messages, we characterize sharp thresholds for breadths above which receivers learn fully and below which they learn nothing. When there is uncertainty about mutation rates, optimizing learning requires either capping depth, or if that is not possible, limiting breadth by capping the number of people to whom someone can forward a message. Limiting breadth cuts the number of messages received but also decreases the fraction originating further from the receiver, and so can increase the signal to noise ratio. Finally, we extend our model to study learning from message survival: e.g., people are more likely to pass messages with one conclusion than another. We find that as depth grows, all learning comes from either the total number of messages received or from the content of received messages, but the learner does not need to pay attention to both.
△ Less
Submitted 26 June, 2020; v1 submitted 8 December, 2018;
originally announced December 2018.
-
How to prevent type-flaw and multi-protocol attacks on cryptographic protocols under Exclusive-OR
Authors:
Sreekanth Malladi
Abstract:
Type-flaw attacks and multi-protocol attacks on security protocols have been frequently reported in the literature. Heather et al. and Guttman et al. have proven that these could be prevented by tagging encrypted components with distinct constants in a standard protocol model with free message algebra and perfect encryption. However, most "real-world" protocols such as SSL 3.0 are designed with th…
▽ More
Type-flaw attacks and multi-protocol attacks on security protocols have been frequently reported in the literature. Heather et al. and Guttman et al. have proven that these could be prevented by tagging encrypted components with distinct constants in a standard protocol model with free message algebra and perfect encryption. However, most "real-world" protocols such as SSL 3.0 are designed with the Exclusive-OR (XOR) operator that possesses algebraic properties, breaking the free algebra assumption. These algebraic properties induce equational theories that need to be considered when analyzing protocols that use the operator. This is the problem we consider in this paper: We prove that, under certain assumptions, tagging encrypted components still prevents type-flaw and multi-protocol attacks even in the presence of the XOR operator and its algebraic properties.
△ Less
Submitted 19 June, 2010; v1 submitted 14 April, 2010;
originally announced April 2010.
-
Disabling equational theories in unification for cryptographic protocol analysis through tagging
Authors:
Sreekanth Malladi
Abstract:
In this paper, we show a new tagging scheme for cryptographic protocol messages. Under this tagging, equational theories of operators such as exclusive-or, binary addition etc. are effectively disabled, when terms are unified. We believe that this result has a significant impact on protocol analysis and security, since unification is at the heart of symbolic protocol analysis. Hence, disabling equ…
▽ More
In this paper, we show a new tagging scheme for cryptographic protocol messages. Under this tagging, equational theories of operators such as exclusive-or, binary addition etc. are effectively disabled, when terms are unified. We believe that this result has a significant impact on protocol analysis and security, since unification is at the heart of symbolic protocol analysis. Hence, disabling equational theories in unification implies disabling them altogether in protocol analysis for most operators and theories.
△ Less
Submitted 9 April, 2010; v1 submitted 28 March, 2010;
originally announced March 2010.
-
How to prevent type-flaw attacks on security protocols under algebraic properties
Authors:
Sreekanth Malladi,
Pascal Lafourcade
Abstract:
Type-flaw attacks upon security protocols wherein agents are led to misinterpret message types have been reported frequently in the literature. Preventing them is crucial for protocol security and verification. Heather et al. proved that tagging every message field with it's type prevents all type-flaw attacks under a free message algebra and perfect encryption system. In this paper, we prove that…
▽ More
Type-flaw attacks upon security protocols wherein agents are led to misinterpret message types have been reported frequently in the literature. Preventing them is crucial for protocol security and verification. Heather et al. proved that tagging every message field with it's type prevents all type-flaw attacks under a free message algebra and perfect encryption system. In this paper, we prove that type-flaw attacks can be prevented with the same technique even under the ACUN algebraic properties of XOR which is commonly used in "real-world" protocols such as SSL 3.0. Our proof method is general and can be easily extended to other monoidal operators that possess properties such as Inverse and Idempotence as well. We also discuss how tagging could be used to prevent type-flaw attacks under other properties such as associativity of pairing, commutative encryption, prefix property and homomorphic encryption.
△ Less
Submitted 28 March, 2010;
originally announced March 2010.
-
Protocol indepedence through disjoint encryption under Exclusive-OR
Authors:
Sreekanth Malladi
Abstract:
Multi-protocol attacks due to protocol interaction has been a notorious problem for security. Gutman-Thayer proved that they can be prevented by ensuring that encrypted messages are distinguishable across protocols, under a free algebra. In this paper, we prove that a similar suggestion prevents these attacks under commonly used operators such as Exclusive-OR, that induce equational theories, brea…
▽ More
Multi-protocol attacks due to protocol interaction has been a notorious problem for security. Gutman-Thayer proved that they can be prevented by ensuring that encrypted messages are distinguishable across protocols, under a free algebra. In this paper, we prove that a similar suggestion prevents these attacks under commonly used operators such as Exclusive-OR, that induce equational theories, breaking the free algebra assumption.
△ Less
Submitted 9 May, 2010; v1 submitted 28 March, 2010;
originally announced March 2010.
-
Automatic analysis of distance bounding protocols
Authors:
Sreekanth Malladi,
Bezawada Bruhadeshwar,
Kishore Kothapalli
Abstract:
Distance bounding protocols are used by nodes in wireless networks to calculate upper bounds on their distances to other nodes. However, dishonest nodes in the network can turn the calculations both illegitimate and inaccurate when they participate in protocol executions. It is important to analyze protocols for the possibility of such violations. Past efforts to analyze distance bounding protocol…
▽ More
Distance bounding protocols are used by nodes in wireless networks to calculate upper bounds on their distances to other nodes. However, dishonest nodes in the network can turn the calculations both illegitimate and inaccurate when they participate in protocol executions. It is important to analyze protocols for the possibility of such violations. Past efforts to analyze distance bounding protocols have only been manual. However, automated approaches are important since they are quite likely to find flaws that manual approaches cannot, as witnessed in literature for analysis pertaining to key establishment protocols. In this paper, we use the constraint solver tool to automatically analyze distance bounding protocols. We first formulate a new trace property called Secure Distance Bounding (SDB) that protocol executions must satisfy. We then classify the scenarios in which these protocols can operate considering the (dis)honesty of nodes and location of the attacker in the network. Finally, we extend the constraint solver so that it can be used to test protocols for violations of SDB in these scenarios and illustrate our technique on some published protocols.
△ Less
Submitted 28 March, 2010;
originally announced March 2010.