-
Imaging at the quantum limit with convolutional neural networks
Authors:
Andrew H. Proppe,
Aaron Z. Goldberg,
Guillaume Thekkadath,
Noah Lupu-Gladstein,
Kyle M. Jordan,
Philip J. Bustard,
Frédéric Bouchard,
Duncan England,
Khabat Heshami,
Jeff S. Lundeen,
Benjamin J. Sussman
Abstract:
Deep neural networks have been shown to achieve exceptional performance for computer vision tasks like image recognition, segmentation, and reconstruction or denoising. Here, we evaluate the ultimate performance limits of deep convolutional neural network models for image reconstruction, by comparing them against the standard quantum limit set by shot-noise and the Heisenberg limit on precision. W…
▽ More
Deep neural networks have been shown to achieve exceptional performance for computer vision tasks like image recognition, segmentation, and reconstruction or denoising. Here, we evaluate the ultimate performance limits of deep convolutional neural network models for image reconstruction, by comparing them against the standard quantum limit set by shot-noise and the Heisenberg limit on precision. We train U-Net models on images of natural objects illuminated with coherent states of light, and find that the average mean-squared error of the reconstructions can surpass the standard quantum limit, and in some cases reaches the Heisenberg limit. Further, we train models on well-parameterized images for which we can calculate the quantum Cramér-Rao bound to determine the minimum possible measurable variance of an estimated parameter for a given probe state. We find the mean-squared error of the model predictions reaches these bounds calculated for the parameters, across a variety of parameterized images. These results suggest that deep convolutional neural networks can learn to become the optimal estimators allowed by the laws of physics, performing parameter estimation and image reconstruction at the ultimate possible limits of precision for the case of classical illumination of the object.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Authors:
Yixiao Huang,
Hanlin Zhu,
Tianyu Guo,
Jiantao Jiao,
Somayeh Sojoudi,
Michael I. Jordan,
Stuart Russell,
Song Mei
Abstract:
Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoni…
▽ More
Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Revisiting mean estimation over $\ell_p$ balls: Is the MLE optimal?
Authors:
Liviu Aolaritei,
Michael I. Jordan,
Reese Pathak,
Annie Ulichney
Abstract:
We revisit the problem of mean estimation on $\ell_p$ balls under additive Gaussian noise. When $p$ is strictly less than $2$, it is well understood that rate-optimal estimators must be nonlinear in the observations. In this work, we study the maximum likelihood estimator (MLE), which may be viewed as a nonlinear shrinkage procedure for mean estimation over $\ell_p$ balls. We demonstrate two pheno…
▽ More
We revisit the problem of mean estimation on $\ell_p$ balls under additive Gaussian noise. When $p$ is strictly less than $2$, it is well understood that rate-optimal estimators must be nonlinear in the observations. In this work, we study the maximum likelihood estimator (MLE), which may be viewed as a nonlinear shrinkage procedure for mean estimation over $\ell_p$ balls. We demonstrate two phenomena for the behavior of the MLE, which depend on the noise level, the radius of the norm constraint, the dimension, and the norm index $p$. First, as a function of the dimension, for $p$ near $1$ or at least $2$, the MLE is minimax rate-optimal for all noise levels and all constraint radii. On the other hand, for $p$ between $1$ and $2$, there is a more striking behavior: for essentially all noise levels and radii for which nonlinear estimates are required, the MLE is minimax rate-suboptimal, despite being nonlinear in the observations. Our results also imply similar conclusions when given $n$ independent and identically distributed Gaussian samples, where we demonstrate that the MLE can be suboptimal by a polynomial factor in the sample size. Our lower bounds are constructive: whenever the MLE is rate-suboptimal, we provide explicit instances on which the MLE provably incurs suboptimal risk.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Sample Complexity and Representation Ability of Test-time Scaling Paradigms
Authors:
Baihe Huang,
Shanda Li,
Tianhao Wu,
Yiming Yang,
Ameet Talwalkar,
Kannan Ramchandran,
Michael I. Jordan,
Jiantao Jiao
Abstract:
Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-$n$, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampl…
▽ More
Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-$n$, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires $Θ(1/Δ^2)$ samples to produce the correct answer, while best-of-$n$ only needs $Θ(1/Δ)$, where $Δ< 1$ denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings. Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.
△ Less
Submitted 12 June, 2025; v1 submitted 5 June, 2025;
originally announced June 2025.
-
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis
Authors:
Hanyu Li,
Haoyu Liu,
Tingyu Zhu,
Tianyu Guo,
Zeyu Zheng,
Xiaotie Deng,
Michael I. Jordan
Abstract:
Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce IDA-Bench, a novel benchmark evaluating LLM agents in multi-round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natu…
▽ More
Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce IDA-Bench, a novel benchmark evaluating LLM agents in multi-round interactive scenarios. Derived from complex Kaggle notebooks, tasks are presented as sequential natural language instructions by an LLM-simulated user. Agent performance is judged by comparing its final numerical output to the human-derived baseline. Initial results show that even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on < 50% of the tasks, highlighting limitations not evident in single-turn tests. This work underscores the need to improve LLMs' multi-round capabilities for building more reliable data analysis agents, highlighting the necessity of achieving a balance between instruction following and reasoning.
△ Less
Submitted 6 June, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
Backward Conformal Prediction
Authors:
Etienne Gauthier,
Francis Bach,
Michael I. Jordan
Abstract:
We introduce $\textit{Backward Conformal Prediction}$, a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction set sizes behave based on the observed data, and adapts the cov…
▽ More
We introduce $\textit{Backward Conformal Prediction}$, a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction set sizes behave based on the observed data, and adapts the coverage level accordingly. Our method builds on two key foundations: (i) recent results by Gauthier et al. [2025] on post-hoc validity using e-values, which ensure marginal coverage of the form $\mathbb{P}(Y_{\rm test} \in \hat C_n^{\tildeα}(X_{\rm test})) \ge 1 - \mathbb{E}[\tildeα]$ up to a first-order Taylor approximation for any data-dependent miscoverage $\tildeα$, and (ii) a novel leave-one-out estimator $\hatα^{\rm LOO}$ of the marginal miscoverage $\mathbb{E}[\tildeα]$ based on the calibration set, ensuring that the theoretical guarantees remain computable in practice. This approach is particularly useful in applications where large prediction sets are impractical such as medical diagnosis. We provide theoretical results and empirical evidence supporting the validity of our method, demonstrating that it maintains computable coverage guarantees while ensuring interpretable, well-controlled prediction set sizes.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Online Decision-Focused Learning
Authors:
Aymeric Capitaine,
Maxime Haddouche,
Eric Moulines,
Michael I. Jordan,
Etienne Boursier,
Alain Durmus
Abstract:
Decision-focused learning (DFL) is an increasingly popular paradigm for training predictive models whose outputs are used in decision-making tasks. Instead of merely optimizing for predictive accuracy, DFL trains models to directly minimize the loss associated with downstream decisions. This end-to-end strategy holds promise for tackling complex combinatorial problems; however, existing studies fo…
▽ More
Decision-focused learning (DFL) is an increasingly popular paradigm for training predictive models whose outputs are used in decision-making tasks. Instead of merely optimizing for predictive accuracy, DFL trains models to directly minimize the loss associated with downstream decisions. This end-to-end strategy holds promise for tackling complex combinatorial problems; however, existing studies focus solely on scenarios where a fixed batch of data is available and the objective function does not change over time. We instead investigate DFL in dynamic environments where the objective function and data distribution evolve over time. This setting is challenging because the objective function has zero or undefined gradients -- which prevents the use of standard first-order optimization methods -- and is generally non-convex. To address these difficulties, we (i) regularize the objective to make it differentiable and (ii) make use of the optimism principle, based on a near-optimal oracle along with an appropriate perturbation. This leads to a practical online algorithm for which we establish bounds on the expected dynamic regret, both when the decision space is a simplex and when it is a general bounded convex polytope. Finally, we demonstrate the effectiveness of our algorithm by comparing its performance with a classic prediction-focused approach on a simple knapsack experiment.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Understanding In-context Learning of Addition via Activation Subspaces
Authors:
Xinyan Hu,
Kayo Yin,
Michael I. Jordan,
Jacob Steinhardt,
Lijie Chen
Abstract:
To perform in-context learning, language models must extract signals from individual few-shot examples, aggregate these into a learned prediction rule, and then apply this rule to new examples. How is this implemented in the forward pass of modern transformer models? To study this, we consider a structured family of few-shot learning tasks for which the true prediction rule is to add an integer…
▽ More
To perform in-context learning, language models must extract signals from individual few-shot examples, aggregate these into a learned prediction rule, and then apply this rule to new examples. How is this implemented in the forward pass of modern transformer models? To study this, we consider a structured family of few-shot learning tasks for which the true prediction rule is to add an integer $k$ to the input. We find that Llama-3-8B attains high accuracy on this task for a range of $k$, and localize its few-shot ability to just three attention heads via a novel optimization approach. We further show the extracted signals lie in a six-dimensional subspace, where four of the dimensions track the unit digit and the other two dimensions track overall magnitude. We finally examine how these heads extract information from individual few-shot examples, identifying a self-correction mechanism in which mistakes from earlier examples are suppressed by later examples. Our results demonstrate how tracking low-dimensional subspaces across a forward pass can provide insight into fine-grained computational structures.
△ Less
Submitted 15 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
Experimental demonstration of a multi-particle collective measurement for optimal quantum state estimation
Authors:
Arman Mansouri,
Kyle M. Jordan,
Raphael A. Abrahao,
Jeff S. Lundeen
Abstract:
We experimentally demonstrate a two-particle collective measurement proposed as the optimal solution to a quantum state estimation game. Our results suggest that, in practice, the collective measurement strategy is at least as good as the best local approach, and it achieves a higher average fidelity when accounting for systematic errors. This photonic implementation uses a recently developed univ…
▽ More
We experimentally demonstrate a two-particle collective measurement proposed as the optimal solution to a quantum state estimation game. Our results suggest that, in practice, the collective measurement strategy is at least as good as the best local approach, and it achieves a higher average fidelity when accounting for systematic errors. This photonic implementation uses a recently developed universal two-photon projective measurement based on Hong-Ou-Mandel interference, polarization-dependent loss, and unitary operations. We compare the performance to the case where the entangling component of the measurement is suppressed. We further apply the collective measurement to quantum state tomography, observing a near-optimal scaling of the infidelity with the total number of samples.
△ Less
Submitted 13 May, 2025; v1 submitted 7 May, 2025;
originally announced May 2025.
-
Stochastic Optimization with Optimal Importance Sampling
Authors:
Liviu Aolaritei,
Bart P. G. Van Parys,
Henry Lam,
Michael I. Jordan
Abstract:
Importance Sampling (IS) is a widely used variance reduction technique for enhancing the efficiency of Monte Carlo methods, particularly in rare-event simulation and related applications. Despite its power, the performance of IS is often highly sensitive to the choice of the proposal distribution and frequently requires stochastic calibration techniques. While the design and analysis of IS have be…
▽ More
Importance Sampling (IS) is a widely used variance reduction technique for enhancing the efficiency of Monte Carlo methods, particularly in rare-event simulation and related applications. Despite its power, the performance of IS is often highly sensitive to the choice of the proposal distribution and frequently requires stochastic calibration techniques. While the design and analysis of IS have been extensively studied in estimation settings, applying IS within stochastic optimization introduces a unique challenge: the decision and the IS distribution are mutually dependent, creating a circular optimization structure. This interdependence complicates both the analysis of convergence for decision iterates and the efficiency of the IS scheme. In this paper, we propose an iterative gradient-based algorithm that jointly updates the decision variable and the IS distribution without requiring time-scale separation between the two. Our method achieves the lowest possible asymptotic variance and guarantees global convergence under convexity of the objective and mild assumptions on the IS distribution family. Furthermore, we show that these properties are preserved under linear constraints by incorporating a recent variant of Nesterov's dual averaging method.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
Universal Log-Optimality for General Classes of e-processes and Sequential Hypothesis Tests
Authors:
Ian Waudby-Smith,
Ricardo Sandoval,
Michael I. Jordan
Abstract:
We consider the problem of sequential hypothesis testing by betting. For a general class of composite testing problems -- which include bounded mean testing, equal mean testing for bounded random tuples, and some key ingredients of two-sample and independence testing as special cases -- we show that any $e$-process satisfying a certain sublinear regret bound is adaptively, asymptotically, and almo…
▽ More
We consider the problem of sequential hypothesis testing by betting. For a general class of composite testing problems -- which include bounded mean testing, equal mean testing for bounded random tuples, and some key ingredients of two-sample and independence testing as special cases -- we show that any $e$-process satisfying a certain sublinear regret bound is adaptively, asymptotically, and almost surely log-optimal for a composite alternative. This is a strong notion of optimality that has not previously been established for the aforementioned problems and we provide explicit test supermartingales and $e$-processes satisfying this notion in the more general case. Furthermore, we derive matching lower and upper bounds on the expected rejection time for the resulting sequential tests in all of these cases. The proofs of these results make weak, algorithm-agnostic moment assumptions and rely on a general-purpose proof technique involving the aforementioned regret and a family of numeraire portfolios. Finally, we discuss how all of these theorems hold in a distribution-uniform sense, a notion of log-optimality that is stronger still and seems to be new to the literature.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Minimum Volume Conformal Sets for Multivariate Regression
Authors:
Sacha Braun,
Liviu Aolaritei,
Michael I. Jordan,
Francis Bach
Abstract:
Conformal prediction provides a principled framework for constructing predictive sets with finite-sample validity. While much of the focus has been on univariate response variables, existing multivariate methods either impose rigid geometric assumptions or rely on flexible but computationally expensive approaches that do not explicitly optimize prediction set volume. We propose an optimization-dri…
▽ More
Conformal prediction provides a principled framework for constructing predictive sets with finite-sample validity. While much of the focus has been on univariate response variables, existing multivariate methods either impose rigid geometric assumptions or rely on flexible but computationally expensive approaches that do not explicitly optimize prediction set volume. We propose an optimization-driven framework based on a novel loss function that directly learns minimum-volume covering sets while ensuring valid coverage. This formulation naturally induces a new nonconformity score for conformal prediction, which adapts to the residual distribution and covariates. Our approach optimizes over prediction sets defined by arbitrary norm balls, including single and multi-norm formulations. Additionally, by jointly optimizing both the predictive model and predictive uncertainty, we obtain prediction sets that are tight, informative, and computationally efficient, as demonstrated in our experiments on real-world datasets.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
E-Values Expand the Scope of Conformal Prediction
Authors:
Etienne Gauthier,
Francis Bach,
Michael I. Jordan
Abstract:
Conformal prediction is a powerful framework for distribution-free uncertainty quantification. The standard approach to conformal prediction relies on comparing the ranks of prediction scores: under exchangeability, the rank of a future test point cannot be too extreme relative to a calibration set. This rank-based method can be reformulated in terms of p-values. In this paper, we explore an alter…
▽ More
Conformal prediction is a powerful framework for distribution-free uncertainty quantification. The standard approach to conformal prediction relies on comparing the ranks of prediction scores: under exchangeability, the rank of a future test point cannot be too extreme relative to a calibration set. This rank-based method can be reformulated in terms of p-values. In this paper, we explore an alternative approach based on e-values, known as conformal e-prediction. E-values offer key advantages that cannot be achieved with p-values, enabling new theoretical and practical capabilities. In particular, we present three applications that leverage the unique strengths of e-values: batch anytime-valid conformal prediction, fixed-size conformal sets with data-dependent coverage, and conformal prediction under ambiguous ground truth. Overall, these examples demonstrate that e-value-based constructions provide a flexible expansion of the toolbox of conformal prediction.
△ Less
Submitted 6 May, 2025; v1 submitted 17 March, 2025;
originally announced March 2025.
-
Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing
Authors:
Bhiman Kumar Baghel,
Scott M. Jordan,
Zheyuan Ryan Shi,
Xiang Lorraine Li
Abstract:
Large Language Models (LLMs) are widely deployed in downstream tasks, but keeping their knowledge up-to-date via retraining or fine-tuning is often computationally expensive. Model editing provides a more efficient alternative by updating a targeted subset of parameters, which often follows the locate-and-edit paradigm. Despite this efficiency, existing methods are limited: edits may fail to injec…
▽ More
Large Language Models (LLMs) are widely deployed in downstream tasks, but keeping their knowledge up-to-date via retraining or fine-tuning is often computationally expensive. Model editing provides a more efficient alternative by updating a targeted subset of parameters, which often follows the locate-and-edit paradigm. Despite this efficiency, existing methods are limited: edits may fail to inject knowledge (UnderEdit) or unintentionally disrupt unrelated neighboring knowledge (OverEdit). To address these challenges, we propose two complementary methods: iterative model editing, which applies successive edits to mitigate UnderEdit, and neighbor-assisted model editing, which incorporates neighboring knowledge during editing to reduce OverEdit. Our extensive experiments show that these techniques improve editing performance across multiple LLMs, algorithms, and benchmarks, reducing UnderEdit by up to 38 percentage points and OverEdit by up to 6, while remaining broadly applicable to any locate-and-edit method.
△ Less
Submitted 17 June, 2025; v1 submitted 14 March, 2025;
originally announced March 2025.
-
Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality
Authors:
Alex Fang,
Hadi Pouransari,
Matt Jordan,
Alexander Toshev,
Vaishaal Shankar,
Ludwig Schmidt,
Tom Gunter
Abstract:
Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and acr…
▽ More
Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
The Role of the Marketplace Operator in Inducing Competition
Authors:
Tiffany Ding,
Dominique Perrault-Joncas,
Orit Ronen,
Michael I. Jordan,
Dirk Bergemann,
Dean Foster,
Omer Gottesman
Abstract:
The steady rise of e-commerce marketplaces underscores the need to study a market structure that captures the key features of this setting. To this end, we consider a price-quantity Stackelberg duopoly in which the leader is the marketplace operator and the follower is an independent seller. The objective of the marketplace operator is to maximize a weighted sum of profit and a term capturing posi…
▽ More
The steady rise of e-commerce marketplaces underscores the need to study a market structure that captures the key features of this setting. To this end, we consider a price-quantity Stackelberg duopoly in which the leader is the marketplace operator and the follower is an independent seller. The objective of the marketplace operator is to maximize a weighted sum of profit and a term capturing positive customer experience, whereas the independent seller solely seeks to maximize their own profit. Furthermore, the independent seller is required to share a fraction of their revenue with the marketplace operator for the privilege of selling on the platform. We derive the subgame-perfect Nash equilibrium of this game and find that the equilibrium strategies depend on the assumed rationing rule. We then consider practical implications for marketplace operators. Finally, we show that, under intensity rationing, consumer surplus and total welfare in the duopoly marketplace is always at least as high as under an independent seller monopoly, demonstrating that it is socially beneficial for the operator to join the market as a seller.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
An Overview of Large Language Models for Statisticians
Authors:
Wenlong Ji,
Weizhe Yuan,
Emily Getzen,
Kyunghyun Cho,
Michael I. Jordan,
Song Mei,
Jason E Weston,
Weijie J. Su,
Jing Xu,
Linjun Zhang
Abstract:
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI), exhibiting remarkable capabilities across diverse tasks such as text generation, reasoning, and decision-making. While their success has primarily been driven by advances in computational power and deep learning architectures, emerging problems -- in areas such as uncertainty quantification, decision…
▽ More
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI), exhibiting remarkable capabilities across diverse tasks such as text generation, reasoning, and decision-making. While their success has primarily been driven by advances in computational power and deep learning architectures, emerging problems -- in areas such as uncertainty quantification, decision-making, causal inference, and distribution shift -- require a deeper engagement with the field of statistics. This paper explores potential areas where statisticians can make important contributions to the development of LLMs, particularly those that aim to engender trustworthiness and transparency for human users. Thus, we focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation. We also consider possible roles for LLMs in statistical analysis. By bridging AI and statistics, we aim to foster a deeper collaboration that advances both the theoretical foundations and practical applications of LLMs, ultimately shaping their role in addressing complex societal challenges.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
Conformal Prediction under Levy-Prokhorov Distribution Shifts: Robustness to Local and Global Perturbations
Authors:
Liviu Aolaritei,
Zheyu Oliver Wang,
Julie Zhu,
Michael I. Jordan,
Youssef Marzouk
Abstract:
Conformal prediction provides a powerful framework for constructing prediction intervals with finite-sample guarantees, yet its robustness under distribution shifts remains a significant challenge. This paper addresses this limitation by modeling distribution shifts using Levy-Prokhorov (LP) ambiguity sets, which capture both local and global perturbations. We provide a self-contained overview of…
▽ More
Conformal prediction provides a powerful framework for constructing prediction intervals with finite-sample guarantees, yet its robustness under distribution shifts remains a significant challenge. This paper addresses this limitation by modeling distribution shifts using Levy-Prokhorov (LP) ambiguity sets, which capture both local and global perturbations. We provide a self-contained overview of LP ambiguity sets and their connections to popular metrics such as Wasserstein and Total Variation. We show that the link between conformal prediction and LP ambiguity sets is a natural one: by propagating the LP ambiguity set through the scoring function, we reduce complex high-dimensional distribution shifts to manageable one-dimensional distribution shifts, enabling exact quantification of worst-case quantiles and coverage. Building on this analysis, we construct robust conformal prediction intervals that remain valid under distribution shifts, explicitly linking LP parameters to interval width and confidence levels. Experimental results on real-world datasets demonstrate the effectiveness of the proposed approach.
△ Less
Submitted 18 May, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
How Do LLMs Perform Two-Hop Reasoning in Context?
Authors:
Tianyu Guo,
Hanlin Zhu,
Ruiqi Zhang,
Jiantao Jiao,
Song Mei,
Michael I. Jordan,
Stuart Russell
Abstract:
``Socrates is human. All humans are mortal. Therefore, Socrates is mortal.'' This form of argument illustrates a typical pattern of two-hop reasoning. Formally, two-hop reasoning refers to the process of inferring a conclusion by making two logical steps, each connecting adjacent concepts, such that the final conclusion depends on the integration of both steps. It is one of the most fundamental co…
▽ More
``Socrates is human. All humans are mortal. Therefore, Socrates is mortal.'' This form of argument illustrates a typical pattern of two-hop reasoning. Formally, two-hop reasoning refers to the process of inferring a conclusion by making two logical steps, each connecting adjacent concepts, such that the final conclusion depends on the integration of both steps. It is one of the most fundamental components of human reasoning and plays a crucial role in both formal logic and everyday decision-making. Despite recent progress in large language models (LLMs), we surprisingly find that they can fail at solving simple two-hop reasoning problems when distractors are present. We observe on a synthetic dataset that pre-trained LLMs often resort to random guessing among all plausible conclusions. However, after few steps of fine-tuning, models achieve near-perfect accuracy and exhibit strong length generalization. To understand the underlying mechanisms, we train a 3-layer Transformer from scratch on a synthetic two-hop reasoning task and reverse-engineer its internal information flow. We observe a clear progression in the attention logits throughout training. This pictures a sharp phase transition from an initial stage of random guessing to the emergence of a structured sequential query mechanism, where the model first retrieves the preceding and the bridge concepts in the early layers and then uses them to infer the final answer. Finally, we show that these dynamics can be captured by a minimal three-parameter attention-only network.
△ Less
Submitted 28 May, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
Statistical Collusion by Collectives on Learning Platforms
Authors:
Etienne Gauthier,
Francis Bach,
Michael I. Jordan
Abstract:
As platforms increasingly rely on learning algorithms, collectives may form and seek ways to influence these platforms to align with their own interests. This can be achieved by coordinated submission of altered data. To evaluate the potential impact of such behavior, it is essential to understand the computations that collectives must perform to impact platforms in this way. In particular, collec…
▽ More
As platforms increasingly rely on learning algorithms, collectives may form and seek ways to influence these platforms to align with their own interests. This can be achieved by coordinated submission of altered data. To evaluate the potential impact of such behavior, it is essential to understand the computations that collectives must perform to impact platforms in this way. In particular, collectives need to make a priori assessments of the effect of the collective before taking action, as they may face potential risks when modifying their data. Moreover they need to develop implementable coordination algorithms based on quantities that can be inferred from observed data. We develop a framework that provides a theoretical and algorithmic treatment of these issues and present experimental results in a product evaluation domain.
△ Less
Submitted 25 May, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
Online Decision-Making in Tree-Like Multi-Agent Games with Transfers
Authors:
Antoine Scheid,
Etienne Boursier,
Alain Durmus,
Eric Moulines,
Michael Jordan
Abstract:
The widespread deployment of Machine Learning systems everywhere raises challenges, such as dealing with interactions or competition between multiple learners. In that goal, we study multi-agent sequential decision-making by considering principal-agent interactions in a tree structure. In this problem, the reward of a player is influenced by the actions of her children, who are all self-interested…
▽ More
The widespread deployment of Machine Learning systems everywhere raises challenges, such as dealing with interactions or competition between multiple learners. In that goal, we study multi-agent sequential decision-making by considering principal-agent interactions in a tree structure. In this problem, the reward of a player is influenced by the actions of her children, who are all self-interested and non-cooperative, hence the complexity of making good decisions. Our main finding is that it is possible to steer all the players towards the globally optimal set of actions by simply allowing single-step transfers between them. A transfer is established between a principal and one of her agents: the principal actually offers the proposed payment if the agent picks the recommended action. The analysis poses specific challenges due to the intricate interactions between the nodes of the tree and the propagation of the regret within this tree. Considering a bandit setup, we propose algorithmic solutions for the players to end up being no-regret with respect to the optimal pair of actions and incentives. In the long run, allowing transfers between players makes them act as if they were collaborating together, although they remain self-interested non-cooperative: transfers restore efficiency.
△ Less
Submitted 27 May, 2025; v1 submitted 31 January, 2025;
originally announced January 2025.
-
Rethinking Early Stopping: Refine, Then Calibrate
Authors:
Eugène Berta,
David Holzmüller,
Michael I. Jordan,
Francis Bach
Abstract:
Machine learning classifiers often produce probabilistic predictions that are critical for accurate and interpretable decision-making in various domains. The quality of these predictions is generally evaluated with proper losses like cross-entropy, which decompose into two components: calibration error assesses general under/overconfidence, while refinement error measures the ability to distinguis…
▽ More
Machine learning classifiers often produce probabilistic predictions that are critical for accurate and interpretable decision-making in various domains. The quality of these predictions is generally evaluated with proper losses like cross-entropy, which decompose into two components: calibration error assesses general under/overconfidence, while refinement error measures the ability to distinguish different classes. In this paper, we provide theoretical and empirical evidence that these two errors are not minimized simultaneously during training. Selecting the best training epoch based on validation loss thus leads to a compromise point that is suboptimal for both calibration error and, most importantly, refinement error. To address this, we introduce a new metric for early stopping and hyperparameter tuning that makes it possible to minimize refinement error during training. The calibration error is minimized after training, using standard techniques. Our method integrates seamlessly with any architecture and consistently improves performance across diverse classification tasks.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
Prediction-Aware Learning in Multi-Agent Systems
Authors:
Aymeric Capitaine,
Etienne Boursier,
Eric Moulines,
Michael I. Jordan,
Alain Durmus
Abstract:
The framework of uncoupled online learning in multiplayer games has made significant progress in recent years. In particular, the development of time-varying games has considerably expanded its modeling capabilities. However, current regret bounds quickly become vacuous when the game undergoes significant variations over time, even when these variations are easy to predict. Intuitively, the abilit…
▽ More
The framework of uncoupled online learning in multiplayer games has made significant progress in recent years. In particular, the development of time-varying games has considerably expanded its modeling capabilities. However, current regret bounds quickly become vacuous when the game undergoes significant variations over time, even when these variations are easy to predict. Intuitively, the ability of players to forecast future payoffs should lead to tighter guarantees, yet existing approaches fail to incorporate this aspect. This work aims to fill this gap by introducing a novel prediction-aware framework for time-varying games, where agents can forecast future payoffs and adapt their strategies accordingly. In this framework, payoffs depend on an underlying state of nature that agents predict in an online manner. To leverage these predictions, we propose the POWMU algorithm, a contextual extension of the optimistic Multiplicative Weight Update algorithm, for which we establish theoretical guarantees on social welfare and convergence to equilibrium. Our results demonstrate that, under bounded prediction errors, the proposed framework achieves performance comparable to the static setting. Finally, we empirically demonstrate the effectiveness of POWMU in a traffic routing experiment.
△ Less
Submitted 31 January, 2025;
originally announced January 2025.
-
The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective
Authors:
Michael Muehlebach,
Zhiyu He,
Michael I. Jordan
Abstract:
We study the sample complexity of online reinforcement learning in the general setting of nonlinear dynamical systems with continuous state and action spaces. Our analysis accommodates a large class of dynamical systems ranging from a finite set of nonlinear candidate models to models with bounded and Lipschitz continuous dynamics, to systems that are parametrized by a compact and real-valued set…
▽ More
We study the sample complexity of online reinforcement learning in the general setting of nonlinear dynamical systems with continuous state and action spaces. Our analysis accommodates a large class of dynamical systems ranging from a finite set of nonlinear candidate models to models with bounded and Lipschitz continuous dynamics, to systems that are parametrized by a compact and real-valued set of parameters. In the most general setting, our algorithm achieves a policy regret of $\mathcal{O}(N ε^2 + \mathrm{ln}(m(ε))/ε^2)$, where $N$ is the time horizon, $ε$ is a user-specified discretization width, and $m(ε)$ measures the complexity of the function class under consideration via its packing number. In the special case where the dynamics are parametrized by a compact and real-valued set of parameters (such as neural networks, transformers, etc.), we prove a policy regret of $\mathcal{O}(\sqrt{N p})$, where $p$ denotes the number of parameters, recovering earlier sample-complexity results that were derived for linear time-invariant dynamical systems. While this article focuses on characterizing sample complexity, the proposed algorithms are likely to be useful in practice, due to their simplicity, their ability to incorporate prior knowledge, and their benign transient behavior.
△ Less
Submitted 20 May, 2025; v1 submitted 27 January, 2025;
originally announced January 2025.
-
Conformal Prediction Sets with Improved Conditional Coverage using Trust Scores
Authors:
Jivat Neet Kaur,
Michael I. Jordan,
Ahmed Alaa
Abstract:
Standard conformal prediction offers a marginal guarantee on coverage, but for prediction sets to be truly useful, they should ideally ensure coverage conditional on each test point. Unfortunately, it is impossible to achieve exact, distribution-free conditional coverage in finite samples. In this work, we propose an alternative conformal prediction algorithm that targets coverage where it matters…
▽ More
Standard conformal prediction offers a marginal guarantee on coverage, but for prediction sets to be truly useful, they should ideally ensure coverage conditional on each test point. Unfortunately, it is impossible to achieve exact, distribution-free conditional coverage in finite samples. In this work, we propose an alternative conformal prediction algorithm that targets coverage where it matters most--in instances where a classifier is overconfident in its incorrect predictions. We start by dissecting miscoverage events in marginally-valid conformal prediction, and show that miscoverage rates vary based on the classifier's confidence and its deviation from the Bayes optimal classifier. Motivated by this insight, we develop a variant of conformal prediction that targets coverage conditional on a reduced set of two variables: the classifier's confidence in a prediction and a nonparametric trust score that measures its deviation from the Bayes classifier. Empirical evaluation on multiple image datasets shows that our method generally improves conditional coverage properties compared to standard conformal prediction, including class-conditional coverage, coverage over arbitrary subgroups, and coverage over demographic groups.
△ Less
Submitted 9 February, 2025; v1 submitted 17 January, 2025;
originally announced January 2025.
-
Gradient Equilibrium in Online Learning: Theory and Applications
Authors:
Anastasios N. Angelopoulos,
Michael I. Jordan,
Ryan J. Tibshirani
Abstract:
We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by, nor implies, sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradien…
▽ More
We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by, nor implies, sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradient descent and mirror descent with constant step sizes (rather than decaying step sizes, as is usually required for no regret). Further, as we show through examples, gradient equilibrium translates into an interpretable and meaningful property in online prediction problems spanning regression, classification, quantile estimation, and others. Notably, we show that the gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions under arbitrary distribution shift, based on simple post hoc online descent updates. We also show that post hoc gradient updates can be used to calibrate predicted quantiles under distribution shift, and that the framework leads to unbiased Elo scores for pairwise preference prediction.
△ Less
Submitted 18 February, 2025; v1 submitted 14 January, 2025;
originally announced January 2025.
-
2 OLMo 2 Furious
Authors:
Team OLMo,
Pete Walsh,
Luca Soldaini,
Dirk Groeneveld,
Kyle Lo,
Shane Arora,
Akshita Bhagia,
Yuling Gu,
Shengyi Huang,
Matt Jordan,
Nathan Lambert,
Dustin Schwenk,
Oyvind Tafjord,
Taira Anderson,
David Atkinson,
Faeze Brahman,
Christopher Clark,
Pradeep Dasigi,
Nouha Dziri,
Michal Guerquin,
Hamish Ivison,
Pang Wei Koh,
Jiacheng Liu,
Saumya Malik,
William Merrill
, et al. (15 additional authors not shown)
Abstract:
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a…
▽ More
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from Tülu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to compute, often matching or outperforming open-weight only models like Llama 3.1 and Qwen 2.5 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size, including Qwen 2.5, Llama 3.1 and Gemma 2. We release all OLMo 2 artifacts openly -- models at 7B and 13B scales, both pretrained and post-trained, including their full training data, training code and recipes, training logs and thousands of intermediate checkpoints. The final instruction model is available on the Ai2 Playground as a free research demo.
△ Less
Submitted 14 January, 2025; v1 submitted 31 December, 2024;
originally announced January 2025.
-
An Optimistic Algorithm for Online Convex Optimization with Adversarial Constraints
Authors:
Jordan Lekeufack,
Michael I. Jordan
Abstract:
We study Online Convex Optimization (OCO) with adversarial constraints, where an online algorithm must make sequential decisions to minimize both convex loss functions and cumulative constraint violations. We focus on a setting where the algorithm has access to predictions of the loss and constraint functions. Our results show that we can improve the current best bounds of $ O(\sqrt{T}) $ regret a…
▽ More
We study Online Convex Optimization (OCO) with adversarial constraints, where an online algorithm must make sequential decisions to minimize both convex loss functions and cumulative constraint violations. We focus on a setting where the algorithm has access to predictions of the loss and constraint functions. Our results show that we can improve the current best bounds of $ O(\sqrt{T}) $ regret and $ \tilde{O}(\sqrt{T}) $ cumulative constraint violations to $ O(\sqrt{E_T(f)}) $ and $ \tilde{O}(\sqrt{E_T(g^+)}) $, respectively, where $ E_T(f) $ and $E_T(g^+)$ represent the cumulative prediction errors of the loss and constraint functions. In the worst case, where $E_T(f) = O(T) $ and $ E_T(g^+) = O(T) $ (assuming bounded gradients of the loss and constraint functions), our rates match the prior $ O(\sqrt{T}) $ results. However, when the loss and constraint predictions are accurate, our approach yields significantly smaller regret and cumulative constraint violations. Finally, we apply this to the setting of adversarial contextual bandits with sequential risk constraints, obtaining optimistic bounds $O (\sqrt{E_T(f)} T^{1/3})$ regret and $O(\sqrt{E_T(g^+)} T^{1/3})$ constraints violation, yielding better performance than existing results when prediction quality is sufficiently high.
△ Less
Submitted 12 March, 2025; v1 submitted 10 December, 2024;
originally announced December 2024.
-
Quadrupolar Density Structures in Driven Magnetic Reconnection Experiments with a Guide Field
Authors:
T. W. O. Varnish,
J. Chen,
S. Chowdhry,
R. Datta,
G. V. Dowhan,
L. S. Horan IV,
N. M. Jordan,
E. R. Neill,
A. P. Shah,
B. J. Sporer,
R. Shapovalov,
R. D. McBride,
J. D. Hare
Abstract:
Magnetic reconnection is a ubiquitous process in plasma physics, driving rapid and energetic events such as coronal mass ejections. Reconnection between magnetic fields with arbitrary shear can be decomposed into an anti-parallel, reconnecting component, and a non-reconnecting guide-field component which is parallel to the reconnecting electric field. This guide field modifies the structure of the…
▽ More
Magnetic reconnection is a ubiquitous process in plasma physics, driving rapid and energetic events such as coronal mass ejections. Reconnection between magnetic fields with arbitrary shear can be decomposed into an anti-parallel, reconnecting component, and a non-reconnecting guide-field component which is parallel to the reconnecting electric field. This guide field modifies the structure of the reconnection layer and the reconnection rate. We present results from experiments on the MAIZE pulsed-power generator (500 kA peak current, 200 ns rise-time) which use two exploding wire arrays, tilted in opposite directions, to embed a guide field in the plasma flows with a relative strength $b\equiv B_g/B_{rec}=\text{0, 0.4, or 1}$. The reconnection layers in these experiments have widths which are less than the ion skin depth, $d_i=c/ω_{pi}$, indicating the importance of the Hall term, which generates a distinctive quadrupolar magnetic field structure along the separatrices of the reconnection layer. Using laser imaging interferometry, we observe quadrupolar structures in the line-integrated electron density, consistent with the interaction of the embedded guide field with the quadrupolar Hall field. Our measurements extend over much larger length scales ($40 d_i$) at higher $β$ ($\sim 1$) than previous experiments, providing an insight into the global structure of the reconnection layer.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Hydrodynamical simulations with strong indirect terms in Fargo-like codes: Numerical aspects of non-inertial frame and artificial viscosity
Authors:
Lucas M. Jordan,
Thomas Rometsch
Abstract:
Context. Binary star systems allow us to study the planet formation process under extreme conditions. In the early stages, these systems contain a circumbinary disk and a disk around each star. To model the interactions between these disks in the frame of one of the stars, strong fictitious forces must be included in the simulations. The original Fargo and the Fargo3D codes fail to correctly simul…
▽ More
Context. Binary star systems allow us to study the planet formation process under extreme conditions. In the early stages, these systems contain a circumbinary disk and a disk around each star. To model the interactions between these disks in the frame of one of the stars, strong fictitious forces must be included in the simulations. The original Fargo and the Fargo3D codes fail to correctly simulate such systems if the indirect term becomes too strong.
Aims. We present a different way to compute the indirect term which, together with a tensor artificial viscosity prescription, allows the Fargo code to simulate the circumbinary disks in a non-inertial frame of reference. In this way, the Fargo code can be used to study interactions between circumstellar and circumbinary disks.
Results. We find that updating the indirect term becomes relevant when the indirect term becomes stronger than the direct gravitational forces, which occurs for mass ratios of $q > 5\%$. The default artificial viscosity used in the Fargo code inherently produces artificial pressure in a non-inertial frame of reference even in the absence of shocks. This leads to artificial mass ejection from the Hill sphere, starting at brown dwarf masses ($q > 1\%$). These problems can be mitigated by using a tensor artificial viscosity formulation. For high mass ratios, $q > 1\%$, it is also becomes important to initialize the disk in the center-of-mass frame. We expect our proposed changes to be relevant for other grid-based hydrodynamic codes where strong indirect terms occur, or for codes that use artificial viscosity.
△ Less
Submitted 28 November, 2024;
originally announced November 2024.
-
Stability of Crossed-Field Amplifiers
Authors:
Christopher Swenson,
Ryan Revolinsky,
Adam Brusstar,
Emma Guerin,
Nicholas M. Jordan,
Y. Y. Lau,
Ronald Gilgenbach
Abstract:
This research examines the stability of crossed-field amplifiers (CFAs) and characterizes their different modes of operation: amplification, driven oscillation, and self-excited oscillation. The CFA used in this paper is the Recirculating Planar Crossed-Field Amplifier (RPCFA), which is a high power (MW) pulsed (300 ns) amplifier that operates around 3 GHz. Initially, the RPCFA is shown to be a st…
▽ More
This research examines the stability of crossed-field amplifiers (CFAs) and characterizes their different modes of operation: amplification, driven oscillation, and self-excited oscillation. The CFA used in this paper is the Recirculating Planar Crossed-Field Amplifier (RPCFA), which is a high power (MW) pulsed (300 ns) amplifier that operates around 3 GHz. Initially, the RPCFA is shown to be a stable amplifier with moderate gain (5.1 dB), but by either reducing the anode-cathode (AK) gap spacing or increasing the driving current, the amplifier operation transitions from amplification to oscillation. Depending on the operating conditions, these oscillations are either driven by the input RF signal or self-excited. These self-excited oscillations can have a lower synchronization phase velocity than the maximum velocity in the electron beam, implying that slower electrons within the Brillouin hub can interact with electromagnetic modes on the RF circuit. A cold tube analysis of the RPCFA shows that the Q-factor of certain modes on the RF circuit varies significantly when the AK gap geometry of the RPCFA is altered which leads to a discrete shift in operating frequency. The operation of the RPCFA close to Hull cutoff is found to share some key features of magnetically insulated transmission line oscillators (MILO) that could also explain the dramatic frequency shift. Instantaneous phase analysis by Hilbert transforms can be used, in conjunction with the frequency and output power analysis, to determine the onset of the transition from amplification to oscillation, and to characterize the oscillation.
△ Less
Submitted 4 December, 2024; v1 submitted 24 November, 2024;
originally announced November 2024.
-
Dimension-free Private Mean Estimation for Anisotropic Distributions
Authors:
Yuval Dagan,
Michael I. Jordan,
Xuelin Yang,
Lydia Zakynthinou,
Nikita Zhivotovskiy
Abstract:
We present differentially private algorithms for high-dimensional mean estimation. Previous private estimators on distributions over $\mathbb{R}^d$ suffer from a curse of dimensionality, as they require $Ω(d^{1/2})$ samples to achieve non-trivial error, even in cases where $O(1)$ samples suffice without privacy. This rate is unavoidable when the distribution is isotropic, namely, when the covarian…
▽ More
We present differentially private algorithms for high-dimensional mean estimation. Previous private estimators on distributions over $\mathbb{R}^d$ suffer from a curse of dimensionality, as they require $Ω(d^{1/2})$ samples to achieve non-trivial error, even in cases where $O(1)$ samples suffice without privacy. This rate is unavoidable when the distribution is isotropic, namely, when the covariance is a multiple of the identity matrix, or when accuracy is measured with respect to the affine-invariant Mahalanobis distance. Yet, real-world data is often highly anisotropic, with signals concentrated on a small number of principal components. We develop estimators that are appropriate for such signals$\unicode{x2013}$our estimators are $(\varepsilon,δ)$-differentially private and have sample complexity that is dimension-independent for anisotropic subgaussian distributions. Given $n$ samples from a distribution with known covariance-proxy $Σ$ and unknown mean $μ$, we present an estimator $\hatμ$ that achieves error $\|\hatμ-μ\|_2\leq α$, as long as $n\gtrsim\mathrm{tr}(Σ)/α^2+ \mathrm{tr}(Σ^{1/2})/(α\varepsilon)$. In particular, when $\pmbσ^2=(σ_1^2, \ldots, σ_d^2)$ are the singular values of $Σ$, we have $\mathrm{tr}(Σ)=\|\pmbσ\|_2^2$ and $\mathrm{tr}(Σ^{1/2})=\|\pmbσ\|_1$, and hence our bound avoids dimension-dependence when the signal is concentrated in a few principal components. We show that this is the optimal sample complexity for this task up to logarithmic factors. Moreover, for the case of unknown covariance, we present an algorithm whose sample complexity has improved dependence on the dimension, from $d^{1/2}$ to $d^{1/4}$.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
Learning Variational Inequalities from Data: Fast Generalization Rates under Strong Monotonicity
Authors:
Eric Zhao,
Tatjana Chavdarova,
Michael Jordan
Abstract:
Variational inequalities (VIs) are a broad class of optimization problems encompassing machine learning problems ranging from standard convex minimization to more complex scenarios like min-max optimization and computing the equilibria of multi-player games. In convex optimization, strong convexity allows for fast statistical learning rates requiring only $Θ(1/ε)$ stochastic first-order oracle cal…
▽ More
Variational inequalities (VIs) are a broad class of optimization problems encompassing machine learning problems ranging from standard convex minimization to more complex scenarios like min-max optimization and computing the equilibria of multi-player games. In convex optimization, strong convexity allows for fast statistical learning rates requiring only $Θ(1/ε)$ stochastic first-order oracle calls to find an $ε$-optimal solution, rather than the standard $Θ(1/ε^2)$ calls. This note provides a simple overview of how one can similarly obtain fast $Θ(1/ε)$ rates for learning VIs that satisfy strong monotonicity, a generalization of strong convexity. Specifically, we demonstrate that standard stability-based generalization arguments for convex minimization extend directly to VIs when the domain admits a small covering, or when the operator is integrable and suboptimality is measured by potential functions; such as when finding equilibria in multi-player games.
△ Less
Submitted 18 February, 2025; v1 submitted 27 October, 2024;
originally announced October 2024.
-
Enhancing Feature-Specific Data Protection via Bayesian Coordinate Differential Privacy
Authors:
Maryam Aliakbarpour,
Syomantak Chaudhuri,
Thomas A. Courtade,
Alireza Fallah,
Michael I. Jordan
Abstract:
Local Differential Privacy (LDP) offers strong privacy guarantees without requiring users to trust external parties. However, LDP applies uniform protection to all data features, including less sensitive ones, which degrades performance of downstream tasks. To overcome this limitation, we propose a Bayesian framework, Bayesian Coordinate Differential Privacy (BCDP), that enables feature-specific p…
▽ More
Local Differential Privacy (LDP) offers strong privacy guarantees without requiring users to trust external parties. However, LDP applies uniform protection to all data features, including less sensitive ones, which degrades performance of downstream tasks. To overcome this limitation, we propose a Bayesian framework, Bayesian Coordinate Differential Privacy (BCDP), that enables feature-specific privacy quantification. This more nuanced approach complements LDP by adjusting privacy protection according to the sensitivity of each feature, enabling improved performance of downstream tasks without compromising privacy. We characterize the properties of BCDP and articulate its connections with standard non-Bayesian privacy frameworks. We further apply our BCDP framework to the problems of private mean estimation and ordinary least-squares regression. The BCDP-based approach obtains improved accuracy compared to a purely LDP-based approach, without compromising on privacy.
△ Less
Submitted 23 October, 2024;
originally announced October 2024.
-
Optimal Design for Reward Modeling in RLHF
Authors:
Antoine Scheid,
Etienne Boursier,
Alain Durmus,
Michael I. Jordan,
Pierre Ménard,
Eric Moulines,
Michal Valko
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text generations and using it to infer (implicitly or explicitly) a reward model. Numerous methods have been proposed to learn the reward model and align a LM with it. Howe…
▽ More
Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text generations and using it to infer (implicitly or explicitly) a reward model. Numerous methods have been proposed to learn the reward model and align a LM with it. However, the costly process of collecting human preferences has received little attention and could benefit from theoretical insights. This paper addresses this issue and aims to formalize the reward training model in RLHF. We frame the selection of an effective dataset as a simple regret minimization task, using a linear contextual dueling bandit method. Given the potentially large number of arms, this approach is more coherent than the best-arm identification setting. We then propose an offline framework for solving this problem. Under appropriate assumptions - linearity of the reward model in the embedding space, and boundedness of the reward parameter - we derive bounds on the simple regret. Finally, we provide a lower bound that matches our upper bound up to constant and logarithmic terms. To our knowledge, this is the first theoretical contribution in this area to provide an offline approach as well as worst-case guarantees.
△ Less
Submitted 23 October, 2024; v1 submitted 22 October, 2024;
originally announced October 2024.
-
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Authors:
Tianyu Guo,
Druv Pai,
Yu Bai,
Jiantao Jiao,
Michael I. Jordan,
Song Mei
Abstract:
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states…
▽ More
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to as extreme-token phenomena. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights, exhibiting significantly smaller value states, and having much larger residual-state norms than those of other tokens. These extreme tokens give rise to various challenges in LLM inference, quantization, and interpretability.
We elucidate the mechanisms behind extreme-token phenomena. First, we show that these phenomena arise in very simple architectures -- transformers with one to three layers -- trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism, where attention heads become sinks for specific input domains while remaining non-sinks for others. Our theoretical analysis of the training dynamics reveals that these phenomena are driven by a mutual reinforcement mechanism. Building on these insights, we propose strategies to mitigate extreme-token phenomena during pretraining, including replacing softmax with ReLU and Adam with SGD. Next, we extend our analysis to pretrained LLMs, including Llama and OLMo, showing that many attention heads exhibit a similar active-dormant mechanism as in the BB task, and that the mutual reinforcement mechanism also governs the emergence of extreme-token phenomena during LLM pretraining. Our results reveal that many of the static and dynamic properties of extreme-token phenomena predicted by the BB task align with observations in pretrained LLMs.
△ Less
Submitted 7 November, 2024; v1 submitted 17 October, 2024;
originally announced October 2024.
-
Wide-field microwave magnetic field imaging with nitrogen-vacancy centers in diamond
Authors:
Luca Basso,
Pauli Kehayias,
Jacob Henshaw,
Gajadhar Joshi,
Michael P. Lilly,
Matthew B. Jordan,
Andrew M. Mounce
Abstract:
Non-invasive imaging of microwave (MW) magnetic fields with microscale lateral resolution is pivotal for various applications, such as MW technologies and integrated circuit failure analysis. Diamond nitrogen-vacancy (NV) center magnetometry has emerged as an ideal tool, offering $μ$m-scale resolution, millimeter-scale field of view, high sensitivity, and non-invasive imaging compatible with diver…
▽ More
Non-invasive imaging of microwave (MW) magnetic fields with microscale lateral resolution is pivotal for various applications, such as MW technologies and integrated circuit failure analysis. Diamond nitrogen-vacancy (NV) center magnetometry has emerged as an ideal tool, offering $μ$m-scale resolution, millimeter-scale field of view, high sensitivity, and non-invasive imaging compatible with diverse samples. However, up until now, it has been predominantly used for imaging of static or low-frequency magnetic fields or, concerning MW field imaging, to directly characterize the same microwave device used to drive the NV spin transitions. In this work we leverage an NV center ensemble in diamond for wide-field imaging of MW magnetic fields generated by a test device employing a differential measurement protocol. The microscope is equipped with a MW loop to induce Rabi oscillations between NV spin states, and the MW field from the device-under-test is measured through local deviations in the Rabi frequency. This differential protocol yields magnetic field maps of a 2.57 GHz MW field with a sensitivity of $\sim$ 9 $μ$T Hz$^{-1/2}$ for a total measurement duration of $T = 357$ s, covering a $340\times340$ $μ$m$^2$ field of view with a $μ$m-scale spatial resolution and a DUT input power dynamic range of 30 dB. This work demonstrates a novel NV magnetometry protocol, based on differential Rabi frequency measurement, that extends NV wide-field imaging capabilities to imaging of weak MW magnetic fields that would be difficult to measure directly through standard NV Rabi magnetometry.
△ Less
Submitted 18 October, 2024; v1 submitted 24 September, 2024;
originally announced September 2024.
-
Safety vs. Performance: How Multi-Objective Learning Reduces Barriers to Market Entry
Authors:
Meena Jagadeesan,
Michael I. Jordan,
Jacob Steinhardt
Abstract:
Emerging marketplaces for large language models and other large-scale machine learning (ML) models appear to exhibit market concentration, which has raised concerns about whether there are insurmountable barriers to entry in such markets. In this work, we study this issue from both an economic and an algorithmic point of view, focusing on a phenomenon that reduces barriers to entry. Specifically,…
▽ More
Emerging marketplaces for large language models and other large-scale machine learning (ML) models appear to exhibit market concentration, which has raised concerns about whether there are insurmountable barriers to entry in such markets. In this work, we study this issue from both an economic and an algorithmic point of view, focusing on a phenomenon that reduces barriers to entry. Specifically, an incumbent company risks reputational damage unless its model is sufficiently aligned with safety objectives, whereas a new company can more easily avoid reputational damage. To study this issue formally, we define a multi-objective high-dimensional regression framework that captures reputational damage, and we characterize the number of data points that a new company needs to enter the market. Our results demonstrate how multi-objective considerations can fundamentally reduce barriers to entry -- the required number of data points can be significantly smaller than the incumbent company's dataset size. En route to proving these results, we develop scaling laws for high-dimensional linear regression in multi-objective environments, showing that the scaling rate becomes slower when the dataset size is large, which could be of independent interest.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
Graph-based Modeling and Simulation of Emergency Services Communication Systems
Authors:
Jardi Martinez Jordan,
Michael Stiber
Abstract:
Emergency Services Communication Systems (ESCS) are evolving into Internet Protocol based communication networks, promising enhancements to their function, availability, and resilience. This increase in complexity and cyber-attack surface demands better understanding of these systems' breakdown dynamics under extreme circumstances. Existing ESCS research largely overlooks simulation and the little…
▽ More
Emergency Services Communication Systems (ESCS) are evolving into Internet Protocol based communication networks, promising enhancements to their function, availability, and resilience. This increase in complexity and cyber-attack surface demands better understanding of these systems' breakdown dynamics under extreme circumstances. Existing ESCS research largely overlooks simulation and the little work that exists focuses primarily on cybersecurity threats and neglects critical factors such as non-stationarity of call arrivals. This paper introduces a robust, adaptable graph-based simulation framework and essential mathematical models for ESCS simulation. The framework uses a representation of ESCSes where each vertex is a communicating finite-state machine that exchanges messages along edges and whose behavior is governed by a discrete event queuing model. Call arrival burstiness and its connection to emergency incidents is modeled through a cluster point process. Model applicability is demonstrated through simulations of the Seattle Police Department ESCS. Ongoing work is developing GPU implementation and exploring use in cybersecurity tabletop exercises.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization
Authors:
Tianyi Lin,
Chi Jin,
Michael. I. Jordan
Abstract:
We provide a unified analysis of two-timescale gradient descent ascent (TTGDA) for solving structured nonconvex minimax optimization problems in the form of $\min_\textbf{x} \max_{\textbf{y} \in Y} f(\textbf{x}, \textbf{y})$, where the objective function $f(\textbf{x}, \textbf{y})$ is nonconvex in $\textbf{x}$ and concave in $\textbf{y}$, and the constraint set $Y \subseteq \mathbb{R}^n$ is convex…
▽ More
We provide a unified analysis of two-timescale gradient descent ascent (TTGDA) for solving structured nonconvex minimax optimization problems in the form of $\min_\textbf{x} \max_{\textbf{y} \in Y} f(\textbf{x}, \textbf{y})$, where the objective function $f(\textbf{x}, \textbf{y})$ is nonconvex in $\textbf{x}$ and concave in $\textbf{y}$, and the constraint set $Y \subseteq \mathbb{R}^n$ is convex and bounded. In the convex-concave setting, the single-timescale gradient descent ascent (GDA) algorithm is widely used in applications and has been shown to have strong convergence guarantees. In more general settings, however, it can fail to converge. Our contribution is to design TTGDA algorithms that are effective beyond the convex-concave setting, efficiently finding a stationary point of the function $Φ(\cdot) := \max_{\textbf{y} \in Y} f(\cdot, \textbf{y})$. We also establish theoretical bounds on the complexity of solving both smooth and nonsmooth nonconvex-concave minimax optimization problems. To the best of our knowledge, this is the first systematic analysis of TTGDA for nonconvex minimax optimization, shedding light on its superior performance in training generative adversarial networks (GANs) and in other real-world application problems.
△ Less
Submitted 27 January, 2025; v1 submitted 21 August, 2024;
originally announced August 2024.
-
Enhancement of Photoresponse for InGaAs Infrared Photodetectors Using Plasmonic WO3-x/CsyWO3-x Nanocrystals
Authors:
Zach D. Merino,
Gyorgy Jaics,
Andrew W. M. Jordan,
Arjun Shetty,
Penghui Yin,
Man C. Tam,
Xinning Wang,
Zbig. R. Wasilewski,
Pavle V. Radovanovic,
Jonathan Baugh
Abstract:
Fast and accurate detection of light in the near-infrared (NIR) spectral range plays a crucial role in modern society, from alleviating speed and capacity bottlenecks in optical communications to enhancing the control and safety of autonomous vehicles through NIR imaging systems. Several technological platforms are currently under investigation to improve NIR photodetection, aiming to surpass the…
▽ More
Fast and accurate detection of light in the near-infrared (NIR) spectral range plays a crucial role in modern society, from alleviating speed and capacity bottlenecks in optical communications to enhancing the control and safety of autonomous vehicles through NIR imaging systems. Several technological platforms are currently under investigation to improve NIR photodetection, aiming to surpass the performance of established III-V semiconductor p-i-n (PIN) junction technology. These platforms include in situ-grown inorganic nanocrystals and nanowire arrays, as well as hybrid organic-inorganic materials such as graphene-perovskite heterostructures. However, challenges remain in nanocrystal and nanowire growth, large-area fabrication of high-quality 2D materials, and the fabrication of devices for practical applications. Here, we explore the potential for tailored semiconductor nanocrystals to enhance the responsivity of planar metal-semiconductor-metal (MSM) photodetectors. MSM technology offers ease of fabrication and fast response times compared to PIN detectors. We observe enhancement of the optical-to-electric conversion efficiency by up to a factor of ~2.5 through the application of plasmonically-active semiconductor nanorods and nanocrystals. We present a protocol for synthesizing and rapidly testing the performance of non-stoichiometric tungsten oxide (WO$_{3-x}$) nanorods and cesium-doped tungsten oxide (Cs$_y$WO$_{3-x}$) hexagonal nanoprisms prepared in colloidal suspensions and drop-cast onto photodetector surfaces. The results demonstrate the potential for a cost-effective and scalable method exploiting tailored nanocrystals to improve the performance of NIR optoelectronic devices.
△ Less
Submitted 26 August, 2024; v1 submitted 19 August, 2024;
originally announced August 2024.
-
Two-dimensional simulations of disks in close binaries: Simulating outburst cycles in cataclysmic variables
Authors:
Lucas M. Jordan,
Dennis Wehner,
Rolf Kuiper
Abstract:
Previous simulations of cataclysmic variables studied either the quiescence, or the outburst state in multiple dimensions or they simulated complete outburst cycles in one dimension using simplified models for the gravitational torques. We self-consistently simulate complete outburst cycles of normal and superoutbursts in cataclysmic variable systems in two dimensions. We study the effect of diffe…
▽ More
Previous simulations of cataclysmic variables studied either the quiescence, or the outburst state in multiple dimensions or they simulated complete outburst cycles in one dimension using simplified models for the gravitational torques. We self-consistently simulate complete outburst cycles of normal and superoutbursts in cataclysmic variable systems in two dimensions. We study the effect of different $α$ viscosity parameters, mass transfer rates, and binary mass ratios on the disk luminosities, outburst occurrence rates, and superhumps. We simulate non-isothermal, viscous accretion disks in cataclysmic variable systems using a modified version of the FARGO code with an updated equation of state and a cooling function designed to reproduce s-curve behavior. Our simulations can model complete outburst cycles using the thermal tidal instability model. We find higher superhump amplitudes and stronger gravitational torques than previous studies, resulting in better agreement with observations.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Unravelling in Collaborative Learning
Authors:
Aymeric Capitaine,
Etienne Boursier,
Antoine Scheid,
Eric Moulines,
Michael I. Jordan,
El-Mahdi El-Mhamdi,
Alain Durmus
Abstract:
Collaborative learning offers a promising avenue for leveraging decentralized data. However, collaboration in groups of strategic learners is not a given. In this work, we consider strategic agents who wish to train a model together but have sampling distributions of different quality. The collaboration is organized by a benevolent aggregator who gathers samples so as to maximize total welfare, bu…
▽ More
Collaborative learning offers a promising avenue for leveraging decentralized data. However, collaboration in groups of strategic learners is not a given. In this work, we consider strategic agents who wish to train a model together but have sampling distributions of different quality. The collaboration is organized by a benevolent aggregator who gathers samples so as to maximize total welfare, but is unaware of data quality. This setting allows us to shed light on the deleterious effect of adverse selection in collaborative learning. More precisely, we demonstrate that when data quality indices are private, the coalition may undergo a phenomenon known as unravelling, wherein it shrinks up to the point that it becomes empty or solely comprised of the worst agent. We show how this issue can be addressed without making use of external transfers, by proposing a novel method inspired by probabilistic verification. This approach makes the grand coalition a Nash equilibrium with high probability despite information asymmetry, thereby breaking unravelling.
△ Less
Submitted 10 December, 2024; v1 submitted 19 July, 2024;
originally announced July 2024.
-
Learning to Mitigate Externalities: the Coase Theorem with Hindsight Rationality
Authors:
Antoine Scheid,
Aymeric Capitaine,
Etienne Boursier,
Eric Moulines,
Michael I Jordan,
Alain Durmus
Abstract:
In economic theory, the concept of externality refers to any indirect effect resulting from an interaction between players that affects the social welfare. Most of the models within which externality has been studied assume that agents have perfect knowledge of their environment and preferences. This is a major hindrance to the practical implementation of many proposed solutions. To address this i…
▽ More
In economic theory, the concept of externality refers to any indirect effect resulting from an interaction between players that affects the social welfare. Most of the models within which externality has been studied assume that agents have perfect knowledge of their environment and preferences. This is a major hindrance to the practical implementation of many proposed solutions. To address this issue, we consider a two-player bandit setting where the actions of one of the players affect the other player and we extend the Coase theorem [Coase, 1960]. This result shows that the optimal approach for maximizing the social welfare in the presence of externality is to establish property rights, i.e., enable transfers and bargaining between the players. Our work removes the classical assumption that bargainers possess perfect knowledge of the underlying game. We first demonstrate that in the absence of property rights, the social welfare breaks down. We then design a policy for the players which allows them to learn a bargaining strategy which maximizes the total welfare, recovering the Coase theorem under uncertainty.
△ Less
Submitted 28 January, 2025; v1 submitted 28 June, 2024;
originally announced June 2024.
-
Automatically Adaptive Conformal Risk Control
Authors:
Vincent Blot,
Anastasios N Angelopoulos,
Michael I Jordan,
Nicolas J-B Brunel
Abstract:
Science and technology have a growing need for effective mechanisms that ensure reliable, controlled performance from black-box machine learning algorithms. These performance guarantees should ideally hold conditionally on the input-that is the performance guarantees should hold, at least approximately, no matter what the input. However, beyond stylized discrete groupings such as ethnicity and gen…
▽ More
Science and technology have a growing need for effective mechanisms that ensure reliable, controlled performance from black-box machine learning algorithms. These performance guarantees should ideally hold conditionally on the input-that is the performance guarantees should hold, at least approximately, no matter what the input. However, beyond stylized discrete groupings such as ethnicity and gender, the right notion of conditioning can be difficult to define. For example, in problems such as image segmentation, we want the uncertainty to reflect the intrinsic difficulty of the test sample, but this may be difficult to capture via a conditioning event. Building on the recent work of Gibbs et al. [2023], we propose a methodology for achieving approximate conditional control of statistical risks-the expected value of loss functions-by adapting to the difficulty of test samples. Our framework goes beyond traditional conditional risk control based on user-provided conditioning events to the algorithmic, data-driven determination of appropriate function classes for conditioning. We apply this framework to various regression and segmentation tasks, enabling finer-grained control over model performance and demonstrating that by continuously monitoring and adjusting these parameters, we can achieve superior precision compared to conventional risk-control methods.
△ Less
Submitted 27 March, 2025; v1 submitted 25 June, 2024;
originally announced June 2024.
-
Position: Benchmarking is Limited in Reinforcement Learning Research
Authors:
Scott M. Jordan,
Adam White,
Bruno Castro da Silva,
Martha White,
Philip S. Thomas
Abstract:
Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is…
▽ More
Novel reinforcement learning algorithms, or improvements on existing ones, are commonly justified by evaluating their performance on benchmark environments and are compared to an ever-changing set of standard algorithms. However, despite numerous calls for improvements, experimental practices continue to produce misleading or unsupported claims. One reason for the ongoing substandard practices is that conducting rigorous benchmarking experiments requires substantial computational time. This work investigates the sources of increased computation costs in rigorous experiment designs. We show that conducting rigorous performance benchmarks will likely have computational costs that are often prohibitive. As a result, we argue for using an additional experimentation paradigm to overcome the limitations of benchmarking.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Defection-Free Collaboration between Competitors in a Learning System
Authors:
Mariel Werner,
Sai Praneeth Karimireddy,
Michael I. Jordan
Abstract:
We study collaborative learning systems in which the participants are competitors who will defect from the system if they lose revenue by collaborating. As such, we frame the system as a duopoly of competitive firms who are each engaged in training machine-learning models and selling their predictions to a market of consumers. We first examine a fully collaborative scheme in which both firms share…
▽ More
We study collaborative learning systems in which the participants are competitors who will defect from the system if they lose revenue by collaborating. As such, we frame the system as a duopoly of competitive firms who are each engaged in training machine-learning models and selling their predictions to a market of consumers. We first examine a fully collaborative scheme in which both firms share their models with each other and show that this leads to a market collapse with the revenues of both firms going to zero. We next show that one-sided collaboration in which only the firm with the lower-quality model shares improves the revenue of both firms. Finally, we propose a more equitable, *defection-free* scheme in which both firms share with each other while losing no revenue, and we show that our algorithm converges to the Nash bargaining solution.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
DataComp-LM: In search of the next generation of training sets for language models
Authors:
Jeffrey Li,
Alex Fang,
Georgios Smyrnis,
Maor Ivgi,
Matt Jordan,
Samir Gadre,
Hritik Bansal,
Etash Guha,
Sedrick Keh,
Kushal Arora,
Saurabh Garg,
Rui Xin,
Niklas Muennighoff,
Reinhard Heckel,
Jean Mercat,
Mayee Chen,
Suchin Gururangan,
Mitchell Wortsman,
Alon Albalak,
Yonatan Bitton,
Marianna Nezhurina,
Amro Abbas,
Cheng-Yu Hsieh,
Dhruba Ghosh,
Josh Gardner
, et al. (34 additional authors not shown)
Abstract:
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with dat…
▽ More
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
△ Less
Submitted 21 April, 2025; v1 submitted 17 June, 2024;
originally announced June 2024.
-
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
Authors:
Anas Awadalla,
Le Xue,
Oscar Lo,
Manli Shu,
Hannah Lee,
Etash Kumar Guha,
Matt Jordan,
Sheng Shen,
Mohamed Awadalla,
Silvio Savarese,
Caiming Xiong,
Ran Xu,
Yejin Choi,
Ludwig Schmidt
Abstract:
Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimo…
▽ More
Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS. Our data and code will be released at https://github.com/mlfoundations/MINT-1T.
△ Less
Submitted 30 October, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Fairness-Aware Meta-Learning via Nash Bargaining
Authors:
Yi Zeng,
Xuelin Yang,
Li Chen,
Cristian Canton Ferrer,
Ming Jin,
Michael I. Jordan,
Ruoxi Jia
Abstract:
To address issues of group-level fairness in machine learning, it is natural to adjust model parameters based on specific fairness objectives over a sensitive-attributed validation set. Such an adjustment procedure can be cast within a meta-learning framework. However, naive integration of fairness goals via meta-learning can cause hypergradient conflicts for subgroups, resulting in unstable conve…
▽ More
To address issues of group-level fairness in machine learning, it is natural to adjust model parameters based on specific fairness objectives over a sensitive-attributed validation set. Such an adjustment procedure can be cast within a meta-learning framework. However, naive integration of fairness goals via meta-learning can cause hypergradient conflicts for subgroups, resulting in unstable convergence and compromising model performance and fairness. To navigate this issue, we frame the resolution of hypergradient conflicts as a multi-player cooperative bargaining game. We introduce a two-stage meta-learning framework in which the first stage involves the use of a Nash Bargaining Solution (NBS) to resolve hypergradient conflicts and steer the model toward the Pareto front, and the second stage optimizes with respect to specific fairness goals. Our method is supported by theoretical results, notably a proof of the NBS for gradient aggregation free from linear independence assumptions, a proof of Pareto improvement, and a proof of monotonic improvement in validation loss. We also show empirical effects across various fairness objectives in six key fairness datasets and two image classification tasks.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.