Skip to main content

Showing 1–23 of 23 results for author: Boix-Adsera, E

.
  1. arXiv:2505.21825  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

    Authors: Parsa Mirtaheri, Ezra Edelman, Samy Jelassi, Eran Malach, Enric Boix-Adsera

    Abstract: Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple sh… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  2. arXiv:2505.06839  [pdf, other

    cs.LG cs.AI stat.ML

    The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts

    Authors: Enric Boix-Adsera, Philippe Rigollet

    Abstract: Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed granularity, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  3. arXiv:2502.03708  [pdf, other

    cs.CL cs.AI stat.ML

    Toward universal steering and monitoring of AI models

    Authors: Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Mikhail Belkin

    Abstract: Modern AI models contain much of human knowledge, yet understanding of their internal representation of this knowledge remains elusive. Characterizing the structure and properties of this representation will lead to improvements in model capabilities and development of effective safeguards. Building on recent advances in feature learning, we develop an effective, scalable approach for extracting l… ▽ More

    Submitted 28 May, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

  4. arXiv:2501.19149  [pdf, ps, other

    cs.LG cs.AI stat.ML

    On the inductive bias of infinite-depth ResNets and the bottleneck rank

    Authors: Enric Boix-Adsera

    Abstract: We compute the minimum-norm weights of a deep linear ResNet, and find that the inductive bias of this architecture lies between minimizing nuclear norm and rank. This implies that, with appropriate hyperparameters, deep nonlinear ResNets have an inductive bias towards minimizing bottleneck rank.

    Submitted 31 January, 2025; originally announced January 2025.

    Comments: 10 pages

  5. arXiv:2403.09053  [pdf, other

    cs.LG cs.AI cs.NE

    Towards a theory of model distillation

    Authors: Enric Boix-Adsera

    Abstract: Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original [BCNM06,HVD15]. Despite many practical applications, basic questions about the extent to which models can be distilled, and the runtime and amount of data needed to distill, remain largely open. To study these questions, we initiate a general theory of distillation, defi… ▽ More

    Submitted 4 May, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: 46 pages, 5 figures. Please reach out with comments! Feedback is welcome

  6. arXiv:2311.07064  [pdf, other

    cs.CL

    Prompts have evil twins

    Authors: Rimon Melamed, Lucas H. McCabe, Tanay Wakhare, Yejin Kim, H. Howie Huang, Enric Boix-Adsera

    Abstract: We discover that many natural-language prompts can be replaced by corresponding prompts that are unintelligible to humans but that provably elicit similar behavior in language models. We call these prompts "evil twins" because they are obfuscated and uninterpretable (evil), but at the same time mimic the functionality of the original natural-language prompts (twins). Remarkably, evil twins transfe… ▽ More

    Submitted 6 October, 2024; v1 submitted 12 November, 2023; originally announced November 2023.

    Comments: EMNLP 2024 Main, camera-ready

  7. arXiv:2310.09753  [pdf, other

    cs.CL cs.AI cs.LG

    When can transformers reason with abstract symbols?

    Authors: Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, Joshua Susskind

    Abstract: We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding abstract relations, and are then tested out-of-distribution on data that contains symbols that did not appear in the training dataset. We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relation… ▽ More

    Submitted 16 April, 2024; v1 submitted 15 October, 2023; originally announced October 2023.

    Comments: 25 figures

  8. arXiv:2306.07042  [pdf, other

    cs.LG

    Transformers learn through gradual rank increase

    Authors: Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, Joshua Susskind

    Abstract: We identify incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank. We rigorously prove this occurs under the simplifying assumptions of diagonal weight matrices and small initialization. Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions.

    Submitted 10 December, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

    Comments: 39 pages, to appear in NeurIPS 2023

  9. arXiv:2305.13141  [pdf, ps, other

    cs.LG

    Tight conditions for when the NTK approximation is valid

    Authors: Enric Boix-Adsera, Etai Littwin

    Abstract: We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss. In the lazy training setting of Chizat et al. 2019, we show that rescaling the model by a factor of $α= O(T)$ suffices for the NTK approximation to be valid until training time $T$. Our bound is tight and improves on the previous bound of Chizat et al. 2019, which required a larger resca… ▽ More

    Submitted 5 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted to TMLR. Added proof flowchart

  10. arXiv:2302.11055  [pdf, other

    cs.LG stat.ML

    SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

    Authors: Emmanuel Abbe, Enric Boix-Adsera, Theodor Misiakiewicz

    Abstract: We investigate the time complexity of SGD learning on fully-connected neural networks with isotropic data. We put forward a complexity measure -- the leap -- which measures how "hierarchical" target functions are. For $d$-dimensional uniform Boolean or isotropic Gaussian data, our main conjecture states that the time complexity to learn a function $f$ with low-dimensional support is… ▽ More

    Submitted 31 August, 2023; v1 submitted 21 February, 2023; originally announced February 2023.

  11. arXiv:2210.06545  [pdf, other

    cs.LG

    GULP: a prediction-based metric between representations

    Authors: Enric Boix-Adsera, Hannah Lawrence, George Stepaniants, Philippe Rigollet

    Abstract: Comparing the representations learned by different neural networks has recently emerged as a key tool to understand various architectures and ultimately optimize them. In this work, we introduce GULP, a family of distance measures between representations that is explicitly motivated by downstream predictive tasks. By construction, GULP provides uniform control over the difference in prediction per… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

    Comments: 34 pages, 24 figures, to appear in NeurIPS'22

  12. arXiv:2208.03113  [pdf, ps, other

    cs.LG

    On the non-universality of deep learning: quantifying the cost of symmetry

    Authors: Emmanuel Abbe, Enric Boix-Adsera

    Abstract: We prove limitations on what neural networks trained by noisy gradient descent (GD) can efficiently learn. Our results apply whenever GD training is equivariant, which holds for many standard architectures and initializations. As applications, (i) we characterize the functions that fully-connected networks can weak-learn on the binary hypercube and unit sphere, demonstrating that depth-2 is as pow… ▽ More

    Submitted 14 October, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

    Comments: Improved exposition, to appear in NeurIPS'22

  13. arXiv:2202.08658  [pdf, other

    cs.LG cs.DS stat.ML

    The merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks

    Authors: Emmanuel Abbe, Enric Boix-Adsera, Theodor Misiakiewicz

    Abstract: It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parameterizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest (non-linear but regular networks) no tight characterization has yet been achieved, despite significant developments. We take a ste… ▽ More

    Submitted 26 August, 2024; v1 submitted 17 February, 2022; originally announced February 2022.

  14. arXiv:2108.10573  [pdf, other

    cs.LG cs.DS cs.NE stat.ML

    The staircase property: How hierarchical structure can guide deep learning

    Authors: Emmanuel Abbe, Enric Boix-Adsera, Matthew Brennan, Guy Bresler, Dheeraj Nagaraj

    Abstract: This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in poly… ▽ More

    Submitted 23 November, 2021; v1 submitted 24 August, 2021; originally announced August 2021.

    Comments: 60 pages, accepted to NeurIPS '21

  15. arXiv:2106.03969  [pdf, other

    cs.LG cs.DS cs.IT math.ST

    Chow-Liu++: Optimal Prediction-Centric Learning of Tree Ising Models

    Authors: Enric Boix-Adsera, Guy Bresler, Frederic Koehler

    Abstract: We consider the problem of learning a tree-structured Ising model from data, such that subsequent predictions computed using the model are accurate. Concretely, we aim to learn a model such that posteriors $P(X_i|X_S)$ for small sets of variables $S$ are accurate. Since its introduction more than 50 years ago, the Chow-Liu algorithm, which efficiently computes the maximum likelihood tree, has been… ▽ More

    Submitted 23 November, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

    Comments: 49 pages, 3 figures, to appear in FOCS'21

  16. arXiv:2101.01100  [pdf, other

    math.OC cs.CC cs.DS cs.LG

    Wasserstein barycenters are NP-hard to compute

    Authors: Jason M. Altschuler, Enric Boix-Adsera

    Abstract: Computing Wasserstein barycenters (a.k.a. Optimal Transport barycenters) is a fundamental problem in geometry which has recently attracted considerable attention due to many applications in data science. While there exist polynomial-time algorithms in any fixed dimension, all known running times suffer exponentially in the dimension. It is an open question whether this exponential dependence is im… ▽ More

    Submitted 3 December, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

    Comments: to appear in SIAM Journal on Mathematics of Data Science (SIMODS)

    Journal ref: SIAM Journal on Mathematics of Data Science, 4(1), 179-203, 2022

  17. arXiv:2012.05398  [pdf, ps, other

    math.OC cs.CC cs.DS cs.LG

    Hardness results for Multimarginal Optimal Transport problems

    Authors: Jason M. Altschuler, Enric Boix-Adsera

    Abstract: Multimarginal Optimal Transport (MOT) is the problem of linear programming over joint probability distributions with fixed marginals. A key issue in many applications is the complexity of solving MOT: the linear program has exponential size in the number of marginals k and their support sizes n. A recent line of work has shown that MOT is poly(n,k)-time solvable for certain families of costs that… ▽ More

    Submitted 9 December, 2020; originally announced December 2020.

    Comments: For expository purposes, some of these results were moved from v1 of arXiv 2008.03006. The current drafts of these papers have no overlapping results. arXiv admin note: text overlap with arXiv:2008.03006

    Journal ref: Discrete Optimization, 42, 100669, 2021. (21 pages)

  18. arXiv:2008.03006  [pdf, other

    math.OC cs.DS cs.LG math.NA

    Polynomial-time algorithms for Multimarginal Optimal Transport problems with structure

    Authors: Jason M. Altschuler, Enric Boix-Adsera

    Abstract: Multimarginal Optimal Transport (MOT) has attracted significant interest due to applications in machine learning, statistics, and the sciences. However, in most applications, the success of MOT is severely limited by a lack of efficient algorithms. Indeed, MOT in general requires exponential time in the number of marginals k and their support sizes n. This paper develops a general theory about wha… ▽ More

    Submitted 16 July, 2022; v1 submitted 7 August, 2020; originally announced August 2020.

    Comments: v4: to appear in Mathematical Programming, improved exposition and refs, no changes to technical results

  19. arXiv:2006.08012  [pdf, other

    math.OC cs.CG cs.DS cs.LG

    Wasserstein barycenters can be computed in polynomial time in fixed dimension

    Authors: Jason M. Altschuler, Enric Boix-Adsera

    Abstract: Computing Wasserstein barycenters is a fundamental geometric problem with widespread applications in machine learning, statistics, and computer graphics. However, it is unknown whether Wasserstein barycenters can be computed in polynomial time, either exactly or to high precision (i.e., with $\textrm{polylog}(1/\varepsilon)$ runtime dependence). This paper answers these questions in the affirmativ… ▽ More

    Submitted 9 December, 2020; v1 submitted 14 June, 2020; originally announced June 2020.

    Comments: 15 pages + refs, 5 figs. Improved exposition. Title has been updated for clarity

    Journal ref: Journal of Machine Learning Research (JMLR), 22, 1-19, 2021

  20. arXiv:2002.05240  [pdf, other

    cs.GT econ.TH

    The Multiplayer Colonel Blotto Game

    Authors: Enric Boix-Adserà, Benjamin L. Edelman, Siddhartha Jayanti

    Abstract: We initiate the study of the natural multiplayer generalization of the classic continuous Colonel Blotto game. The two-player Blotto game, introduced by Borel as a model of resource competition across $n$ simultaneous fronts, has been studied extensively for a century and seen numerous applications throughout the social sciences. Our work defines the multiplayer Colonel Blotto game and derives Nas… ▽ More

    Submitted 21 May, 2021; v1 submitted 12 February, 2020; originally announced February 2020.

    Comments: 24 pages; minor additions to introduction

  21. arXiv:1912.03824  [pdf, ps, other

    cs.DS cs.CC cs.DC

    Approximating the Determinant of Well-Conditioned Matrices by Shallow Circuits

    Authors: Enric Boix-Adserà, Lior Eldar, Saeed Mehraban

    Abstract: The determinant can be computed by classical circuits of depth $O(\log^2 n)$, and therefore it can also be computed in classical space $O(\log^2 n)$. Recent progress by Ta-Shma [Ta13] implies a method to approximate the determinant of Hermitian matrices with condition number $κ$ in quantum space $O(\log n + \log κ)$. However, it is not known how to perform the task in less than $O(\log^2 n)$ space… ▽ More

    Submitted 8 December, 2019; originally announced December 2019.

    Comments: 24 pages

  22. arXiv:1903.08247  [pdf, ps, other

    cs.CC cs.DS math.CO math.PR

    The Average-Case Complexity of Counting Cliques in Erdos-Renyi Hypergraphs

    Authors: Enric Boix-Adserà, Matthew Brennan, Guy Bresler

    Abstract: We consider the problem of counting $k$-cliques in $s$-uniform Erdos-Renyi hypergraphs $G(n,c,s)$ with edge density $c$, and show that its fine-grained average-case complexity can be based on its worst-case complexity. We prove the following: 1. Dense Erdos-Renyi graphs and hypergraphs: Counting $k$-cliques on $G(n,c,s)$ with $k$ and $c$ constant matches its worst-case time complexity up to a… ▽ More

    Submitted 21 July, 2021; v1 submitted 19 March, 2019; originally announced March 2019.

    Comments: 44 pages, 2 figures, appeared in FOCS'19, accepted to SICOMP special edition

  23. arXiv:1902.02431  [pdf, ps, other

    math.PR cs.IT

    Subadditivity Beyond Trees and the Chi-Squared Mutual Information

    Authors: Emmanuel Abbe, Enric Boix-Adserà

    Abstract: In 2000, Evans et al. [Eva+00] proved the subadditivity of the mutual information in the broadcasting on tree model with binary vertex labels and symmetric channels. They raised the question of whether such subadditivity extends to loopy graphs in some appropriate way. We recently proposed such an extension that applies to general graphs and binary vertex labels [AB18], using synchronization model… ▽ More

    Submitted 6 February, 2019; originally announced February 2019.

    Comments: 16 pages