Search | arXiv e-print repository

Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

Authors: Parsa Mirtaheri, Ezra Edelman, Samy Jelassi, Eran Malach, Enric Boix-Adsera

Abstract: Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple sh… ▽ More Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models. △ Less

Submitted 27 May, 2025; originally announced May 2025.

arXiv:2505.06839 [pdf, other]

The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts

Authors: Enric Boix-Adsera, Philippe Rigollet

Abstract: Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed granularity, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4… ▽ More Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed granularity, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4 models). We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity. Experimental results corroborate our theoretical findings and illustrate this separation. △ Less

Submitted 11 May, 2025; originally announced May 2025.

arXiv:2502.03708 [pdf, other]

Toward universal steering and monitoring of AI models

Authors: Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, Mikhail Belkin

Abstract: Modern AI models contain much of human knowledge, yet understanding of their internal representation of this knowledge remains elusive. Characterizing the structure and properties of this representation will lead to improvements in model capabilities and development of effective safeguards. Building on recent advances in feature learning, we develop an effective, scalable approach for extracting l… ▽ More Modern AI models contain much of human knowledge, yet understanding of their internal representation of this knowledge remains elusive. Characterizing the structure and properties of this representation will lead to improvements in model capabilities and development of effective safeguards. Building on recent advances in feature learning, we develop an effective, scalable approach for extracting linear representations of general concepts in large-scale AI models (language models, vision-language models, and reasoning models). We show how these representations enable model steering, through which we expose vulnerabilities, mitigate misaligned behaviors, and improve model capabilities. Additionally, we demonstrate that concept representations are remarkably transferable across human languages and combinable to enable multi-concept steering. Through quantitative analysis across hundreds of concepts, we find that newer, larger models are more steerable and steering can improve model capabilities beyond standard prompting. We show how concept representations are effective for monitoring misaligned content (hallucinations, toxic content). We demonstrate that predictive models built using concept representations are more accurate for monitoring misaligned content than using models that judge outputs directly. Together, our results illustrate the power of using internal representations to map the knowledge in AI models, advance AI safety, and improve model capabilities. △ Less

Submitted 28 May, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

arXiv:2501.19149 [pdf, ps, other]

On the inductive bias of infinite-depth ResNets and the bottleneck rank

Authors: Enric Boix-Adsera

Abstract: We compute the minimum-norm weights of a deep linear ResNet, and find that the inductive bias of this architecture lies between minimizing nuclear norm and rank. This implies that, with appropriate hyperparameters, deep nonlinear ResNets have an inductive bias towards minimizing bottleneck rank. We compute the minimum-norm weights of a deep linear ResNet, and find that the inductive bias of this architecture lies between minimizing nuclear norm and rank. This implies that, with appropriate hyperparameters, deep nonlinear ResNets have an inductive bias towards minimizing bottleneck rank. △ Less

Submitted 31 January, 2025; originally announced January 2025.

Comments: 10 pages

arXiv:2403.09053 [pdf, other]

Towards a theory of model distillation

Authors: Enric Boix-Adsera

Abstract: Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original [BCNM06,HVD15]. Despite many practical applications, basic questions about the extent to which models can be distilled, and the runtime and amount of data needed to distill, remain largely open. To study these questions, we initiate a general theory of distillation, defi… ▽ More Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original [BCNM06,HVD15]. Despite many practical applications, basic questions about the extent to which models can be distilled, and the runtime and amount of data needed to distill, remain largely open. To study these questions, we initiate a general theory of distillation, defining PAC-distillation in an analogous way to PAC-learning [Val84]. As applications of this theory: (1) we propose new algorithms to extract the knowledge stored in the trained weights of neural networks -- we show how to efficiently distill neural networks into succinct, explicit decision tree representations when possible by using the ``linear representation hypothesis''; and (2) we prove that distillation can be much cheaper than learning from scratch, and make progress on characterizing its complexity. △ Less

Submitted 4 May, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

Comments: 46 pages, 5 figures. Please reach out with comments! Feedback is welcome

arXiv:2311.07064 [pdf, other]

Prompts have evil twins

Authors: Rimon Melamed, Lucas H. McCabe, Tanay Wakhare, Yejin Kim, H. Howie Huang, Enric Boix-Adsera

Abstract: We discover that many natural-language prompts can be replaced by corresponding prompts that are unintelligible to humans but that provably elicit similar behavior in language models. We call these prompts "evil twins" because they are obfuscated and uninterpretable (evil), but at the same time mimic the functionality of the original natural-language prompts (twins). Remarkably, evil twins transfe… ▽ More We discover that many natural-language prompts can be replaced by corresponding prompts that are unintelligible to humans but that provably elicit similar behavior in language models. We call these prompts "evil twins" because they are obfuscated and uninterpretable (evil), but at the same time mimic the functionality of the original natural-language prompts (twins). Remarkably, evil twins transfer between models. We find these prompts by solving a maximum-likelihood problem which has applications of independent interest. △ Less

Submitted 6 October, 2024; v1 submitted 12 November, 2023; originally announced November 2023.

Comments: EMNLP 2024 Main, camera-ready

arXiv:2310.09753 [pdf, other]

When can transformers reason with abstract symbols?

Authors: Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, Joshua Susskind

Abstract: We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding abstract relations, and are then tested out-of-distribution on data that contains symbols that did not appear in the training dataset. We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relation… ▽ More We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding abstract relations, and are then tested out-of-distribution on data that contains symbols that did not appear in the training dataset. We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set when trained by gradient descent on sufficiently large quantities of training data. This is in contrast to classical fully-connected networks, which we prove fail to learn to reason. Our results inspire modifications of the transformer architecture that add only two trainable parameters per head, and that we empirically demonstrate improve data efficiency for learning to reason. △ Less

Submitted 16 April, 2024; v1 submitted 15 October, 2023; originally announced October 2023.

Comments: 25 figures

arXiv:2306.07042 [pdf, other]

Transformers learn through gradual rank increase

Authors: Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, Joshua Susskind

Abstract: We identify incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank. We rigorously prove this occurs under the simplifying assumptions of diagonal weight matrices and small initialization. Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions. We identify incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank. We rigorously prove this occurs under the simplifying assumptions of diagonal weight matrices and small initialization. Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions. △ Less

Submitted 10 December, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

Comments: 39 pages, to appear in NeurIPS 2023

arXiv:2305.13141 [pdf, ps, other]

Tight conditions for when the NTK approximation is valid

Authors: Enric Boix-Adsera, Etai Littwin

Abstract: We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss. In the lazy training setting of Chizat et al. 2019, we show that rescaling the model by a factor of $α= O(T)$ suffices for the NTK approximation to be valid until training time $T$. Our bound is tight and improves on the previous bound of Chizat et al. 2019, which required a larger resca… ▽ More We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss. In the lazy training setting of Chizat et al. 2019, we show that rescaling the model by a factor of $α= O(T)$ suffices for the NTK approximation to be valid until training time $T$. Our bound is tight and improves on the previous bound of Chizat et al. 2019, which required a larger rescaling factor of $α= O(T^2)$. △ Less

Submitted 5 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted to TMLR. Added proof flowchart

arXiv:2302.11055 [pdf, other]

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

Authors: Emmanuel Abbe, Enric Boix-Adsera, Theodor Misiakiewicz

Abstract: We investigate the time complexity of SGD learning on fully-connected neural networks with isotropic data. We put forward a complexity measure -- the leap -- which measures how "hierarchical" target functions are. For $d$-dimensional uniform Boolean or isotropic Gaussian data, our main conjecture states that the time complexity to learn a function $f$ with low-dimensional support is… ▽ More We investigate the time complexity of SGD learning on fully-connected neural networks with isotropic data. We put forward a complexity measure -- the leap -- which measures how "hierarchical" target functions are. For $d$-dimensional uniform Boolean or isotropic Gaussian data, our main conjecture states that the time complexity to learn a function $f$ with low-dimensional support is $\tildeΘ(d^{\max(\mathrm{Leap}(f),2)})$. We prove a version of this conjecture for a class of functions on Gaussian isotropic data and 2-layer neural networks, under additional technical assumptions on how SGD is run. We show that the training sequentially learns the function support with a saddle-to-saddle dynamic. Our result departs from [Abbe et al. 2022] by going beyond leap 1 (merged-staircase functions), and by going beyond the mean-field and gradient flow approximations that prohibit the full complexity control obtained here. Finally, we note that this gives an SGD complexity for the full training trajectory that matches that of Correlational Statistical Query (CSQ) lower-bounds. △ Less

Submitted 31 August, 2023; v1 submitted 21 February, 2023; originally announced February 2023.

arXiv:2210.06545 [pdf, other]

GULP: a prediction-based metric between representations

Authors: Enric Boix-Adsera, Hannah Lawrence, George Stepaniants, Philippe Rigollet

Abstract: Comparing the representations learned by different neural networks has recently emerged as a key tool to understand various architectures and ultimately optimize them. In this work, we introduce GULP, a family of distance measures between representations that is explicitly motivated by downstream predictive tasks. By construction, GULP provides uniform control over the difference in prediction per… ▽ More Comparing the representations learned by different neural networks has recently emerged as a key tool to understand various architectures and ultimately optimize them. In this work, we introduce GULP, a family of distance measures between representations that is explicitly motivated by downstream predictive tasks. By construction, GULP provides uniform control over the difference in prediction performance between two representations, with respect to regularized linear prediction tasks. Moreover, it satisfies several desirable structural properties, such as the triangle inequality and invariance under orthogonal transformations, and thus lends itself to data embedding and visualization. We extensively evaluate GULP relative to other methods, and demonstrate that it correctly differentiates between architecture families, converges over the course of training, and captures generalization performance on downstream linear tasks. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: 34 pages, 24 figures, to appear in NeurIPS'22

arXiv:2208.03113 [pdf, ps, other]

On the non-universality of deep learning: quantifying the cost of symmetry

Authors: Emmanuel Abbe, Enric Boix-Adsera

Abstract: We prove limitations on what neural networks trained by noisy gradient descent (GD) can efficiently learn. Our results apply whenever GD training is equivariant, which holds for many standard architectures and initializations. As applications, (i) we characterize the functions that fully-connected networks can weak-learn on the binary hypercube and unit sphere, demonstrating that depth-2 is as pow… ▽ More We prove limitations on what neural networks trained by noisy gradient descent (GD) can efficiently learn. Our results apply whenever GD training is equivariant, which holds for many standard architectures and initializations. As applications, (i) we characterize the functions that fully-connected networks can weak-learn on the binary hypercube and unit sphere, demonstrating that depth-2 is as powerful as any other depth for this task; (ii) we extend the merged-staircase necessity result for learning with latent low-dimensional structure [ABM22] to beyond the mean-field regime. Under cryptographic assumptions, we also show hardness results for learning with fully-connected networks trained by stochastic gradient descent (SGD). △ Less

Submitted 14 October, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

Comments: Improved exposition, to appear in NeurIPS'22

arXiv:2202.08658 [pdf, other]

The merged-staircase property: a necessary and nearly sufficient condition for SGD learning of sparse functions on two-layer neural networks

Authors: Emmanuel Abbe, Enric Boix-Adsera, Theodor Misiakiewicz

Abstract: It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parameterizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest (non-linear but regular networks) no tight characterization has yet been achieved, despite significant developments. We take a ste… ▽ More It is currently known how to characterize functions that neural networks can learn with SGD for two extremal parameterizations: neural networks in the linear regime, and neural networks with no structural constraints. However, for the main parametrization of interest (non-linear but regular networks) no tight characterization has yet been achieved, despite significant developments. We take a step in this direction by considering depth-2 neural networks trained by SGD in the mean-field regime. We consider functions on binary inputs that depend on a latent low-dimensional subspace (i.e., small number of coordinates). This regime is of interest since it is poorly understood how neural networks routinely tackle high-dimensional datasets and adapt to latent low-dimensional structure without suffering from the curse of dimensionality. Accordingly, we study SGD-learnability with $O(d)$ sample complexity in a large ambient dimension $d$. Our main results characterize a hierarchical property, the "merged-staircase property", that is both necessary and nearly sufficient for learning in this setting. We further show that non-linear training is necessary: for this class of functions, linear methods on any feature map (e.g., the NTK) are not capable of learning efficiently. The key tools are a new "dimension-free" dynamics approximation result that applies to functions defined on a latent space of low-dimension, a proof of global convergence based on polynomial identity testing, and an improvement of lower bounds against linear methods for non-almost orthogonal functions. △ Less

Submitted 26 August, 2024; v1 submitted 17 February, 2022; originally announced February 2022.

arXiv:2108.10573 [pdf, other]

The staircase property: How hierarchical structure can guide deep learning

Authors: Emmanuel Abbe, Enric Boix-Adsera, Matthew Brennan, Guy Bresler, Dheeraj Nagaraj

Abstract: This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in poly… ▽ More This paper identifies a structural property of data distributions that enables deep neural networks to learn hierarchically. We define the "staircase" property for functions over the Boolean hypercube, which posits that high-order Fourier coefficients are reachable from lower-order Fourier coefficients along increasing chains. We prove that functions satisfying this property can be learned in polynomial time using layerwise stochastic coordinate descent on regular neural networks -- a class of network architectures and initializations that have homogeneity properties. Our analysis shows that for such staircase functions and neural networks, the gradient-based algorithm learns high-level features by greedily combining lower-level features along the depth of the network. We further back our theoretical results with experiments showing that staircase functions are also learnable by more standard ResNet architectures with stochastic gradient descent. Both the theoretical and experimental results support the fact that staircase properties have a role to play in understanding the capabilities of gradient-based learning on regular networks, in contrast to general polynomial-size networks that can emulate any SQ or PAC algorithms as recently shown. △ Less

Submitted 23 November, 2021; v1 submitted 24 August, 2021; originally announced August 2021.

Comments: 60 pages, accepted to NeurIPS '21

arXiv:2106.03969 [pdf, other]

Chow-Liu++: Optimal Prediction-Centric Learning of Tree Ising Models

Authors: Enric Boix-Adsera, Guy Bresler, Frederic Koehler

Abstract: We consider the problem of learning a tree-structured Ising model from data, such that subsequent predictions computed using the model are accurate. Concretely, we aim to learn a model such that posteriors $P(X_i|X_S)$ for small sets of variables $S$ are accurate. Since its introduction more than 50 years ago, the Chow-Liu algorithm, which efficiently computes the maximum likelihood tree, has been… ▽ More We consider the problem of learning a tree-structured Ising model from data, such that subsequent predictions computed using the model are accurate. Concretely, we aim to learn a model such that posteriors $P(X_i|X_S)$ for small sets of variables $S$ are accurate. Since its introduction more than 50 years ago, the Chow-Liu algorithm, which efficiently computes the maximum likelihood tree, has been the benchmark algorithm for learning tree-structured graphical models. A bound on the sample complexity of the Chow-Liu algorithm with respect to the prediction-centric local total variation loss was shown in [BK19]. While those results demonstrated that it is possible to learn a useful model even when recovering the true underlying graph is impossible, their bound depends on the maximum strength of interactions and thus does not achieve the information-theoretic optimum. In this paper, we introduce a new algorithm that carefully combines elements of the Chow-Liu algorithm with tree metric reconstruction methods to efficiently and optimally learn tree Ising models under a prediction-centric loss. Our algorithm is robust to model misspecification and adversarial corruptions. In contrast, we show that the celebrated Chow-Liu algorithm can be arbitrarily suboptimal. △ Less

Submitted 23 November, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: 49 pages, 3 figures, to appear in FOCS'21

arXiv:2101.01100 [pdf, other]

doi 10.1137/21M1390062

Wasserstein barycenters are NP-hard to compute

Authors: Jason M. Altschuler, Enric Boix-Adsera

Abstract: Computing Wasserstein barycenters (a.k.a. Optimal Transport barycenters) is a fundamental problem in geometry which has recently attracted considerable attention due to many applications in data science. While there exist polynomial-time algorithms in any fixed dimension, all known running times suffer exponentially in the dimension. It is an open question whether this exponential dependence is im… ▽ More Computing Wasserstein barycenters (a.k.a. Optimal Transport barycenters) is a fundamental problem in geometry which has recently attracted considerable attention due to many applications in data science. While there exist polynomial-time algorithms in any fixed dimension, all known running times suffer exponentially in the dimension. It is an open question whether this exponential dependence is improvable to a polynomial dependence. This paper proves that unless P=NP, the answer is no. This uncovers a "curse of dimensionality" for Wasserstein barycenter computation which does not occur for Optimal Transport computation. Moreover, our hardness results for computing Wasserstein barycenters extend to approximate computation, to seemingly simple cases of the problem, and to averaging probability distributions in other Optimal Transport metrics. △ Less

Submitted 3 December, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

Comments: to appear in SIAM Journal on Mathematics of Data Science (SIMODS)

Journal ref: SIAM Journal on Mathematics of Data Science, 4(1), 179-203, 2022

arXiv:2012.05398 [pdf, ps, other]

doi 10.1016/j.disopt.2021.100669

Hardness results for Multimarginal Optimal Transport problems

Authors: Jason M. Altschuler, Enric Boix-Adsera

Abstract: Multimarginal Optimal Transport (MOT) is the problem of linear programming over joint probability distributions with fixed marginals. A key issue in many applications is the complexity of solving MOT: the linear program has exponential size in the number of marginals k and their support sizes n. A recent line of work has shown that MOT is poly(n,k)-time solvable for certain families of costs that… ▽ More Multimarginal Optimal Transport (MOT) is the problem of linear programming over joint probability distributions with fixed marginals. A key issue in many applications is the complexity of solving MOT: the linear program has exponential size in the number of marginals k and their support sizes n. A recent line of work has shown that MOT is poly(n,k)-time solvable for certain families of costs that have poly(n,k)-size implicit representations. However, it is unclear what further families of costs this line of algorithmic research can encompass. In order to understand these fundamental limitations, this paper initiates the study of intractability results for MOT. Our main technical contribution is developing a toolkit for proving NP-hardness and inapproximability results for MOT problems. We demonstrate this toolkit by using it to establish the intractability of a number of MOT problems studied in the literature that have resisted previous algorithmic efforts. For instance, we provide evidence that repulsive costs make MOT intractable by showing that several such problems of interest are NP-hard to solve--even approximately. △ Less

Submitted 9 December, 2020; originally announced December 2020.

Comments: For expository purposes, some of these results were moved from v1 of arXiv 2008.03006. The current drafts of these papers have no overlapping results. arXiv admin note: text overlap with arXiv:2008.03006

Journal ref: Discrete Optimization, 42, 100669, 2021. (21 pages)

arXiv:2008.03006 [pdf, other]

Polynomial-time algorithms for Multimarginal Optimal Transport problems with structure

Authors: Jason M. Altschuler, Enric Boix-Adsera

Abstract: Multimarginal Optimal Transport (MOT) has attracted significant interest due to applications in machine learning, statistics, and the sciences. However, in most applications, the success of MOT is severely limited by a lack of efficient algorithms. Indeed, MOT in general requires exponential time in the number of marginals k and their support sizes n. This paper develops a general theory about wha… ▽ More Multimarginal Optimal Transport (MOT) has attracted significant interest due to applications in machine learning, statistics, and the sciences. However, in most applications, the success of MOT is severely limited by a lack of efficient algorithms. Indeed, MOT in general requires exponential time in the number of marginals k and their support sizes n. This paper develops a general theory about what "structure" makes MOT solvable in poly(n,k) time. We develop a unified algorithmic framework for solving MOT in poly(n,k) time by characterizing the "structure" that different algorithms require in terms of simple variants of the dual feasibility oracle. This framework has several benefits. First, it enables us to show that the Sinkhorn algorithm, which is currently the most popular MOT algorithm, requires strictly more structure than other algorithms do to solve MOT in poly(n,k) time. Second, our framework makes it much simpler to develop poly(n,k) time algorithms for a given MOT problem. In particular, it is necessary and sufficient to (approximately) solve the dual feasibility oracle -- which is much more amenable to standard algorithmic techniques. We illustrate this ease-of-use by developing poly(n,k) time algorithms for three general classes of MOT cost structures: (1) graphical structure; (2) set-optimization structure; and (3) low-rank plus sparse structure. For structure (1), we recover the known result that Sinkhorn has poly(n,k) runtime; moreover, we provide the first poly(n,k) time algorithms for computing solutions that are exact and sparse. For structures (2)-(3), we give the first poly(n,k) time algorithms, even for approximate computation. Together, these three structures encompass many -- if not most -- current applications of MOT. △ Less

Submitted 16 July, 2022; v1 submitted 7 August, 2020; originally announced August 2020.

Comments: v4: to appear in Mathematical Programming, improved exposition and refs, no changes to technical results

arXiv:2006.08012 [pdf, other]

Wasserstein barycenters can be computed in polynomial time in fixed dimension

Authors: Jason M. Altschuler, Enric Boix-Adsera

Abstract: Computing Wasserstein barycenters is a fundamental geometric problem with widespread applications in machine learning, statistics, and computer graphics. However, it is unknown whether Wasserstein barycenters can be computed in polynomial time, either exactly or to high precision (i.e., with $\textrm{polylog}(1/\varepsilon)$ runtime dependence). This paper answers these questions in the affirmativ… ▽ More Computing Wasserstein barycenters is a fundamental geometric problem with widespread applications in machine learning, statistics, and computer graphics. However, it is unknown whether Wasserstein barycenters can be computed in polynomial time, either exactly or to high precision (i.e., with $\textrm{polylog}(1/\varepsilon)$ runtime dependence). This paper answers these questions in the affirmative for any fixed dimension. Our approach is to solve an exponential-size linear programming formulation by efficiently implementing the corresponding separation oracle using techniques from computational geometry. △ Less

Submitted 9 December, 2020; v1 submitted 14 June, 2020; originally announced June 2020.

Comments: 15 pages + refs, 5 figs. Improved exposition. Title has been updated for clarity

Journal ref: Journal of Machine Learning Research (JMLR), 22, 1-19, 2021

arXiv:2002.05240 [pdf, other]

The Multiplayer Colonel Blotto Game

Authors: Enric Boix-Adserà, Benjamin L. Edelman, Siddhartha Jayanti

Abstract: We initiate the study of the natural multiplayer generalization of the classic continuous Colonel Blotto game. The two-player Blotto game, introduced by Borel as a model of resource competition across $n$ simultaneous fronts, has been studied extensively for a century and seen numerous applications throughout the social sciences. Our work defines the multiplayer Colonel Blotto game and derives Nas… ▽ More We initiate the study of the natural multiplayer generalization of the classic continuous Colonel Blotto game. The two-player Blotto game, introduced by Borel as a model of resource competition across $n$ simultaneous fronts, has been studied extensively for a century and seen numerous applications throughout the social sciences. Our work defines the multiplayer Colonel Blotto game and derives Nash equilibria for various settings of $k$ (number of players) and $n$. We also introduce a "Boolean" version of Blotto that becomes interesting in the multiplayer setting. The main technical difficulty of our work, as in the two-player theoretical literature, is the challenge of coupling various marginal distributions into a joint distribution satisfying a strict sum constraint. In contrast to previous works in the continuous setting, we derive our couplings algorithmically in the form of efficient sampling algorithms. △ Less

Submitted 21 May, 2021; v1 submitted 12 February, 2020; originally announced February 2020.

Comments: 24 pages; minor additions to introduction

arXiv:1912.03824 [pdf, ps, other]

Approximating the Determinant of Well-Conditioned Matrices by Shallow Circuits

Authors: Enric Boix-Adserà, Lior Eldar, Saeed Mehraban

Abstract: The determinant can be computed by classical circuits of depth $O(\log^2 n)$, and therefore it can also be computed in classical space $O(\log^2 n)$. Recent progress by Ta-Shma [Ta13] implies a method to approximate the determinant of Hermitian matrices with condition number $κ$ in quantum space $O(\log n + \log κ)$. However, it is not known how to perform the task in less than $O(\log^2 n)$ space… ▽ More The determinant can be computed by classical circuits of depth $O(\log^2 n)$, and therefore it can also be computed in classical space $O(\log^2 n)$. Recent progress by Ta-Shma [Ta13] implies a method to approximate the determinant of Hermitian matrices with condition number $κ$ in quantum space $O(\log n + \log κ)$. However, it is not known how to perform the task in less than $O(\log^2 n)$ space using classical resources only. In this work, we show that the condition number of a matrix implies an upper bound on the depth complexity (and therefore also on the space complexity) for this task: the determinant of Hermitian matrices with condition number $κ$ can be approximated to inverse polynomial relative error with classical circuits of depth $\tilde O(\log n \cdot \log κ)$, and in particular one can approximate the determinant for sufficiently well-conditioned matrices in depth $\tilde{O}(\log n)$. Our algorithm combines Barvinok's recent complex-analytic approach for approximating combinatorial counting problems [Bar16] with the Valiant-Berkowitz-Skyum-Rackoff depth-reduction theorem for low-degree arithmetic circuits [Val83]. △ Less

Submitted 8 December, 2019; originally announced December 2019.

Comments: 24 pages

arXiv:1903.08247 [pdf, ps, other]

The Average-Case Complexity of Counting Cliques in Erdos-Renyi Hypergraphs

Authors: Enric Boix-Adserà, Matthew Brennan, Guy Bresler

Abstract: We consider the problem of counting $k$-cliques in $s$-uniform Erdos-Renyi hypergraphs $G(n,c,s)$ with edge density $c$, and show that its fine-grained average-case complexity can be based on its worst-case complexity. We prove the following: 1. Dense Erdos-Renyi graphs and hypergraphs: Counting $k$-cliques on $G(n,c,s)$ with $k$ and $c$ constant matches its worst-case time complexity up to a… ▽ More We consider the problem of counting $k$-cliques in $s$-uniform Erdos-Renyi hypergraphs $G(n,c,s)$ with edge density $c$, and show that its fine-grained average-case complexity can be based on its worst-case complexity. We prove the following: 1. Dense Erdos-Renyi graphs and hypergraphs: Counting $k$-cliques on $G(n,c,s)$ with $k$ and $c$ constant matches its worst-case time complexity up to a $\mathrm{polylog}(n)$ factor. Assuming randomized ETH, it takes $n^{Ω(k)}$ time to count $k$-cliques in $G(n,c,s)$ if $k$ and $c$ are constant. 2. Sparse Erdos-Renyi graphs and hypergraphs: When $c = Θ(n^{-α})$, we give several algorithms exploiting the sparsity of $G(n, c, s)$ that are faster than the best known worst-case algorithms. Complementing this, based on a fine-grained worst-case assumption, our results imply a different average-case phase diagram for each fixed $α$ depicting a tradeoff between a runtime lower bound and $k$. Surprisingly, in the hypergraph case ($s \ge 3$), these lower bounds are tight against our algorithms exactly when $c$ is above the Erdős-Rényi $k$-clique percolation threshold. This is the first worst-case-to-average-case hardness reduction for a problem on Erdős-Rényi hypergraphs that we are aware of. We also give a variant of our result for computing the parity of the $k$-clique count that tolerates higher error probability. △ Less

Submitted 21 July, 2021; v1 submitted 19 March, 2019; originally announced March 2019.

Comments: 44 pages, 2 figures, appeared in FOCS'19, accepted to SICOMP special edition

arXiv:1902.02431 [pdf, ps, other]

Subadditivity Beyond Trees and the Chi-Squared Mutual Information

Authors: Emmanuel Abbe, Enric Boix-Adserà

Abstract: In 2000, Evans et al. [Eva+00] proved the subadditivity of the mutual information in the broadcasting on tree model with binary vertex labels and symmetric channels. They raised the question of whether such subadditivity extends to loopy graphs in some appropriate way. We recently proposed such an extension that applies to general graphs and binary vertex labels [AB18], using synchronization model… ▽ More In 2000, Evans et al. [Eva+00] proved the subadditivity of the mutual information in the broadcasting on tree model with binary vertex labels and symmetric channels. They raised the question of whether such subadditivity extends to loopy graphs in some appropriate way. We recently proposed such an extension that applies to general graphs and binary vertex labels [AB18], using synchronization models and relying on percolation bounds. This extension requires however the edge channels to be symmetric on the product of the adjacent spins. A more general version of such a percolation bound that applies to asymmetric channels is also obtained in [PW18], relying on the SDPI, but the subadditivity property does not follow with such generalizations. In this note, we provide a new result showing that the subadditivity property still holds for arbitrary (asymmetric) channels acting on the product of spins, when the graphs are restricted to be series-parallel. The proof relies on the use of the Chi-squared mutual information rather than the classical mutual information, and various properties of the former are discussed. We also present a generalization of the broadcasting on tree model (the synchronization on tree) where the bound from [PW18] relying on the SPDI can be significantly looser than the bound resulting from the Chi-squared subadditivity property presented here. △ Less

Submitted 6 February, 2019; originally announced February 2019.

Comments: 16 pages

Showing 1–23 of 23 results for author: Boix-Adsera, E