-
Matrix Chaos Inequalities and Chaos of Combinatorial Type
Authors:
Afonso S. Bandeira,
Kevin Lucca,
Petar Nizić-Nikolac,
Ramon van Handel
Abstract:
Matrix concentration inequalities and their recently discovered sharp counterparts provide powerful tools to bound the spectrum of random matrices whose entries are linear functions of independent random variables. However, in many applications in theoretical computer science and in other areas one encounters more general random matrix models, called matrix chaoses, whose entries are polynomials o…
▽ More
Matrix concentration inequalities and their recently discovered sharp counterparts provide powerful tools to bound the spectrum of random matrices whose entries are linear functions of independent random variables. However, in many applications in theoretical computer science and in other areas one encounters more general random matrix models, called matrix chaoses, whose entries are polynomials of independent random variables. Such models have often been studied on a case-by-case basis using ad-hoc methods that can yield suboptimal dimensional factors.
In this paper we provide general matrix concentration inequalities for matrix chaoses, which enable the treatment of such models in a systematic manner. These inequalities are expressed in terms of flattenings of the coefficients of the matrix chaos. We further identify a special family of matrix chaoses of combinatorial type for which the flattening parameters can be computed mechanically by a simple rule. This allows us to provide a unified treatment of and improved bounds for matrix chaoses that arise in a variety of applications, including graph matrices, Khatri-Rao matrices, and matrices that arise in average case analysis of the sum-of-squares hierarchy.
△ Less
Submitted 31 March, 2025; v1 submitted 24 December, 2024;
originally announced December 2024.
-
A Theory of Universal Learning
Authors:
Olivier Bousquet,
Steve Hanneke,
Shay Moran,
Ramon van Handel,
Amir Yehudayoff
Abstract:
How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, d…
▽ More
How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, does not explain the behavior of learning curves: the distribution-free PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy.
In this paper, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this paper is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case.
For concreteness, we consider in this paper only the realizable case, though analogous results are expected to extend to more general learning scenarios.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
Second-Order Converses via Reverse Hypercontractivity
Authors:
Jingbo Liu,
Ramon van Handel,
Sergio Verdú
Abstract:
A strong converse shows that no procedure can beat the asymptotic (as blocklength $n\to\infty$) fundamental limit of a given information-theoretic problem for any fixed error probability. A second-order converse strengthens this conclusion by showing that the asymptotic fundamental limit cannot be exceeded by more than $O(\tfrac{1}{\sqrt{n}})$. While strong converses are achieved in a broad range…
▽ More
A strong converse shows that no procedure can beat the asymptotic (as blocklength $n\to\infty$) fundamental limit of a given information-theoretic problem for any fixed error probability. A second-order converse strengthens this conclusion by showing that the asymptotic fundamental limit cannot be exceeded by more than $O(\tfrac{1}{\sqrt{n}})$. While strong converses are achieved in a broad range of information-theoretic problems by virtue of the "blowing-up method"---a powerful methodology due to Ahlswede, Gács and Körner (1976) based on concentration of measure---this method is fundamentally unable to attain second-order converses and is restricted to finite-alphabet settings. Capitalizing on reverse hypercontractivity of Markov semigroups and functional inequalities, this paper develops the "smoothing-out" method, an alternative to the blowing-up approach that does not rely on finite alphabets and that leads to second-order converses in a variety of information-theoretic problems that were out of reach of previous methods.
△ Less
Submitted 14 November, 2019; v1 submitted 25 December, 2018;
originally announced December 2018.
-
Ergodicity, Decisions, and Partial Information
Authors:
Ramon van Handel
Abstract:
In the simplest sequential decision problem for an ergodic stochastic process X, at each time n a decision u_n is made as a function of past observations X_0,...,X_{n-1}, and a loss l(u_n,X_n) is incurred. In this setting, it is known that one may choose (under a mild integrability assumption) a decision strategy whose pathwise time-average loss is asymptotically smaller than that of any other str…
▽ More
In the simplest sequential decision problem for an ergodic stochastic process X, at each time n a decision u_n is made as a function of past observations X_0,...,X_{n-1}, and a loss l(u_n,X_n) is incurred. In this setting, it is known that one may choose (under a mild integrability assumption) a decision strategy whose pathwise time-average loss is asymptotically smaller than that of any other strategy. The corresponding problem in the case of partial information proves to be much more delicate, however: if the process X is not observable, but decisions must be based on the observation of a different process Y, the existence of pathwise optimal strategies is not guaranteed.
The aim of this paper is to exhibit connections between pathwise optimal strategies and notions from ergodic theory. The sequential decision problem is developed in the general setting of an ergodic dynamical system (Ω,B,P,T) with partial information Y\subseteq B. The existence of pathwise optimal strategies grounded in two basic properties: the conditional ergodic theory of the dynamical system, and the complexity of the loss function. When the loss function is not too complex, a general sufficient condition for the existence of pathwise optimal strategies is that the dynamical system is a conditional K-automorphism relative to the past observations \bigvee_n T^n Y. If the conditional ergodicity assumption is strengthened, the complexity assumption can be weakened. Several examples demonstrate the interplay between complexity and ergodicity, which does not arise in the case of full information. Our results also yield a decision-theoretic characterization of weak mixing in ergodic theory, and establish pathwise optimality of ergodic nonlinear filters.
△ Less
Submitted 15 August, 2012;
originally announced August 2012.
-
A complete solution to Blackwell's unique ergodicity problem for hidden Markov chains
Authors:
Pavel Chigansky,
Ramon van Handel
Abstract:
We develop necessary and sufficient conditions for uniqueness of the invariant measure of the filtering process associated to an ergodic hidden Markov model in a finite or countable state space. These results provide a complete solution to a problem posed by Blackwell (1957), and subsume earlier partial results due to Kaijser, Kochman and Reeds. The proofs of our main results are based on the stab…
▽ More
We develop necessary and sufficient conditions for uniqueness of the invariant measure of the filtering process associated to an ergodic hidden Markov model in a finite or countable state space. These results provide a complete solution to a problem posed by Blackwell (1957), and subsume earlier partial results due to Kaijser, Kochman and Reeds. The proofs of our main results are based on the stability theory of nonlinear filters.
△ Less
Submitted 15 November, 2010; v1 submitted 19 October, 2009;
originally announced October 2009.
-
On the minimal penalty for Markov order estimation
Authors:
Ramon van Handel
Abstract:
We show that large-scale typicality of Markov sample paths implies that the likelihood ratio statistic satisfies a law of iterated logarithm uniformly to the same scale. As a consequence, the penalized likelihood Markov order estimator is strongly consistent for penalties growing as slowly as log log n when an upper bound is imposed on the order which may grow as rapidly as log n. Our method of…
▽ More
We show that large-scale typicality of Markov sample paths implies that the likelihood ratio statistic satisfies a law of iterated logarithm uniformly to the same scale. As a consequence, the penalized likelihood Markov order estimator is strongly consistent for penalties growing as slowly as log log n when an upper bound is imposed on the order which may grow as rapidly as log n. Our method of proof, using techniques from empirical process theory, does not rely on the explicit expression for the maximum likelihood estimator in the Markov case and could therefore be applicable in other settings.
△ Less
Submitted 25 August, 2009;
originally announced August 2009.
-
When do nonlinear filters achieve maximal accuracy?
Authors:
Ramon van Handel
Abstract:
The nonlinear filter for an ergodic signal observed in white noise is said to achieve maximal accuracy if the stationary filtering error vanishes as the signal to noise ratio diverges. We give a general characterization of the maximal accuracy property in terms of various systems theoretic notions. When the signal state space is a finite set explicit necessary and sufficient conditions are obtai…
▽ More
The nonlinear filter for an ergodic signal observed in white noise is said to achieve maximal accuracy if the stationary filtering error vanishes as the signal to noise ratio diverges. We give a general characterization of the maximal accuracy property in terms of various systems theoretic notions. When the signal state space is a finite set explicit necessary and sufficient conditions are obtained, while the linear Gaussian case reduces to a classic result of Kwakernaak and Sivan (1972).
△ Less
Submitted 1 July, 2009; v1 submitted 8 January, 2009;
originally announced January 2009.