Search | arXiv e-print repository

Quantitative Clustering in Mean-Field Transformer Models

Authors: Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

Abstract: The evolution of tokens through a deep transformer models can be modeled as an interacting particle system that has been shown to exhibit an asymptotic clustering behavior akin to the synchronization phenomenon in Kuramoto models. In this work, we investigate the long-time clustering of mean-field transformer models. More precisely, we establish exponential rates of contraction to a Dirac point ma… ▽ More The evolution of tokens through a deep transformer models can be modeled as an interacting particle system that has been shown to exhibit an asymptotic clustering behavior akin to the synchronization phenomenon in Kuramoto models. In this work, we investigate the long-time clustering of mean-field transformer models. More precisely, we establish exponential rates of contraction to a Dirac point mass for any suitably regular initialization under some assumptions on the parameters of transformer models, any suitably regular mean-field initialization synchronizes exponentially fast with some quantitative rates. △ Less

Submitted 30 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

Comments: 48 pages, 4 figures

arXiv:2503.20193 [pdf, ps, other]

Nonparametric MLE for Gaussian Location Mixtures: Certified Computation and Generic Behavior

Authors: Yury Polyanskiy, Mark Sellke

Abstract: We study the nonparametric maximum likelihood estimator $\widehatπ$ for Gaussian location mixtures in one dimension. It has been known since (Lindsay, 1983) that given an $n$-point dataset, this estimator always returns a mixture with at most $n$ components, and more recently (Wu-Polyanskiy, 2020) gave a sharp $O(\log n)$ bound for subgaussian data. In this work we study computational aspects of… ▽ More We study the nonparametric maximum likelihood estimator $\widehatπ$ for Gaussian location mixtures in one dimension. It has been known since (Lindsay, 1983) that given an $n$-point dataset, this estimator always returns a mixture with at most $n$ components, and more recently (Wu-Polyanskiy, 2020) gave a sharp $O(\log n)$ bound for subgaussian data. In this work we study computational aspects of $\widehatπ$. We provide an algorithm which for small enough $\varepsilon>0$ computes an $\varepsilon$-approximation of $\widehatπ$ in Wasserstein distance in time $K+Cnk^2\log\log(1/\varepsilon)$. Here $K$ is data-dependent but independent of $\varepsilon$, while $C$ is an absolute constant and $k=|supp(\widehatπ)|\leq n$ is the number of atoms in $\widehatπ$. We also certifiably compute the exact value of $|supp(\widehatπ)|$ in finite time. These guarantees hold almost surely whenever the dataset $(x_1,\dots,x_n)\in [-cn^{1/4},cn^{1/4}]$ consists of independent points from a probability distribution with a density (relative to Lebesgue measure). We also show the distribution of $\widehatπ$ conditioned to be $k$-atomic admits a density on the associated $2k-1$ dimensional parameter space for all $k\leq \sqrt{n}/3$, and almost sure locally linear convergence of the EM algorithm. One key tool is a classical Fourier analytic estimate for non-degenerate curves. △ Less

Submitted 25 March, 2025; originally announced March 2025.

Comments: 43 pages

arXiv:2503.17823 [pdf, ps, other]

On the Minimax Regret of Sequential Probability Assignment via Square-Root Entropy

Authors: Zeyu Jia, Yury Polyanskiy, Alexander Rakhlin

Abstract: We study the problem of sequential probability assignment under logarithmic loss, both with and without side information. Our objective is to analyze the minimax regret -- a notion extensively studied in the literature -- in terms of geometric quantities, such as covering numbers and scale-sensitive dimensions. We show that the minimax regret for the case of no side information (equivalently, the… ▽ More We study the problem of sequential probability assignment under logarithmic loss, both with and without side information. Our objective is to analyze the minimax regret -- a notion extensively studied in the literature -- in terms of geometric quantities, such as covering numbers and scale-sensitive dimensions. We show that the minimax regret for the case of no side information (equivalently, the Shtarkov sum) can be upper bounded in terms of sequential square-root entropy, a notion closely related to Hellinger distance. For the problem of sequential probability assignment with side information, we develop both upper and lower bounds based on the aforementioned entropy. The lower bound matches the upper bound, up to log factors, for classes in the Donsker regime (according to our definition of entropy). △ Less

Submitted 22 March, 2025; originally announced March 2025.

arXiv:2502.14988 [pdf, ps, other]

Partial and Exact Recovery of a Random Hypergraph from its Graph Projection

Authors: Guy Bresler, Chenghao Guo, Yury Polyanskiy, Andrew Yao

Abstract: Consider a $d$-uniform random hypergraph on $n$ vertices in which hyperedges are included iid so that the average degree is $n^δ$. The projection of a hypergraph is a graph on the same $n$ vertices where an edge connects two vertices if and only if they belong to some hyperedge. The goal is to reconstruct the hypergraph given its projection. An earlier work of Bresler, Guo, and Polyanskiy (COLT 20… ▽ More Consider a $d$-uniform random hypergraph on $n$ vertices in which hyperedges are included iid so that the average degree is $n^δ$. The projection of a hypergraph is a graph on the same $n$ vertices where an edge connects two vertices if and only if they belong to some hyperedge. The goal is to reconstruct the hypergraph given its projection. An earlier work of Bresler, Guo, and Polyanskiy (COLT 2024) showed that exact recovery for $d=3$ is possible if and only if $δ< 2/5$. This work completely resolves the question for all values of $d$ for both exact and partial recovery and for both cases of whether multiplicity information about each edge is available or not. In addition, we show that the reconstruction fidelity undergoes an all-or-nothing transition at a threshold. In particular, this resolves all conjectures from Bresler, Guo, and Polyanskiy (COLT 2024). △ Less

Submitted 20 February, 2025; originally announced February 2025.

arXiv:2502.09844 [pdf, other]

Solving Empirical Bayes via Transformers

Authors: Anzo Teh, Mark Jabbour, Yury Polyanskiy

Abstract: This work applies modern AI tools (transformers) to solving one of the oldest statistical problems: Poisson means under empirical Bayes (Poisson-EB) setting. In Poisson-EB a high-dimensional mean vector $θ$ (with iid coordinates sampled from an unknown prior $π$) is estimated on the basis of $X=\mathrm{Poisson}(θ)$. A transformer model is pre-trained on a set of synthetically generated pairs… ▽ More This work applies modern AI tools (transformers) to solving one of the oldest statistical problems: Poisson means under empirical Bayes (Poisson-EB) setting. In Poisson-EB a high-dimensional mean vector $θ$ (with iid coordinates sampled from an unknown prior $π$) is estimated on the basis of $X=\mathrm{Poisson}(θ)$. A transformer model is pre-trained on a set of synthetically generated pairs $(X,θ)$ and learns to do in-context learning (ICL) by adapting to unknown $π$. Theoretically, we show that a sufficiently wide transformer can achieve vanishing regret with respect to an oracle estimator who knows $π$ as dimension grows to infinity. Practically, we discover that already very small models (100k parameters) are able to outperform the best classical algorithm (non-parametric maximum likelihood, or NPMLE) both in runtime and validation loss, which we compute on out-of-distribution synthetic data as well as real-world datasets (NHL hockey, MLB baseball, BookCorpusOpen). Finally, by using linear probes, we confirm that the transformer's EB estimator appears to internally work differently from either NPMLE or Robbins' estimators. △ Less

Submitted 27 May, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

Comments: 28 pages, 6 figures, 12 tables. Code at https://github.com/Anzoteh96/eb-transformers

arXiv:2502.09720 [pdf, ps, other]

NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

Authors: Semyon Savkin, Eitan Porat, Or Ordentlich, Yury Polyanskiy

Abstract: Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement… ▽ More Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Metas SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant. △ Less

Submitted 11 June, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

Comments: 23 pages

arXiv:2502.08840 [pdf, ps, other]

Thresholds for Reconstruction of Random Hypergraphs From Graph Projections

Authors: Guy Bresler, Chenghao Guo, Yury Polyanskiy

Abstract: The graph projection of a hypergraph is a simple graph with the same vertex set and with an edge between each pair of vertices that appear in a hyperedge. We consider the problem of reconstructing a random $d$-uniform hypergraph from its projection. Feasibility of this task depends on $d$ and the density of hyperedges in the random hypergraph. For $d=3$ we precisely determine the threshold, while… ▽ More The graph projection of a hypergraph is a simple graph with the same vertex set and with an edge between each pair of vertices that appear in a hyperedge. We consider the problem of reconstructing a random $d$-uniform hypergraph from its projection. Feasibility of this task depends on $d$ and the density of hyperedges in the random hypergraph. For $d=3$ we precisely determine the threshold, while for $d\geq 4$ we give bounds. All of our feasibility results are obtained by exhibiting an efficient algorithm for reconstructing the original hypergraph, while infeasibility is information-theoretic. Our results also apply to mildly inhomogeneous random hypergrahps, including hypergraph stochastic block models (HSBM). A consequence of our results is an optimal HSBM recovery algorithm, improving on a result of Guadio and Joshi in 2023. △ Less

Submitted 12 February, 2025; originally announced February 2025.

Comments: 32 pages

arXiv:2501.00762 [pdf, other]

Residual connections provably mitigate oversmoothing in graph neural networks

Authors: Ziang Chen, Zhengjiang Lin, Shi Chen, Yury Polyanskiy, Philippe Rigollet

Abstract: Graph neural networks (GNNs) have achieved remarkable empirical success in processing and representing graph-structured data across various domains. However, a significant challenge known as "oversmoothing" persists, where vertex features become nearly indistinguishable in deep GNNs, severely restricting their expressive power and practical utility. In this work, we analyze the asymptotic oversmoo… ▽ More Graph neural networks (GNNs) have achieved remarkable empirical success in processing and representing graph-structured data across various domains. However, a significant challenge known as "oversmoothing" persists, where vertex features become nearly indistinguishable in deep GNNs, severely restricting their expressive power and practical utility. In this work, we analyze the asymptotic oversmoothing rates of deep GNNs with and without residual connections by deriving explicit convergence rates for a normalized vertex similarity measure. Our analytical framework is grounded in the multiplicative ergodic theorem. Furthermore, we demonstrate that adding residual connections effectively mitigates or prevents oversmoothing across several broad families of parameter distributions. The theoretical findings are strongly supported by numerical experiments. △ Less

Submitted 3 January, 2025; v1 submitted 1 January, 2025; originally announced January 2025.

arXiv:2411.04990 [pdf, other]

Clustering in Causal Attention Masking

Authors: Nikita Karagodin, Yury Polyanskiy, Philippe Rigollet

Abstract: This work presents a modification of the self-attention dynamics proposed by Geshkovski et al. (arXiv:2312.10794) to better reflect the practically relevant, causally masked attention used in transformer architectures for generative AI. This modification translates into an interacting particle system that cannot be interpreted as a mean-field gradient flow. Despite this loss of structure, we signi… ▽ More This work presents a modification of the self-attention dynamics proposed by Geshkovski et al. (arXiv:2312.10794) to better reflect the practically relevant, causally masked attention used in transformer architectures for generative AI. This modification translates into an interacting particle system that cannot be interpreted as a mean-field gradient flow. Despite this loss of structure, we significantly strengthen the results of Geshkovski et al. (arXiv:2312.10794) in this context: While previous rigorous results focused on cases where all three matrices (Key, Query, and Value) were scaled identities, we prove asymptotic convergence to a single cluster for arbitrary key-query matrices and a value matrix equal to the identity. Additionally, we establish a connection to the classical Rényi parking problem from combinatorial geometry to make initial theoretical steps towards demonstrating the existence of meta-stable states. △ Less

Submitted 10 November, 2024; v1 submitted 7 November, 2024; originally announced November 2024.

Comments: 38th Conference on Neural Information Processing Systems (NeurIPS 2024), 22 pages, 6 figures

MSC Class: 68T07; 35Q68; 37N99; 82C22

arXiv:2410.13780 [pdf, other]

Optimal Quantization for Matrix Multiplication

Authors: Or Ordentlich, Yury Polyanskiy

Abstract: Recent work in machine learning community proposed multiple methods for performing lossy compression (quantization) of large matrices. This quantization is important for accelerating matrix multiplication (main component of large language models), which is often bottlenecked by the speed of loading these matrices from memory. Unlike classical vector quantization and rate-distortion theory, the goa… ▽ More Recent work in machine learning community proposed multiple methods for performing lossy compression (quantization) of large matrices. This quantization is important for accelerating matrix multiplication (main component of large language models), which is often bottlenecked by the speed of loading these matrices from memory. Unlike classical vector quantization and rate-distortion theory, the goal of these new compression algorithms is to be able to approximate not the matrices themselves, but their matrix product. Specifically, given a pair of real matrices $A,B$ an encoder (compressor) is applied to each of them independently producing descriptions with $R$ bits per entry. These representations subsequently are used by the decoder to estimate matrix product $A^\top B$. In this work, we provide a non-asymptotic lower bound on the mean squared error of this approximation (as a function of rate $R$) for the case of matrices $A,B$ with iid Gaussian entries. Algorithmically, we construct a universal quantizer based on nested lattices with an explicit guarantee of approximation error for any (non-random) pair of matrices $A$, $B$ in terms of only Frobenius norms $\|\bar{A}\|_F, \|\bar{B}\|_F$ and $\|\bar{A}^\top \bar{B}\|_F$, where $\bar{A},\bar{B}$ are versions of $A,B$ with zero-centered columns, respectively. For iid Gaussian matrices our quantizer achieves the lower bound and is, thus, asymptotically optimal. A practical low-complexity version of our quantizer achieves performance quite close to optimal. In addition, we derive rate-distortion function for matrix multiplication of iid Gaussian matrices, which exhibits an interesting phase-transition at $R\approx 0.906$ bit/entry. △ Less

Submitted 17 January, 2025; v1 submitted 17 October, 2024; originally announced October 2024.

arXiv:2410.06833 [pdf, other]

Dynamic metastability in the self-attention model

Authors: Borjan Geshkovski, Hugo Koubbi, Yury Polyanskiy, Philippe Rigollet

Abstract: We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a confi… ▽ More We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks. △ Less

Submitted 9 October, 2024; originally announced October 2024.

arXiv:2409.08839 [pdf, other]

doi 10.1109/OJCOMS.2025.3556319

RF Challenge: The Data-Driven Radio Frequency Signal Separation Challenge

Authors: Alejandro Lancho, Amir Weiss, Gary C. F. Lee, Tejas Jayashankar, Binoy Kurien, Yury Polyanskiy, Gregory W. Wornell

Abstract: We address the critical problem of interference rejection in radio-frequency (RF) signals using a data-driven approach that leverages deep-learning methods. A primary contribution of this paper is the introduction of the RF Challenge, which is a publicly available, diverse RF signal dataset for data-driven analyses of RF signal problems. Specifically, we adopt a simplified signal model for develop… ▽ More We address the critical problem of interference rejection in radio-frequency (RF) signals using a data-driven approach that leverages deep-learning methods. A primary contribution of this paper is the introduction of the RF Challenge, which is a publicly available, diverse RF signal dataset for data-driven analyses of RF signal problems. Specifically, we adopt a simplified signal model for developing and analyzing interference rejection algorithms. For this signal model, we introduce a set of carefully chosen deep learning architectures, incorporating key domain-informed modifications alongside traditional benchmark solutions to establish baseline performance metrics for this intricate, ubiquitous problem. Through extensive simulations involving eight different signal mixture types, we demonstrate the superior performance (in some cases, by two orders of magnitude) of architectures such as UNet and WaveNet over traditional methods like matched filtering and linear minimum mean square error estimation. Our findings suggest that the data-driven approach can yield scalable solutions, in the sense that the same architectures may be similarly trained and deployed for different types of signals. Moreover, these findings further corroborate the promising potential of deep learning algorithms for enhancing communication systems, particularly via interference mitigation. This work also includes results from an open competition based on the RF Challenge, hosted at the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'24). △ Less

Submitted 27 March, 2025; v1 submitted 13 September, 2024; originally announced September 2024.

Comments: 17 pages, 16 figures, to appear in the IEEE Open Journal of the Communications Society

Journal ref: IEEE Open Journal of the Communications Society, vol. 6, pp. 4083-4100, 2025

arXiv:2409.07689 [pdf, ps, other]

Entropy Contractions in Markov Chains: Half-Step, Full-Step and Continuous-Time

Authors: Pietro Caputo, Zongchen Chen, Yuzhou Gu, Yury Polyanskiy

Abstract: This paper considers the speed of convergence (mixing) of a finite Markov kernel $P$ with respect to the Kullback-Leibler divergence (entropy). Given a Markov kernel one defines either a discrete-time Markov chain (with the $n$-step transition kernel given by the matrix power $P^n$) or a continuous-time Markov process (with the time-$t$ transition kernel given by $e^{t(P-\mathrm{Id})}$). The contr… ▽ More This paper considers the speed of convergence (mixing) of a finite Markov kernel $P$ with respect to the Kullback-Leibler divergence (entropy). Given a Markov kernel one defines either a discrete-time Markov chain (with the $n$-step transition kernel given by the matrix power $P^n$) or a continuous-time Markov process (with the time-$t$ transition kernel given by $e^{t(P-\mathrm{Id})}$). The contraction of entropy for $n=1$ or $t=0+$ are characterized by the famous functional inequalities, the strong data processing inequality (SDPI) and the modified log-Sobolev inequality (MLSI), respectively. When $P=KK^*$ is written as the product of a kernel and its adjoint, one could also consider the ``half-step'' contraction, which is the SDPI for $K$, while the ``full-step'' contraction refers to the SDPI for $P$. The work [DMLM03] claimed that these contraction coefficients (half-step, full-step, and continuous-time) are generally within a constant factor of each other. We disprove this and related conjectures by working out a number of different counterexamples. In particular, we construct (a) a continuous-time Markov process that contracts arbitrarily faster than its discrete-time counterpart; and (b) a kernel $P$ such that $P^{m+1}$ contracts arbitrarily better than $P^m$. Hence, our main conclusion is that the four standard inequalities comparing five common notions of entropy and variance contraction are generally not improvable. In the process of analyzing the counterexamples, we survey and sharpen the tools for bounding the contraction coefficients and characterize properties of extremizers of the respective functional inequalities. As our examples range from Bernoulli-Laplace model, random walks on graphs, to birth-death chains, the paper is also intended as a tutorial on computing MLSI, SDPI and other constants for these types of commonly occurring Markov chains. △ Less

Submitted 11 September, 2024; originally announced September 2024.

arXiv:2408.09874 [pdf, other]

Unsourced Multiple Access: A Coding Paradigm for Massive Random Access

Authors: Gianluigi Liva, Yury Polyanskiy

Abstract: This paper is a tutorial introduction to the field of unsourced multiple access (UMAC) protocols. We first provide a historical survey of the evolution of random access protocols, focusing specifically on the case in which uncoordinated users share a wireless broadcasting medium. Next, we highlight the change of perspective originated by the UMAC model, in which the physical and medium access laye… ▽ More This paper is a tutorial introduction to the field of unsourced multiple access (UMAC) protocols. We first provide a historical survey of the evolution of random access protocols, focusing specifically on the case in which uncoordinated users share a wireless broadcasting medium. Next, we highlight the change of perspective originated by the UMAC model, in which the physical and medium access layer's protocols cooperate, thus reframing random access as a novel coding-theoretic problem. By now, a large variety of UMAC protocols (codes) emerged, necessitating a certain classification that we indeed propose here. Although some random access schemes require a radical change of the physical layer, others can be implemented with minimal changes to existing industry standards. As an example, we discuss a simple modification to the 5GNR Release 16 random access channel that builds on the UMAC theory and that dramatically improves energy efficiency for systems with even moderate number of simultaneous users (e.g., $5-10$ dB gain for $10-50$ users), and also enables handling of high number of users, something completely out of reach of the state-of-the-art. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Comments: Proceedings of the IEEE, accepted for publication

arXiv:2405.03348 [pdf, other]

Evolution of the 5G New Radio Two-Step Random Access towards 6G Unsourced MAC

Authors: Patrick Agostini, Jean-Francois Chamberland, Federico Clazzer, Johannes Dommel, Gianluigi Liva, Andrea Munari, Krishna Narayanan, Yury Polyanskiy, Slawomir Stanczak, Zoran Utkovski

Abstract: This report summarizes some considerations on possible evolutions of grant-free random access in the next generation of the 3GPP wireless cellular standard. The analysis is carried out by mapping the problem to the recently-introduced unsourced multiple access channel (UMAC) setup. By doing so, the performance of existing solutions can be benchmarked with information-theoretic bounds, assessing th… ▽ More This report summarizes some considerations on possible evolutions of grant-free random access in the next generation of the 3GPP wireless cellular standard. The analysis is carried out by mapping the problem to the recently-introduced unsourced multiple access channel (UMAC) setup. By doing so, the performance of existing solutions can be benchmarked with information-theoretic bounds, assessing the potential gains that can be achieved over legacy 3GPP schemes. The study focuses on the two-step random access (2SRA) protocol introduced by Release 16 of the 5G New Radio standard, investigating its applicability to support large MTC / IoT terminal populations in a grant-free fashion. The analysis shows that the existing 2SRA scheme may not succeed in providing energy-efficient support to large user populations. Modifications to the protocol are proposed that enable remarkable gains in both energy and spectral efficiency while retaining a strong resemblance to the legacy protocol. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: Version 1.0 of the report

arXiv:2312.17701 [pdf, ps, other]

Density estimation using the perceptron

Authors: Patrik Róbert Gerber, Tianze Jiang, Yury Polyanskiy, Rui Sun

Abstract: We propose a new density estimation algorithm. Given $n$ i.i.d. observations from a distribution belonging to a class of densities on $\mathbb{R}^d$, our estimator outputs any density in the class whose "perceptron discrepancy" with the empirical distribution is at most $O(\sqrt{d / n})$. The perceptron discrepancy is defined as the largest difference in mass two distribution place on any halfspac… ▽ More We propose a new density estimation algorithm. Given $n$ i.i.d. observations from a distribution belonging to a class of densities on $\mathbb{R}^d$, our estimator outputs any density in the class whose "perceptron discrepancy" with the empirical distribution is at most $O(\sqrt{d / n})$. The perceptron discrepancy is defined as the largest difference in mass two distribution place on any halfspace. It is shown that this estimator achieves the expected total variation distance to the truth that is almost minimax optimal over the class of densities with bounded Sobolev norm and Gaussian mixtures. This suggests that the regularity of the prior distribution could be an explanation for the efficiency of the ubiquitous step in machine learning that replaces optimization over large function spaces with simpler parametric classes (such as discriminators of GANs). We also show that replacing the perceptron discrepancy with the generalized energy distance of Székely and Rizzo (2013) further improves total variation loss. The generalized energy distance between empirical distributions is easily computable and differentiable, which makes it especially useful for fitting generative models. To the best of our knowledge, it is the first "simple" distance with such properties that yields minimax optimal statistical guarantees. In addition, we shed light on the ubiquitous method of representing discrete data in domain $[k]$ via embedding vectors on a unit ball in $\mathbb{R}^d$. We show that taking $d \asymp \log (k)$ allows one to use simple linear probing to evaluate and estimate total variation distance, as well as recovering minimax optimal sample complexity for the class of discrete distributions on $[k]$. △ Less

Submitted 5 June, 2025; v1 submitted 29 December, 2023; originally announced December 2023.

Comments: 44 pages

MSC Class: 62G07

arXiv:2312.10794 [pdf, other]

A mathematical perspective on Transformers

Authors: Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet

Abstract: Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists. Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists. △ Less

Submitted 12 August, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

arXiv:2308.09043 [pdf, other]

Kernel-Based Tests for Likelihood-Free Hypothesis Testing

Authors: Patrik Róbert Gerber, Tianze Jiang, Yury Polyanskiy, Rui Sun

Abstract: Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions ($n=\infty$) the problem is solved optimally by the likelihood-ratio test; when $m=1$ it corresponds to binary classification; and when… ▽ More Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions ($n=\infty$) the problem is solved optimally by the likelihood-ratio test; when $m=1$ it corresponds to binary classification; and when $m\approx n$ it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-off between $m$ and $n$: increasing the data sample $m$ reduces the amount $n$ of training/simulation data needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes -- a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under \textit{maximum mean discrepancy} (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detection of the Higgs boson and detection of planted DDPM generated images amidst CIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric $m$ vs $n$ trade-off. △ Less

Submitted 23 November, 2023; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: 36 pages, 6 figures

arXiv:2307.02070 [pdf, other]

Empirical Bayes via ERM and Rademacher complexities: the Poisson model

Authors: Soham Jana, Yury Polyanskiy, Anzo Teh, Yihong Wu

Abstract: We consider the problem of empirical Bayes estimation for (multivariate) Poisson means. Existing solutions that have been shown theoretically optimal for minimizing the regret (excess risk over the Bayesian oracle that knows the prior) have several shortcomings. For example, the classical Robbins estimator does not retain the monotonicity property of the Bayes estimator and performs poorly under m… ▽ More We consider the problem of empirical Bayes estimation for (multivariate) Poisson means. Existing solutions that have been shown theoretically optimal for minimizing the regret (excess risk over the Bayesian oracle that knows the prior) have several shortcomings. For example, the classical Robbins estimator does not retain the monotonicity property of the Bayes estimator and performs poorly under moderate sample size. Estimators based on the minimum distance and non-parametric maximum likelihood (NPMLE) methods correct these issues, but are computationally expensive with complexity growing exponentially with dimension. Extending the approach of Barbehenn and Zhao (2022), in this work we construct monotone estimators based on empirical risk minimization (ERM) that retain similar theoretical guarantees and can be computed much more efficiently. Adapting the idea of offset Rademacher complexity Liang et al. (2015) to the non-standard loss and function class in empirical Bayes, we show that the shape-constrained ERM estimator attains the minimax regret within constant factors in one dimension and within logarithmic factors in multiple dimensions. △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: 34 pages, 1 Figure, to appear in the 2023 Conference of Learning Theory (COLT)

arXiv:2307.01095 [pdf, other]

Coded Orthogonal Modulation for the Multi-Antenna Multiple-Access Channel

Authors: Alexander Fengler, Alejandro Lancho, Yury Polyanskiy

Abstract: This study focuses on (traditional and unsourced) multiple-access communication over a single transmit and multiple ($M$) receive antennas. We assume full or partial channel state information (CSI) at the receiver. It is known that to fully achieve the fundamental limits (even asymptotically) the decoder needs to jointly estimate all user codewords, doing which directly is computationally infeasib… ▽ More This study focuses on (traditional and unsourced) multiple-access communication over a single transmit and multiple ($M$) receive antennas. We assume full or partial channel state information (CSI) at the receiver. It is known that to fully achieve the fundamental limits (even asymptotically) the decoder needs to jointly estimate all user codewords, doing which directly is computationally infeasible. We propose a low-complexity solution, termed coded orthogonal modulation multiple-access (COMMA), in which users first encode their messages via a long (multi-user interference aware) outer code operating over a $q$-ary alphabet. These symbols are modulated onto $q$ orthogonal waveforms. At the decoder a multiple-measurement vector approximate message passing (MMV-AMP) algorithm estimates several candidates (out of $q$) for each user, with the remaining uncertainty resolved by the single-user outer decoders. Numerically, we show that COMMA outperforms a standard solution based on linear multiuser detection (MUD) with Gaussian signaling. Theoretically, we derive bounds and scaling laws for $M$, the number of users $K_a$, SNR, and $q$, allowing to quantify the trade-off between receive antennas and spectral efficiency. The orthogonal signaling scheme is applicable to unsourced random access and, with chirp sequences as basis, allows for low-complexity fast Fourier transform (FFT) based receivers that are resilient to frequency and timing offsets. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2306.16735 [pdf, ps, other]

Comparing Poisson and Gaussian channels (extended)

Authors: Anzo Teh, Yury Polyanskiy

Abstract: Consider a pair of input distributions which after passing through a Poisson channel become $ε$-close in total variation. We show that they must necessarily then be $ε^{0.5+o(1)}$-close after passing through a Gaussian channel as well. In the opposite direction, we show that distributions inducing $ε$-close outputs over the Gaussian channel must induce $ε^{1+o(1)}$-close outputs over the Poisson.… ▽ More Consider a pair of input distributions which after passing through a Poisson channel become $ε$-close in total variation. We show that they must necessarily then be $ε^{0.5+o(1)}$-close after passing through a Gaussian channel as well. In the opposite direction, we show that distributions inducing $ε$-close outputs over the Gaussian channel must induce $ε^{1+o(1)}$-close outputs over the Poisson. This quantifies a well-known intuition that ''smoothing'' induced by Poissonization and Gaussian convolution are similar. As an application, we improve a recent upper bound of Han-Miao-Shen'2021 for estimating mixing distribution of a Poisson mixture in Gaussian optimal transport distance from $n^{-0.1 + o(1)}$ to $n^{-0.25 + o(1)}$. △ Less

Submitted 29 June, 2023; originally announced June 2023.

Comments: 7 pages, 2 figures, to appear in the 2023 IEEE International Symposium on Information Theory (ISIT)

arXiv:2306.14411 [pdf, other]

Score-based Source Separation with Applications to Digital Communication Signals

Authors: Tejas Jayashankar, Gary C. F. Lee, Alejandro Lancho, Amir Weiss, Yury Polyanskiy, Gregory W. Wornell

Abstract: We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by maximum a posteriori estimation with an $α$-posterior, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we a… ▽ More We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by maximum a posteriori estimation with an $α$-posterior, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we are interested in sources with underlying discrete nature and the recovery of encoded bits from a signal of interest, as measured by the bit error rate (BER). Experimental results with RF mixtures demonstrate that our method results in a BER reduction of 95% over classical and existing learning-based methods. Our analysis demonstrates that our proposed method yields solutions that asymptotically approach the modes of an underlying discrete distribution. Furthermore, our method can be viewed as a multi-source extension to the recently proposed score distillation sampling scheme, shedding additional light on its use beyond conditional sampling. The project webpage is available at https://alpha-rgs.github.io △ Less

Submitted 17 January, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

Comments: 34 pages, 18 figures, for associated project webpage see https://alpha-rgs.github.io

arXiv:2306.12308 [pdf, ps, other]

Entropic characterization of optimal rates for learning Gaussian mixtures

Authors: Zeyu Jia, Yury Polyanskiy, Yihong Wu

Abstract: We consider the question of estimating multi-dimensional Gaussian mixtures (GM) with compactly supported or subgaussian mixing distributions. Minimax estimation rate for this class (under Hellinger, TV and KL divergences) is a long-standing open question, even in one dimension. In this paper we characterize this rate (for all constant dimensions) in terms of the metric entropy of the class. Such c… ▽ More We consider the question of estimating multi-dimensional Gaussian mixtures (GM) with compactly supported or subgaussian mixing distributions. Minimax estimation rate for this class (under Hellinger, TV and KL divergences) is a long-standing open question, even in one dimension. In this paper we characterize this rate (for all constant dimensions) in terms of the metric entropy of the class. Such characterizations originate from seminal works of Le Cam (1973); Birge (1983); Haussler and Opper (1997); Yang and Barron (1999). However, for GMs a key ingredient missing from earlier work (and widely sought-after) is a comparison result showing that the KL and the squared Hellinger distance are within a constant multiple of each other uniformly over the class. Our main technical contribution is in showing this fact, from which we derive entropy characterization for estimation rate under Hellinger and KL. Interestingly, the sequential (online learning) estimation rate is characterized by the global entropy, while the single-step (batch) rate corresponds to local entropy, paralleling a similar result for the Gaussian sequence model recently discovered by Neykov (2022) and Mourtada (2023). Additionally, since Hellinger is a proper metric, our comparison shows that GMs under KL satisfy the triangle inequality within multiplicative constants, implying that proper and improper estimation rates coincide. △ Less

Submitted 27 June, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

arXiv:2306.11085 [pdf, ps, other]

Minimax optimal testing by classification

Authors: Patrik Róbert Gerber, Yanjun Han, Yury Polyanskiy

Abstract: This paper considers an ML inspired approach to hypothesis testing known as classifier/classification-accuracy testing ($\mathsf{CAT}$). In $\mathsf{CAT}$, one first trains a classifier by feeding it labeled synthetic samples generated by the null and alternative distributions, which is then used to predict labels of the actual data samples. This method is widely used in practice when the null and… ▽ More This paper considers an ML inspired approach to hypothesis testing known as classifier/classification-accuracy testing ($\mathsf{CAT}$). In $\mathsf{CAT}$, one first trains a classifier by feeding it labeled synthetic samples generated by the null and alternative distributions, which is then used to predict labels of the actual data samples. This method is widely used in practice when the null and alternative are only specified via simulators (as in many scientific experiments). We study goodness-of-fit, two-sample ($\mathsf{TS}$) and likelihood-free hypothesis testing ($\mathsf{LFHT}$), and show that $\mathsf{CAT}$ achieves (near-)minimax optimal sample complexity in both the dependence on the total-variation ($\mathsf{TV}$) separation $ε$ and the probability of error $δ$ in a variety of non-parametric settings, including discrete distributions, $d$-dimensional distributions with a smooth density, and the Gaussian sequence model. In particular, we close the high probability sample complexity of $\mathsf{LFHT}$ for each class. As another highlight, we recover the minimax optimal complexity of $\mathsf{TS}$ over discrete distributions, which was recently established by Diakonikolas et al. (2021). The corresponding $\mathsf{CAT}$ simply compares empirical frequencies in the first half of the data, and rejects the null when the classification accuracy on the second half is better than random. △ Less

Submitted 19 June, 2023; originally announced June 2023.

arXiv:2305.09995 [pdf, ps, other]

Algorithmic Decorrelation and Planted Clique in Dependent Random Graphs: The Case of Extra Triangles

Authors: Guy Bresler, Chenghao Guo, Yury Polyanskiy

Abstract: We aim to understand the extent to which the noise distribution in a planted signal-plus-noise problem impacts its computational complexity. To that end, we consider the planted clique and planted dense subgraph problems, but in a different ambient graph. Instead of Erdős-Rényi $G(n,p)$, which has independent edges, we take the ambient graph to be the random graph with triangles (RGT) obtained by… ▽ More We aim to understand the extent to which the noise distribution in a planted signal-plus-noise problem impacts its computational complexity. To that end, we consider the planted clique and planted dense subgraph problems, but in a different ambient graph. Instead of Erdős-Rényi $G(n,p)$, which has independent edges, we take the ambient graph to be the random graph with triangles (RGT) obtained by adding triangles to $G(n,p)$. We show that the RGT can be efficiently mapped to the corresponding $G(n,p)$, and moreover, that the planted clique (or dense subgraph) is approximately preserved under this mapping. This constitutes the first average-case reduction transforming dependent noise to independent noise. Together with the easier direction of mapping the ambient graph from Erdős-Rényi to RGT, our results yield a strong equivalence between models. In order to prove our results, we develop a new general framework for reasoning about the validity of average-case reductions based on low sensitivity to perturbations. △ Less

Submitted 27 June, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: 71 pages

arXiv:2305.06985 [pdf, other]

On the Advantages of Asynchrony in the Unsourced MAC

Authors: Alexander Fengler, Alejandro Lancho, Krishna Narayanan, Yury Polyanskiy

Abstract: In this work we demonstrate how a lack of synchronization can in fact be advantageous in the problem of random access. Specifically, we consider a multiple-access problem over a frame-asynchronous 2-user binary-input adder channel in the unsourced setup (2-UBAC). Previous work has shown that under perfect synchronization the per-user rates achievable with linear codes over the 2-UBAC are limited b… ▽ More In this work we demonstrate how a lack of synchronization can in fact be advantageous in the problem of random access. Specifically, we consider a multiple-access problem over a frame-asynchronous 2-user binary-input adder channel in the unsourced setup (2-UBAC). Previous work has shown that under perfect synchronization the per-user rates achievable with linear codes over the 2-UBAC are limited by 0.5 bit per channel use (compared to the capacity of 0.75). In this paper, we first demonstrate that arbitrary small (even single-bit) shift between the user's frames enables (random) linear codes to attain full capacity of 0.75 bit/user. Furthermore, we derive density evolution equations for irregular LDPC codes, and prove (via concentration arguments) that they correctly track the asymptotic bit-error rate of a BP decoder. Optimizing the degree distributions we construct LDPC codes achieving per-user rates of 0.73 bit per channel use. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: Accepted for presentation at IEEE ISIT 2023

arXiv:2305.05465 [pdf, other]

The emergence of clusters in self-attention dynamics

Authors: Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet

Abstract: Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Cluster locations are determined by the initial tokens, confirming context-awareness of representations learned by Transformers. U… ▽ More Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Cluster locations are determined by the initial tokens, confirming context-awareness of representations learned by Transformers. Using techniques from dynamical systems and partial differential equations, we show that the type of limiting object that emerges depends on the spectrum of the value matrix. Additionally, in the one-dimensional case we prove that the self-attention matrix converges to a low-rank Boolean matrix. The combination of these results mathematically confirms the empirical observation made by Vaswani et al. [VSP'17] that leaders appear in a sequence of tokens when processed by Transformers. △ Less

Submitted 12 February, 2024; v1 submitted 9 May, 2023; originally announced May 2023.

arXiv:2303.14689 [pdf, other]

Weak Recovery Threshold for the Hypergraph Stochastic Block Model

Authors: Yuzhou Gu, Yury Polyanskiy

Abstract: We study the weak recovery problem on the $r$-uniform hypergraph stochastic block model ($r$-HSBM) with two balanced communities. In HSBM a random graph is constructed by placing hyperedges with higher density if all vertices of a hyperedge share the same binary label, and weak recovery asks to recover a non-trivial fraction of the labels. We introduce a multi-terminal version of strong data proce… ▽ More We study the weak recovery problem on the $r$-uniform hypergraph stochastic block model ($r$-HSBM) with two balanced communities. In HSBM a random graph is constructed by placing hyperedges with higher density if all vertices of a hyperedge share the same binary label, and weak recovery asks to recover a non-trivial fraction of the labels. We introduce a multi-terminal version of strong data processing inequalities (SDPIs), which we call the multi-terminal SDPI, and use it to prove a variety of impossibility results for weak recovery. In particular, we prove that weak recovery is impossible below the Kesten-Stigum (KS) threshold if $r=3,4$, or a strength parameter $λ$ is at least $\frac 15$. Prior work Pal and Zhu (2021) established that weak recovery in HSBM is always possible above the KS threshold. Consequently, there is no information-computation gap for these cases, which (partially) resolves a conjecture of Angelini et al. (2015). To our knowledge this is the first impossibility result for HSBM weak recovery. As usual, we reduce the study of non-recovery of HSBM to the study of non-reconstruction in a related broadcasting on hypertrees (BOHT) model. While we show that BOHT's reconstruction threshold coincides with KS for $r=3,4$, surprisingly, we demonstrate that for $r\ge 7$ reconstruction is possible also below KS. This shows an interesting phase transition in the parameter $r$, and suggests that for $r\ge 7$, there might be an information-computation gap for the HSBM. For $r=5,6$ and large degree we propose an approach for showing non-reconstruction below KS, suggesting that $r=7$ is the correct threshold for onset of the new phase. △ Less

Submitted 27 June, 2023; v1 submitted 26 March, 2023; originally announced March 2023.

arXiv:2303.14688 [pdf, ps, other]

Uniqueness of BP fixed point for the Potts model and applications to community detection

Authors: Yuzhou Gu, Yury Polyanskiy

Abstract: In the study of sparse stochastic block models (SBMs) one often needs to analyze a distributional recursion, known as the belief propagation (BP) recursion. Uniqueness of the fixed point of this recursion implies several results about the SBM, including optimal recovery algorithms for SBM (Mossel et al. (2016)) and SBM with side information (Mossel and Xu (2016)), and a formula for SBM mutual info… ▽ More In the study of sparse stochastic block models (SBMs) one often needs to analyze a distributional recursion, known as the belief propagation (BP) recursion. Uniqueness of the fixed point of this recursion implies several results about the SBM, including optimal recovery algorithms for SBM (Mossel et al. (2016)) and SBM with side information (Mossel and Xu (2016)), and a formula for SBM mutual information (Abbe et al. (2021)). The 2-community case corresponds to an Ising model, for which Yu and Polyanskiy (2022) established uniqueness for all cases. In this paper we analyze the $q$-ary Potts model, i.e., broadcasting of $q$-ary spins on a Galton-Watson tree with expected offspring degree $d$ through Potts channels with second-largest eigenvalue $λ$. We allow the intermediate vertices to be observed through noisy channels (side information). We prove that BP uniqueness holds with and without side information when $dλ^2 \ge 1 + C \max\{λ, q^{-1}\}\log q$ for some absolute constant $C>0$ independent of $q,λ,d$. For large $q$ and $λ= o(1/\log q)$, this is asymptotically achieving the Kesten-Stigum threshold $dλ^2=1$. These results imply mutual information formulas and optimal recovery algorithms for the $q$-community SBM in the corresponding ranges. For $q\ge 4$, Sly (2011); Mossel et al. (2022) showed that there exist choices of $q,λ,d$ below Kesten-Stigum (i.e. $dλ^2 < 1$) but reconstruction is possible. Somewhat surprisingly, we show that in such regimes BP uniqueness does not hold at least in the presence of weak side information. Our technical tool is a theory of $q$-ary symmetric channels, that we initiate here, generalizing the classical and widely-utilized information-theoretic characterization of BMS (binary memoryless symmetric) channels. △ Less

Submitted 27 June, 2023; v1 submitted 26 March, 2023; originally announced March 2023.

arXiv:2303.06438 [pdf, ps, other]

doi 10.1109/ICASSP49357.2023.10096702

On Neural Architectures for Deep Learning-based Source Separation of Co-Channel OFDM Signals

Authors: Gary C. F. Lee, Amir Weiss, Alejandro Lancho, Yury Polyanskiy, Gregory W. Wornell

Abstract: We study the single-channel source separation problem involving orthogonal frequency-division multiplexing (OFDM) signals, which are ubiquitous in many modern-day digital communication systems. Related efforts have been pursued in monaural source separation, where state-of-the-art neural architectures have been adopted to train an end-to-end separator for audio signals (as 1-dimensional time serie… ▽ More We study the single-channel source separation problem involving orthogonal frequency-division multiplexing (OFDM) signals, which are ubiquitous in many modern-day digital communication systems. Related efforts have been pursued in monaural source separation, where state-of-the-art neural architectures have been adopted to train an end-to-end separator for audio signals (as 1-dimensional time series). In this work, through a prototype problem based on the OFDM source model, we assess -- and question -- the efficacy of using audio-oriented neural architectures in separating signals based on features pertinent to communication waveforms. Perhaps surprisingly, we demonstrate that in some configurations, where perfect separation is theoretically attainable, these audio-oriented neural architectures perform poorly in separating co-channel OFDM waveforms. Yet, we propose critical domain-informed modifications to the network parameterization, based on insights from OFDM structures, that can confer about 30 dB improvement in performance. △ Less

Submitted 15 March, 2023; v1 submitted 11 March, 2023; originally announced March 2023.

arXiv:2302.04658 [pdf, ps, other]

The Sample Complexity of Approximate Rejection Sampling with Applications to Smoothed Online Learning

Authors: Adam Block, Yury Polyanskiy

Abstract: Suppose we are given access to $n$ independent samples from distribution $μ$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $ν$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tildeΘ(\frac{D}{f'(n)})$ over the class of all pairs $ν,μ$ with a bounded $f$-divergence… ▽ More Suppose we are given access to $n$ independent samples from distribution $μ$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $ν$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tildeΘ(\frac{D}{f'(n)})$ over the class of all pairs $ν,μ$ with a bounded $f$-divergence $D_f(ν\|μ)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $ν$ with respect to $μ$ is uniformly bounded. We then consider an application in the seemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithms still hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniform over a function class and compare importance sampling with rejection sampling. △ Less

Submitted 23 February, 2024; v1 submitted 9 February, 2023; originally announced February 2023.

Comments: Corrected mistake in proof of Lemma 27 from the COLT 2023 version of this paper

arXiv:2211.15242 [pdf, other]

Ising Model on Locally Tree-like Graphs: Uniqueness of Solutions to Cavity Equations

Authors: Qian Yu, Yury Polyanskiy

Abstract: In the study of Ising models on large locally tree-like graphs, in both rigorous and non-rigorous methods one is often led to understanding the so-called belief propagation distributional recursions and its fixed points. We prove that there is at most one non-trivial fixed point for Ising models with zero or certain random external fields. Previously this was only known for sufficiently ``low-temp… ▽ More In the study of Ising models on large locally tree-like graphs, in both rigorous and non-rigorous methods one is often led to understanding the so-called belief propagation distributional recursions and its fixed points. We prove that there is at most one non-trivial fixed point for Ising models with zero or certain random external fields. Previously this was only known for sufficiently ``low-temperature'' models. Our main innovation is in applying information-theoretic ideas of channel comparison leading to a new metric (degradation index) between binary-input-symmetric (BMS) channels under which the Belief Propagation (BP) operator is a strict contraction (albeit non-multiplicative). A key ingredient of our proof is a strengthening of the classical stringy tree lemma of (Evans-Kenyon-Peres-Schulman'00). Our result simultaneously closes the following 6 conjectures in the literature: 1) independence of robust reconstruction accuracy to leaf noise in broadcasting on trees (Mossel-Neeman-Sly'16); 2) uselessness of global information for a labeled 2-community stochastic block model, or 2-SBM (Kanade-Mossel-Schramm'16); 3) optimality of local algorithms for 2-SBM under noisy side information (Mossel-Xu'16); 4) uniqueness of BP fixed point in broadcasting on trees in the Gaussian (large degree) limit (ibid); 5) boundary irrelevance in broadcasting on trees (Abbe-Cornacchia-Gu-Polyanskiy'21); 6) characterization of entropy (and mutual information) of community labels given the graph in 2-SBM (ibid). △ Less

Submitted 31 July, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.01126 [pdf, other]

Likelihood-free hypothesis testing

Authors: Patrik Róbert Gerber, Yury Polyanskiy

Abstract: Consider the problem of binary hypothesis testing. Given $Z$ coming from either $\mathbb P^{\otimes m}$ or $\mathbb Q^{\otimes m}$, to decide between the two with small probability of error it is sufficient, and in many cases necessary, to have $m\asymp1/ε^2$, where $ε$ measures the separation between $\mathbb P$ and $\mathbb Q$ in total variation ($\mathsf{TV}$). Achieving this, however, requires… ▽ More Consider the problem of binary hypothesis testing. Given $Z$ coming from either $\mathbb P^{\otimes m}$ or $\mathbb Q^{\otimes m}$, to decide between the two with small probability of error it is sufficient, and in many cases necessary, to have $m\asymp1/ε^2$, where $ε$ measures the separation between $\mathbb P$ and $\mathbb Q$ in total variation ($\mathsf{TV}$). Achieving this, however, requires complete knowledge of the distributions and can be done, for example, using the Neyman-Pearson test. In this paper we consider a variation of the problem which we call likelihood-free hypothesis testing, where access to $\mathbb P$ and $\mathbb Q$ is given through $n$ i.i.d. observations from each. In the case when $\mathbb P$ and $\mathbb Q$ are assumed to belong to a non-parametric family, we demonstrate the existence of a fundamental trade-off between $n$ and $m$ given by $nm\asymp n_\sf{GoF}^2(ε)$, where $n_\sf{GoF}(ε)$ is the minimax sample complexity of testing between the hypotheses $H_0:\, \mathbb P=\mathbb Q$ vs $H_1:\, \mathsf{TV}(\mathbb P,\mathbb Q)\geqε$. We show this for three families of distributions, in addition to the family of all discrete distributions for which we obtain a more complicated trade-off exhibiting an additional phase-transition. Our results demonstrate the possibility of testing without fully estimating $\mathbb P$ and $\mathbb Q$, provided $m \gg 1/ε^2$. △ Less

Submitted 7 March, 2024; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: 58 pages, 1 figure

arXiv:2210.01951 [pdf, other]

Finite-Blocklength Results for the A-channel: Applications to Unsourced Random Access and Group Testing

Authors: Alejandro Lancho, Alexander Fengler, Yury Polyanskiy

Abstract: We present finite-blocklength achievability bounds for the unsourced A-channel. In this multiple-access channel, users noiselessly transmit codewords picked from a common codebook with entries generated from a $q$-ary alphabet. At each channel use, the receiver observes the set of different transmitted symbols but not their multiplicity. We show that the A-channel finds applications in unsourced r… ▽ More We present finite-blocklength achievability bounds for the unsourced A-channel. In this multiple-access channel, users noiselessly transmit codewords picked from a common codebook with entries generated from a $q$-ary alphabet. At each channel use, the receiver observes the set of different transmitted symbols but not their multiplicity. We show that the A-channel finds applications in unsourced random-access (URA) and group testing. Leveraging the insights provided by the finite-blocklength bounds and the connection between URA and non-adaptive group testing through the A-channel, we propose improved decoding methods for state-of-the-art A-channel codes and we showcase how A-channel codes provide a new class of structured group testing matrices. The developed bounds allow to evaluate the achievable error probabilities of group testing matrices based on random A-channel codes for arbitrary numbers of tests, items and defectives. We show that such a construction asymptotically achieves the optimal number of tests. In addition, every efficiently decodable A-channel code can be used to construct a group testing matrix with sub-linear recovery time. △ Less

Submitted 4 October, 2022; originally announced October 2022.

Comments: 11 pages, 4 figures, extended version of the paper presented at the 58th Annual Allerton Conference on Communication, Control, and Computing (2022)

arXiv:2209.04871 [pdf, other]

doi 10.1109/GLOBECOM48099.2022.10001513

Data-Driven Blind Synchronization and Interference Rejection for Digital Communication Signals

Authors: Alejandro Lancho, Amir Weiss, Gary C. F. Lee, Jennifer Tang, Yuheng Bu, Yury Polyanskiy, Gregory W. Wornell

Abstract: We study the potential of data-driven deep learning methods for separation of two communication signals from an observation of their mixture. In particular, we assume knowledge on the generation process of one of the signals, dubbed signal of interest (SOI), and no knowledge on the generation process of the second signal, referred to as interference. This form of the single-channel source separati… ▽ More We study the potential of data-driven deep learning methods for separation of two communication signals from an observation of their mixture. In particular, we assume knowledge on the generation process of one of the signals, dubbed signal of interest (SOI), and no knowledge on the generation process of the second signal, referred to as interference. This form of the single-channel source separation problem is also referred to as interference rejection. We show that capturing high-resolution temporal structures (nonstationarities), which enables accurate synchronization to both the SOI and the interference, leads to substantial performance gains. With this key insight, we propose a domain-informed neural network (NN) design that is able to improve upon both "off-the-shelf" NNs and classical detection and interference rejection methods, as demonstrated in our simulations. Our findings highlight the key role communication-specific domain knowledge plays in the development of data-driven approaches that hold the promise of unprecedented gains. △ Less

Submitted 11 September, 2022; originally announced September 2022.

Comments: 9 pages, 6 figures, accepted at IEEE GLOBECOM 2022 (this version contains extended proofs)

arXiv:2209.01328 [pdf, other]

Optimal empirical Bayes estimation for the Poisson model via minimum-distance methods

Authors: Soham Jana, Yury Polyanskiy, Yihong Wu

Abstract: The Robbins estimator is the most iconic and widely used procedure in the empirical Bayes literature for the Poisson model. On one hand, this method has been recently shown to be minimax optimal in terms of the regret (excess risk over the Bayesian oracle that knows the true prior) for various nonparametric classes of priors. On the other hand, it has been long recognized in practice that Robbins… ▽ More The Robbins estimator is the most iconic and widely used procedure in the empirical Bayes literature for the Poisson model. On one hand, this method has been recently shown to be minimax optimal in terms of the regret (excess risk over the Bayesian oracle that knows the true prior) for various nonparametric classes of priors. On the other hand, it has been long recognized in practice that Robbins estimator lacks the desired smoothness and monotonicity of Bayes estimators and can be easily derailed by those data points that were rarely observed before. Based on the minimum-distance distance method, we propose a suite of empirical Bayes estimators, including the classical nonparametric maximum likelihood, that outperform the Robbins method in a variety of synthetic and real data sets and retain its optimality in terms of minimax regret. △ Less

Submitted 7 May, 2024; v1 submitted 3 September, 2022; originally announced September 2022.

Comments: 28 pages, 7 figures, 3 tables. Added an extension to a multivariate setup and a comparison with the worst case prior

MSC Class: 62G05; 62G07

arXiv:2208.10325 [pdf, other]

doi 10.1109/MLSP55214.2022.9943311

Exploiting Temporal Structures of Cyclostationary Signals for Data-Driven Single-Channel Source Separation

Authors: Gary C. F. Lee, Amir Weiss, Alejandro Lancho, Jennifer Tang, Yuheng Bu, Yury Polyanskiy, Gregory W. Wornell

Abstract: We study the problem of single-channel source separation (SCSS), and focus on cyclostationary signals, which are particularly suitable in a variety of application domains. Unlike classical SCSS approaches, we consider a setting where only examples of the sources are available rather than their models, inspiring a data-driven approach. For source models with underlying cyclostationary Gaussian cons… ▽ More We study the problem of single-channel source separation (SCSS), and focus on cyclostationary signals, which are particularly suitable in a variety of application domains. Unlike classical SCSS approaches, we consider a setting where only examples of the sources are available rather than their models, inspiring a data-driven approach. For source models with underlying cyclostationary Gaussian constituents, we establish a lower bound on the attainable mean squared error (MSE) for any separation method, model-based or data-driven. Our analysis further reveals the operation for optimal separation and the associated implementation challenges. As a computationally attractive alternative, we propose a deep learning approach using a U-Net architecture, which is competitive with the minimum MSE estimator. We demonstrate in simulation that, with suitable domain-informed architectural choices, our U-Net method can approach the optimal performance with substantially reduced computational burden. △ Less

Submitted 22 August, 2022; originally announced August 2022.

arXiv:2205.03752 [pdf, other]

Efficient Representation of Large-Alphabet Probability Distributions

Authors: Aviv Adler, Jennifer Tang, Yury Polyanskiy

Abstract: A number of engineering and scientific problems require representing and manipulating probability distributions over large alphabets, which we may think of as long vectors of reals summing to $1$. In some cases it is required to represent such a vector with only $b$ bits per entry. A natural choice is to partition the interval $[0,1]$ into $2^b$ uniform bins and quantize entries to each bin indepe… ▽ More A number of engineering and scientific problems require representing and manipulating probability distributions over large alphabets, which we may think of as long vectors of reals summing to $1$. In some cases it is required to represent such a vector with only $b$ bits per entry. A natural choice is to partition the interval $[0,1]$ into $2^b$ uniform bins and quantize entries to each bin independently. We show that a minor modification of this procedure -- applying an entrywise non-linear function (compander) $f(x)$ prior to quantization -- yields an extremely effective quantization method. For example, for $b=8 (16)$ and $10^5$-sized alphabets, the quality of representation improves from a loss (under KL divergence) of $0.5 (0.1)$ bits/entry to $10^{-4} (10^{-9})$ bits/entry. Compared to floating point representations, our compander method improves the loss from $10^{-1}(10^{-6})$ to $10^{-4}(10^{-9})$ bits/entry. These numbers hold for both real-world data (word frequencies in books and DNA $k$-mer counts) and for synthetic randomly generated distributions. Theoretically, we set up a minimax optimality criterion and show that the compander $f(x) ~\propto~ \mathrm{ArcSinh}(\sqrt{(1/2) (K \log K) x})$ achieves near-optimal performance, attaining a KL-quantization loss of $\asymp 2^{-2b} \log^2 K$ for a $K$-letter alphabet and $b\to \infty$. Interestingly, a similar minimax criterion for the quadratic loss on the hypercube shows optimality of the standard uniform quantizer. This suggests that the $\mathrm{ArcSinh}$ quantizer is as fundamental for KL-distortion as the uniform quantizer for quadratic distortion. △ Less

Submitted 26 October, 2023; v1 submitted 7 May, 2022; originally announced May 2022.

Comments: corrected typos

arXiv:2205.02128 [pdf, ps, other]

Rate of convergence of the smoothed empirical Wasserstein distance

Authors: Adam Block, Zeyu Jia, Yury Polyanskiy, Alexander Rakhlin

Abstract: Consider an empirical measure $\mathbb{P}_n$ induced by $n$ iid samples from a $d$-dimensional $K$-subgaussian distribution $\mathbb{P}$ and let $γ= N(0,σ^2 I_d)$ be the isotropic Gaussian measure. We study the speed of convergence of the smoothed Wasserstein distance $W_2(\mathbb{P}_n * γ, \mathbb{P}*γ) = n^{-α+ o(1)}$ with $*$ being the convolution of measures. For $K<σ$ and in any dimension… ▽ More Consider an empirical measure $\mathbb{P}_n$ induced by $n$ iid samples from a $d$-dimensional $K$-subgaussian distribution $\mathbb{P}$ and let $γ= N(0,σ^2 I_d)$ be the isotropic Gaussian measure. We study the speed of convergence of the smoothed Wasserstein distance $W_2(\mathbb{P}_n * γ, \mathbb{P}*γ) = n^{-α+ o(1)}$ with $*$ being the convolution of measures. For $K<σ$ and in any dimension $d\ge 1$ we show that $α= {1\over2}$. For $K>σ$ in dimension $d=1$ we show that the rate is slower and is given by $α= {(σ^2 + K^2)^2\over 4 (σ^4 + K^4)} < 1/2$. This resolves several open problems in [GGNWP20], and in particular precisely identifies the amount of smoothing $σ$ needed to obtain a parametric rate. In addition, for any $d$-dimensional $K$-subgaussian distribution $\mathbb{P}$, we also establish that $D_{KL}(\mathbb{P}_n * γ\|\mathbb{P}*γ)$ has rate $O(1/n)$ for $K<σ$ but only slows down to $O({(\log n)^{d+1}\over n})$ for $K>σ$. The surprising difference of the behavior of $W_2^2$ and KL implies the failure of $T_{2}$-transportation inequality when $σ< K$. Consequently, it follows that for $K>σ$ the log-Sobolev inequality (LSI) for the Gaussian mixture $\mathbb{P} * N(0, σ^{2})$ cannot hold. This closes an open problem in [WW+16], who established the LSI under the condition $K<σ$ and asked if their bound can be improved. △ Less

Submitted 8 February, 2025; v1 submitted 4 May, 2022; originally announced May 2022.

arXiv:2204.06357 [pdf, other]

Linear Programs with Polynomial Coefficients and Applications to 1D Cellular Automata

Authors: Guy Bresler, Chenghao Guo, Yury Polyanskiy

Abstract: Given a matrix $A$ and vector $b$ with polynomial entries in $d$ real variables $δ=(δ_1,\ldots,δ_d)$ we consider the following notion of feasibility: the pair $(A,b)$ is locally feasible if there exists an open neighborhood $U$ of $0$ such that for every $δ\in U$ there exists $x$ satisfying $A(δ)x\ge b(δ)$ entry-wise. For $d=1$ we construct a polynomial time algorithm for deciding local feasibilit… ▽ More Given a matrix $A$ and vector $b$ with polynomial entries in $d$ real variables $δ=(δ_1,\ldots,δ_d)$ we consider the following notion of feasibility: the pair $(A,b)$ is locally feasible if there exists an open neighborhood $U$ of $0$ such that for every $δ\in U$ there exists $x$ satisfying $A(δ)x\ge b(δ)$ entry-wise. For $d=1$ we construct a polynomial time algorithm for deciding local feasibility. For $d \ge 2$ we show local feasibility is NP-hard. This also gives the first polynomial-time algorithm for the asymptotic linear program problem introduced by Jeroslow in 1973. As an application (which was the primary motivation for this work) we give a computer-assisted proof of ergodicity of the following elementary 1D cellular automaton: given the current state $η_t \in \{0,1\}^{\mathbb{Z}}$ the next state $η_{t+1}(n)$ at each vertex $n\in \mathbb{Z}$ is obtained by $η_{t+1}(n)= \text{NAND}\big(\text{BSC}_δ(η_t(n-1)), \text{BSC}_δ(η_t(n))\big)$. Here the binary symmetric channel $\text{BSC}_δ$ takes a bit as input and flips it with probability $δ$ (and leaves it unchanged with probability $1-δ$). It is shown that there exists $δ_0>0$ such that for all $0<δ<δ_0$ the distribution of $η_t$ converges to a unique stationary measure irrespective of the initial condition $η_0$. We also consider the problem of broadcasting information on the 2D-grid of noisy binary-symmetric channels $\text{BSC}_δ$, where each node may apply an arbitrary processing function to its input bits. We prove that there exists $δ_0'>0$ such that for all noise levels $0<δ<δ_0'$ it is impossible to broadcast information for any processing function, as conjectured by Makur, Mossel and Polyanskiy. △ Less

Submitted 9 May, 2023; v1 submitted 13 April, 2022; originally announced April 2022.

arXiv:2111.00559 [pdf, other]

doi 10.1109/TIT.2023.3247812

Capacity of Noisy Permutation Channels

Authors: Jennifer Tang, Yury Polyanskiy

Abstract: We establish the capacity of a class of communication channels introduced in [1]. The $n$-letter input from a finite alphabet is passed through a discrete memoryless channel $P_{Z|X}$ and then the output $n$-letter sequence is uniformly permuted. We show that the maximal communication rate (normalized by $\log n$) equals $1/2 (rank(P_{Z|X})-1)$ whenever $P_{Z|X}$ is strictly positive. This is done… ▽ More We establish the capacity of a class of communication channels introduced in [1]. The $n$-letter input from a finite alphabet is passed through a discrete memoryless channel $P_{Z|X}$ and then the output $n$-letter sequence is uniformly permuted. We show that the maximal communication rate (normalized by $\log n$) equals $1/2 (rank(P_{Z|X})-1)$ whenever $P_{Z|X}$ is strictly positive. This is done by establishing a converse bound matching the achievability of [1]. The two main ingredients of our proof are (1) a sharp bound on the entropy of a uniformly sampled vector from a type class and observed through a DMC; and (2) the covering $ε$-net of a probability simplex with Kullback-Leibler divergence as a metric. In addition to strictly positive DMC we also find the noisy permutation capacity for $q$-ary erasure channels, the Z-channel and others. △ Less

Submitted 26 October, 2023; v1 submitted 31 October, 2021; originally announced November 2021.

Comments: draft version updated to newer version

Journal ref: in IEEE Transactions on Information Theory, vol. 69, no. 7, pp. 4145-4162, July 2023

arXiv:2109.03943 [pdf, ps, other]

Sharp regret bounds for empirical Bayes and compound decision problems

Authors: Yury Polyanskiy, Yihong Wu

Abstract: We consider the classical problems of estimating the mean of an $n$-dimensional normally (with identity covariance matrix) or Poisson distributed vector under the squared loss. In a Bayesian setting the optimal estimator is given by the prior-dependent conditional mean. In a frequentist setting various shrinkage methods were developed over the last century. The framework of empirical Bayes, put fo… ▽ More We consider the classical problems of estimating the mean of an $n$-dimensional normally (with identity covariance matrix) or Poisson distributed vector under the squared loss. In a Bayesian setting the optimal estimator is given by the prior-dependent conditional mean. In a frequentist setting various shrinkage methods were developed over the last century. The framework of empirical Bayes, put forth by Robbins (1956), combines Bayesian and frequentist mindsets by postulating that the parameters are independent but with an unknown prior and aims to use a fully data-driven estimator to compete with the Bayesian oracle that knows the true prior. The central figure of merit is the regret, namely, the total excess risk over the Bayes risk in the worst case (over the priors). Although this paradigm was introduced more than 60 years ago, little is known about the asymptotic scaling of the optimal regret in the nonparametric setting. We show that for the Poisson model with compactly supported and subexponential priors, the optimal regret scales as $Θ((\frac{\log n}{\log\log n})^2)$ and $Θ(\log^3 n)$, respectively, both attained by the original estimator of Robbins. For the normal mean model, the regret is shown to be at least $Ω((\frac{\log n}{\log\log n})^2)$ and $Ω(\log^2 n)$ for compactly supported and subgaussian priors, respectively, the former of which resolves the conjecture of Singh (1979) on the impossibility of achieving bounded regret; before this work, the best regret lower bound was $Ω(1)$. In addition to the empirical Bayes setting, these results are shown to hold in the compound setting where the parameters are deterministic. As a side application, the construction in this paper also leads to improved or new lower bounds for density estimation of Gaussian and Poisson mixtures. △ Less

Submitted 10 September, 2021; v1 submitted 8 September, 2021; originally announced September 2021.

arXiv:2106.04018 [pdf, other]

Intrinsic Dimension Estimation Using Wasserstein Distances

Authors: Adam Block, Zeyu Jia, Yury Polyanskiy, Alexander Rakhlin

Abstract: It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guaran… ▽ More It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data. △ Less

Submitted 31 May, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:2102.00050 [pdf, ps, other]

Sequential prediction under log-loss and misspecification

Authors: Meir Feder, Yury Polyanskiy

Abstract: We consider the question of sequential prediction under the log-loss in terms of cumulative regret. Namely, given a hypothesis class of distributions, learner sequentially predicts the (distribution of the) next letter in sequence and its performance is compared to the baseline of the best constant predictor from the hypothesis class. The well-specified case corresponds to an additional assumption… ▽ More We consider the question of sequential prediction under the log-loss in terms of cumulative regret. Namely, given a hypothesis class of distributions, learner sequentially predicts the (distribution of the) next letter in sequence and its performance is compared to the baseline of the best constant predictor from the hypothesis class. The well-specified case corresponds to an additional assumption that the data-generating distribution belongs to the hypothesis class as well. Here we present results in the more general misspecified case. Due to special properties of the log-loss, the same problem arises in the context of competitive-optimality in density estimation, and model selection. For the $d$-dimensional Gaussian location hypothesis class, we show that cumulative regrets in the well-specified and misspecified cases asymptotically coincide. In other words, we provide an $o(1)$ characterization of the distribution-free (or PAC) regret in this case -- the first such result as far as we know. We recall that the worst-case (or individual-sequence) regret in this case is larger by an additive constant ${d\over 2} + o(1)$. Surprisingly, neither the traditional Bayesian estimators, nor the Shtarkov's normalized maximum likelihood achieve the PAC regret and our estimator requires special "robustification" against heavy-tailed data. In addition, we show two general results for misspecified regret: the existence and uniqueness of the optimal estimator, and the bound sandwiching the misspecified regret between well-specified regrets with (asymptotically) close hypotheses classes. △ Less

Submitted 15 September, 2021; v1 submitted 29 January, 2021; originally announced February 2021.

arXiv:2101.12601 [pdf, other]

Stochastic block model entropy and broadcasting on trees with survey

Authors: Emmanuel Abbe, Elisabetta Cornacchia, Yuzhou Gu, Yury Polyanskiy

Abstract: The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [COKPZ17] and for the classical case of assortative communities but in the dense regime [DAM16]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides fo… ▽ More The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [COKPZ17] and for the classical case of assortative communities but in the dense regime [DAM16]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides for the interval (1, 3.513). It further gives an approximation to the limit in this window. The result is obtained by expressing the global SBM entropy as an integral of local tree entropies in a broadcasting on tree model with erasure side-information. The main technical advancement then relies on showing the irrelevance of the boundary in such a model, also studied with variants in [KMS16], [MNS16] and [MX15]. In particular, we establish the uniqueness of the BP fixed point in the survey model for any SNR above 3.513 or below 1. This only leaves a narrow region in the plane between SNR and survey strength where the uniqueness of BP conjectured in these papers remains unproved. △ Less

Submitted 29 January, 2021; originally announced January 2021.

arXiv:2010.01987 [pdf, ps, other]

Strong data processing constant is achieved by binary inputs

Authors: Or Ordentlich, Yury Polyanskiy

Abstract: For any channel $P_{Y|X}$ the strong data processing constant is defined as the smallest number $η_{KL}\in[0,1]$ such that $I(U;Y)\le η_{KL} I(U;X)$ holds for any Markov chain $U-X-Y$. It is shown that the value of $η_{KL}$ is given by that of the best binary-input subchannel of $P_{Y|X}$. The same result holds for any $f$-divergence, verifying a conjecture of Cohen, Kemperman and Zbaganu (1998). For any channel $P_{Y|X}$ the strong data processing constant is defined as the smallest number $η_{KL}\in[0,1]$ such that $I(U;Y)\le η_{KL} I(U;X)$ holds for any Markov chain $U-X-Y$. It is shown that the value of $η_{KL}$ is given by that of the best binary-input subchannel of $P_{Y|X}$. The same result holds for any $f$-divergence, verifying a conjecture of Cohen, Kemperman and Zbaganu (1998). △ Less

Submitted 3 June, 2021; v1 submitted 15 September, 2020; originally announced October 2020.

Comments: 1 page

arXiv:2010.01390 [pdf, other]

doi 10.1109/TIT.2022.3177667

Broadcasting on Two-Dimensional Regular Grids

Authors: Anuran Makur, Elchanan Mossel, Yury Polyanskiy

Abstract: We study a specialization of the problem of broadcasting on directed acyclic graphs, namely, broadcasting on 2D regular grids. Consider a 2D regular grid with source vertex $X$ at layer $0$ and $k+1$ vertices at layer $k\geq 1$, which are at distance $k$ from $X$. Every vertex of the 2D regular grid has outdegree $2$, the vertices at the boundary have indegree $1$, and all other vertices have inde… ▽ More We study a specialization of the problem of broadcasting on directed acyclic graphs, namely, broadcasting on 2D regular grids. Consider a 2D regular grid with source vertex $X$ at layer $0$ and $k+1$ vertices at layer $k\geq 1$, which are at distance $k$ from $X$. Every vertex of the 2D regular grid has outdegree $2$, the vertices at the boundary have indegree $1$, and all other vertices have indegree $2$. At time $0$, $X$ is given a random bit. At time $k\geq 1$, each vertex in layer $k$ receives transmitted bits from its parents in layer $k-1$, where the bits pass through binary symmetric channels with noise level $δ\in(0,1/2)$. Then, each vertex combines its received bits using a common Boolean processing function to produce an output bit. The objective is to recover $X$ with probability of error better than $1/2$ from all vertices at layer $k$ as $k \rightarrow \infty$. Besides their natural interpretation in communication networks, such broadcasting processes can be construed as 1D probabilistic cellular automata (PCA) with boundary conditions that limit the number of sites at each time $k$ to $k+1$. We conjecture that it is impossible to propagate information in a 2D regular grid regardless of the noise level and the choice of processing function. In this paper, we make progress towards establishing this conjecture, and prove using ideas from percolation and coding theory that recovery of $X$ is impossible for any $δ$ provided that all vertices use either AND or XOR processing functions. Furthermore, we propose a martingale-based approach that establishes the impossibility of recovering $X$ for any $δ$ when all NAND processing functions are used if certain supermartingales can be rigorously constructed. We also provide numerical evidence for the existence of these supermartingales by computing explicit examples for different values of $δ$ via linear programming. △ Less

Submitted 16 September, 2022; v1 submitted 3 October, 2020; originally announced October 2020.

Comments: 38 pages, double column format, 2 figures

Journal ref: IEEE Transactions on Information Theory, vol. 68, no. 10, Oct. 2022

arXiv:2008.13372 [pdf, other]

Note on approximating the Laplace transform of a Gaussian on a complex disk

Authors: Yury Polyanskiy, Yihong Wu

Abstract: In this short note we study how well a Gaussian distribution can be approximated by distributions supported on $[-a,a]$. Perhaps, the natural conjecture is that for large $a$ the almost optimal choice is given by truncating the Gaussian to $[-a,a]$. Indeed, such approximation achieves the optimal rate of $e^{-Θ(a^2)}$ in terms of the $L_\infty$-distance between characteristic functions. However, i… ▽ More In this short note we study how well a Gaussian distribution can be approximated by distributions supported on $[-a,a]$. Perhaps, the natural conjecture is that for large $a$ the almost optimal choice is given by truncating the Gaussian to $[-a,a]$. Indeed, such approximation achieves the optimal rate of $e^{-Θ(a^2)}$ in terms of the $L_\infty$-distance between characteristic functions. However, if we consider the $L_\infty$-distance between Laplace transforms on a complex disk, the optimal rate is $e^{-Θ(a^2 \log a)}$, while truncation still only attains $e^{-Θ(a^2)}$. The optimal rate can be attained by the Gauss-Hermite quadrature. As corollary, we also construct a ``super-flat'' Gaussian mixture of $Θ(a^2)$ components with means in $[-a,a]$ and whose density has all derivatives bounded by $e^{-Ω(a^2 \log(a))}$ in the $O(1)$-neighborhood of the origin. △ Less

Submitted 31 August, 2020; originally announced August 2020.

arXiv:2008.08244 [pdf, ps, other]

Self-regularizing Property of Nonparametric Maximum Likelihood Estimator in Mixture Models

Authors: Yury Polyanskiy, Yihong Wu

Abstract: Introduced by Kiefer and Wolfowitz \cite{KW56}, the nonparametric maximum likelihood estimator (NPMLE) is a widely used methodology for learning mixture odels and empirical Bayes estimation. Sidestepping the non-convexity in mixture likelihood, the NPMLE estimates the mixing distribution by maximizing the total likelihood over the space of probability measures, which can be viewed as an extreme fo… ▽ More Introduced by Kiefer and Wolfowitz \cite{KW56}, the nonparametric maximum likelihood estimator (NPMLE) is a widely used methodology for learning mixture odels and empirical Bayes estimation. Sidestepping the non-convexity in mixture likelihood, the NPMLE estimates the mixing distribution by maximizing the total likelihood over the space of probability measures, which can be viewed as an extreme form of overparameterization. In this paper we discover a surprising property of the NPMLE solution. Consider, for example, a Gaussian mixture model on the real line with a subgaussian mixing distribution. Leveraging complex-analytic techniques, we show that with high probability the NPMLE based on a sample of size $n$ has $O(\log n)$ atoms (mass points), significantly improving the deterministic upper bound of $n$ due to Lindsay \cite{lindsay1983geometry1}. Notably, any such Gaussian mixture is statistically indistinguishable from a finite one with $O(\log n)$ components (and this is tight for certain mixtures). Thus, absent any explicit form of model selection, NPMLE automatically chooses the right model complexity, a property we term \emph{self-regularization}. Extensions to other exponential families are given. As a statistical application, we show that this structural property can be harnessed to bootstrap existing Hellinger risk bound of the (parametric) MLE for finite Gaussian mixtures to the NPMLE for general Gaussian mixtures, recovering a result of Zhang \cite{zhang2009generalized}. △ Less

Submitted 6 September, 2020; v1 submitted 18 August, 2020; originally announced August 2020.

Comments: Refer conjecture of [DYPS20] in Sec 5.4 to the arxiv version arXiv:1901.03264v4

arXiv:2005.10561 [pdf, ps, other]

Extrapolating the profile of a finite population

Authors: Soham Jana, Yury Polyanskiy, Yihong Wu

Abstract: We study a prototypical problem in empirical Bayes. Namely, consider a population consisting of $k$ individuals each belonging to one of $k$ types (some types can be empty). Without any structural restrictions, it is impossible to learn the composition of the full population having observed only a small (random) subsample of size $m = o(k)$. Nevertheless, we show that in the sublinear regime of… ▽ More We study a prototypical problem in empirical Bayes. Namely, consider a population consisting of $k$ individuals each belonging to one of $k$ types (some types can be empty). Without any structural restrictions, it is impossible to learn the composition of the full population having observed only a small (random) subsample of size $m = o(k)$. Nevertheless, we show that in the sublinear regime of $m =ω(k/\log k)$, it is possible to consistently estimate in total variation the \emph{profile} of the population, defined as the empirical distribution of the sizes of each type, which determines many symmetric properties of the population. We also prove that in the linear regime of $m=c k$ for any constant $c$ the optimal rate is $Θ(1/\log k)$. Our estimator is based on Wolfowitz's minimum distance method, which entails solving a linear program (LP) of size $k$. We show that there is a single infinite-dimensional LP whose value simultaneously characterizes the risk of the minimum distance estimator and certifies its minimax optimality. The sharp convergence rate is obtained by evaluating this LP using complex-analytic techniques. △ Less

Submitted 21 May, 2020; originally announced May 2020.

Showing 1–50 of 107 results for author: Polyanskiy, Y