-
Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance
Authors:
Alexander Stollenwerk,
Laurent Jacques
Abstract:
We propose a novel algorithm for distributed stochastic gradient descent (SGD) with compressed gradient communication in the parameter-server framework. Our gradient compression technique, named flattened one-bit stochastic gradient descent (FO-SGD), relies on two simple algorithmic ideas: (i) a one-bit quantization procedure leveraging the technique of dithering, and (ii) a randomized fast Walsh-…
▽ More
We propose a novel algorithm for distributed stochastic gradient descent (SGD) with compressed gradient communication in the parameter-server framework. Our gradient compression technique, named flattened one-bit stochastic gradient descent (FO-SGD), relies on two simple algorithmic ideas: (i) a one-bit quantization procedure leveraging the technique of dithering, and (ii) a randomized fast Walsh-Hadamard transform to flatten the stochastic gradient before quantization. As a result, the approximation of the true gradient in this scheme is biased, but it prevents commonly encountered algorithmic problems, such as exploding variance in the one-bit compression regime, deterioration of performance in the case of sparse gradients, and restrictive assumptions on the distribution of the stochastic gradients. In fact, we show SGD-like convergence guarantees under mild conditions. The compression technique can be used in both directions of worker-server communication, therefore admitting distributed optimization with full communication compression.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Fast metric embedding into the Hamming cube
Authors:
Sjoerd Dirksen,
Shahar Mendelson,
Alexander Stollenwerk
Abstract:
We consider the problem of embedding a subset of $\mathbb{R}^n$ into a low-dimensional Hamming cube in an almost isometric way. We construct a simple, data-oblivious, and computationally efficient map that achieves this task with high probability: we first apply a specific structured random matrix, which we call the double circulant matrix; using that matrix requires linear storage and matrix-vect…
▽ More
We consider the problem of embedding a subset of $\mathbb{R}^n$ into a low-dimensional Hamming cube in an almost isometric way. We construct a simple, data-oblivious, and computationally efficient map that achieves this task with high probability: we first apply a specific structured random matrix, which we call the double circulant matrix; using that matrix requires linear storage and matrix-vector multiplication can be performed in near-linear time. We then binarize each vector by comparing each of its entries to a random threshold, selected uniformly at random from a well-chosen interval.
We estimate the number of bits required for this encoding scheme in terms of two natural geometric complexity parameters of the set - its Euclidean covering numbers and its localized Gaussian complexity. The estimate we derive turns out to be the best that one can hope for - up to logarithmic terms.
The key to the proof is a phenomenon of independent interest: we show that the double circulant matrix mimics the behavior of a Gaussian matrix in two important ways. First, it maps an arbitrary set in $\mathbb{R}^n$ into a set of well-spread vectors. Second, it yields a fast near-isometric embedding of any finite subset of $\ell_2^n$ into $\ell_1^m$. This embedding achieves the same dimension reduction as a Gaussian matrix in near-linear time, under an optimal condition - up to logarithmic factors - on the number of points to be embedded. This improves a well-known construction due to Ailon and Chazelle.
△ Less
Submitted 6 September, 2022; v1 submitted 8 April, 2022;
originally announced April 2022.
-
Sharp estimates on random hyperplane tessellations
Authors:
Sjoerd Dirksen,
Shahar Mendelson,
Alexander Stollenwerk
Abstract:
We study the problem of generating a hyperplane tessellation of an arbitrary set $T$ in $\mathbb{R}^n$, ensuring that the Euclidean distance between any two points corresponds to the fraction of hyperplanes separating them up to a pre-specified error $δ$. We focus on random gaussian tessellations with uniformly distributed shifts and derive sharp bounds on the number of hyperplanes $m$ that are re…
▽ More
We study the problem of generating a hyperplane tessellation of an arbitrary set $T$ in $\mathbb{R}^n$, ensuring that the Euclidean distance between any two points corresponds to the fraction of hyperplanes separating them up to a pre-specified error $δ$. We focus on random gaussian tessellations with uniformly distributed shifts and derive sharp bounds on the number of hyperplanes $m$ that are required. Surprisingly, our lower estimates falsify the conjecture that $m\sim \ell_*^2(T)/δ^2$, where $\ell_*^2(T)$ is the gaussian width of $T$, is optimal.
△ Less
Submitted 13 January, 2022;
originally announced January 2022.
-
The Separation Capacity of Random Neural Networks
Authors:
Sjoerd Dirksen,
Martin Genzel,
Laurent Jacques,
Alexander Stollenwerk
Abstract:
Neural networks with random weights appear in a variety of machine learning applications, most prominently as the initialization of many deep learning algorithms and as a computationally cheap alternative to fully learned neural networks. In the present article, we enhance the theoretical understanding of random neural networks by addressing the following data separation problem: under what condit…
▽ More
Neural networks with random weights appear in a variety of machine learning applications, most prominently as the initialization of many deep learning algorithms and as a computationally cheap alternative to fully learned neural networks. In the present article, we enhance the theoretical understanding of random neural networks by addressing the following data separation problem: under what conditions can a random neural network make two classes $\mathcal{X}^-, \mathcal{X}^+ \subset \mathbb{R}^d$ (with positive distance) linearly separable? We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability. Crucially, the number of required neurons is explicitly linked to geometric properties of the underlying sets $\mathcal{X}^-, \mathcal{X}^+$ and their mutual arrangement. This instance-specific viewpoint allows us to overcome the usual curse of dimensionality (exponential width of the layers) in non-pathological situations where the data carries low-complexity structure. We quantify the relevant structure of the data in terms of a novel notion of mutual complexity (based on a localized version of Gaussian mean width), which leads to sound and informative separation guarantees. We connect our result with related lines of work on approximation, memorization, and generalization.
△ Less
Submitted 28 November, 2022; v1 submitted 31 July, 2021;
originally announced August 2021.
-
A Unified Approach to Uniform Signal Recovery From Non-Linear Observations
Authors:
Martin Genzel,
Alexander Stollenwerk
Abstract:
Recent advances in quantized compressed sensing and high-dimensional estimation have shown that signal recovery is even feasible under strong non-linear distortions in the observation process. An important characteristic of associated guarantees is uniformity, i.e., recovery succeeds for an entire class of structured signals with a fixed measurement ensemble. However, despite significant results i…
▽ More
Recent advances in quantized compressed sensing and high-dimensional estimation have shown that signal recovery is even feasible under strong non-linear distortions in the observation process. An important characteristic of associated guarantees is uniformity, i.e., recovery succeeds for an entire class of structured signals with a fixed measurement ensemble. However, despite significant results in various special cases, a general understanding of uniform recovery from non-linear observations is still missing. This paper develops a unified approach to this problem under the assumption of i.i.d. sub-Gaussian measurement vectors. Our main result shows that a simple least-squares estimator with any convex constraint can serve as a universal recovery strategy, which is outlier robust and does not require explicit knowledge of the underlying non-linearity. Based on empirical process theory, a key technical novelty is an approximative increment condition that can be implemented for all common types of non-linear models. This flexibility allows us to apply our approach to a variety of problems in non-linear compressed sensing and high-dimensional statistics, leading to several new and improved guarantees. Each of these applications is accompanied by a conceptually simple and systematic proof, which does not rely on any deeper properties of the observation model. On the other hand, known local stability properties can be incorporated into our framework in a plug-and-play manner, thereby implying near-optimal error bounds.
△ Less
Submitted 6 January, 2022; v1 submitted 19 September, 2020;
originally announced September 2020.
-
Binarized Johnson-Lindenstrauss embeddings
Authors:
Sjoerd Dirksen,
Alexander Stollenwerk
Abstract:
We consider the problem of encoding a set of vectors into a minimal number of bits while preserving information on their Euclidean geometry. We show that this task can be accomplished by applying a Johnson-Lindenstrauss embedding and subsequently binarizing each vector by comparing each entry of the vector to a uniformly random threshold. Using this simple construction we produce two encodings of…
▽ More
We consider the problem of encoding a set of vectors into a minimal number of bits while preserving information on their Euclidean geometry. We show that this task can be accomplished by applying a Johnson-Lindenstrauss embedding and subsequently binarizing each vector by comparing each entry of the vector to a uniformly random threshold. Using this simple construction we produce two encodings of a dataset such that one can query Euclidean information for a pair of points using a small number of bit operations up to a desired additive error - Euclidean distances in the first case and inner products and squared Euclidean distances in the second. In the latter case, each point is encoded in near-linear time. The number of bits required for these encodings is quantified in terms of two natural complexity parameters of the dataset - its covering numbers and localized Gaussian complexity - and shown to be near-optimal.
△ Less
Submitted 11 April, 2022; v1 submitted 17 September, 2020;
originally announced September 2020.
-
Quantized Compressed Sensing by Rectified Linear Units
Authors:
Hans Christian Jung,
Johannes Maly,
Lars Palzer,
Alexander Stollenwerk
Abstract:
This work is concerned with the problem of recovering high-dimensional signals $\mathbf{x} \in \mathbb{R}^n$ which belong to a convex set of low-complexity from a small number of quantized measurements. We propose to estimate the signals via a convex program based on rectified linear units (ReLUs) for two different quantization schemes, namely one-bit and uniform multi-bit quantization. Assuming t…
▽ More
This work is concerned with the problem of recovering high-dimensional signals $\mathbf{x} \in \mathbb{R}^n$ which belong to a convex set of low-complexity from a small number of quantized measurements. We propose to estimate the signals via a convex program based on rectified linear units (ReLUs) for two different quantization schemes, namely one-bit and uniform multi-bit quantization. Assuming that the linear measurement process can be modelled by a sensing matrix with i.i.d. subgaussian rows, we obtain for both schemes near-optimal uniform reconstruction guarantees by adding well-designed noise to the linear measurements prior to the quantization step. In the one-bit case, we show that the program is robust against adversarial bit corruptions as well as additive noise on the linear measurements. Further, our analysis quantifies precisely how the rate-distortion relationship of the program changes depending on whether we seek reconstruction accuracies above or below the noise floor. The proofs rely on recent results by Dirksen and Mendelson on non-Gaussian hyperplane tessellations. Finally, we complement our theoretical analysis with numerical experiments which compare our method to other state-of-the-art methodologies.
△ Less
Submitted 26 March, 2021; v1 submitted 18 November, 2019;
originally announced November 2019.
-
Robust 1-Bit Compressed Sensing via Hinge Loss Minimization
Authors:
Martin Genzel,
Alexander Stollenwerk
Abstract:
This work theoretically studies the problem of estimating a structured high-dimensional signal $x_0 \in \mathbb{R}^n$ from noisy $1$-bit Gaussian measurements. Our recovery approach is based on a simple convex program which uses the hinge loss function as data fidelity term. While such a risk minimization strategy is very natural to learn binary output models, such as in classification, its capaci…
▽ More
This work theoretically studies the problem of estimating a structured high-dimensional signal $x_0 \in \mathbb{R}^n$ from noisy $1$-bit Gaussian measurements. Our recovery approach is based on a simple convex program which uses the hinge loss function as data fidelity term. While such a risk minimization strategy is very natural to learn binary output models, such as in classification, its capacity to estimate a specific signal vector is largely unexplored. A major difficulty is that the hinge loss is just piecewise linear, so that its "curvature energy" is concentrated in a single point. This is substantially different from other popular loss functions considered in signal estimation, e.g., the square or logistic loss, which are at least locally strongly convex. It is therefore somewhat unexpected that we can still prove very similar types of recovery guarantees for the hinge loss estimator, even in the presence of strong noise. More specifically, our non-asymptotic error bounds show that stable and robust reconstruction of $x_0$ can be achieved with the optimal oversampling rate $O(m^{-1/2})$ in terms of the number of measurements $m$. Moreover, we permit a wide class of structural assumptions on the ground truth signal, in the sense that $x_0$ can belong to an arbitrary bounded convex set $K \subset \mathbb{R}^n$. The proofs of our main results rely on some recent advances in statistical learning theory due to Mendelson. In particular, we invoke an adapted version of Mendelson's small ball method that allows us to establish a quadratic lower bound on the error of the first order Taylor approximation of the empirical hinge loss function.
△ Less
Submitted 30 May, 2020; v1 submitted 13 April, 2018;
originally announced April 2018.