-
Near-Optimal and Tractable Estimation under Shift-Invariance
Authors:
Dmitrii M. Ostrovskii
Abstract:
How hard is it to estimate a discrete-time signal $(x_{1}, ..., x_{n}) \in \mathbb{C}^n$ satisfying an unknown linear recurrence relation of order $s$ and observed in i.i.d. complex Gaussian noise? The class of all such signals is parametric but extremely rich: it contains all exponential polynomials over $\mathbb{C}$ with total degree $s$, including harmonic oscillations with $s$ arbitrary freque…
▽ More
How hard is it to estimate a discrete-time signal $(x_{1}, ..., x_{n}) \in \mathbb{C}^n$ satisfying an unknown linear recurrence relation of order $s$ and observed in i.i.d. complex Gaussian noise? The class of all such signals is parametric but extremely rich: it contains all exponential polynomials over $\mathbb{C}$ with total degree $s$, including harmonic oscillations with $s$ arbitrary frequencies. Geometrically, this class corresponds to the projection onto $\mathbb{C}^{n}$ of the union of all shift-invariant subspaces of $\mathbb{C}^\mathbb{Z}$ of dimension $s$. We show that the statistical complexity of this class, as measured by the squared minimax radius of the $(1-δ)$-confidence $\ell_2$-ball, is nearly the same as for the class of $s$-sparse signals, namely $O\left(s\log(en) + \log(δ^{-1})\right) \cdot \log^2(es) \cdot \log(en/s).$ Moreover, the corresponding near-minimax estimator is tractable, and it can be used to build a test statistic with a near-minimax detection threshold in the associated detection problem. These statistical results rest upon an approximation-theoretic one: we show that finite-dimensional shift-invariant subspaces admit compactly supported reproducing kernels whose Fourier spectra have nearly the smallest possible $\ell_p$-norms, for all $p \in [1,+\infty]$ at once.
△ Less
Submitted 9 January, 2025; v1 submitted 5 November, 2024;
originally announced November 2024.
-
Efficient and Near-Optimal Online Portfolio Selection
Authors:
Rémi Jézéquel,
Dmitrii M. Ostrovskii,
Pierre Gaillard
Abstract:
In the problem of online portfolio selection as formulated by Cover (1991), the trader repeatedly distributes her capital over $ d $ assets in each of $ T > 1 $ rounds, with the goal of maximizing the total return. Cover proposed an algorithm, termed Universal Portfolios, that performs nearly as well as the best (in hindsight) static assignment of a portfolio, with an $ O(d\log(T)) $ regret in ter…
▽ More
In the problem of online portfolio selection as formulated by Cover (1991), the trader repeatedly distributes her capital over $ d $ assets in each of $ T > 1 $ rounds, with the goal of maximizing the total return. Cover proposed an algorithm, termed Universal Portfolios, that performs nearly as well as the best (in hindsight) static assignment of a portfolio, with an $ O(d\log(T)) $ regret in terms of the logarithmic return. Without imposing any restrictions on the market this guarantee is known to be worst-case optimal, and no other algorithm attaining it has been discovered so far. Unfortunately, Cover's algorithm crucially relies on computing certain $ d $-dimensional integral which must be approximated in any implementation; this results in a prohibitive $ \tilde O(d^4(T+d)^{14}) $ per-round runtime for the fastest known implementation due to Kalai and Vempala (2002). We propose an algorithm for online portfolio selection that admits essentially the same regret guarantee as Universal Portfolios -- up to a constant factor and replacement of $ \log(T) $ with $ \log(T+d) $ -- yet has a drastically reduced runtime of $ \tilde O(d^2(T+d)) $ per round. The selected portfolio minimizes the current logarithmic loss regularized by the log-determinant of its Hessian -- equivalently, the hybrid logarithmic-volumetric barrier of the polytope specified by the asset return vectors. As such, our work reveals surprising connections of online portfolio selection with two classical topics in optimization theory: cutting-plane and interior-point algorithms.
△ Less
Submitted 9 March, 2025; v1 submitted 28 September, 2022;
originally announced September 2022.
-
Nonconvex-Nonconcave Min-Max Optimization with a Small Maximization Domain
Authors:
Dmitrii M. Ostrovskii,
Babak Barazandeh,
Meisam Razaviyayn
Abstract:
We study the problem of finding approximate first-order stationary points in optimization problems of the form $\min_{x \in X} \max_{y \in Y} f(x,y)$, where the sets $X,Y$ are convex and $Y$ is compact. The objective function $f$ is smooth, but assumed neither convex in $x$ nor concave in $y$. Our approach relies upon replacing the function $f(x,\cdot)$ with its $k$th order Taylor approximation (i…
▽ More
We study the problem of finding approximate first-order stationary points in optimization problems of the form $\min_{x \in X} \max_{y \in Y} f(x,y)$, where the sets $X,Y$ are convex and $Y$ is compact. The objective function $f$ is smooth, but assumed neither convex in $x$ nor concave in $y$. Our approach relies upon replacing the function $f(x,\cdot)$ with its $k$th order Taylor approximation (in $y$) and finding a near-stationary point in the resulting surrogate problem. To guarantee its success, we establish the following result: let the Euclidean diameter of $Y$ be small in terms of the target accuracy $\varepsilon$, namely $O(\varepsilon^{\frac{2}{k+1}})$ for $k \in \mathbb{N}$ and $O(\varepsilon)$ for $k = 0$, with the constant factors controlled by certain regularity parameters of $f$; then any $\varepsilon$-stationary point in the surrogate problem remains $O(\varepsilon)$-stationary for the initial problem. Moreover, we show that these upper bounds are nearly optimal: the aforementioned reduction provably fails when the diameter of $Y$ is larger. For $0 \le k \le 2$ the surrogate function can be efficiently maximized in $y$; our general approximation result then leads to efficient algorithms for finding a near-stationary point in nonconvex-nonconcave min-max problems, for which we also provide convergence guarantees.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
Near-Optimal Procedures for Model Discrimination with Non-Disclosure Properties
Authors:
Dmitrii M. Ostrovskii,
Mohamed Ndaoud,
Adel Javanmard,
Meisam Razaviyayn
Abstract:
Let $θ_0,θ_1 \in \mathbb{R}^d$ be the population risk minimizers associated to some loss $\ell:\mathbb{R}^d\times \mathcal{Z}\to\mathbb{R}$ and two distributions $\mathbb{P}_0,\mathbb{P}_1$ on $\mathcal{Z}$. The models $θ_0,θ_1$ are unknown, and $\mathbb{P}_0,\mathbb{P}_1$ can be accessed by drawing i.i.d samples from them. Our work is motivated by the following model discrimination question: "Wha…
▽ More
Let $θ_0,θ_1 \in \mathbb{R}^d$ be the population risk minimizers associated to some loss $\ell:\mathbb{R}^d\times \mathcal{Z}\to\mathbb{R}$ and two distributions $\mathbb{P}_0,\mathbb{P}_1$ on $\mathcal{Z}$. The models $θ_0,θ_1$ are unknown, and $\mathbb{P}_0,\mathbb{P}_1$ can be accessed by drawing i.i.d samples from them. Our work is motivated by the following model discrimination question: "What sizes of the samples from $\mathbb{P}_0$ and $\mathbb{P}_1$ allow to distinguish between the two hypotheses $θ^*=θ_0$ and $θ^*=θ_1$ for given $θ^*\in\{θ_0,θ_1\}$?" Making the first steps towards answering it in full generality, we first consider the case of a well-specified linear model with squared loss. Here we provide matching upper and lower bounds on the sample complexity as given by $\min\{1/Δ^2,\sqrt{r}/Δ\}$ up to a constant factor; here $Δ$ is a measure of separation between $\mathbb{P}_0$ and $\mathbb{P}_1$ and $r$ is the rank of the design covariance matrix. We then extend this result in two directions: (i) for general parametric models in asymptotic regime; (ii) for generalized linear models in small samples ($n\le r$) under weak moment assumptions. In both cases we derive sample complexity bounds of a similar form while allowing for model misspecification. In fact, our testing procedures only access $θ^*$ via a certain functional of empirical risk. In addition, the number of observations that allows us to reach statistical confidence does not allow to "resolve" the two models $-$ that is, recover $θ_0,θ_1$ up to $O(Δ)$ prediction accuracy. These two properties allow to use our framework in applied tasks where one would like to $\textit{identify}$ a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually $\textit{inferred}$ by the identifying agent.
△ Less
Submitted 10 July, 2021; v1 submitted 4 December, 2020;
originally announced December 2020.
-
Efficient Search of First-Order Nash Equilibria in Nonconvex-Concave Smooth Min-Max Problems
Authors:
Dmitrii M. Ostrovskii,
Andrew Lowy,
Meisam Razaviyayn
Abstract:
We propose an efficient algorithm for finding first-order Nash equilibria in min-max problems of the form $\min_{x \in X}\max_{y\in Y} F(x,y)$, where the objective function is smooth in both variables and concave with respect to $y$; the sets $X$ and $Y$ are convex and "projection-friendly," and $Y$ is compact. Our goal is to find an $(\varepsilon_x,\varepsilon_y)$-first-order Nash equilibrium wit…
▽ More
We propose an efficient algorithm for finding first-order Nash equilibria in min-max problems of the form $\min_{x \in X}\max_{y\in Y} F(x,y)$, where the objective function is smooth in both variables and concave with respect to $y$; the sets $X$ and $Y$ are convex and "projection-friendly," and $Y$ is compact. Our goal is to find an $(\varepsilon_x,\varepsilon_y)$-first-order Nash equilibrium with respect to a stationarity criterion that is stronger than the commonly used proximal gradient norm. The proposed approach is fairly simple: we perform approximate proximal-point iterations on the primal function, with inexact oracle provided by Nesterov's algorithm run on the regularized function $F(x_t,\cdot)$, $x_t$ being the current primal iterate. The resulting iteration complexity is $O(\varepsilon_x{}^{-2} \varepsilon_y{}^{-1/2})$ up to a logarithmic factor. As a byproduct, the choice $\varepsilon_y = O(\varepsilon_x{}^2)$ allows for the $O(\varepsilon_x{}^{-3})$ complexity of finding an $\varepsilon_x$-stationary point for the standard Moreau envelope of the primal function. Moreover, when the objective is strongly concave with respect to $y$, the complexity estimate for our algorithm improves to $O(\varepsilon_x{}^{-2}{κ_y}^{1/2})$ up to a logarithmic factor, where $κ_y$ is the condition number appropriately adjusted for coupling. In both scenarios, the complexity estimates are the best known so far, and are only known for the (weaker) proximal gradient norm criterion. Meanwhile, our approach is "user-friendly:" (i) the algorithm is built upon running a variant of Nesterov's accelerated algorithm as subroutine and avoids extragradient steps; (ii) the convergence analysis recycles the well-known results on accelerated methods with inexact oracle. Finally, we extend the approach to non-Euclidean proximal geometries.
△ Less
Submitted 2 May, 2021; v1 submitted 18 February, 2020;
originally announced February 2020.
-
Efficient Primal-Dual Algorithms for Large-Scale Multiclass Classification
Authors:
Dmitry Babichev,
Dmitrii Ostrovskii,
Francis Bach
Abstract:
We develop efficient algorithms to train $\ell_1$-regularized linear classifiers with large dimensionality $d$ of the feature space, number of classes $k$, and sample size $n$. Our focus is on a special class of losses that includes, in particular, the multiclass hinge and logistic losses. Our approach combines several ideas: (i) passing to the equivalent saddle-point problem with a quasi-bilinear…
▽ More
We develop efficient algorithms to train $\ell_1$-regularized linear classifiers with large dimensionality $d$ of the feature space, number of classes $k$, and sample size $n$. Our focus is on a special class of losses that includes, in particular, the multiclass hinge and logistic losses. Our approach combines several ideas: (i) passing to the equivalent saddle-point problem with a quasi-bilinear objective; (ii) applying stochastic mirror descent with a proper choice of geometry which guarantees a favorable accuracy bound; (iii) devising non-uniform sampling schemes to approximate the matrix products. In particular, for the multiclass hinge loss we propose a \textit{sublinear} algorithm with iterations performed in $O(d+n+k)$ arithmetic operations.
△ Less
Submitted 11 February, 2019;
originally announced February 2019.
-
Affine Invariant Covariance Estimation for Heavy-Tailed Distributions
Authors:
Dmitrii Ostrovskii,
Alessandro Rudi
Abstract:
In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distributionWe prove that the proposed estimator $\widehat{\mathbf{S}}$ admits an \textit{affine-invariant} bound of the form \[(1-\varepsilon) \mathbf{S} \preccurlyeq \widehat{\mathbf{S}} \preccurlyeq (1+\varepsilon) \mathbf{S}\]in high probability, where $\mathbf{S}$ is the unknown covariance matrix, an…
▽ More
In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distributionWe prove that the proposed estimator $\widehat{\mathbf{S}}$ admits an \textit{affine-invariant} bound of the form \[(1-\varepsilon) \mathbf{S} \preccurlyeq \widehat{\mathbf{S}} \preccurlyeq (1+\varepsilon) \mathbf{S}\]in high probability, where $\mathbf{S}$ is the unknown covariance matrix, and $\preccurlyeq$ is the positive semidefinite order on symmetric matrices. The result only requires the existence of fourth-order moments, and allows for $\varepsilon = O(\sqrt{κ^4 d\log(d/δ)/n})$ where $κ^4$ is a measure of kurtosis of the distribution, $d$ is the dimensionality of the space, $n$ is the sample size, and $1-δ$ is the desired confidence level. More generally, we can allow for regularization with level $λ$, then $d$ gets replaced with the degrees of freedom number. Denoting $\text{cond}(\mathbf{S})$ the condition number of $\mathbf{S}$, the computational cost of the novel estimator is $O(d^2 n + d^3\log(\text{cond}(\mathbf{S})))$, which is comparable to the cost of the sample covariance estimator in the statistically interesing regime $n \ge d$. We consider applications of our estimator to eigenvalue estimation with relative error, and to ridge regression with heavy-tailed random design.
△ Less
Submitted 24 September, 2019; v1 submitted 8 February, 2019;
originally announced February 2019.
-
Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance
Authors:
Ulysse Marteau-Ferey,
Dmitrii Ostrovskii,
Francis Bach,
Alessandro Rudi
Abstract:
We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as $O(1/\sqrt{n})$ from $n$ observations, we assume that the individual losses are self-conc…
▽ More
We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as $O(1/\sqrt{n})$ from $n$ observations, we assume that the individual losses are self-concordant, that is, their third-order derivatives are bounded by their second-order derivatives. This setting includes least-squares, as well as all generalized linear models such as logistic and softmax regression. For this class of losses, we provide a bias-variance decomposition and show that the assumptions commonly made in least-squares regression, such as the source and capacity conditions, can be adapted to obtain fast non-asymptotic rates of convergence by improving the bias terms, the variance terms or both.
△ Less
Submitted 18 June, 2019; v1 submitted 8 February, 2019;
originally announced February 2019.
-
Finite-sample analysis of M-estimators using self-concordance
Authors:
Dmitrii Ostrovskii,
Francis Bach
Abstract:
The classical asymptotic theory for parametric $M$-estimators guarantees that, in the limit of infinite sample size, the excess risk has a chi-square type distribution, even in the misspecified case. We demonstrate how self-concordance of the loss allows to characterize the critical sample size sufficient to guarantee a chi-square type in-probability bound for the excess risk. Specifically, we con…
▽ More
The classical asymptotic theory for parametric $M$-estimators guarantees that, in the limit of infinite sample size, the excess risk has a chi-square type distribution, even in the misspecified case. We demonstrate how self-concordance of the loss allows to characterize the critical sample size sufficient to guarantee a chi-square type in-probability bound for the excess risk. Specifically, we consider two classes of losses: (i) self-concordant losses in the classical sense of Nesterov and Nemirovski, i.e., whose third derivative is uniformly bounded with the $3/2$ power of the second derivative; (ii) pseudo self-concordant losses, for which the power is removed. These classes contain losses corresponding to several generalized linear models, including the logistic loss and pseudo-Huber losses. Our basic result under minimal assumptions bounds the critical sample size by $O(d \cdot d_{\text{eff}}),$ where $d$ the parameter dimension and $d_{\text{eff}}$ the effective dimension that accounts for model misspecification. In contrast to the existing results, we only impose local assumptions that concern the population risk minimizer $θ_*$. Namely, we assume that the calibrated design, i.e., design scaled by the square root of the second derivative of the loss, is subgaussian at $θ_*$. Besides, for type-ii losses we require boundedness of a certain measure of curvature of the population risk at $θ_*$.Our improved result bounds the critical sample size from above as $O(\max\{d_{\text{eff}}, d \log d\})$ under slightly stronger assumptions. Namely, the local assumptions must hold in the neighborhood of $θ_*$ given by the Dikin ellipsoid of the population risk. Interestingly, we find that, for logistic regression with Gaussian design, there is no actual restriction of conditions: the subgaussian parameter and curvature measure remain near-constant over the Dikin ellipsoid. Finally, we extend some of these results to $\ell_1$-penalized estimators in high dimensions.
△ Less
Submitted 30 November, 2020; v1 submitted 16 October, 2018;
originally announced October 2018.
-
Adaptive Denoising of Signals with Local Shift-Invariant Structure
Authors:
Zaid Harchaoui,
Anatoli Juditsky,
Arkadi Nemirovski,
Dmitrii Ostrovskii
Abstract:
We discuss the problem of adaptive discrete-time signal denoising in the situation where the signal to be recovered admits a "linear oracle" -- an unknown linear estimate that takes the form of convolution of observations with a time-invariant filter. It was shown by Juditsky and Nemirovski (2009) that when the $\ell_2$-norm of the oracle filter is small enough, such oracle can be "mimicked" by an…
▽ More
We discuss the problem of adaptive discrete-time signal denoising in the situation where the signal to be recovered admits a "linear oracle" -- an unknown linear estimate that takes the form of convolution of observations with a time-invariant filter. It was shown by Juditsky and Nemirovski (2009) that when the $\ell_2$-norm of the oracle filter is small enough, such oracle can be "mimicked" by an efficiently computable adaptive estimate of the same structure with an observation-driven filter. The filter in question was obtained as a solution to the optimization problem in which the $\ell_\infty$-norm of the Discrete Fourier Transform (DFT) of the estimation residual is minimized under constraint on the $\ell_1$-norm of the filter DFT. In this paper, we discuss a new family of adaptive estimates which rely upon minimizing the $\ell_2$-norm of the estimation residual. We show that such estimators possess better statistical properties than those based on $\ell_\infty$-fit; in particular, we prove oracle inequalities for their $\ell_2$-loss and improved bounds for $\ell_2$- and pointwise losses. The oracle inequalities rely on the "approximate shift-invariance" assumption stating that the signal to be recovered is close to an (unknown) shift-invariant subspace. We also study the relationship of the approximate shift-invariance assumption with the "signal simplicity" assumption introduced in Juditsky and Nemirovski (2009) and discuss the application of the proposed approach to harmonic oscillations denoising.
△ Less
Submitted 11 February, 2021; v1 submitted 11 June, 2018;
originally announced June 2018.
-
Efficient First-Order Algorithms for Adaptive Signal Denoising
Authors:
Dmitrii Ostrovskii,
Zaid Harchaoui
Abstract:
We consider the problem of discrete-time signal denoising, focusing on a specific family of non-linear convolution-type estimators. Each such estimator is associated with a time-invariant filter which is obtained adaptively, by solving a certain convex optimization problem. Adaptive convolution-type estimators were demonstrated to have favorable statistical properties. However, the question of the…
▽ More
We consider the problem of discrete-time signal denoising, focusing on a specific family of non-linear convolution-type estimators. Each such estimator is associated with a time-invariant filter which is obtained adaptively, by solving a certain convex optimization problem. Adaptive convolution-type estimators were demonstrated to have favorable statistical properties. However, the question of their computational complexity remains largely unexplored, and in fact we are not aware of any publicly available implementation of these estimators. Our first contribution is an efficient implementation of these estimators via some known first-order proximal algorithms. Our second contribution is a computational complexity analysis of the proposed procedures, which takes into account their statistical nature and the related notion of statistical accuracy. The proposed procedures and their analysis are illustrated on a simulated data benchmark.
△ Less
Submitted 12 June, 2018; v1 submitted 29 March, 2018;
originally announced March 2018.