Skip to main content

Showing 1–28 of 28 results for author: Dereziński, M

Searching in archive stat. Search in all archives.
.
  1. arXiv:2505.13723  [pdf, ps, other

    cs.LG math.OC stat.ML

    Turbocharging Gaussian Process Inference with Approximate Sketch-and-Project

    Authors: Pratik Rathore, Zachary Frangella, Sachin Garg, Shaghayegh Fazliani, Michał Dereziński, Madeleine Udell

    Abstract: Gaussian processes (GPs) play an essential role in biostatistics, scientific machine learning, and Bayesian optimization for their ability to provide probabilistic predictions and model uncertainty. However, GP inference struggles to scale to large datasets (which are common in modern applications), since it requires the solution of a linear system whose size scales quadratically with the number o… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: 28 pages, 6 figures, 2 tables

  2. arXiv:2501.11673  [pdf, other

    math.NA cs.DS cs.LG math.OC stat.ML

    Randomized Kaczmarz Methods with Beyond-Krylov Convergence

    Authors: Michał Dereziński, Deanna Needell, Elizaveta Rebrova, Jiaming Yang

    Abstract: Randomized Kaczmarz methods form a family of linear system solvers which converge by repeatedly projecting their iterates onto randomly sampled equations. While effective in some contexts, such as highly over-determined least squares, Kaczmarz methods are traditionally deemed secondary to Krylov subspace methods, since this latter family of solvers can exploit outliers in the input's singular valu… ▽ More

    Submitted 20 January, 2025; originally announced January 2025.

  3. arXiv:2411.08773  [pdf, ps, other

    cs.DS cs.LG math.NA math.PR stat.ML

    Optimal Oblivious Subspace Embeddings with Near-optimal Sparsity

    Authors: Shabarish Chenakkod, Michał Dereziński, Xiaoyu Dong

    Abstract: An oblivious subspace embedding is a random $m\times n$ matrix $Π$ such that, for any $d$-dimensional subspace, with high probability $Π$ preserves the norms of all vectors in that subspace within a $1\pmε$ factor. In this work, we give an oblivious subspace embedding with the optimal dimension $m=Θ(d/ε^2)$ that has a near-optimal sparsity of $\tilde O(1/ε)$ non-zero entries per column of $Π$. Thi… ▽ More

    Submitted 28 April, 2025; v1 submitted 13 November, 2024; originally announced November 2024.

    Comments: ICALP 2025

  4. arXiv:2407.10070  [pdf, other

    cs.LG math.OC stat.ML

    Have ASkotch: A Neat Solution for Large-scale Kernel Ridge Regression

    Authors: Pratik Rathore, Zachary Frangella, Jiaming Yang, Michał Dereziński, Madeleine Udell

    Abstract: Kernel ridge regression (KRR) is a fundamental computational tool, appearing in problems that range from computational chemistry to health analytics, with a particular interest due to its starring role in Gaussian process regression. However, full KRR solvers are challenging to scale to large datasets: both direct (i.e., Cholesky decomposition) and iterative methods (i.e., PCG) incur prohibitive c… ▽ More

    Submitted 21 February, 2025; v1 submitted 14 July, 2024; originally announced July 2024.

    Comments: 64 pages (including appendices), 16 figures, 5 tables

    MSC Class: 65F10; 68W20; 90C06

  5. arXiv:2406.11151  [pdf, other

    cs.LG math.NA stat.ML

    Recent and Upcoming Developments in Randomized Numerical Linear Algebra for Machine Learning

    Authors: Michał Dereziński, Michael W. Mahoney

    Abstract: Large matrices arise in many machine learning and data analysis applications, including as representations of datasets, graphs, model weights, and first and second-order derivatives. Randomized Numerical Linear Algebra (RandNLA) is an area which uses randomness to develop improved algorithms for ubiquitous matrix problems. The area has reached a certain level of maturity; but recent hardware trend… ▽ More

    Submitted 18 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

  6. arXiv:2406.01478  [pdf, other

    math.OC cs.LG stat.ML

    Stochastic Newton Proximal Extragradient Method

    Authors: Ruichen Jiang, Michał Dereziński, Aryan Mokhtari

    Abstract: Stochastic second-order methods achieve fast local convergence in strongly convex optimization by using noisy Hessian estimates to precondition the gradient. However, these methods typically reach superlinear convergence only when the stochastic Hessian noise diminishes, increasing per-iteration costs over time. Recent work in [arXiv:2204.09266] addressed this with a Hessian averaging scheme that… ▽ More

    Submitted 11 November, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted to NeurIPS 2024; 35 pages, 3 figures

  7. arXiv:2404.14758  [pdf, other

    math.OC cs.LG stat.ML

    Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

    Authors: Sachin Garg, Albert S. Berahas, Michał Dereziński

    Abstract: We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches. We demonstrate this phenomenon on a prototypical stochastic second-or… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    MSC Class: 65K05; 90C06; 90C30

  8. arXiv:2311.10680  [pdf, other

    cs.DS cs.LG math.NA stat.ML

    Optimal Embedding Dimension for Sparse Subspace Embeddings

    Authors: Shabarish Chenakkod, Michał Dereziński, Xiaoyu Dong, Mark Rudelson

    Abstract: A random $m\times n$ matrix $S$ is an oblivious subspace embedding (OSE) with parameters $ε>0$, $δ\in(0,1/3)$ and $d\leq m\leq n$, if for any $d$-dimensional subspace $W\subseteq R^n$, $P\big(\,\forall_{x\in W}\ (1+ε)^{-1}\|x\|\leq\|Sx\|\leq (1+ε)\|x\|\,\big)\geq 1-δ.$ It is known that the embedding dimension of an OSE must satisfy $m\geq d$, and for any $θ> 0$, a Gaussian embedding matrix wit… ▽ More

    Submitted 5 June, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

    Comments: STOC 2024

  9. arXiv:2208.09585  [pdf, other

    math.OC math.NA stat.ML

    Sharp Analysis of Sketch-and-Project Methods via a Connection to Randomized Singular Value Decomposition

    Authors: Michał Dereziński, Elizaveta Rebrova

    Abstract: Sketch-and-project is a framework which unifies many known iterative methods for solving linear systems and their variants, as well as further extensions to non-linear optimization problems. It includes popular methods such as randomized Kaczmarz, coordinate descent, variants of the Newton method in convex optimization, and others. In this paper, we develop a theoretical framework for obtaining sh… ▽ More

    Submitted 18 September, 2023; v1 submitted 19 August, 2022; originally announced August 2022.

    MSC Class: 65F10; 68W20; 60B20

  10. arXiv:2206.10291  [pdf, other

    cs.LG cs.DS math.ST stat.ML

    Algorithmic Gaussianization through Sketching: Converting Data into Sub-gaussian Random Designs

    Authors: Michał Dereziński

    Abstract: Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical… ▽ More

    Submitted 27 July, 2023; v1 submitted 21 June, 2022; originally announced June 2022.

  11. arXiv:2206.02702  [pdf, other

    math.OC cs.LG stat.ML

    Stochastic Variance-Reduced Newton: Accelerating Finite-Sum Minimization with Large Batches

    Authors: Michał Dereziński

    Abstract: Stochastic variance reduction has proven effective at accelerating first-order algorithms for solving convex finite-sum optimization tasks such as empirical risk minimization. Incorporating second-order information has proven helpful in further improving the performance of these first-order methods. Yet, comparatively little is known about the benefits of using variance reduction to accelerate pop… ▽ More

    Submitted 29 April, 2025; v1 submitted 6 June, 2022; originally announced June 2022.

  12. arXiv:2204.09266  [pdf, other

    math.OC cs.LG stat.ML

    Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence

    Authors: Sen Na, Michał Dereziński, Michael W. Mahoney

    Abstract: We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as Subsampled Newton and Newton Sketch. Despite using second-order information, these existing methods do not exhibit superlinear converge… ▽ More

    Submitted 28 November, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

    Comments: 43 pages, 16 figures

  13. arXiv:2107.07480  [pdf, other

    math.OC cs.DS cs.LG stat.ML

    Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update

    Authors: Michał Dereziński, Jonathan Lacotte, Mert Pilanci, Michael W. Mahoney

    Abstract: In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration. Randomized sketching has emerged as a powerful technique for constructing estimates of the Hessian which can be used to perform approximate Newton steps. This involves multiplication by a random sketching matrix, which introduces a trade-off between the computation… ▽ More

    Submitted 15 July, 2021; originally announced July 2021.

  14. arXiv:2105.07320  [pdf, other

    cs.DC stat.ML

    LocalNewton: Reducing Communication Bottleneck for Distributed Learning

    Authors: Vipul Gupta, Avishek Ghosh, Michal Derezinski, Rajiv Khanna, Kannan Ramchandran, Michael Mahoney

    Abstract: To address the communication bottleneck problem in distributed optimization within a master-worker framework, we propose LocalNewton, a distributed second-order algorithm with local averaging. In LocalNewton, the worker machines update their model in every iteration by finding a suitable second-order descent direction using only the data and model stored in their own local memory. We let the worke… ▽ More

    Submitted 15 May, 2021; originally announced May 2021.

    Comments: To be published in Uncertainty in Artificial Intelligence (UAI) 2021

  15. arXiv:2011.10695  [pdf, ps, other

    cs.DS cs.LG stat.ML

    Sparse sketches with small inversion bias

    Authors: Michał Dereziński, Zhenyu Liao, Edgar Dobriban, Michael W. Mahoney

    Abstract: For a tall $n\times d$ matrix $A$ and a random $m\times n$ sketching matrix $S$, the sketched estimate of the inverse covariance matrix $(A^\top A)^{-1}$ is typically biased: $E[(\tilde A^\top\tilde A)^{-1}]\ne(A^\top A)^{-1}$, where $\tilde A=SA$. This phenomenon, which we call inversion bias, arises, e.g., in statistics and distributed optimization, when averaging multiple independently construc… ▽ More

    Submitted 9 July, 2021; v1 submitted 20 November, 2020; originally announced November 2020.

  16. arXiv:2007.01327  [pdf, other

    cs.LG math.OC stat.ML

    Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization

    Authors: Michał Dereziński, Burak Bartan, Mert Pilanci, Michael W. Mahoney

    Abstract: In distributed second order optimization, a standard strategy is to average many local estimates, each of which is based on a small sketch or batch of the data. However, the local estimates on each machine are typically biased, relative to the full solution on all of the data, and this can limit the effectiveness of averaging. Here, we introduce a new technique for debiasing the local estimates, w… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

  17. arXiv:2006.16947  [pdf, other

    cs.LG cs.DS stat.ML

    Sampling from a $k$-DPP without looking at all items

    Authors: Daniele Calandriello, Michał Dereziński, Michal Valko

    Abstract: Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more. Given a kernel function and a subset size $k$, our goal is to sample $k$ out of $n$ items with probability proportional to the determinant of the kernel matrix induced by… ▽ More

    Submitted 30 June, 2020; originally announced June 2020.

  18. arXiv:2006.10653  [pdf, other

    cs.LG stat.ML

    Precise expressions for random projections: Low-rank approximation and randomized Newton

    Authors: Michał Dereziński, Feynman Liang, Zhenyu Liao, Michael W. Mahoney

    Abstract: It is often desirable to reduce the dimensionality of a large dataset by projecting it onto a low-dimensional subspace. Matrix sketching has emerged as a powerful technique for performing such dimensionality reduction very efficiently. Even though there is an extensive literature on the worst-case performance of sketching, existing guarantees are typically very different from what is observed in p… ▽ More

    Submitted 13 June, 2022; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: This version of the paper includes a correction to the assumptions in a technical result, Theorem 2. None of the other claims are affected by this change. The conference version of this paper does not include the correction, so we recommend to cite this arXiv version when referencing Theorem 2

  19. arXiv:2002.09073  [pdf, other

    cs.LG stat.ML

    Improved guarantees and a multiple-descent curve for Column Subset Selection and the Nyström method

    Authors: Michał Dereziński, Rajiv Khanna, Michael W. Mahoney

    Abstract: The Column Subset Selection Problem (CSSP) and the Nyström method are among the leading tools for constructing small low-rank approximations of large datasets in machine learning and scientific computing. A fundamental question in this area is: how well can a data subset of size k compete with the best rank k approximation? We develop techniques which exploit spectral properties of the data matrix… ▽ More

    Submitted 18 December, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

    Comments: Minor typo corrections and clarifications; slight change in the title; moved part of the related work and background discussion to the appendix

  20. arXiv:1912.04533  [pdf, other

    cs.LG math.ST stat.ML

    Exact expressions for double descent and implicit regularization via surrogate random design

    Authors: Michał Dereziński, Feynman Liang, Michael W. Mahoney

    Abstract: Double descent refers to the phase transition that is exhibited by the generalization error of unregularized learning models when varying the ratio between the number of parameters and the number of training samples. The recent success of highly over-parameterized machine learning models such as deep neural networks has motivated a theoretical analysis of the double descent phenomenon in classical… ▽ More

    Submitted 18 June, 2020; v1 submitted 10 December, 2019; originally announced December 2019.

    Comments: Minor typo corrections and clarifications; moved the proofs into the appendix

  21. arXiv:1907.03411  [pdf, other

    stat.ML cs.LG

    Unbiased estimators for random design regression

    Authors: Michał Dereziński, Manfred K. Warmuth, Daniel Hsu

    Abstract: In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over $d$-dimensional input points and real-valued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunatel… ▽ More

    Submitted 7 June, 2022; v1 submitted 8 July, 2019; originally announced July 2019.

  22. arXiv:1906.04133  [pdf, other

    cs.LG stat.ML

    Bayesian experimental design using regularized determinantal point processes

    Authors: Michał Dereziński, Feynman Liang, Michael W. Mahoney

    Abstract: In experimental design, we are given $n$ vectors in $d$ dimensions, and our goal is to select $k\ll n$ of them to perform expensive measurements, e.g., to obtain labels/responses, for a linear regression task. Many statistical criteria have been proposed for choosing the optimal design, with popular choices including A- and D-optimality. If prior knowledge is given, typically in the form of a… ▽ More

    Submitted 10 June, 2019; originally announced June 2019.

  23. arXiv:1905.13476  [pdf, other

    cs.LG stat.ML

    Exact sampling of determinantal point processes with sublinear time preprocessing

    Authors: Michał Dereziński, Daniele Calandriello, Michal Valko

    Abstract: We study the complexity of sampling from a distribution over all index subsets of the set $\{1,...,n\}$ with the probability of a subset $S$ proportional to the determinant of the submatrix $\mathbf{L}_S$ of some $n\times n$ p.s.d. matrix $\mathbf{L}$, where $\mathbf{L}_S$ corresponds to the entries of $\mathbf{L}$ indexed by $S$. Known as a determinantal point process, this distribution is used i… ▽ More

    Submitted 8 July, 2019; v1 submitted 31 May, 2019; originally announced May 2019.

  24. arXiv:1905.11546  [pdf, ps, other

    cs.LG stat.ML

    Distributed estimation of the inverse Hessian by determinantal averaging

    Authors: Michał Dereziński, Michael W. Mahoney

    Abstract: In distributed optimization and distributed numerical linear algebra, we often encounter an inversion bias: if we want to compute a quantity that depends on the inverse of a sum of distributed matrices, then the sum of the inverses does not equal the inverse of the sum. An example of this occurs in distributed Newton's method, where we wish to compute (or implicitly work with) the inverse Hessian… ▽ More

    Submitted 27 May, 2019; originally announced May 2019.

  25. arXiv:1902.00995  [pdf, ps, other

    cs.LG stat.ML

    Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression

    Authors: Michał Dereziński, Kenneth L. Clarkson, Michael W. Mahoney, Manfred K. Warmuth

    Abstract: In experimental design, we are given a large collection of vectors, each with a hidden response value that we assume derives from an underlying linear model, and we wish to pick a small subset of the vectors such that querying the corresponding responses will lead to a good estimator of the model. A classical approach in statistics is to assume the responses are linear, plus zero-mean i.i.d. Gauss… ▽ More

    Submitted 3 February, 2019; originally announced February 2019.

  26. arXiv:1811.03717  [pdf, ps, other

    cs.LG stat.ML

    Fast determinantal point processes via distortion-free intermediate sampling

    Authors: Michał Dereziński

    Abstract: Given a fixed $n\times d$ matrix $\mathbf{X}$, where $n\gg d$, we study the complexity of sampling from a distribution over all subsets of rows where the probability of a subset is proportional to the squared volume of the parallelepiped spanned by the rows (a.k.a. a determinantal point process). In this task, it is important to minimize the preprocessing cost of the procedure (performed once) as… ▽ More

    Submitted 21 February, 2019; v1 submitted 8 November, 2018; originally announced November 2018.

  27. arXiv:1810.02453  [pdf, ps, other

    cs.LG stat.ML

    Correcting the bias in least squares regression with volume-rescaled sampling

    Authors: Michał Dereziński, Manfred K. Warmuth, Daniel Hsu

    Abstract: Consider linear regression where the examples are generated by an unknown distribution on $R^d\times R$. Without any assumptions on the noise, the linear least squares solution for any i.i.d. sample will typically be biased w.r.t. the least squares optimum over the entire distribution. However, we show that if an i.i.d. sample of any size k is augmented by a certain small additional sample, then t… ▽ More

    Submitted 4 October, 2018; originally announced October 2018.

  28. arXiv:1806.01969  [pdf, other

    cs.LG stat.ML

    Reverse iterative volume sampling for linear regression

    Authors: Michał Dereziński, Manfred K. Warmuth

    Abstract: We study the following basic machine learning task: Given a fixed set of $d$-dimensional input points for a linear regression problem, we wish to predict a hidden response value for each of the points. We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset. The performance of the predictions is ev… ▽ More

    Submitted 5 June, 2018; originally announced June 2018.