Skip to main content

Showing 1–32 of 32 results for author: Klusowski, J M

.
  1. arXiv:2506.15643  [pdf, ps, other

    stat.ML cs.LG

    Revisiting Randomization in Greedy Model Search

    Authors: Xin Chen, Jason M. Klusowski, Yan Shuo Tan, Chang Yu

    Abstract: Combining randomized estimators in an ensemble, such as via random forests, has become a fundamental technique in modern data science, but can be computationally expensive. Furthermore, the mechanism by which this improves predictive performance is poorly understood. We address these issues in the context of sparse linear regression by proposing and analyzing an ensemble of greedy forward selectio… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  2. arXiv:2411.10830  [pdf, other

    cs.LG cs.AI math.OC

    One-Layer Transformer Provably Learns One-Nearest Neighbor In Context

    Authors: Zihao Li, Yuan Cao, Cheng Gao, Yihan He, Han Liu, Jason M. Klusowski, Jianqing Fan, Mengdi Wang

    Abstract: Transformers have achieved great success in recent years. Interestingly, transformers have shown particularly strong in-context learning capability -- even without fine-tuning, they are still able to solve unseen tasks well purely based on task-specific prompts. In this paper, we study the capability of one-layer transformers in learning one of the most classical nonparametric estimators, the one-… ▽ More

    Submitted 16 November, 2024; originally announced November 2024.

  3. arXiv:2411.04394  [pdf, ps, other

    stat.ML cs.DS cs.LG math.ST

    Statistical-Computational Trade-offs for Recursive Adaptive Partitioning Estimators

    Authors: Yan Shuo Tan, Jason M. Klusowski, Krishnakumar Balasubramanian

    Abstract: Models based on recursive adaptive partitioning such as decision trees and their ensembles are popular for high-dimensional regression as they can potentially avoid the curse of dimensionality. Because empirical risk minimization (ERM) is computationally infeasible, these models are typically trained using greedy algorithms. Although effective in many cases, these algorithms have been empirically… ▽ More

    Submitted 18 November, 2024; v1 submitted 6 November, 2024; originally announced November 2024.

    MSC Class: 68Q32; 62G08 ACM Class: G.3

  4. arXiv:2410.23610  [pdf, other

    stat.ML cs.LG math.ST

    Global Convergence in Training Large-Scale Transformers

    Authors: Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang, Han Liu, Jason Matthew Klusowski, Jianqing Fan

    Abstract: Despite the widespread success of Transformers across various domains, their optimization guarantees in large-scale model settings are not well-understood. This paper rigorously analyzes the convergence properties of gradient flow in training Transformers with weight decay regularization. First, we construct the mean-field limit of large-scale Transformers, showing that as the model width and dept… ▽ More

    Submitted 30 October, 2024; originally announced October 2024.

    Comments: to be published in 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

    MSC Class: 35Q93

  5. arXiv:2410.03968  [pdf, other

    cs.LG cs.AI cs.GT math.OC

    Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies

    Authors: Sijin Chen, Omar Hagrass, Jason M. Klusowski

    Abstract: Decoding strategies play a pivotal role in text generation for modern language models, yet a puzzling gap divides theory and practice. Surprisingly, strategies that should intuitively be optimal, such as Maximum a Posteriori (MAP), often perform poorly in practice. Meanwhile, popular heuristic approaches like Top-$k$ and Nucleus sampling, which employ truncation and normalization of the conditiona… ▽ More

    Submitted 16 May, 2025; v1 submitted 4 October, 2024; originally announced October 2024.

    Comments: 20 pages, accepted to ICLR 2025

  6. arXiv:2402.03447  [pdf, other

    stat.ML cs.LG stat.ME

    Challenges in Variable Importance Ranking Under Correlation

    Authors: Annie Liang, Thomas Jemielita, Andy Liaw, Vladimir Svetnik, Lingkang Huang, Richard Baumgartner, Jason M. Klusowski

    Abstract: Variable importance plays a pivotal role in interpretable machine learning as it helps measure the impact of factors on the output of the prediction model. Model agnostic methods based on the generation of "null" features via permutation (or related approaches) can be applied. Such analysis is often utilized in pharmaceutical applications due to its ability to interpret black-box models, including… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

  7. arXiv:2401.00691  [pdf, ps, other

    stat.ML cs.LG

    Stochastic Gradient Descent for Nonparametric Regression

    Authors: Xin Chen, Jason M. Klusowski

    Abstract: This paper introduces an iterative algorithm for training nonparametric additive models that enjoys favorable memory storage and computational requirements. The algorithm can be viewed as the functional counterpart of stochastic gradient descent, applied to the coefficients of a truncated basis expansion of the component functions. We show that the resulting estimator satisfies an oracle inequalit… ▽ More

    Submitted 20 June, 2025; v1 submitted 1 January, 2024; originally announced January 2024.

  8. arXiv:2310.09702  [pdf, other

    math.ST stat.ME stat.ML

    Inference with Mondrian Random Forests

    Authors: Matias D. Cattaneo, Jason M. Klusowski, William G. Underwood

    Abstract: Random forests are popular methods for regression and classification analysis, and many different variants have been proposed in recent years. One interesting example is the Mondrian random forest, in which the underlying constituent trees are constructed via a Mondrian process. We give precise bias and variance characterizations, along with a Berry-Esseen-type central limit theorem, for the Mondr… ▽ More

    Submitted 8 April, 2025; v1 submitted 14 October, 2023; originally announced October 2023.

    Comments: 64 pages, 1 figure, 6 tables

    MSC Class: 62G08 (Primary); 62G05; 62G20 (Secondary)

  9. arXiv:2310.04606  [pdf, ps, other

    stat.ML cs.LG math.ST

    Robust Transfer Learning with Unreliable Source Data

    Authors: Jianqing Fan, Cheng Gao, Jason M. Klusowski

    Abstract: This paper addresses challenges in robust transfer learning stemming from ambiguity in Bayes classifiers and weak transferable signals between the target and source distribution. We introduce a novel quantity called the ''ambiguity level'' that measures the discrepancy between the target and source regression functions, propose a simple transfer learning procedure, and establish a general theorem… ▽ More

    Submitted 3 May, 2025; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: Accepted for publication in the Annals of Statistics

  10. arXiv:2309.09880  [pdf, other

    stat.ML cs.LG

    Error Reduction from Stacked Regressions

    Authors: Xin Chen, Jason M. Klusowski, Yan Shuo Tan

    Abstract: Stacking regressions is an ensemble technique that forms linear combinations of different regression estimators to enhance predictive accuracy. The conventional approach uses cross-validation data to generate predictions from the constituent estimators, and least-squares with nonnegativity constraints to learn the combination weights. In this paper, we learn these weights analogously by minimizing… ▽ More

    Submitted 7 October, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

  11. arXiv:2309.00079  [pdf, other

    cs.LG cs.AI math.OC stat.CO stat.ML

    On the Implicit Bias of Adam

    Authors: Matias D. Cattaneo, Jason M. Klusowski, Boris Shigida

    Abstract: In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their… ▽ More

    Submitted 16 June, 2024; v1 submitted 31 August, 2023; originally announced September 2023.

  12. arXiv:2307.07679  [pdf, ps, other

    stat.ML cs.LG math.NA

    Sharp Convergence Rates for Matching Pursuit

    Authors: Jason M. Klusowski, Jonathan W. Siegel

    Abstract: We study the fundamental limits of matching pursuit, or the pure greedy algorithm, for approximating a target function $ f $ by a linear combination $f_n$ of $n$ elements from a dictionary. When the target function is contained in the variation space corresponding to the dictionary, many impressive works over the past few decades have obtained upper and lower bounds on the error $\|f-f_n\|$ of mat… ▽ More

    Submitted 22 July, 2024; v1 submitted 14 July, 2023; originally announced July 2023.

  13. arXiv:2211.10805  [pdf, other

    stat.ML cs.LG math.ST

    On the Pointwise Behavior of Recursive Partitioning and Its Implications for Heterogeneous Causal Effect Estimation

    Authors: Matias D. Cattaneo, Jason M. Klusowski, Peter M. Tian

    Abstract: Decision tree learning is increasingly being used for pointwise inference. Important applications include causal heterogenous treatment effects and dynamic policy decisions, as well as conditional quantile regression and design of experiments, where tree estimation and inference is conducted at specific values of the covariates. In this paper, we call into question the use of decision trees (train… ▽ More

    Submitted 6 February, 2024; v1 submitted 19 November, 2022; originally announced November 2022.

  14. arXiv:2210.14429  [pdf, other

    math.ST stat.ME

    Convergence Rates of Oblique Regression Trees for Flexible Function Libraries

    Authors: Matias D. Cattaneo, Rajita Chandak, Jason M. Klusowski

    Abstract: We develop a theoretical framework for the analysis of oblique decision trees, where the splits at each decision node occur at linear combinations of the covariates (as opposed to conventional tree constructions that force axis-aligned splits involving only a single covariate). While this methodology has garnered significant attention from the computer science and optimization communities since th… ▽ More

    Submitted 30 August, 2023; v1 submitted 25 October, 2022; originally announced October 2022.

  15. arXiv:2104.13881  [pdf, ps, other

    stat.ML cs.LG math.ST

    Large Scale Prediction with Decision Trees

    Authors: Jason M. Klusowski, Peter M. Tian

    Abstract: This paper shows that decision trees constructed with Classification and Regression Trees (CART) and C4.5 methodology are consistent for regression and classification tasks, even when the number of predictor variables grows sub-exponentially with the sample size, under natural 0-norm and 1-norm sparsity constraints. The theory applies to a wide range of models, including (ordinary or logistic) add… ▽ More

    Submitted 13 November, 2023; v1 submitted 28 April, 2021; originally announced April 2021.

  16. arXiv:2011.02683  [pdf, other

    stat.ML cs.LG

    Nonparametric Variable Screening with Optimal Decision Stumps

    Authors: Jason M. Klusowski, Peter M. Tian

    Abstract: Decision trees and their ensembles are endowed with a rich set of diagnostic tools for ranking and screening variables in a predictive model. Despite the widespread use of tree based variable importance measures, pinning down their theoretical properties has been challenging and therefore largely unexplored. To address this gap between theory and practice, we derive finite sample performance guara… ▽ More

    Submitted 10 December, 2020; v1 submitted 5 November, 2020; originally announced November 2020.

  17. arXiv:2006.12625  [pdf, other

    stat.ML cs.LG

    Good Classifiers are Abundant in the Interpolating Regime

    Authors: Ryan Theisen, Jason M. Klusowski, Michael W. Mahoney

    Abstract: Within the machine learning community, the widely-used uniform convergence framework has been used to answer the question of how complex, over-parameterized models can generalize well to new data. This approach bounds the test error of the worst-case model one could have fit to the data, but it has fundamental limitations. Inspired by the statistical mechanics approach to learning, we formally def… ▽ More

    Submitted 4 March, 2021; v1 submitted 22 June, 2020; originally announced June 2020.

  18. arXiv:2006.04266  [pdf, other

    stat.ML cs.LG math.ST

    Sparse learning with CART

    Authors: Jason M. Klusowski

    Abstract: Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. This paper aims to study the statistical propert… ▽ More

    Submitted 18 November, 2020; v1 submitted 7 June, 2020; originally announced June 2020.

  19. arXiv:1910.10245  [pdf, other

    stat.ML cs.LG

    Global Capacity Measures for Deep ReLU Networks via Path Sampling

    Authors: Ryan Theisen, Jason M. Klusowski, Huan Wang, Nitish Shirish Keskar, Caiming Xiong, Richard Socher

    Abstract: Classical results on the statistical complexity of linear models have commonly identified the norm of the weights $\|w\|$ as a fundamental capacity measure. Generalizations of this measure to the setting of deep networks have been varied, though a frequently identified quantity is the product of weight norms of each layer. In this work, we show that for a large class of networks possessing a posit… ▽ More

    Submitted 22 October, 2019; originally announced October 2019.

  20. arXiv:1906.10086  [pdf, other

    stat.ML cs.LG

    Analyzing CART

    Authors: Jason M. Klusowski

    Abstract: Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For binary classification and regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. This paper aims to stu… ▽ More

    Submitted 13 August, 2020; v1 submitted 24 June, 2019; originally announced June 2019.

  21. arXiv:1902.00800  [pdf, ps, other

    stat.ML cs.LG

    Complexity, Statistical Risk, and Metric Entropy of Deep Nets Using Total Path Variation

    Authors: Andrew R. Barron, Jason M. Klusowski

    Abstract: For any ReLU network there is a representation in which the sum of the absolute values of the weights into each node is exactly $1$, and the input layer variables are multiplied by a value $V$ coinciding with the total variation of the path weights. Implications are given for Gaussian complexity, Rademacher complexity, statistical risk, and metric entropy, all of which are shown to be proportional… ▽ More

    Submitted 6 February, 2019; v1 submitted 2 February, 2019; originally announced February 2019.

  22. arXiv:1809.03090  [pdf, ps, other

    stat.ML cs.LG

    Approximation and Estimation for High-Dimensional Deep Learning Networks

    Authors: Andrew R. Barron, Jason M. Klusowski

    Abstract: It has been experimentally observed in recent years that multi-layer artificial neural networks have a surprising ability to generalize, even when trained with far more parameters than observations. Is there a theoretical basis for this? The best available bounds on their metric entropy and associated complexity measures are essentially linear in the number of parameters, which is inadequate to ex… ▽ More

    Submitted 18 September, 2018; v1 submitted 9 September, 2018; originally announced September 2018.

  23. arXiv:1805.02587  [pdf, other

    stat.ML cs.LG

    Sharp Analysis of a Simple Model for Random Forests

    Authors: Jason M. Klusowski

    Abstract: Random forests have become an important tool for improving accuracy in regression and classification problems since their inception by Leo Breiman in 2001. In this paper, we revisit a historically important random forest model originally proposed by Breiman in 2004 and later studied by Gérard Biau in 2012, where a feature is selected at random and the splits occurs at the midpoint of the node alon… ▽ More

    Submitted 22 June, 2020; v1 submitted 7 May, 2018; originally announced May 2018.

    MSC Class: 62G08; 68W20

  24. arXiv:1804.09879  [pdf, ps, other

    math.ST

    Estimation of convex supports from noisy measurements

    Authors: Victor-Emmanuel Brunel, Jason M. Klusowski, Dana Yang

    Abstract: A popular class of problem in statistics deals with estimating the support of a density from $n$ observations drawn at random from a $d$-dimensional distribution. The one-dimensional case reduces to estimating the end points of a univariate density. In practice, an experimenter may only have access to a noisy version of the original data. Therefore, a more realistic model allows for the observatio… ▽ More

    Submitted 25 April, 2018; originally announced April 2018.

    MSC Class: 62H12; 62G30

  25. arXiv:1802.07773  [pdf, other

    math.ST cs.DM stat.ML

    Counting Motifs with Graph Sampling

    Authors: Jason M. Klusowski, Yihong Wu

    Abstract: Applied researchers often construct a network from a random sample of nodes in order to infer properties of the parent network. Two of the most widely used sampling schemes are subgraph sampling, where we sample each vertex independently with probability $p$ and observe the subgraph induced by the sampled vertices, and neighborhood sampling, where we additionally observe the edges between the samp… ▽ More

    Submitted 21 February, 2018; originally announced February 2018.

  26. arXiv:1801.04339  [pdf, other

    math.ST cs.DM cs.LG stat.ML

    Estimating the Number of Connected Components in a Graph via Subgraph Sampling

    Authors: Jason M. Klusowski, Yihong Wu

    Abstract: Learning properties of large graphs from samples has been an important problem in statistical network analysis since the early work of Goodman \cite{Goodman1949} and Frank \cite{Frank1978}. We revisit a problem formulated by Frank \cite{Frank1978} of estimating the number of connected components in a large graph based on the subgraph sampling model, in which we randomly sample a subset of the vert… ▽ More

    Submitted 15 June, 2019; v1 submitted 12 January, 2018; originally announced January 2018.

    MSC Class: 62D05; 62C20

  27. arXiv:1712.10087  [pdf, ps, other

    math.ST stat.ML

    Finite-sample risk bounds for maximum likelihood estimation with arbitrary penalties

    Authors: W. D. Brinda, Jason M. Klusowski

    Abstract: The MDL two-part coding $ \textit{index of resolvability} $ provides a finite-sample upper bound on the statistical risk of penalized likelihood estimators over countable models. However, the bound does not apply to unpenalized maximum likelihood estimation or procedures with exceedingly small penalties. In this paper, we point out a more general inequality that holds for arbitrary penalties. In a… ▽ More

    Submitted 28 December, 2017; originally announced December 2017.

    Comments: To appear in IEEE Transactions on Information Theory, 2018

    MSC Class: 62B10 94A20 94A15 62E17

  28. arXiv:1704.08231  [pdf, other

    stat.ML

    Estimating the Coefficients of a Mixture of Two Linear Regressions by Expectation Maximization

    Authors: Jason M. Klusowski, Dana Yang, W. D. Brinda

    Abstract: We give convergence guarantees for estimating the coefficients of a symmetric mixture of two linear regressions by expectation maximization (EM). In particular, we show that the empirical EM iterates converge to the target parameter vector at the parametric rate, provided the algorithm is initialized in an unbounded cone. In particular, if the initial guess has a sufficiently large cosine angle wi… ▽ More

    Submitted 15 October, 2018; v1 submitted 26 April, 2017; originally announced April 2017.

    MSC Class: 62F10; 68W40

  29. arXiv:1702.02828  [pdf, ps, other

    stat.ML cs.LG

    Minimax Lower Bounds for Ridge Combinations Including Neural Nets

    Authors: Jason M. Klusowski, Andrew R. Barron

    Abstract: Estimation of functions of $ d $ variables is considered using ridge combinations of the form $ \textstyle\sum_{k=1}^m c_{1,k} φ(\textstyle\sum_{j=1}^d c_{0,j,k}x_j-b_k) $ where the activation function $ φ$ is a function with bounded value and derivative. These include single-hidden layer neural networks, polynomials, and sinusoidal models. From a sample of size $ n $ of possibly noisy values at r… ▽ More

    Submitted 9 February, 2017; originally announced February 2017.

    MSC Class: 62J02; 62G08; 68T05

  30. arXiv:1608.02280  [pdf, other

    stat.ML

    Statistical Guarantees for Estimating the Centers of a Two-component Gaussian Mixture by EM

    Authors: Jason M. Klusowski, W. D. Brinda

    Abstract: Recently, a general method for analyzing the statistical accuracy of the EM algorithm has been developed and applied to some simple latent variable models [Balakrishnan et al. 2016]. In that method, the basin of attraction for valid initialization is required to be a ball around the truth. Using Stein's Lemma, we extend these results in the case of estimating the centers of a two-component Gaussia… ▽ More

    Submitted 7 August, 2016; originally announced August 2016.

    MSC Class: 62F10; 62F15; 68W40

  31. arXiv:1607.07819  [pdf, ps, other

    stat.ML math.ST

    Approximation by Combinations of ReLU and Squared ReLU Ridge Functions with $ \ell^1 $ and $ \ell^0 $ Controls

    Authors: Jason M. Klusowski, Andrew R. Barron

    Abstract: We establish $ L^{\infty} $ and $ L^2 $ error bounds for functions of many variables that are approximated by linear combinations of ReLU (rectified linear unit) and squared ReLU ridge functions with $ \ell^1 $ and $ \ell^0 $ controls on their inner and outer parameters. With the squared ReLU ridge function, we show that the $ L^2 $ approximation error is inversely proportional to the inner layer… ▽ More

    Submitted 23 May, 2018; v1 submitted 26 July, 2016; originally announced July 2016.

    MSC Class: 62M45; 41A15

  32. arXiv:1607.01434  [pdf, ps, other

    math.ST stat.ML

    Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks

    Authors: Jason M. Klusowski, Andrew R. Barron

    Abstract: Let $ f^{\star} $ be a function on $ \mathbb{R}^d $ with an assumption of a spectral norm $ v_{f^{\star}} $. For various noise settings, we show that $ \mathbb{E}\|\hat{f} - f^{\star} \|^2 \leq \left(v^4_{f^{\star}}\frac{\log d}{n}\right)^{1/3} $, where $ n $ is the sample size and $ \hat{f} $ is either a penalized least squares estimator or a greedily obtained version of such using linear combina… ▽ More

    Submitted 29 October, 2018; v1 submitted 5 July, 2016; originally announced July 2016.

    Comments: Submitted to Annals of Statistics

    MSC Class: 62J02; 62G08; 68T05