Skip to main content

Showing 1–9 of 9 results for author: Wen, K

Searching in archive stat. Search in all archives.
.
  1. arXiv:2502.08991  [pdf, ps, other

    cs.LG stat.ML

    Task Generalization With AutoRegressive Compositional Structure: Can Learning From $D$ Tasks Generalize to $D^{T}$ Tasks?

    Authors: Amirhesam Abedsoltan, Huaqing Zhang, Kaiyue Wen, Hongzhou Lin, Jingzhao Zhang, Mikhail Belkin

    Abstract: Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of autoregressive compositional structure, where each task is a c… ▽ More

    Submitted 8 June, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

  2. arXiv:2410.05459  [pdf, other

    cs.LG cs.CL stat.ML

    From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency

    Authors: Kaiyue Wen, Huaqing Zhang, Hongzhou Lin, Jingzhao Zhang

    Abstract: Chain-of-thought (CoT) significantly enhances the reasoning performance of large language models (LLM). While current theoretical studies often attribute this improvement to increased expressiveness and computational capacity, we argue that expressiveness is not the primary limitation in the LLM regime, as current large models will fail on simple tasks. Using a parity-learning setup, we demonstrat… ▽ More

    Submitted 5 March, 2025; v1 submitted 7 October, 2024; originally announced October 2024.

    Comments: 43 pages,11 figures

  3. arXiv:2410.05192  [pdf, other

    cs.LG cs.CL stat.ML

    Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

    Authors: Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma

    Abstract: Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute bu… ▽ More

    Submitted 2 December, 2024; v1 submitted 7 October, 2024; originally announced October 2024.

    Comments: 45 pages,13 figures

  4. arXiv:2402.18510  [pdf, other

    cs.LG cs.CL stat.ML

    RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval

    Authors: Kaiyue Wen, Xingyu Dang, Kaifeng Lyu

    Abstract: This paper investigates the gap in representation powers of Recurrent Neural Networks (RNNs) and Transformers in the context of solving algorithmic problems. We focus on understanding whether RNNs, known for their memory efficiency in handling long sequences, can match the performance of Transformers, particularly when enhanced with Chain-of-Thought (CoT) prompting. Our theoretical analysis reveal… ▽ More

    Submitted 6 December, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: 42 pages, 6 figures

  5. arXiv:2312.01429  [pdf, other

    cs.LG cs.CL stat.ML

    Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars

    Authors: Kaiyue Wen, Yuchen Li, Bingbin Liu, Andrej Risteski

    Abstract: Interpretability methods aim to understand the algorithm implemented by a trained model (e.g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a combination of theoretical results and carefully controlled experiments on synthetic data, we take a critical view of methods that exclusively focus on individual parts… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

  6. arXiv:2307.11007  [pdf, other

    cs.LG math.OC stat.ML

    Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization

    Authors: Kaiyue Wen, Zhiyuan Li, Tengyu Ma

    Abstract: Despite extensive studies, the underlying reason as to why overparameterized neural networks can generalize remains elusive. Existing theory shows that common stochastic optimizers prefer flatter minimizers of the training loss, and thus a natural potential explanation is that flatness implies generalization. This work critically examines this explanation. Through theoretical and empirical investi… ▽ More

    Submitted 22 July, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: 34 pages,11 figures

  7. arXiv:2211.16182  [pdf, other

    math.ST stat.ME

    Residual permutation test for regression coefficient testing

    Authors: Kaiyue Wen, Tengyao Wang, Yuhao Wang

    Abstract: We consider the problem of testing whether a single coefficient is equal to zero in linear models when the dimension of covariates $p$ can be up to a constant fraction of sample size $n$. In this regime, an important topic is to propose tests with finite-sample valid size control without requiring the noise to follow strong distributional assumptions. In this paper, we propose a new method, called… ▽ More

    Submitted 3 May, 2025; v1 submitted 29 November, 2022; originally announced November 2022.

    Journal ref: The Annals of Statistics 53.2 (2025): 724-748

  8. arXiv:2211.05729  [pdf, other

    cs.LG math.OC stat.ML

    How Does Sharpness-Aware Minimization Minimize Sharpness?

    Authors: Kaiyue Wen, Tengyu Ma, Zhiyuan Li

    Abstract: Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient… ▽ More

    Submitted 5 January, 2023; v1 submitted 10 November, 2022; originally announced November 2022.

    Comments: 94 pages, 1 figure

  9. arXiv:2206.00501  [pdf, other

    cs.LG cs.AI stat.ML

    Benign Overfitting in Classification: Provably Counter Label Noise with Larger Models

    Authors: Kaiyue Wen, Jiaye Teng, Jingzhao Zhang

    Abstract: Studies on benign overfitting provide insights for the success of overparameterized deep learning models. In this work, we examine whether overfitting is truly benign in real-world classification tasks. We start with the observation that a ResNet model overfits benignly on Cifar10 but not benignly on ImageNet. To understand why benign overfitting fails in the ImageNet experiment, we theoretically… ▽ More

    Submitted 3 April, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

    Comments: Published as a conference paper at ICLR 2023