-
Gold after Randomized Sand: Model-X Split Knockoffs for Controlled Transformation Selection
Authors:
Yang Cao,
Hangyu Lin,
Xinwei Sun,
Yuan Yao
Abstract:
Controlling the False Discovery Rate (FDR) in variable selection is crucial for reproducibility and preventing over-selection, particularly with the increasing prevalence of predictive modeling. The Split Knockoff method, a recent extension of the canonical Knockoffs framework, offers finite-sample FDR control for selecting sparse transformations, finding applications across signal processing, eco…
▽ More
Controlling the False Discovery Rate (FDR) in variable selection is crucial for reproducibility and preventing over-selection, particularly with the increasing prevalence of predictive modeling. The Split Knockoff method, a recent extension of the canonical Knockoffs framework, offers finite-sample FDR control for selecting sparse transformations, finding applications across signal processing, economics, information technology, and the life sciences. However, its current formulation is limited to fixed design settings, restricting its use to linear models. The question of whether it can be generalized to random designs, thereby accommodating a broader range of models beyond the linear case -- similar to the Model-X Knockoff framework -- remains unanswered. A major challenge in addressing transformational sparsity within random design settings lies in reconciling the combination of a random design with a deterministic transformation. To overcome this limitation, we propose the Model-X Split Knockoff method. Our method achieves FDR control for transformation selection in random designs, bridging the gap between existing approaches. This is accomplished by introducing an auxiliary randomized design that interacts with both the existing random design and the deterministic transformation, enabling the construction of Model-X Split Knockoffs. Like the classical Model-X framework, our method provides provable finite-sample FDR control under known or accurately estimated covariate distributions, regardless of the conditional distribution of the response. Importantly, it guarantees at least the same selection power as Model-X Knockoffs when both are applicable. Empirical studies, including simulations and real-world applications to Alzheimer's disease imaging and university ranking analysis, demonstrate robust FDR control and suggest improved selection power over the original Model-X approach.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling
Authors:
Yichuan Cao,
Yibo Miao,
Xiao-Shan Gao,
Yinpeng Dong
Abstract:
Text-to-image (T2I) models raise ethical and safety concerns due to their potential to generate inappropriate or harmful images. Evaluating these models' security through red-teaming is vital, yet white-box approaches are limited by their need for internal access, complicating their use with closed-source models. Moreover, existing black-box methods often assume knowledge about the model's specifi…
▽ More
Text-to-image (T2I) models raise ethical and safety concerns due to their potential to generate inappropriate or harmful images. Evaluating these models' security through red-teaming is vital, yet white-box approaches are limited by their need for internal access, complicating their use with closed-source models. Moreover, existing black-box methods often assume knowledge about the model's specific defense mechanisms, limiting their utility in real-world commercial API scenarios. A significant challenge is how to evade unknown and diverse defense mechanisms. To overcome this difficulty, we propose a novel Rule-based Preference modeling Guided Red-Teaming (RPG-RT), which iteratively employs LLM to modify prompts to query and leverages feedback from T2I systems for fine-tuning the LLM. RPG-RT treats the feedback from each iteration as a prior, enabling the LLM to dynamically adapt to unknown defense mechanisms. Given that the feedback is often labeled and coarse-grained, making it difficult to utilize directly, we further propose rule-based preference modeling, which employs a set of rules to evaluate desired or undesired feedback, facilitating finer-grained control over the LLM's dynamic adaptation process. Extensive experiments on nineteen T2I systems with varied safety mechanisms, three online commercial API services, and T2V models verify the superiority and practicality of our approach.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Metric Graph Kernels via the Tropical Torelli Map
Authors:
Yueqi Cao,
Anthea Monod
Abstract:
We propose new graph kernels grounded in the study of metric graphs via tropical algebraic geometry. In contrast to conventional graph kernels that are based on graph combinatorics such as nodes, edges, and subgraphs, our graph kernels are purely based on the geometry and topology of the underlying metric space. A key characterizing property of our construction is its invariance under edge subdivi…
▽ More
We propose new graph kernels grounded in the study of metric graphs via tropical algebraic geometry. In contrast to conventional graph kernels that are based on graph combinatorics such as nodes, edges, and subgraphs, our graph kernels are purely based on the geometry and topology of the underlying metric space. A key characterizing property of our construction is its invariance under edge subdivision, making the kernels intrinsically well-suited for comparing graphs that represent different underlying spaces. We develop efficient algorithms for computing these kernels and analyze their complexity, showing that it depends primarily on the genus of the input graphs. Empirically, our kernels outperform existing methods in label-free settings, as demonstrated on both synthetic and real-world benchmark datasets. We further highlight their practical utility through an urban road network classification task.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenchel-Young Losses
Authors:
Yuzhou Cao,
Han Bao,
Lei Feng,
Bo An
Abstract:
Surrogate regret bounds, also known as excess risk bounds, bridge the gap between the convergence rates of surrogate and target losses, with linear bounds favorable for their lossless regret transfer. While convex smooth surrogate losses are appealing in particular due to the efficient estimation and optimization, the existence of a trade-off between the smoothness and linear regret bound has been…
▽ More
Surrogate regret bounds, also known as excess risk bounds, bridge the gap between the convergence rates of surrogate and target losses, with linear bounds favorable for their lossless regret transfer. While convex smooth surrogate losses are appealing in particular due to the efficient estimation and optimization, the existence of a trade-off between the smoothness and linear regret bound has been believed in the community. That being said, the better optimization and estimation properties of convex smooth surrogate losses may inevitably deteriorate after undergoing the regret transfer onto a target loss. We overcome this dilemma for arbitrary discrete target losses by constructing a convex smooth surrogate loss, which entails a linear surrogate regret bound composed with a tailored prediction link. The construction is based on Fenchel-Young losses generated by the convolutional negentropy, which are equivalent to the infimal convolution of a generalized negentropy and the target Bayes risk. Consequently, the infimal convolution enables us to derive a smooth loss while maintaining the surrogate regret bound linear. We additionally benefit from the infimal convolution to have a consistent estimator of the underlying class probability. Our results are overall a novel demonstration of how convex analysis penetrates into optimization and statistical efficiency in risk minimization.
△ Less
Submitted 14 May, 2025; v1 submitted 14 May, 2025;
originally announced May 2025.
-
High-Dimensional Importance-Weighted Information Criteria: Theory and Optimality
Authors:
Yong-Syun Cao,
Shinpei Imori,
Ching-Kang Ing
Abstract:
Imori and Ing (2025) proposed the importance-weighted orthogonal greedy algorithm (IWOGA) for model selection in high-dimensional misspecified regression models under covariate shift. To determine the number of IWOGA iterations, they introduced the high-dimensional importance-weighted information criterion (HDIWIC). They argued that the combined use of IWOGA and HDIWIC, IWOGA + HDIWIC, achieves an…
▽ More
Imori and Ing (2025) proposed the importance-weighted orthogonal greedy algorithm (IWOGA) for model selection in high-dimensional misspecified regression models under covariate shift. To determine the number of IWOGA iterations, they introduced the high-dimensional importance-weighted information criterion (HDIWIC). They argued that the combined use of IWOGA and HDIWIC, IWOGA + HDIWIC, achieves an optimal trade-off between variance and squared bias, leading to optimal convergence rates in terms of conditional mean squared prediction error. In this article, we provide a theoretical justification for this claim by establishing the optimality of IWOGA + HDIWIC under a set of reasonable assumptions.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
Interpretable Hybrid-Rule Temporal Point Processes
Authors:
Yunyang Cao,
Juekai Lin,
Hongye Wang,
Wenhao Li,
Bo Jin
Abstract:
Temporal Point Processes (TPPs) are widely used for modeling event sequences in various medical domains, such as disease onset prediction, progression analysis, and clinical decision support. Although TPPs effectively capture temporal dynamics, their lack of interpretability remains a critical challenge. Recent advancements have introduced interpretable TPPs. However, these methods fail to incorpo…
▽ More
Temporal Point Processes (TPPs) are widely used for modeling event sequences in various medical domains, such as disease onset prediction, progression analysis, and clinical decision support. Although TPPs effectively capture temporal dynamics, their lack of interpretability remains a critical challenge. Recent advancements have introduced interpretable TPPs. However, these methods fail to incorporate numerical features, thereby limiting their ability to generate precise predictions. To address this issue, we propose Hybrid-Rule Temporal Point Processes (HRTPP), a novel framework that integrates temporal logic rules with numerical features, improving both interpretability and predictive accuracy in event modeling. HRTPP comprises three key components: basic intensity for intrinsic event likelihood, rule-based intensity for structured temporal dependencies, and numerical feature intensity for dynamic probability modulation. To effectively discover valid rules, we introduce a two-phase rule mining strategy with Bayesian optimization. To evaluate our method, we establish a multi-criteria assessment framework, incorporating rule validity, model fitting, and temporal predictive accuracy. Experimental results on real-world medical datasets demonstrate that HRTPP outperforms state-of-the-art interpretable TPPs in terms of predictive performance and clinical interpretability. In case studies, the rules extracted by HRTPP explain the disease progression, offering valuable contributions to medical diagnosis.
△ Less
Submitted 19 April, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
Transformer Learns Optimal Variable Selection in Group-Sparse Classification
Authors:
Chenyang Zhang,
Xuran Meng,
Yuan Cao
Abstract:
Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We…
▽ More
Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks
Authors:
Chenyang Zhang,
Peifeng Gao,
Difan Zou,
Yuan Cao
Abstract:
Modern neural networks are usually highly over-parameterized. Behind the wide usage of over-parameterized networks is the belief that, if the data are simple, then the trained network will be automatically equivalent to a simple predictor. Following this intuition, many existing works have studied different notions of "ranks" of neural networks and their relation to the rank of data. In this work,…
▽ More
Modern neural networks are usually highly over-parameterized. Behind the wide usage of over-parameterized networks is the belief that, if the data are simple, then the trained network will be automatically equivalent to a simple predictor. Following this intuition, many existing works have studied different notions of "ranks" of neural networks and their relation to the rank of data. In this work, we study the rank of convolutional neural networks (CNNs) trained by gradient descent, with a specific focus on the robustness of the rank to image background noises. Specifically, we point out that, when adding background noises to images, the rank of the CNN trained with gradient descent is affected far less compared with the rank of the data. We support our claim with a theoretical case study, where we consider a particular data model to characterize low-rank clean images with added background noises. We prove that CNNs trained by gradient descent can learn the intrinsic dimension of clean images, despite the presence of relatively large background noises. We also conduct experiments on synthetic and real datasets to further validate our claim.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Prompt Optimization with Logged Bandit Data
Authors:
Haruka Kiyohara,
Daniel Yiming Cao,
Yuta Saito,
Thorsten Joachims
Abstract:
We study how to use naturally available user feedback, such as clicks, to optimize large language model (LLM) pipelines for generating personalized sentences using prompts. Naive approaches, which estimate the policy gradient in the prompt space, suffer either from variance caused by the large action space of prompts or bias caused by inaccurate reward predictions. To circumvent these challenges,…
▽ More
We study how to use naturally available user feedback, such as clicks, to optimize large language model (LLM) pipelines for generating personalized sentences using prompts. Naive approaches, which estimate the policy gradient in the prompt space, suffer either from variance caused by the large action space of prompts or bias caused by inaccurate reward predictions. To circumvent these challenges, we propose a novel kernel-based off-policy gradient method, which estimates the policy gradient by leveraging similarity among generated sentences, substantially reducing variance while suppressing the bias. Empirical results on our newly established suite of benchmarks demonstrate the effectiveness of the proposed approach in generating personalized descriptions for movie recommendations, particularly when the number of candidate prompts is large.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
On the Robustness of Transformers against Context Hijacking for Linear Classification
Authors:
Tianle Li,
Chenyang Zhang,
Xingwu Chen,
Yuan Cao,
Difan Zou
Abstract:
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear…
▽ More
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear transformers. In our setup, context tokens are designed as factually correct query-answer pairs, where the queries are similar to the final query but have opposite labels. Then, we develop a general theoretical analysis on the robustness of the linear transformers, which is formulated as a function of the model depth, training context lengths, and number of hijacking context tokens. A key finding is that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations. We show that this improvement arises because deeper layers enable more fine-grained optimization steps, effectively mitigating interference from context hijacking. This is also well supported by our numerical experiments. Our findings provide theoretical insights into the benefits of deeper architectures and contribute to enhancing the understanding of transformer architectures.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Transformers versus the EM Algorithm in Multi-class Clustering
Authors:
Yihan He,
Hong-Yu Chen,
Yuan Cao,
Jianqing Fan,
Han Liu
Abstract:
LLMs demonstrate significant inference capacities in complicated machine learning tasks, using the Transformer model as its backbone. Motivated by the limited understanding of such models on the unsupervised learning problems, we study the learning guarantees of Transformers in performing multi-class clustering of the Gaussian Mixture Models. We develop a theory drawing strong connections between…
▽ More
LLMs demonstrate significant inference capacities in complicated machine learning tasks, using the Transformer model as its backbone. Motivated by the limited understanding of such models on the unsupervised learning problems, we study the learning guarantees of Transformers in performing multi-class clustering of the Gaussian Mixture Models. We develop a theory drawing strong connections between the Softmax Attention layers and the workflow of the EM algorithm on clustering the mixture of Gaussians. Our theory provides approximation bounds for the Expectation and Maximization steps by proving the universal approximation abilities of multivariate mappings by Softmax functions. In addition to the approximation guarantees, we also show that with a sufficient number of pre-training samples and an initialization, Transformers can achieve the minimax optimal rate for the problem considered. Our extensive simulations empirically verified our theory by revealing the strong learning capacities of Transformers even beyond the assumptions in the theory, shedding light on the powerful inference capacities of LLMs.
△ Less
Submitted 9 February, 2025;
originally announced February 2025.
-
Transformers Simulate MLE for Sequence Generation in Bayesian Networks
Authors:
Yuan Cao,
Yihan He,
Dennis Wu,
Hong-Yu Chen,
Jianqing Fan,
Han Liu
Abstract:
Transformers have achieved significant success in various fields, notably excelling in tasks involving sequential data like natural language processing. Despite these achievements, the theoretical understanding of transformers' capabilities remains limited. In this paper, we investigate the theoretical capabilities of transformers to autoregressively generate sequences in Bayesian networks based o…
▽ More
Transformers have achieved significant success in various fields, notably excelling in tasks involving sequential data like natural language processing. Despite these achievements, the theoretical understanding of transformers' capabilities remains limited. In this paper, we investigate the theoretical capabilities of transformers to autoregressively generate sequences in Bayesian networks based on in-context maximum likelihood estimation (MLE). Specifically, we consider a setting where a context is formed by a set of independent sequences generated according to a Bayesian network. We demonstrate that there exists a simple transformer model that can (i) estimate the conditional probabilities of the Bayesian network according to the context, and (ii) autoregressively generate a new sample according to the Bayesian network with estimated conditional probabilities. We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training. Our analysis highlights the potential of transformers to learn complex probabilistic models and contributes to a better understanding of large language models as a powerful class of sequence generators.
△ Less
Submitted 8 July, 2025; v1 submitted 5 January, 2025;
originally announced January 2025.
-
Learning Spectral Methods by Transformers
Authors:
Yihan He,
Yuan Cao,
Hong-Yu Chen,
Dennis Wu,
Jianqing Fan,
Han Liu
Abstract:
Transformers demonstrate significant advantages as the building block of modern LLMs. In this work, we study the capacities of Transformers in performing unsupervised learning. We show that multi-layered Transformers, given a sufficiently large set of pre-training instances, are able to learn the algorithms themselves and perform statistical estimation tasks given new instances. This learning para…
▽ More
Transformers demonstrate significant advantages as the building block of modern LLMs. In this work, we study the capacities of Transformers in performing unsupervised learning. We show that multi-layered Transformers, given a sufficiently large set of pre-training instances, are able to learn the algorithms themselves and perform statistical estimation tasks given new instances. This learning paradigm is distinct from the in-context learning setup and is similar to the learning procedure of human brains where skills are learned through past experience. Theoretically, we prove that pre-trained Transformers can learn the spectral methods and use the classification of bi-class Gaussian mixture model as an example. Our proof is constructive using algorithmic design techniques. Our results are built upon the similarities of multi-layered Transformer architecture with the iterative recovery algorithms used in practice. Empirically, we verify the strong capacity of the multi-layered (pre-trained) Transformer on unsupervised learning through the lens of both the PCA and the Clustering tasks performed on the synthetic and real-world datasets.
△ Less
Submitted 12 January, 2025; v1 submitted 2 January, 2025;
originally announced January 2025.
-
Towards Simple and Provable Parameter-Free Adaptive Gradient Methods
Authors:
Yuanzhe Tao,
Huizhuo Yuan,
Xun Zhou,
Yuan Cao,
Quanquan Gu
Abstract:
Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, adhoc tuning of learning rates poses a challenge, leading to inefficiencies in practice. To address this issue, recent research has focused on developing "learning-rate-free" or "parameter-free" algorithms that…
▽ More
Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, adhoc tuning of learning rates poses a challenge, leading to inefficiencies in practice. To address this issue, recent research has focused on developing "learning-rate-free" or "parameter-free" algorithms that operate effectively without the need for learning rate tuning. Despite these efforts, existing parameter-free variants of AdaGrad and Adam tend to be overly complex and/or lack formal convergence guarantees. In this paper, we present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates. Experimental results across various deep learning tasks validate the competitive performance of AdaGrad++ and Adam++.
△ Less
Submitted 26 December, 2024;
originally announced December 2024.
-
On the Feature Learning in Diffusion Models
Authors:
Andi Han,
Wei Huang,
Yuan Cao,
Difan Zou
Abstract:
The predominant success of diffusion models in generative modeling has spurred significant interest in understanding their theoretical foundations. In this work, we propose a feature learning framework aimed at analyzing and comparing the training dynamics of diffusion models with those of traditional classification models. Our theoretical analysis demonstrates that diffusion models, due to the de…
▽ More
The predominant success of diffusion models in generative modeling has spurred significant interest in understanding their theoretical foundations. In this work, we propose a feature learning framework aimed at analyzing and comparing the training dynamics of diffusion models with those of traditional classification models. Our theoretical analysis demonstrates that diffusion models, due to the denoising objective, are encouraged to learn more balanced and comprehensive representations of the data. In contrast, neural networks with a similar architecture trained for classification tend to prioritize learning specific patterns in the data, often focusing on easy-to-learn components. To support these theoretical insights, we conduct several experiments on both synthetic and real-world datasets, which empirically validate our findings and highlight the distinct feature learning dynamics in diffusion models compared to classification.
△ Less
Submitted 2 March, 2025; v1 submitted 1 December, 2024;
originally announced December 2024.
-
Can a Single Tree Outperform an Entire Forest?
Authors:
Qiangqiang Mao,
Yankai Cao
Abstract:
The prevailing mindset is that a single decision tree underperforms classic random forests in testing accuracy, despite its advantages in interpretability and lightweight structure. This study challenges such a mindset by significantly improving the testing accuracy of an oblique regression tree through our gradient-based entire tree optimization framework, making its performance comparable to the…
▽ More
The prevailing mindset is that a single decision tree underperforms classic random forests in testing accuracy, despite its advantages in interpretability and lightweight structure. This study challenges such a mindset by significantly improving the testing accuracy of an oblique regression tree through our gradient-based entire tree optimization framework, making its performance comparable to the classic random forest. Our approach reformulates tree training as a differentiable unconstrained optimization task, employing a scaled sigmoid approximation strategy. To ameliorate numerical instability, we propose an algorithmic scheme that solves a sequence of increasingly accurate approximations. Additionally, a subtree polish strategy is implemented to reduce approximation errors accumulated across the tree. Extensive experiments on 16 datasets demonstrate that our optimized tree outperforms the classic random forest by an average of $2.03\%$ improvements in testing accuracy.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
Performance Analysis of uRLLC in scalable Cell-free Radio Access Network System
Authors:
Ziyang Zhang,
Dongming Wang,
Yunxiang Guo,
Yang Cao,
Xiaohu You
Abstract:
As a critical component of beyond fifth-generation (B5G) and sixth-generation (6G) mobile communication systems, ultra-reliable low-latency communication (uRLLC) imposes stringent requirements on latency and reliability. In recent years, with the improvement of mobile communication network, centralized and distributed processing schemes for cellfree massive multiple-input multiple-output (CF-mMIMO…
▽ More
As a critical component of beyond fifth-generation (B5G) and sixth-generation (6G) mobile communication systems, ultra-reliable low-latency communication (uRLLC) imposes stringent requirements on latency and reliability. In recent years, with the improvement of mobile communication network, centralized and distributed processing schemes for cellfree massive multiple-input multiple-output (CF-mMIMO) have attracted significant research attention. This paper investigates the performance of a novel scalable cell-free radio access network (CF-RAN) architecture featuring multiple edge distributed units (EDUs) under the finite block length regime. Closed expressions for the upper and lower bounds of its expected spectral efficiency (SE) performance are derived, where centralized and fully distributed deployment can be treated as two special cases, respectively. Furthermore, the spatial distribution of user equipments (UEs) and remote radio units (RRUs) is examined and the analysis reveals that the interleaving RRUs deployment associated with the EDU can enhance SE performance under finite block length constraints with specific transmission error probability. The paper also compares Monte Carlo simulation results with multi-RRU clustering-based collaborative processing, validating the accuracy of the space-time exchange theory in the scalable CF-RAN scenario. By deploying scalable EDUs, a practical trade-off between latency and reliability can be achieved through spatial degree-of-freedom (DoF), offering a distributed and scalable realization of the space-time exchange theory.
△ Less
Submitted 12 December, 2024; v1 submitted 13 November, 2024;
originally announced November 2024.
-
Global Convergence in Training Large-Scale Transformers
Authors:
Cheng Gao,
Yuan Cao,
Zihao Li,
Yihan He,
Mengdi Wang,
Han Liu,
Jason Matthew Klusowski,
Jianqing Fan
Abstract:
Despite the widespread success of Transformers across various domains, their optimization guarantees in large-scale model settings are not well-understood. This paper rigorously analyzes the convergence properties of gradient flow in training Transformers with weight decay regularization. First, we construct the mean-field limit of large-scale Transformers, showing that as the model width and dept…
▽ More
Despite the widespread success of Transformers across various domains, their optimization guarantees in large-scale model settings are not well-understood. This paper rigorously analyzes the convergence properties of gradient flow in training Transformers with weight decay regularization. First, we construct the mean-field limit of large-scale Transformers, showing that as the model width and depth go to infinity, gradient flow converges to the Wasserstein gradient flow, which is represented by a partial differential equation. Then, we demonstrate that the gradient flow reaches a global minimum consistent with the PDE solution when the weight decay regularization parameter is sufficiently small. Our analysis is based on a series of novel mean-field techniques that adapt to Transformers. Compared with existing tools for deep networks (Lu et al., 2020) that demand homogeneity and global Lipschitz smoothness, we utilize a refined analysis assuming only $\textit{partial homogeneity}$ and $\textit{local Lipschitz smoothness}$. These new techniques may be of independent interest.
△ Less
Submitted 30 October, 2024;
originally announced October 2024.
-
NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks
Authors:
Yongchang Hao,
Yanshuai Cao,
Lili Mou
Abstract:
The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-…
▽ More
The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.
△ Less
Submitted 27 October, 2024;
originally announced October 2024.
-
A Systematic Review of Machine Learning Approaches for Detecting Deceptive Activities on Social Media: Methods, Challenges, and Biases
Authors:
Yunchong Liu,
Xiaorui Shen,
Yeyubei Zhang,
Zhongyan Wang,
Yexin Tian,
Jianglai Dai,
Yuchen Cao
Abstract:
Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review id…
▽ More
Social media platforms like Twitter, Facebook, and Instagram have facilitated the spread of misinformation, necessitating automated detection systems. This systematic review evaluates 36 studies that apply machine learning (ML) and deep learning (DL) models to detect fake news, spam, and fake accounts on social media. Using the Prediction model Risk Of Bias ASsessment Tool (PROBAST), the review identified key biases across the ML lifecycle: selection bias due to non-representative sampling, inadequate handling of class imbalance, insufficient linguistic preprocessing (e.g., negations), and inconsistent hyperparameter tuning. Although models such as Support Vector Machines (SVM), Random Forests, and Long Short-Term Memory (LSTM) networks showed strong potential, over-reliance on accuracy as an evaluation metric in imbalanced data settings was a common flaw. The review highlights the need for improved data preprocessing (e.g., resampling techniques), consistent hyperparameter tuning, and the use of appropriate metrics like precision, recall, F1 score, and AUROC. Addressing these limitations can lead to more reliable and generalizable ML/DL models for detecting deceptive content, ultimately contributing to the reduction of misinformation on social media.
△ Less
Submitted 9 March, 2025; v1 submitted 26 October, 2024;
originally announced October 2024.
-
Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers
Authors:
Shuning Shang,
Xuran Meng,
Yuan Cao,
Difan Zou
Abstract:
Benign overfitting refers to how over-parameterized neural networks can fit training data perfectly and generalize well to unseen data. While this has been widely investigated theoretically, existing works are limited to two-layer networks with fixed output layers, where only the hidden weights are trained. We extend the analysis to two-layer ReLU convolutional neural networks (CNNs) with fully tr…
▽ More
Benign overfitting refers to how over-parameterized neural networks can fit training data perfectly and generalize well to unseen data. While this has been widely investigated theoretically, existing works are limited to two-layer networks with fixed output layers, where only the hidden weights are trained. We extend the analysis to two-layer ReLU convolutional neural networks (CNNs) with fully trainable layers, which is closer to the practice. Our results show that the initialization scaling of the output layer is crucial to the training dynamics: large scales make the model training behave similarly to that with the fixed output, the hidden layer grows rapidly while the output layer remains largely unchanged; in contrast, small scales result in more complex layer interactions, the hidden layer initially grows to a specific ratio relative to the output layer, after which both layers jointly grow and maintain that ratio throughout training. Furthermore, in both settings, we provide nearly matching upper and lower bounds on the test errors, identifying the sharp conditions on the initialization scaling and signal-to-noise ratio (SNR) in which the benign overfitting can be achieved or not. Numerical experiments back up the theoretical results.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
Understanding the Benefits of SimCLR Pre-Training in Two-Layer Convolutional Neural Networks
Authors:
Han Zhang,
Yuan Cao
Abstract:
SimCLR is one of the most popular contrastive learning methods for vision tasks. It pre-trains deep neural networks based on a large amount of unlabeled data by teaching the model to distinguish between positive and negative pairs of augmented images. It is believed that SimCLR can pre-train a deep neural network to learn efficient representations that can lead to a better performance of future su…
▽ More
SimCLR is one of the most popular contrastive learning methods for vision tasks. It pre-trains deep neural networks based on a large amount of unlabeled data by teaching the model to distinguish between positive and negative pairs of augmented images. It is believed that SimCLR can pre-train a deep neural network to learn efficient representations that can lead to a better performance of future supervised fine-tuning. Despite its effectiveness, our theoretical understanding of the underlying mechanisms of SimCLR is still limited. In this paper, we theoretically introduce a case study of the SimCLR method. Specifically, we consider training a two-layer convolutional neural network (CNN) to learn a toy image data model. We show that, under certain conditions on the number of labeled data, SimCLR pre-training combined with supervised fine-tuning achieves almost optimal test loss. Notably, the label complexity for SimCLR pre-training is far less demanding compared to direct training on supervised data. Our analysis sheds light on the benefits of SimCLR in learning with fewer labels.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
Heterogeneous peer effects of college roommates on academic performance
Authors:
Yi Cao,
Tao Zhou,
Jian Gao
Abstract:
Understanding how student peers influence learning outcomes is crucial for effective education management in complex social systems. The complexities of peer selection and evolving peer relationships, however, pose challenges for identifying peer effects using static observational data. Here we use both null-model and regression approaches to examine peer effects using longitudinal data from 5,272…
▽ More
Understanding how student peers influence learning outcomes is crucial for effective education management in complex social systems. The complexities of peer selection and evolving peer relationships, however, pose challenges for identifying peer effects using static observational data. Here we use both null-model and regression approaches to examine peer effects using longitudinal data from 5,272 undergraduates, where roommate assignments are plausibly random upon enrollment and roommate relationships persist until graduation. Specifically, we construct a roommate null model by randomly shuffling students among dorm rooms and introduce an assimilation metric to quantify similarities in roommate academic performance. We find significantly larger assimilation in actual data than in the roommate null model, suggesting roommate peer effects, whereby roommates have more similar performance than expected by chance alone. Moreover, assimilation exhibits an overall increasing trend over time, suggesting that peer effects become stronger the longer roommates live together. Our regression analysis further reveals the moderating role of peer heterogeneity. In particular, when roommates perform similarly, the positive relationship between a student's future performance and their roommates' average prior performance is more pronounced, and their ordinal rank in the dorm room has an independent effect. Our findings contribute to understanding the role of college roommates in influencing student academic performance.
△ Less
Submitted 29 May, 2024;
originally announced June 2024.
-
The Implicit Bias of Adam on Separable Data
Authors:
Chenyang Zhang,
Difan Zou,
Yuan Cao
Abstract:
Adam has become one of the most favored optimizers in deep learning problems. Despite its success in practice, numerous mysteries persist regarding its theoretical understanding. In this paper, we study the implicit bias of Adam in linear logistic regression. Specifically, we show that when the training data are linearly separable, Adam converges towards a linear classifier that achieves the maxim…
▽ More
Adam has become one of the most favored optimizers in deep learning problems. Despite its success in practice, numerous mysteries persist regarding its theoretical understanding. In this paper, we study the implicit bias of Adam in linear logistic regression. Specifically, we show that when the training data are linearly separable, Adam converges towards a linear classifier that achieves the maximum $\ell_\infty$-margin. Notably, for a general class of diminishing learning rates, this convergence occurs within polynomial time. Our result shed light on the difference between Adam and (stochastic) gradient descent from a theoretical perspective.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Ginger: An Efficient Curvature Approximation with Linear Complexity for General Neural Networks
Authors:
Yongchang Hao,
Yanshuai Cao,
Lili Mou
Abstract:
Second-order optimization approaches like the generalized Gauss-Newton method are considered more powerful as they utilize the curvature information of the objective function with preconditioning matrices. Albeit offering tempting theoretical benefits, they are not easily applicable to modern deep learning. The major reason is due to the quadratic memory and cubic time complexity to compute the in…
▽ More
Second-order optimization approaches like the generalized Gauss-Newton method are considered more powerful as they utilize the curvature information of the objective function with preconditioning matrices. Albeit offering tempting theoretical benefits, they are not easily applicable to modern deep learning. The major reason is due to the quadratic memory and cubic time complexity to compute the inverse of the matrix. These requirements are infeasible even with state-of-the-art hardware. In this work, we propose Ginger, an eigendecomposition for the inverse of the generalized Gauss-Newton matrix. Our method enjoys efficient linear memory and time complexity for each iteration. Instead of approximating the conditioning matrix, we directly maintain its inverse to make the approximation more accurate. We provide the convergence result of Ginger for non-convex objectives. Our experiments on different tasks with different model architectures verify the effectiveness of our method. Our code is publicly available.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Flora: Low-Rank Adapters Are Secretly Gradient Compressors
Authors:
Yongchang Hao,
Yanshuai Cao,
Lili Mou
Abstract:
Despite large neural networks demonstrating remarkable abilities to complete different tasks, they require excessive memory usage to store the optimization states for training. To alleviate this, the low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters. However, LoRA restricts overall weight update matrices to be low-rank, limiting the model perform…
▽ More
Despite large neural networks demonstrating remarkable abilities to complete different tasks, they require excessive memory usage to store the optimization states for training. To alleviate this, the low-rank adaptation (LoRA) is proposed to reduce the optimization states by training fewer parameters. However, LoRA restricts overall weight update matrices to be low-rank, limiting the model performance. In this work, we investigate the dynamics of LoRA and identify that it can be approximated by a random projection. Based on this observation, we propose Flora, which is able to achieve high-rank updates by resampling the projection matrices while enjoying the sublinear space complexity of optimization states. We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach.
△ Less
Submitted 12 June, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Can overfitted deep neural networks in adversarial training generalize? -- An approximation viewpoint
Authors:
Zhongjie Shi,
Fanghui Liu,
Yuan Cao,
Johan A. K. Suykens
Abstract:
Adversarial training is a widely used method to improve the robustness of deep neural networks (DNNs) over adversarial perturbations. However, it is empirically observed that adversarial training on over-parameterized networks often suffers from the \textit{robust overfitting}: it can achieve almost zero adversarial training error while the robust generalization performance is not promising. In th…
▽ More
Adversarial training is a widely used method to improve the robustness of deep neural networks (DNNs) over adversarial perturbations. However, it is empirically observed that adversarial training on over-parameterized networks often suffers from the \textit{robust overfitting}: it can achieve almost zero adversarial training error while the robust generalization performance is not promising. In this paper, we provide a theoretical understanding of the question of whether overfitted DNNs in adversarial training can generalize from an approximation viewpoint. Specifically, our main results are summarized into three folds: i) For classification, we prove by construction the existence of infinitely many adversarial training classifiers on over-parameterized DNNs that obtain arbitrarily small adversarial training error (overfitting), whereas achieving good robust generalization error under certain conditions concerning the data quality, well separated, and perturbation level. ii) Linear over-parameterization (meaning that the number of parameters is only slightly larger than the sample size) is enough to ensure such existence if the target function is smooth enough. iii) For regression, our results demonstrate that there also exist infinitely many overfitted DNNs with linear over-parameterization in adversarial training that can achieve almost optimal rates of convergence for the standard generalization error. Overall, our analysis points out that robust overfitting can be avoided but the required model capacity will depend on the smoothness of the target function, while a robust generalization gap is inevitable. We hope our analysis will give a better understanding of the mathematical foundations of robustness in DNNs from an approximation view.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Handling The Non-Smooth Challenge in Tensor SVD: A Multi-Objective Tensor Recovery Framework
Authors:
Jingjing Zheng,
Wanglong Lu,
Wenzhe Wang,
Yankai Cao,
Xiaoqin Zhang,
Xianta Jiang
Abstract:
Recently, numerous tensor singular value decomposition (t-SVD)-based tensor recovery methods have shown promise in processing visual data, such as color images and videos. However, these methods often suffer from severe performance degradation when confronted with tensor data exhibiting non-smooth changes. It has been commonly observed in real-world scenarios but ignored by the traditional t-SVD-b…
▽ More
Recently, numerous tensor singular value decomposition (t-SVD)-based tensor recovery methods have shown promise in processing visual data, such as color images and videos. However, these methods often suffer from severe performance degradation when confronted with tensor data exhibiting non-smooth changes. It has been commonly observed in real-world scenarios but ignored by the traditional t-SVD-based methods. In this work, we introduce a novel tensor recovery model with a learnable tensor nuclear norm to address such a challenge. We develop a new optimization algorithm named the Alternating Proximal Multiplier Method (APMM) to iteratively solve the proposed tensor completion model. Theoretical analysis demonstrates the convergence of the proposed APMM to the Karush-Kuhn-Tucker (KKT) point of the optimization problem. In addition, we propose a multi-objective tensor recovery framework based on APMM to efficiently explore the correlations of tensor data across its various dimensions, providing a new perspective on extending the t-SVD-based method to higher-order tensor cases. Numerical experiments demonstrated the effectiveness of the proposed method in tensor completion.
△ Less
Submitted 13 July, 2024; v1 submitted 23 November, 2023;
originally announced November 2023.
-
Regression with Cost-based Rejection
Authors:
Xin Cheng,
Yuzhou Cao,
Haobo Wang,
Hongxin Wei,
Bo An,
Lei Feng
Abstract:
Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem…
▽ More
Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem called regression with cost-based rejection, where the model can reject to make predictions on some examples given certain rejection costs. To solve this problem, we first formulate the expected risk for this problem and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the mean squared error is used as the evaluation metric. Furthermore, we propose to train the model by a surrogate loss function that considers rejection as binary classification and we provide conditions for the model consistency, which implies that the Bayes optimal solution can be recovered by our proposed surrogate loss. Extensive experiments demonstrate the effectiveness of our proposed method.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Split Knockoffs for Multiple Comparisons: Controlling the Directional False Discovery Rate
Authors:
Yang Cao,
Xinwei Sun,
Yuan Yao
Abstract:
Multiple comparisons in hypothesis testing often encounter structural constraints in various applications. For instance, in structural Magnetic Resonance Imaging for Alzheimer's Disease, the focus extends beyond examining atrophic brain regions to include comparisons of anatomically adjacent regions. These constraints can be modeled as linear transformations of parameters, where the sign patterns…
▽ More
Multiple comparisons in hypothesis testing often encounter structural constraints in various applications. For instance, in structural Magnetic Resonance Imaging for Alzheimer's Disease, the focus extends beyond examining atrophic brain regions to include comparisons of anatomically adjacent regions. These constraints can be modeled as linear transformations of parameters, where the sign patterns play a crucial role in estimating directional effects. This class of problems, encompassing total variations, wavelet transforms, fused LASSO, trend filtering, and more, presents an open challenge in effectively controlling the directional false discovery rate. In this paper, we propose an extended Split Knockoff method specifically designed to address the control of directional false discovery rate under linear transformations. Our proposed approach relaxes the stringent linear manifold constraint to its neighborhood, employing a variable splitting technique commonly used in optimization. This methodology yields an orthogonal design that benefits both power and directional false discovery rate control. By incorporating a sample splitting scheme, we achieve effective control of the directional false discovery rate, with a notable reduction to zero as the relaxed neighborhood expands. To demonstrate the efficacy of our method, we conduct simulation experiments and apply it to two real-world scenarios: Alzheimer's Disease analysis and human age comparisons.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
The Implicit Bias of Batch Normalization in Linear Models and Two-layer Linear Convolutional Neural Networks
Authors:
Yuan Cao,
Difan Zou,
Yuanzhi Li,
Quanquan Gu
Abstract:
We study the implicit bias of batch normalization trained by gradient descent. We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-Ω(\log^2 t))$ convergence rate. This distinguishes linear models with batch normalization from those without batch normalization in t…
▽ More
We study the implicit bias of batch normalization trained by gradient descent. We show that when learning a linear model with batch normalization for binary classification, gradient descent converges to a uniform margin classifier on the training data with an $\exp(-Ω(\log^2 t))$ convergence rate. This distinguishes linear models with batch normalization from those without batch normalization in terms of both the type of implicit bias and the convergence rate. We further extend our result to a class of two-layer, single-filter linear convolutional neural networks, and show that batch normalization has an implicit bias towards a patch-wise uniform margin. Based on two examples, we demonstrate that patch-wise uniform margin classifiers can outperform the maximum margin classifiers in certain learning problems. Our results contribute to a better theoretical understanding of batch normalization.
△ Less
Submitted 11 July, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
Per-Example Gradient Regularization Improves Learning Signals from Noisy Data
Authors:
Xuran Meng,
Yuan Cao,
Difan Zou
Abstract:
Gradient regularization, as described in \citet{barrett2021implicit}, is a highly effective technique for promoting flat minima during gradient descent. Empirical evidence suggests that this regularization technique can significantly enhance the robustness of deep learning models against noisy perturbations, while also reducing test error. In this paper, we explore the per-example gradient regular…
▽ More
Gradient regularization, as described in \citet{barrett2021implicit}, is a highly effective technique for promoting flat minima during gradient descent. Empirical evidence suggests that this regularization technique can significantly enhance the robustness of deep learning models against noisy perturbations, while also reducing test error. In this paper, we explore the per-example gradient regularization (PEGR) and present a theoretical analysis that demonstrates its effectiveness in improving both test error and robustness against noise perturbations. Specifically, we adopt a signal-noise data model from \citet{cao2022benign} and show that PEGR can learn signals effectively while suppressing noise. In contrast, standard gradient descent struggles to distinguish the signal from the noise, leading to suboptimal generalization performance. Our analysis reveals that PEGR penalizes the variance of pattern learning, thus effectively suppressing the memorization of noises from the training data. These findings underscore the importance of variance control in deep learning training and offer useful insights for developing more effective training approaches.
△ Less
Submitted 31 March, 2023;
originally announced March 2023.
-
The Benefits of Mixup for Feature Learning
Authors:
Difan Zou,
Yuan Cao,
Yuanzhi Li,
Quanquan Gu
Abstract:
Mixup, a simple data augmentation method that randomly mixes two data points via linear interpolation, has been extensively applied in various deep learning applications to gain better generalization. However, the theoretical underpinnings of its efficacy are not yet fully understood. In this paper, we aim to seek a fundamental understanding of the benefits of Mixup. We first show that Mixup using…
▽ More
Mixup, a simple data augmentation method that randomly mixes two data points via linear interpolation, has been extensively applied in various deep learning applications to gain better generalization. However, the theoretical underpinnings of its efficacy are not yet fully understood. In this paper, we aim to seek a fundamental understanding of the benefits of Mixup. We first show that Mixup using different linear interpolation parameters for features and labels can still achieve similar performance to the standard Mixup. This indicates that the intuitive linearity explanation in Zhang et al., (2018) may not fully explain the success of Mixup. Then we perform a theoretical study of Mixup from the feature learning perspective. We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data). In contrast, standard training can only learn the common features but fails to learn the rare features, thus suffering from bad generalization performance. Moreover, our theoretical analysis also shows that the benefits of Mixup for feature learning are mostly gained in the early training phase, based on which we propose to apply early stopping in Mixup. Experimental results verify our theoretical findings and demonstrate the effectiveness of the early-stopped Mixup training.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
Analyzing covariate clustering effects in healthcare cost subgroups: insights and applications for prediction
Authors:
Zhengxiao Li,
Yifan Huang,
Yang Cao
Abstract:
Healthcare cost prediction is a challenging task due to the high-dimensionality and high correlation among covariates. Additionally, the skewed, heavy-tailed, and often multi-modal nature of cost data can complicate matters further due to unobserved heterogeneity. In this study, we propose a novel framework for finite mixture regression models that incorporates covariate clustering methods to bett…
▽ More
Healthcare cost prediction is a challenging task due to the high-dimensionality and high correlation among covariates. Additionally, the skewed, heavy-tailed, and often multi-modal nature of cost data can complicate matters further due to unobserved heterogeneity. In this study, we propose a novel framework for finite mixture regression models that incorporates covariate clustering methods to better account for the effects of clustered covariates on subgroups of the outcome, which enables a more accurate characterization of the complex distribution of the data. The proposed framework can be formulated as a convex optimization problem with an additional penalty term based on the prior similarity of the covariates. To efficiently solve this optimization problem, a specialized EM-ADMM algorithm is proposed that integrates the alternating direction multiplicative method (ADMM) into the iterative process of the expectation-maximizing (EM) algorithm. The convergence of the algorithm and the efficiency of the covariate clustering method are verified using simulation data, and the superiority of the approach over traditional regression techniques is demonstrated using two real Chinese medical expenditure datasets. Our empirical results provide valuable insights into the complex network graph of the covariates and can inform business practices, such as the design and pricing of medical insurance products.
△ Less
Submitted 10 March, 2023;
originally announced March 2023.
-
Revisiting Discriminative vs. Generative Classifiers: Theory and Implications
Authors:
Chenyu Zheng,
Guoqiang Wu,
Fan Bao,
Yue Cao,
Chongxuan Li,
Jun Zhu
Abstract:
A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the sta…
▽ More
A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the statistical efficiency of naive Bayes, the paper revisits the classical topic on discriminative vs. generative classifiers. Theoretically, the paper considers the surrogate loss instead of the zero-one loss in analyses and generalizes the classical results from binary cases to multiclass ones. We show that, under mild assumptions, multiclass naive Bayes requires $O(\log n)$ samples to approach its asymptotic error while the corresponding multiclass logistic regression requires $O(n)$ samples, where $n$ is the feature dimension. To establish it, we present a multiclass $\mathcal{H}$-consistency bound framework and an explicit bound for logistic loss, which are of independent interests. Simulation results on a mixture of Gaussian validate our theoretical findings. Experiments on various pre-trained deep vision models show that naive Bayes consistently converges faster as the number of data increases. Besides, naive Bayes shows promise in few-shot cases and we observe the "two regimes" phenomenon in pre-trained supervised models. Our code is available at https://github.com/ML-GSAI/Revisiting-Dis-vs-Gen-Classifiers.
△ Less
Submitted 29 May, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
Inferring changes to the global carbon cycle with WOMBAT v2.0, a hierarchical flux-inversion framework
Authors:
Michael Bertolacci,
Andrew Zammit-Mangion,
Andrew Schuh,
Beata Bukosa,
Jenny Fisher,
Yi Cao,
Aleya Kaushik,
Noel Cressie
Abstract:
The natural cycles of the surface-to-atmosphere fluxes of carbon dioxide (CO$_2$) and other important greenhouse gases are changing in response to human influences. These changes need to be quantified to understand climate change and its impacts, but this is difficult to do because natural fluxes occur over large spatial and temporal scales. To infer trends in fluxes and identify phase shifts and…
▽ More
The natural cycles of the surface-to-atmosphere fluxes of carbon dioxide (CO$_2$) and other important greenhouse gases are changing in response to human influences. These changes need to be quantified to understand climate change and its impacts, but this is difficult to do because natural fluxes occur over large spatial and temporal scales. To infer trends in fluxes and identify phase shifts and amplitude changes in flux seasonal cycles, we construct a flux-inversion system that uses a novel spatially varying time-series decomposition of the fluxes, while also accommodating physical constraints on the fluxes. We incorporate these features into the Wollongong Methodology for Bayesian Assimilation of Trace-gases (WOMBAT, Zammit-Mangion et al., Geosci. Model Dev., 15, 2022), a hierarchical flux-inversion framework that yields posterior distributions for all unknowns in the underlying model. We apply the new method, which we call WOMBAT v2.0, to a mix of satellite observations of CO$_2$ mole fraction from the Orbiting Carbon Observatory-2 (OCO-2) satellite and direct measurements of CO$_2$ mole fraction from a variety of sources. We estimate the changes to CO$_2$ fluxes that occurred from January 2015 to December 2020, and compare our posterior estimates to those from an alternative method based on a bottom-up understanding of the physical processes involved. We find substantial trends in the fluxes, including that tropical ecosystems trended from being a net source to a net sink of CO$_2$ over the study period. We also find that the amplitude of the global seasonal cycle of ecosystem CO$_2$ fluxes increased over the study period by 0.11 PgC/month (an increase of 8%), and that the seasonal cycle of ecosystem CO$_2$ fluxes in the northern temperate and northern boreal regions shifted earlier in the year by 0.4-0.7 and 0.4-0.9 days, respectively (2.5th to 97.5th posterior percentiles).
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
$k$-Means Clustering for Persistent Homology
Authors:
Yueqi Cao,
Prudence Leung,
Anthea Monod
Abstract:
Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram; it has recently gained much popularity from its myriad successful applications to many domains. However, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we…
▽ More
Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram; it has recently gained much popularity from its myriad successful applications to many domains. However, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the $k$-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush--Kuhn--Tucker framework. Additionally, we perform numerical experiments on various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures; we find that $k$-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.
△ Less
Submitted 25 November, 2023; v1 submitted 18 October, 2022;
originally announced October 2022.
-
Multiple Descent in the Multiple Random Feature Model
Authors:
Xuran Meng,
Jianfeng Yao,
Yuan Cao
Abstract:
Recent works have demonstrated a double descent phenomenon in over-parameterized learning. Although this phenomenon has been investigated by recent works, it has not been fully understood in theory. In this paper, we investigate the multiple descent phenomenon in a class of multi-component prediction models. We first consider a ''double random feature model'' (DRFM) concatenating two types of rand…
▽ More
Recent works have demonstrated a double descent phenomenon in over-parameterized learning. Although this phenomenon has been investigated by recent works, it has not been fully understood in theory. In this paper, we investigate the multiple descent phenomenon in a class of multi-component prediction models. We first consider a ''double random feature model'' (DRFM) concatenating two types of random features, and study the excess risk achieved by the DRFM in ridge regression. We calculate the precise limit of the excess risk under the high dimensional framework where the training sample size, the dimension of data, and the dimension of random features tend to infinity proportionally. Based on the calculation, we further theoretically demonstrate that the risk curves of DRFMs can exhibit triple descent. We then provide a thorough experimental study to verify our theory. At last, we extend our study to the ''multiple random feature model'' (MRFM), and show that MRFMs ensembling $K$ types of random features may exhibit $(K+1)$-fold descent. Our analysis points out that risk curves with a specific number of descent generally exist in learning multi-component prediction models.
△ Less
Submitted 10 October, 2023; v1 submitted 21 August, 2022;
originally announced August 2022.
-
A Geometric Condition for Uniqueness of Fréchet Means of Persistence Diagrams
Authors:
Yueqi Cao,
Anthea Monod
Abstract:
The Fréchet mean is an important statistical summary and measure of centrality of data; it has been defined and studied for persistent homology captured by persistence diagrams. However, the complicated geometry of the space of persistence diagrams implies that the Fréchet mean for a given set of persistence diagrams is not necessarily unique, which prohibits theoretical guarantees for empirical m…
▽ More
The Fréchet mean is an important statistical summary and measure of centrality of data; it has been defined and studied for persistent homology captured by persistence diagrams. However, the complicated geometry of the space of persistence diagrams implies that the Fréchet mean for a given set of persistence diagrams is not necessarily unique, which prohibits theoretical guarantees for empirical means with respect to population means. In this paper, we derive a variance expression for a set of persistence diagrams exhibiting a multi-matching between the persistence points known as a grouping. Moreover, we propose a condition for groupings, which we refer to as flatness; we prove that sets of persistence diagrams that exhibit flat groupings give rise to unique Fréchet means. We derive a finite sample convergence result for general groupings, which results in convergence for Fréchet means if the groupings are flat. We then interpret flat groupings in a recently-proposed general framework of Fréchet means in Alexandrov geometry. Finally, we show that for manifold-valued data, the persistence diagrams can be truncated to construct flat groupings.
△ Less
Submitted 2 January, 2025; v1 submitted 8 July, 2022;
originally announced July 2022.
-
Learning Optimal Flows for Non-Equilibrium Importance Sampling
Authors:
Yu Cao,
Eric Vanden-Eijnden
Abstract:
Many applications in computational sciences and statistical inference require the computation of expectations with respect to complex high-dimensional distributions with unknown normalization constants, as well as the estimation of these constants. Here we develop a method to perform these calculations based on generating samples from a simple base distribution, transporting them by the flow gener…
▽ More
Many applications in computational sciences and statistical inference require the computation of expectations with respect to complex high-dimensional distributions with unknown normalization constants, as well as the estimation of these constants. Here we develop a method to perform these calculations based on generating samples from a simple base distribution, transporting them by the flow generated by a velocity field, and performing averages along these flowlines. This non-equilibrium importance sampling (NEIS) strategy is straightforward to implement and can be used for calculations with arbitrary target distributions. On the theory side, we discuss how to tailor the velocity field to the target and establish general conditions under which the proposed estimator is a perfect estimator with zero-variance. We also draw connections between NEIS and approaches based on mapping a base distribution onto a target via a transport map. On the computational side, we show how to use deep learning to represent the velocity field by a neural network and train it towards the zero variance optimum. These results are illustrated numerically on benchmark examples (with dimension up to $10$), where after training the velocity field, the variance of the NEIS estimator is reduced by up to $6$ orders of magnitude than that of a vanilla estimator. We also compare the performances of NEIS with those of Neal's annealed importance sampling (AIS).
△ Less
Submitted 24 October, 2022; v1 submitted 20 June, 2022;
originally announced June 2022.
-
Privacy Protection for Youth Risk Behavior Using Bayesian Data Synthesis: A Case Study to the YRBS
Authors:
Yixiao Cao,
Jingchen Hu
Abstract:
The large number of publicly available survey datasets of wide variety, albeit useful, raise respondent-level privacy concerns. The synthetic data approach to data privacy and confidentiality has been shown useful in terms of privacy protection and utility preservation. This paper aims at illustrating how synthetic data can facilitate the dissemination of highly sensitive information about youth r…
▽ More
The large number of publicly available survey datasets of wide variety, albeit useful, raise respondent-level privacy concerns. The synthetic data approach to data privacy and confidentiality has been shown useful in terms of privacy protection and utility preservation. This paper aims at illustrating how synthetic data can facilitate the dissemination of highly sensitive information about youth risk behavior by presenting a case study of synthetic data for a sample of the Youth Risk Behavior Survey (YRBS). Given the categorical nature of almost all variables in YRBS, the Dirichlet Process mixture of products of multinomials (DPMPM) synthesizer is adopted to partially synthesize the YRBS sample. Detailed evaluations of utility and disclosure risks demonstrate that the generated synthetic data are able to significantly reduce the disclosure risks compared to the confidential YRSB sample while maintaining a high level of utility.
△ Less
Submitted 22 May, 2022;
originally announced May 2022.
-
Approximating Persistent Homology for Large Datasets
Authors:
Yueqi Cao,
Anthea Monod
Abstract:
Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement…
▽ More
Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large. In this paper we address the problem of finding a representative persistence diagram for prohibitively large datasets. We adapt the classical statistical method of bootstrapping, namely, drawing and studying smaller multiple subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples -- taken as a mean persistence measure computed from the subsamples -- is a valid approximation of the true persistent homology of the larger dataset. We give the rate of convergence of the mean persistence diagram to the true persistence diagram in terms of the number of subsamples and size of each subsample. Given the complex algebraic and geometric nature of persistent homology, we adapt the convexity and stability properties in the space of persistence diagrams together with random set theory to achieve our theoretical results for the general setting of point cloud data. We demonstrate our approach on simulated and real data, including an application of shape clustering on complex large-scale point cloud data.
△ Less
Submitted 18 May, 2022; v1 submitted 19 April, 2022;
originally announced April 2022.
-
Benign Overfitting in Two-layer Convolutional Neural Networks
Authors:
Yuan Cao,
Zixiang Chen,
Mikhail Belkin,
Quanquan Gu
Abstract:
Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as "benign overfitting". Recently, there emerges a line of works studying "benign overfitting" from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there i…
▽ More
Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as "benign overfitting". Recently, there emerges a line of works studying "benign overfitting" from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN). We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve a constant level test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. To the best of our knowledge, this is the first work that precisely characterizes the conditions under which benign overfitting can occur in training convolutional neural networks.
△ Less
Submitted 14 June, 2022; v1 submitted 14 February, 2022;
originally announced February 2022.
-
Benign Overfitting in Adversarially Robust Linear Classification
Authors:
Jinghui Chen,
Yuan Cao,
Quanquan Gu
Abstract:
"Benign overfitting", where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. To explain this surprising phenomenon, a series of works have provided theoretical justification in over-parameterized linear regression, classification, and kernel methods. However, it is not clear if benign overfitt…
▽ More
"Benign overfitting", where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. To explain this surprising phenomenon, a series of works have provided theoretical justification in over-parameterized linear regression, classification, and kernel methods. However, it is not clear if benign overfitting still occurs in the presence of adversarial examples, i.e., examples with tiny and intentional perturbations to fool the classifiers. In this paper, we show that benign overfitting indeed occurs in adversarial training, a principled approach to defend against adversarial examples. In detail, we prove the risk bounds of the adversarially trained linear classifier on the mixture of sub-Gaussian data under $\ell_p$ adversarial perturbations. Our result suggests that under moderate perturbations, adversarially trained linear classifiers can achieve the near-optimal standard and adversarial risks, despite overfitting the noisy training data. Numerical experiments validate our theoretical findings.
△ Less
Submitted 30 December, 2021;
originally announced December 2021.
-
Understanding How Encoder-Decoder Architectures Attend
Authors:
Kyle Aitken,
Vinay V Ramasesh,
Yuan Cao,
Niru Maheswaranathan
Abstract:
Encoder-decoder networks with attention have proven to be a powerful way to solve many sequence-to-sequence tasks. In these networks, attention aligns encoder and decoder states and is often used for visualizing network behavior. However, the mechanisms used by networks to generate appropriate attention matrices are still mysterious. Moreover, how these mechanisms vary depending on the particular…
▽ More
Encoder-decoder networks with attention have proven to be a powerful way to solve many sequence-to-sequence tasks. In these networks, attention aligns encoder and decoder states and is often used for visualizing network behavior. However, the mechanisms used by networks to generate appropriate attention matrices are still mysterious. Moreover, how these mechanisms vary depending on the particular architecture used for the encoder and decoder (recurrent, feed-forward, etc.) are also not well understood. In this work, we investigate how encoder-decoder networks solve different sequence-to-sequence tasks. We introduce a way of decomposing hidden states over a sequence into temporal (independent of input) and input-driven (independent of sequence position) components. This reveals how attention matrices are formed: depending on the task requirements, networks rely more heavily on either the temporal or input-driven components. These findings hold across both recurrent and feed-forward architectures despite their differences in forming the temporal components. Overall, our results provide new insight into the inner workings of attention-based encoder-decoder networks.
△ Less
Submitted 28 October, 2021;
originally announced October 2021.
-
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization
Authors:
Difan Zou,
Yuan Cao,
Yuanzhi Li,
Quanquan Gu
Abstract:
Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications such as image classification, even with a fine-tuned regularization. In this paper, we provide a theo…
▽ More
Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a significantly worse test error in many deep learning applications such as image classification, even with a fine-tuned regularization. In this paper, we provide a theoretical explanation for this phenomenon: we show that in the nonconvex setting of learning over-parameterized two-layer convolutional neural networks starting from the same random initialization, for a class of data distributions (inspired from image data), Adam and gradient descent (GD) can converge to different global solutions of the training objective with provably different generalization errors, even with weight decay regularization. In contrast, we show that if the training objective is convex, and the weight decay regularization is employed, any optimization algorithms including Adam and GD will converge to the same solution if the training is successful. This suggests that the inferior generalization performance of Adam is fundamentally tied to the nonconvex landscape of deep learning optimization.
△ Less
Submitted 25 August, 2021;
originally announced August 2021.
-
Multi-Class Classification from Single-Class Data with Confidences
Authors:
Yuzhou Cao,
Lei Feng,
Senlin Shu,
Yitian Xu,
Bo An,
Gang Niu,
Masashi Sugiyama
Abstract:
Can we learn a multi-class classifier from only data of a single class? We show that without any assumptions on the loss functions, models, and optimizers, we can successfully learn a multi-class classifier from only data of a single class with a rigorous consistency guarantee when confidences (i.e., the class-posterior probabilities for all the classes) are available. Specifically, we propose an…
▽ More
Can we learn a multi-class classifier from only data of a single class? We show that without any assumptions on the loss functions, models, and optimizers, we can successfully learn a multi-class classifier from only data of a single class with a rigorous consistency guarantee when confidences (i.e., the class-posterior probabilities for all the classes) are available. Specifically, we propose an empirical risk minimization framework that is loss-/model-/optimizer-independent. Instead of constructing a boundary between the given class and other classes, our method can conduct discriminative classification between all the classes even if no data from the other classes are provided. We further theoretically and experimentally show that our method can be Bayes-consistent with a simple modification even if the provided confidences are highly noisy. Then, we provide an extension of our method for the case where data from a subset of all the classes are available. Experimental results demonstrate the effectiveness of our methods.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
Multi-sample estimation of centered log-ratio matrix in microbiome studies
Authors:
Yezheng Li,
Hongzhe Li,
Yuanpei Cao
Abstract:
In microbiome studies, one of the ways of studying bacterial abundances is to estimate bacterial composition based on the sequencing read counts. Various transformations are then applied to such compositional data for downstream statistical analysis, among which the centered log-ratio (clr) transformation is most commonly used.
Due to limited sequencing depth and DNA dropouts, many rare bacteria…
▽ More
In microbiome studies, one of the ways of studying bacterial abundances is to estimate bacterial composition based on the sequencing read counts. Various transformations are then applied to such compositional data for downstream statistical analysis, among which the centered log-ratio (clr) transformation is most commonly used.
Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which makes clr transformation infeasible. This paper proposes a multi-sample approach to estimation of the clr matrix directly in order to borrow information across samples and across species. Empirical results from real datasets suggest that the clr matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient is developed. Theoretical upper bounds of the estimation errors and of its corresponding singular subspace errors are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is analyzed on Gut Microbiome dataset and the American Gut project.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures
Authors:
Yuan Cao,
Quanquan Gu,
Mikhail Belkin
Abstract:
Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mi…
▽ More
Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
△ Less
Submitted 2 January, 2022; v1 submitted 28 April, 2021;
originally announced April 2021.
-
Topological Information Retrieval with Dilation-Invariant Bottleneck Comparative Measures
Authors:
Yueqi Cao,
Athanasios Vlontzos,
Luca Schmidtke,
Bernhard Kainz,
Anthea Monod
Abstract:
Appropriately representing elements in a database so that queries may be accurately matched is a central task in information retrieval; recently, this has been achieved by embedding the graphical structure of the database into a manifold in a hierarchy-preserving manner using a variety of metrics. Persistent homology is a tool commonly used in topological data analysis that is able to rigorously c…
▽ More
Appropriately representing elements in a database so that queries may be accurately matched is a central task in information retrieval; recently, this has been achieved by embedding the graphical structure of the database into a manifold in a hierarchy-preserving manner using a variety of metrics. Persistent homology is a tool commonly used in topological data analysis that is able to rigorously characterize a database in terms of both its hierarchy and connectivity structure. Computing persistent homology on a variety of embedded datasets reveals that some commonly used embeddings fail to preserve the connectivity. We show that those embeddings which successfully retain the database topology coincide in persistent homology by introducing two dilation-invariant comparative measures to capture this effect: in particular, they address the issue of metric distortion on manifolds. We provide an algorithm for their computation that exhibits greatly reduced time complexity over existing methods. We use these measures to perform the first instance of topology-based information retrieval and demonstrate its increased performance over the standard bottleneck distance for persistent homology. We showcase our approach on databases of different data varieties including text, videos, and medical images.
△ Less
Submitted 6 July, 2022; v1 submitted 4 April, 2021;
originally announced April 2021.