-
Global Convergence of Iteratively Reweighted Least Squares for Robust Subspace Recovery
Authors:
Gilad Lerman,
Kang Li,
Tyler Maunu,
Teng Zhang
Abstract:
Robust subspace estimation is fundamental to many machine learning and data analysis tasks. Iteratively Reweighted Least Squares (IRLS) is an elegant and empirically effective approach to this problem, yet its theoretical properties remain poorly understood. This paper establishes that, under deterministic conditions, a variant of IRLS with dynamic smoothing regularization converges linearly to th…
▽ More
Robust subspace estimation is fundamental to many machine learning and data analysis tasks. Iteratively Reweighted Least Squares (IRLS) is an elegant and empirically effective approach to this problem, yet its theoretical properties remain poorly understood. This paper establishes that, under deterministic conditions, a variant of IRLS with dynamic smoothing regularization converges linearly to the underlying subspace from any initialization. We extend these guarantees to affine subspace estimation, a setting that lacks prior recovery theory. Additionally, we illustrate the practical benefits of IRLS through an application to low-dimensional neural network training. Our results provide the first global convergence guarantees for IRLS in robust subspace recovery and, more broadly, for nonconvex IRLS on a Riemannian manifold.
△ Less
Submitted 29 June, 2025; v1 submitted 25 June, 2025;
originally announced June 2025.
-
Almost Linear Convergence under Minimal Score Assumptions: Quantized Transition Diffusion
Authors:
Xunpeng Huang,
Yingyu Lin,
Nikki Lijing Kuang,
Hanze Dong,
Difan Zou,
Yian Ma,
Tong Zhang
Abstract:
Continuous diffusion models have demonstrated remarkable performance in data generation across various domains, yet their efficiency remains constrained by two critical limitations: (1) the local adjacency structure of the forward Markov process, which restricts long-range transitions in the data space, and (2) inherent biases introduced during the simulation of time-inhomogeneous reverse denoisin…
▽ More
Continuous diffusion models have demonstrated remarkable performance in data generation across various domains, yet their efficiency remains constrained by two critical limitations: (1) the local adjacency structure of the forward Markov process, which restricts long-range transitions in the data space, and (2) inherent biases introduced during the simulation of time-inhomogeneous reverse denoising processes. To address these challenges, we propose Quantized Transition Diffusion (QTD), a novel approach that integrates data quantization with discrete diffusion dynamics. Our method first transforms the continuous data distribution $p_*$ into a discrete one $q_*$ via histogram approximation and binary encoding, enabling efficient representation in a structured discrete latent space. We then design a continuous-time Markov chain (CTMC) with Hamming distance-based transitions as the forward process, which inherently supports long-range movements in the original data space. For reverse-time sampling, we introduce a \textit{truncated uniformization} technique to simulate the reverse CTMC, which can provably provide unbiased generation from $q_*$ under minimal score assumptions. Through a novel KL dynamic analysis of the reverse CTMC, we prove that QTD can generate samples with $O(d\ln^2(d/ε))$ score evaluations in expectation to approximate the $d$--dimensional target distribution $p_*$ within an $ε$ error tolerance. Our method not only establishes state-of-the-art inference efficiency but also advances the theoretical foundations of diffusion-based generative modeling by unifying discrete and continuous diffusion paradigms.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Cyclic-Shift Sparse Kronecker Tensor Classifier for Signal-Region Detection in Neuroimaging
Authors:
Hsin-Hsiung Huang,
Yuh-Haur Chen,
Teng Zhang
Abstract:
This study proposes a cyclic-shift logistic sparse Kronecker product decomposition (SKPD) model for high-dimensional tensor data, enhancing the SKPD framework with a cyclic-shift mechanism for binary classification. The method enables interpretable and scalable analysis of brain MRI data, detecting disease-relevant regions through a structured low-rank factorization. By incorporating a second spat…
▽ More
This study proposes a cyclic-shift logistic sparse Kronecker product decomposition (SKPD) model for high-dimensional tensor data, enhancing the SKPD framework with a cyclic-shift mechanism for binary classification. The method enables interpretable and scalable analysis of brain MRI data, detecting disease-relevant regions through a structured low-rank factorization. By incorporating a second spatially shifted view of the data, the cyclic-shift logistic SKPD improves robustness to misalignment across subjects, a common challenge in neuroimaging. We provide asymptotic consistency guarantees under a restricted isometry condition adapted to logistic loss. Simulations confirm the model's ability to recover spatial signals under noise and identify optimal patch sizes for factor decomposition. Application to OASIS-1 and ADNI-1 datasets demonstrates that the model achieves strong classification accuracy and localizes estimated coefficients in clinically relevant brain regions, such as the hippocampus. A data-driven slice selection strategy further improves interpretability in 2D projections. The proposed framework offers a principled, interpretable, and computationally efficient tool for neuroimaging-based disease diagnosis, with potential extensions to multi-class settings and more complex transformations.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Data Envelopment Analysis with Robust and Closest Targets:Integrating Full-Dimensional Efficient Facets for Risk-Resilient Benchmarking
Authors:
Xiuquan Huang,
Xi Wang,
Tao Zhang,
Xiaocang Xu,
Ali Emrouznejad
Abstract:
As the external environment become increasingly volatile and unpredictable, the selection of benchmarking targets in data envelopment analysis should account for their ability to consider risks; however, this aspect has not received sufficient attention. We propose a robust benchmarking target defined by the intersection of the maximum number of full-dimensional efficient facets, each representing…
▽ More
As the external environment become increasingly volatile and unpredictable, the selection of benchmarking targets in data envelopment analysis should account for their ability to consider risks; however, this aspect has not received sufficient attention. We propose a robust benchmarking target defined by the intersection of the maximum number of full-dimensional efficient facets, each representing a unique marginal substitution relationship. These targets can serve as robust projections for decision making units that are lacking prior risk information because they incorporate the maximum number of marginal substitution relationships. This enables decision makers to adjust their production through these relationships, thereby maximizing the likelihood of achieving globally optimal outcomes. Furthermore, we propose a novel, well-defined efficiency measure based on robust and closest targets. Finally, we demonstrate the application of the proposed measure using a dataset comprising 38 universities from China's 985 Project.
△ Less
Submitted 18 June, 2025; v1 submitted 9 May, 2025;
originally announced May 2025.
-
Multi-Step Consistency Models: Fast Generation with Theoretical Guarantees
Authors:
Nishant Jain,
Xunpeng Huang,
Yian Ma,
Tong Zhang
Abstract:
Consistency models have recently emerged as a compelling alternative to traditional SDE-based diffusion models. They offer a significant acceleration in generation by producing high-quality samples in very few steps. Despite their empirical success, a proper theoretic justification for their speed-up is still lacking. In this work, we address the gap by providing a theoretical analysis of consiste…
▽ More
Consistency models have recently emerged as a compelling alternative to traditional SDE-based diffusion models. They offer a significant acceleration in generation by producing high-quality samples in very few steps. Despite their empirical success, a proper theoretic justification for their speed-up is still lacking. In this work, we address the gap by providing a theoretical analysis of consistency models capable of mapping inputs at a given time to arbitrary points along the reverse trajectory. We show that one can achieve a KL divergence of order $ O(\varepsilon^2) $ using only $ O\left(\log\left(\frac{d}{\varepsilon}\right)\right) $ iterations with a constant step size. Additionally, under minimal assumptions on the data distribution (non smooth case) an increasingly common setting in recent diffusion model analyses we show that a similar KL convergence guarantee can be obtained, with the number of steps scaling as $ O\left(d \log\left(\frac{d}{\varepsilon}\right)\right) $. Going further, we also provide a theoretical analysis for estimation of such consistency models, concluding that accurate learning is feasible using small discretization steps, both in smooth and non-smooth settings. Notably, our results for the non-smooth case yield best in class convergence rates compared to existing SDE or ODE based analyses under minimal assumptions.
△ Less
Submitted 25 May, 2025; v1 submitted 2 May, 2025;
originally announced May 2025.
-
Capturing Conditional Dependence via Auto-regressive Diffusion Models
Authors:
Xunpeng Huang,
Yujin Han,
Difan Zou,
Yian Ma,
Tong Zhang
Abstract:
Diffusion models have demonstrated appealing performance in both image and video generation. However, many works discover that they struggle to capture important, high-level relationships that are present in the real world. For example, they fail to learn physical laws from data, and even fail to understand that the objects in the world exist in a stable fashion. This is due to the fact that impor…
▽ More
Diffusion models have demonstrated appealing performance in both image and video generation. However, many works discover that they struggle to capture important, high-level relationships that are present in the real world. For example, they fail to learn physical laws from data, and even fail to understand that the objects in the world exist in a stable fashion. This is due to the fact that important conditional dependence structures are not adequately captured in the vanilla diffusion models. In this work, we initiate an in-depth study on strengthening the diffusion model to capture the conditional dependence structures in the data. In particular, we examine the efficacy of the auto-regressive (AR) diffusion models for such purpose and develop the first theoretical results on the sampling error of AR diffusion models under (possibly) the mildest data assumption. Our theoretical findings indicate that, compared with typical diffusion models, the AR variant produces samples with a reduced gap in approximating the data conditional distribution. On the other hand, the overall inference time of the AR-diffusion models is only moderately larger than that for the vanilla diffusion models, making them still practical for large scale applications. We also provide empirical results showing that when there is clear conditional dependence structure in the data, the AR diffusion models captures such structure, whereas vanilla DDPM fails to do so. On the other hand, when there is no obvious conditional dependence across patches of the data, AR diffusion does not outperform DDPM.
△ Less
Submitted 30 April, 2025;
originally announced April 2025.
-
Discrimination-free Insurance Pricing with Privatized Sensitive Attributes
Authors:
Tianhe Zhang,
Suhan Liu,
Peng Shi
Abstract:
Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes such as gender and race, the field of algorithmic bias has introduced various fairness concept…
▽ More
Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes such as gender and race, the field of algorithmic bias has introduced various fairness concepts, along with methodologies to achieve these notions in different contexts. Despite the rapid advancement, not all sectors have embraced these fairness principles to the same extent. One specific sector that merits attention in this regard is insurance. Within the realm of insurance pricing, fairness is defined through a distinct and specialized framework. Consequently, achieving fairness according to established notions does not automatically ensure fair pricing in insurance. In particular, regulators are increasingly emphasizing transparency in pricing algorithms and imposing constraints on insurance companies on the collection and utilization of sensitive consumer attributes. These factors present additional challenges in the implementation of fairness in pricing algorithms. To address these complexities and comply with regulatory demands, we propose an efficient method for constructing fair models that are tailored to the insurance domain, using only privatized sensitive attributes. Notably, our approach ensures statistical guarantees, does not require direct access to sensitive attributes, and adapts to varying transparency requirements, addressing regulatory demands while ensuring fairness in insurance pricing.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
Authors:
Wei Xiong,
Jiarui Yao,
Yuhui Xu,
Bo Pang,
Lei Wang,
Doyen Sahoo,
Junnan Li,
Nan Jiang,
Tong Zhang,
Caiming Xiong,
Hanze Dong
Abstract:
Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core comp…
▽ More
Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO's main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.
△ Less
Submitted 12 June, 2025; v1 submitted 15 April, 2025;
originally announced April 2025.
-
Dynamic Topic Analysis in Academic Journals using Convex Non-negative Matrix Factorization Method
Authors:
Yang Yang,
Tong Zhang,
Jian Wu,
Lijie Su
Abstract:
With the rapid advancement of large language models, academic topic identification and topic evolution analysis are crucial for enhancing AI's understanding capabilities. Dynamic topic analysis provides a powerful approach to capturing and understanding the temporal evolution of topics in large-scale datasets. This paper presents a two-stage dynamic topic analysis framework that incorporates conve…
▽ More
With the rapid advancement of large language models, academic topic identification and topic evolution analysis are crucial for enhancing AI's understanding capabilities. Dynamic topic analysis provides a powerful approach to capturing and understanding the temporal evolution of topics in large-scale datasets. This paper presents a two-stage dynamic topic analysis framework that incorporates convex optimization to improve topic consistency, sparsity, and interpretability. In Stage 1, a two-layer non-negative matrix factorization (NMF) model is employed to extract annual topics and identify key terms. In Stage 2, a convex optimization algorithm refines the dynamic topic structure using the convex NMF (cNMF) model, further enhancing topic integration and stability. Applying the proposed method to IEEE journal abstracts from 2004 to 2022 effectively identifies and quantifies emerging research topics, such as COVID-19 and digital twins. By optimizing sparsity differences in the clustering feature space between traditional and emerging research topics, the framework provides deeper insights into topic evolution and ranking analysis. Moreover, the NMF-cNMF model demonstrates superior stability in topic consistency. At sparsity levels of 0.4, 0.6, and 0.9, the proposed approach improves topic ranking stability by 24.51%, 56.60%, and 36.93%, respectively. The source code (to be open after publication) is available at https://github.com/meetyangyang/CDNMF.
△ Less
Submitted 23 March, 2025;
originally announced April 2025.
-
Using large language models to produce literature reviews: Usages and systematic biases of microphysics parametrizations in 2699 publications
Authors:
Tianhang Zhang,
Shengnan Fu,
David M. Schultz,
Zhonghua Zheng
Abstract:
Large language models afford opportunities for using computers for intensive tasks, realizing research opportunities that have not been considered before. One such opportunity could be a systematic interrogation of the scientific literature. Here, we show how a large language model can be used to construct a literature review of 2699 publications associated with microphysics parametrizations in th…
▽ More
Large language models afford opportunities for using computers for intensive tasks, realizing research opportunities that have not been considered before. One such opportunity could be a systematic interrogation of the scientific literature. Here, we show how a large language model can be used to construct a literature review of 2699 publications associated with microphysics parametrizations in the Weather and Research Forecasting (WRF) model, with the goal of learning how they were used and their systematic biases, when simulating precipitation. The database was constructed of publications identified from Web of Science and Scopus searches. The large language model GPT-4 Turbo was used to extract information about model configurations and performance from the text of 2699 publications. Our results reveal the landscape of how nine of the most popular microphysics parameterizations have been used around the world: Lin, Ferrier, WRF Single-Moment, Goddard Cumulus Ensemble, Morrison, Thompson, and WRF Double-Moment. More studies used one-moment parameterizations before 2020 and two-moment parameterizations after 2020. Seven out of nine parameterizations tended to overestimate precipitation. However, systematic biases of parameterizations differed in various regions. Except simulations using the Lin, Ferrier, and Goddard parameterizations that tended to underestimate precipitation over almost all locations, the remaining six parameterizations tended to overestimate, particularly over China, southeast Asia, western United States, and central Africa. This method could be used by other researchers to help understand how the increasingly massive body of scientific literature can be harnessed through the power of artificial intelligence to solve their research problems.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Neural Posterior Estimation for Cataloging Astronomical Images with Spatially Varying Backgrounds and Point Spread Functions
Authors:
Aakash Patel,
Tianqing Zhang,
Camille Avestruz,
Jeffrey Regier,
the LSST Dark Energy Science Collaboration
Abstract:
Neural posterior estimation (NPE), a type of amortized variational inference, is a computationally efficient means of constructing probabilistic catalogs of light sources from astronomical images. To date, NPE has not been used to perform inference in models with spatially varying covariates. However, ground-based astronomical images have spatially varying sky backgrounds and point spread function…
▽ More
Neural posterior estimation (NPE), a type of amortized variational inference, is a computationally efficient means of constructing probabilistic catalogs of light sources from astronomical images. To date, NPE has not been used to perform inference in models with spatially varying covariates. However, ground-based astronomical images have spatially varying sky backgrounds and point spread functions (PSFs), and accounting for this variation is essential for constructing accurate catalogs of imaged light sources. In this work, we introduce a method of performing NPE with spatially varying backgrounds and PSFs. In this method, we generate synthetic catalogs and semi-synthetic images for these catalogs using randomly sampled PSF and background estimates from existing surveys. Using this data, we train a neural network, which takes an astronomical image and representations of its background and PSF as input, to output a probabilistic catalog. Our experiments with Sloan Digital Sky Survey data demonstrate the effectiveness of NPE in the presence of spatially varying backgrounds and PSFs for light source detection, star/galaxy separation, and flux measurement.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
-
Reverse Markov Learning: Multi-Step Generative Models for Complex Distributions
Authors:
Xinwei Shen,
Nicolai Meinshausen,
Tong Zhang
Abstract:
Learning complex distributions is a fundamental challenge in contemporary applications. Generative models, such as diffusion models, have demonstrated remarkable success in overcoming many limitations of traditional statistical methods. Shen and Meinshausen (2024) introduced engression, a generative approach based on scoring rules that maps noise (and covariates, if available) directly to data. Wh…
▽ More
Learning complex distributions is a fundamental challenge in contemporary applications. Generative models, such as diffusion models, have demonstrated remarkable success in overcoming many limitations of traditional statistical methods. Shen and Meinshausen (2024) introduced engression, a generative approach based on scoring rules that maps noise (and covariates, if available) directly to data. While effective, engression struggles with highly complex distributions, such as those encountered in image data. In this work, we extend engression to improve its capability in learning complex distributions. We propose a framework that defines a general forward process transitioning from the target distribution to a known distribution (e.g., Gaussian) and then learns a reverse Markov process using multiple engression models. This reverse process reconstructs the target distribution step by step. Our approach supports general forward processes, allows for dimension reduction, and naturally discretizes the generative process. As a special case, when using a diffusion-based forward process, our framework offers a method to discretize the training and inference of diffusion models efficiently. Empirical evaluations on simulated and climate data validate our theoretical insights, demonstrating the effectiveness of our approach in capturing complex distributions.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
In-Context Learning of Linear Dynamical Systems with Transformers: Error Bounds and Depth-Separation
Authors:
Frank Cole,
Yulong Lu,
Tianhao Zhang,
Yuxuan Zhao
Abstract:
This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an $L^2$-testing loss uniformly defined across tasks. This result demonstrates that transformers wi…
▽ More
This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an $L^2$-testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.
△ Less
Submitted 13 February, 2025; v1 submitted 12 February, 2025;
originally announced February 2025.
-
Logarithmic Regret for Online KL-Regularized Reinforcement Learning
Authors:
Heyang Zhao,
Chenlu Ye,
Wei Xiong,
Quanquan Gu,
Tong Zhang
Abstract:
Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role in improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored. While there is a recent line of work on the theoretical analys…
▽ More
Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role in improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored. While there is a recent line of work on the theoretical analysis of KL-regularized objective in decision making \citep{xiong2024iterative, xie2024exploratory,zhao2024sharp}, these analyses either reduce to the traditional RL setting or rely on strong coverage assumptions. In this paper, we propose an optimism-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret. By carefully leveraging the benign optimization landscape induced by the KL-regularization and the optimistic reward estimation, our algorithm achieves an $\mathcal{O}\big(η\log (N_{\mathcal R} T)\cdot d_{\mathcal R}\big)$ logarithmic regret bound, where $η, N_{\mathcal R},T,d_{\mathcal R}$ denote the KL-regularization parameter, the cardinality of the reward function class, number of rounds, and the complexity of the reward function class. Furthermore, we extend our algorithm and analysis to reinforcement learning by developing a novel decomposition over transition steps and also obtain a similar logarithmic regret bound.
△ Less
Submitted 30 May, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits
Authors:
Qingyue Zhao,
Kaixuan Ji,
Heyang Zhao,
Tong Zhang,
Quanquan Gu
Abstract:
Although many popular reinforcement learning algorithms are underpinned by $f$-divergence regularization, their sample complexity with respect to the \emph{regularized objective} still lacks a tight characterization. In this paper, we analyze $f$-divergence-regularized offline policy learning. For reverse Kullback-Leibler (KL) divergence, arguably the most commonly used one, we give the first…
▽ More
Although many popular reinforcement learning algorithms are underpinned by $f$-divergence regularization, their sample complexity with respect to the \emph{regularized objective} still lacks a tight characterization. In this paper, we analyze $f$-divergence-regularized offline policy learning. For reverse Kullback-Leibler (KL) divergence, arguably the most commonly used one, we give the first $\tilde{O}(ε^{-1})$ sample complexity under single-policy concentrability for contextual bandits, surpassing existing $\tilde{O}(ε^{-1})$ bound under all-policy concentrability and $\tilde{O}(ε^{-2})$ bound under single-policy concentrability. Our analysis for general function approximation leverages the principle of pessimism in the face of uncertainty to refine a mean-value-type argument to its extreme. This in turn leads to a novel moment-based technique, effectively bypassing the need for uniform control over the discrepancy between any two functions in the function class. We further propose a lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the strong convexity of reverse KL. In addition, for $f$-divergences with strongly convex $f$, to which reverse KL \emph{does not} belong, we show that the sharp sample complexity $\tildeΘ(ε^{-1})$ is achievable even without single-policy concentrability. In this case, the algorithm design can get rid of pessimistic estimators. We further extend our analysis to dueling bandits, and we believe these results take a significant step toward a comprehensive understanding of $f$-divergence-regularized policy learning.
△ Less
Submitted 30 May, 2025; v1 submitted 9 February, 2025;
originally announced February 2025.
-
Catoni Contextual Bandits are Robust to Heavy-tailed Rewards
Authors:
Chenlu Ye,
Yujia Jin,
Alekh Agarwal,
Tong Zhang
Abstract:
Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni…
▽ More
Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni's estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$ as well as the number of rounds $T$. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning
Authors:
Thomas T. Zhang,
Behrad Moniri,
Ansh Nagwekar,
Faraz Rahman,
Anton Xue,
Hamed Hassani,
Nikolai Matni
Abstract:
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their p…
▽ More
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Divergence-Augmented Policy Optimization
Authors:
Qing Wang,
Yingru Li,
Jiechao Xiong,
Tong Zhang
Abstract:
In deep reinforcement learning, policy optimization methods need to deal with issues such as function approximation and the reuse of off-policy data. Standard policy gradient methods do not handle off-policy data well, leading to premature convergence and instability. This paper introduces a method to stabilize policy optimization when off-policy data are reused. The idea is to include a Bregman d…
▽ More
In deep reinforcement learning, policy optimization methods need to deal with issues such as function approximation and the reuse of off-policy data. Standard policy gradient methods do not handle off-policy data well, leading to premature convergence and instability. This paper introduces a method to stabilize policy optimization when off-policy data are reused. The idea is to include a Bregman divergence between the behavior policy that generates the data and the current policy to ensure small and safe policy updates with off-policy data. The Bregman divergence is calculated between the state distributions of two policies, instead of only on the action probabilities, leading to a divergence augmentation formulation. Empirical experiments on Atari games show that in the data-scarce scenario where the reuse of off-policy data becomes necessary, our method can achieve better performance than other state-of-the-art deep reinforcement learning algorithms.
△ Less
Submitted 24 January, 2025;
originally announced January 2025.
-
Fréchet Sufficient Dimension Reduction for Metric Space-Valued Data via Distance Covariance
Authors:
Hsin-Hsiung Huang,
Feng Yu,
Kang Li,
Teng Zhang
Abstract:
We propose a novel Fréchet sufficient dimension reduction (SDR) method based on kernel distance covariance, tailored for metric space-valued responses such as count data, probability densities, and other complex structures. The method leverages a kernel-based transformation to map metric space-valued responses into a feature space, enabling efficient dimension reduction. By incorporating kernel di…
▽ More
We propose a novel Fréchet sufficient dimension reduction (SDR) method based on kernel distance covariance, tailored for metric space-valued responses such as count data, probability densities, and other complex structures. The method leverages a kernel-based transformation to map metric space-valued responses into a feature space, enabling efficient dimension reduction. By incorporating kernel distance covariance, the proposed approach offers enhanced flexibility and adaptability for datasets with diverse and non-Euclidean characteristics. The effectiveness of the method is demonstrated through synthetic simulations and several real-world applications. In all cases, the proposed method runs faster and consistently outperforms the existing Fréchet SDR approaches, demonstrating its broad applicability and robustness in addressing complex data challenges.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
Selection of Ultrahigh-Dimensional Interactions Using $L_0$ Penalty
Authors:
Tonglin Zhang
Abstract:
Selecting interactions from an ultrahigh-dimensional statistical model with $n$ observations and $p$ variables when $p\gg n$ is difficult because the number of candidates for interactions is $p(p-1)/2$ and a selected model should satisfy the strong hierarchical (SH) restriction. A new method called the SHL0 is proposed to overcome the difficulty. The objective function of the SHL0 method is compos…
▽ More
Selecting interactions from an ultrahigh-dimensional statistical model with $n$ observations and $p$ variables when $p\gg n$ is difficult because the number of candidates for interactions is $p(p-1)/2$ and a selected model should satisfy the strong hierarchical (SH) restriction. A new method called the SHL0 is proposed to overcome the difficulty. The objective function of the SHL0 method is composed of a loglikelihood function and an $L_0$ penalty. A well-known approach in theoretical computer science called local combinatorial optimization is used to optimize the objective function. We show that any local solution of the SHL0 is consistent and enjoys the oracle properties, implying that it is unnecessary to use a global solution in practice. Three additional advantages are: a tuning parameter is used to penalize the main effects and interactions; a closed-form expression can derive the tuning parameter; and the idea can be extended to arbitrary ultrahigh-dimensional statistical models. The proposed method is more flexible than the previous methods for selecting interactions. A simulation study of the research shows that the proposed SHL0 outperforms its competitors.
△ Less
Submitted 5 December, 2024;
originally announced December 2024.
-
Unsupervised Variable Selection for Ultrahigh-Dimensional Clustering Analysis
Authors:
Tonglin Zhang,
Huyunting Huang
Abstract:
Compared to supervised variable selection, the research on unsupervised variable selection is far behind. A forward partial-variable clustering full-variable loss (FPCFL) method is proposed for the corresponding challenges. An advantage is that the FPCFL method can distinguish active, redundant, and uninformative variables, which the previous methods cannot achieve. Theoretical and simulation stud…
▽ More
Compared to supervised variable selection, the research on unsupervised variable selection is far behind. A forward partial-variable clustering full-variable loss (FPCFL) method is proposed for the corresponding challenges. An advantage is that the FPCFL method can distinguish active, redundant, and uninformative variables, which the previous methods cannot achieve. Theoretical and simulation studies show that the performance of a clustering method using all the variables can be worse if many uninformative variables are involved. Better results are expected if the uninformative variables are excluded. The research addresses a previous concern about how variable selection affects the performance of clustering. Rather than many previous methods attempting to select all the relevant variables, the proposed method selects a subset that can induce an equally good result. This phenomenon does not appear in the supervised variable selection problems.
△ Less
Submitted 28 November, 2024;
originally announced November 2024.
-
Double Descent in Portfolio Optimization: Dance between Theoretical Sharpe Ratio and Estimation Accuracy
Authors:
Yonghe Lu,
Yanrong Yang,
Terry Zhang
Abstract:
We study the relationship between model complexity and out-of-sample performance in the context of mean-variance portfolio optimization. Representing model complexity by the number of assets, we find that the performance of low-dimensional models initially improves with complexity but then declines due to overfitting. As model complexity becomes sufficiently high, the performance improves with com…
▽ More
We study the relationship between model complexity and out-of-sample performance in the context of mean-variance portfolio optimization. Representing model complexity by the number of assets, we find that the performance of low-dimensional models initially improves with complexity but then declines due to overfitting. As model complexity becomes sufficiently high, the performance improves with complexity again, resulting in a double ascent Sharpe ratio curve similar to the double descent phenomenon observed in artificial intelligence. The underlying mechanisms involve an intricate interaction between the theoretical Sharpe ratio and estimation accuracy. In high-dimensional models, the theoretical Sharpe ratio approaches its upper limit, and the overfitting problem is reduced because there are more parameters than data restrictions, which allows us to choose well-behaved parameters based on inductive bias.
△ Less
Submitted 27 November, 2024;
originally announced November 2024.
-
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
Authors:
Heyang Zhao,
Chenlu Ye,
Quanquan Gu,
Tong Zhang
Abstract:
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenari…
▽ More
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same $\mathcal{O}(1 / ε^2)$ sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an $\mathcal{O}(1 / ε)$ sample complexity when $ε$ is sufficiently small.
We further explore the role of data coverage in contextual bandits and RLHF. While the coverage assumption is commonly employed in offline RLHF to link the samples from the reference policy to the optimal policy, often at the cost of a multiplicative dependence on the coverage coefficient, its impact on the sample complexity of online RLHF remains unclear. Previous theoretical analyses of online RLHF typically require explicit exploration and additional structural assumptions on the reward function class. In contrast, we show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.
△ Less
Submitted 11 February, 2025; v1 submitted 7 November, 2024;
originally announced November 2024.
-
Unsupervised Machine Learning for Detecting and Locating Human-Made Objects in 3D Point Cloud
Authors:
Hong Zhao,
Huyunting Huang,
Tonglin Zhang,
Baijian Yang,
Jin Wei-Kocsis,
Songlin Fei
Abstract:
A 3D point cloud is an unstructured, sparse, and irregular dataset, typically collected by airborne LiDAR systems over a geological region. Laser pulses emitted from these systems reflect off objects both on and above the ground, resulting in a dataset containing the longitude, latitude, and elevation of each point, as well as information about the corresponding laser pulse strengths. A widely stu…
▽ More
A 3D point cloud is an unstructured, sparse, and irregular dataset, typically collected by airborne LiDAR systems over a geological region. Laser pulses emitted from these systems reflect off objects both on and above the ground, resulting in a dataset containing the longitude, latitude, and elevation of each point, as well as information about the corresponding laser pulse strengths. A widely studied research problem, addressed in many previous works, is ground filtering, which involves partitioning the points into ground and non-ground subsets. This research introduces a novel task: detecting and identifying human-made objects amidst natural tree structures. This task is performed on the subset of non-ground points derived from the ground filtering stage. Marked Point Fields (MPFs) are used as models well-suited to these tasks. The proposed methodology consists of three stages: ground filtering, local information extraction (LIE), and clustering. In the ground filtering stage, a statistical method called One-Sided Regression (OSR) is introduced, addressing the limitations of prior ground filtering methods on uneven terrains. In the LIE stage, unsupervised learning methods are lacking. To mitigate this, a kernel-based method for the Hessian matrix of the MPF is developed. In the clustering stage, the Gaussian Mixture Model (GMM) is applied to the results of the LIE stage to partition the non-ground points into trees and human-made objects. The underlying assumption is that LiDAR points from trees exhibit a three-dimensional distribution, while those from human-made objects follow a two-dimensional distribution. The Hessian matrix of the MPF effectively captures this distinction. Experimental results demonstrate that the proposed ground filtering method outperforms previous techniques, and the LIE method successfully distinguishes between points representing trees and human-made objects.
△ Less
Submitted 25 October, 2024;
originally announced October 2024.
-
Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples
Authors:
Thomas T. Zhang,
Bruce D. Lee,
Ingvar Ziemann,
George J. Pappas,
Nikolai Matni
Abstract:
A driving force behind the diverse applicability of modern machine learning is the ability to extract meaningful features across many sources. However, many practical domains involve data that are non-identically distributed across sources, and statistically dependent within its source, violating vital assumptions in existing theoretical studies. Toward addressing these issues, we establish statis…
▽ More
A driving force behind the diverse applicability of modern machine learning is the ability to extract meaningful features across many sources. However, many practical domains involve data that are non-identically distributed across sources, and statistically dependent within its source, violating vital assumptions in existing theoretical studies. Toward addressing these issues, we establish statistical guarantees for learning general $\textit{nonlinear}$ representations from multiple data sources that admit different input distributions and possibly dependent data. Specifically, we study the sample-complexity of learning $T+1$ functions $f_\star^{(t)} \circ g_\star$ from a function class $\mathcal F \times \mathcal G$, where $f_\star^{(t)}$ are task specific linear functions and $g_\star$ is a shared nonlinear representation. A representation $\hat g$ is estimated using $N$ samples from each of $T$ source tasks, and a fine-tuning function $\hat f^{(0)}$ is fit using $N'$ samples from a target task passed through $\hat g$. We show that when $N \gtrsim C_{\mathrm{dep}} (\mathrm{dim}(\mathcal F) + \mathrm{C}(\mathcal G)/T)$, the excess risk of $\hat f^{(0)} \circ \hat g$ on the target task decays as $ν_{\mathrm{div}} \big(\frac{\mathrm{dim}(\mathcal F)}{N'} + \frac{\mathrm{C}(\mathcal G)}{N T} \big)$, where $C_{\mathrm{dep}}$ denotes the effect of data dependency, $ν_{\mathrm{div}}$ denotes an (estimatable) measure of $\textit{task-diversity}$ between the source and target tasks, and $\mathrm C(\mathcal G)$ denotes the complexity of the representation class $\mathcal G$. In particular, our analysis reveals: as the number of tasks $T$ increases, both the sample requirement and risk bound converge to that of $r$-dimensional regression as if $g_\star$ had been given, and the effect of dependency only enters the sample requirement, leaving the risk bound matching the iid setting.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Exact MLE for Generalized Linear Mixed Models
Authors:
Tonglin Zhang
Abstract:
Exact MLE for generalized linear mixed models (GLMMs) is a long-standing problem unsolved until today. The proposed research solves the problem. In this problem, the main difficulty is caused by intractable integrals in the likelihood function when the response does not follow normal and the prior distribution for the random effects is specified by normal. Previous methods use Laplace approximatio…
▽ More
Exact MLE for generalized linear mixed models (GLMMs) is a long-standing problem unsolved until today. The proposed research solves the problem. In this problem, the main difficulty is caused by intractable integrals in the likelihood function when the response does not follow normal and the prior distribution for the random effects is specified by normal. Previous methods use Laplace approximations or Monte Carol simulations to compute the MLE approximately. These methods cannot provide the exact MLEs of the parameters and the hyperparameters. The exact MLE problem remains unsolved until the proposed work. The idea is to construct a sequence of mathematical functions in the optimization procedure. Optimization of the mathematical functions can be numerically computed. The result can lead to the exact MLEs of the parameters and hyperparameters. Because computing the likelihood is unnecessary, the proposed method avoids the main difficulty caused by the intractable integrals in the likelihood function.
△ Less
Submitted 10 October, 2024;
originally announced October 2024.
-
In-Context Learning of Linear Systems: Generalization Theory and Applications to Operator Learning
Authors:
Frank Cole,
Yulong Lu,
Wuzhe Xu,
Tianhao Zhang
Abstract:
We study theoretical guarantees for solving linear systems in-context using a linear transformer architecture. For in-domain generalization, we provide neural scaling laws that bound the generalization error in terms of the number of tasks and sizes of samples used in training and inference. For out-of-domain generalization, we find that the behavior of trained transformers under task distribution…
▽ More
We study theoretical guarantees for solving linear systems in-context using a linear transformer architecture. For in-domain generalization, we provide neural scaling laws that bound the generalization error in terms of the number of tasks and sizes of samples used in training and inference. For out-of-domain generalization, we find that the behavior of trained transformers under task distribution shifts depends crucially on the distribution of the tasks seen during training. We introduce a novel notion of task diversity and show that it defines a necessary and sufficient condition for pre-trained transformers generalize under task distribution shifts. We also explore applications of learning linear systems in-context, such as to in-context operator learning for PDEs. Finally, we provide some numerical experiments to validate the established theory.
△ Less
Submitted 23 May, 2025; v1 submitted 18 September, 2024;
originally announced September 2024.
-
Generalized Matrix Factor Model
Authors:
Xinbing Kong,
Tong Zhang
Abstract:
This article introduces a nonlinear generalized matrix factor model (GMFM) that allows for mixed-type variables, extending the scope of linear matrix factor models (LMFM) that are so far limited to handling continuous variables. We introduce a novel augmented Lagrange multiplier method, equivalent to the constraint maximum likelihood estimation, and carefully tailored to be locally concave around…
▽ More
This article introduces a nonlinear generalized matrix factor model (GMFM) that allows for mixed-type variables, extending the scope of linear matrix factor models (LMFM) that are so far limited to handling continuous variables. We introduce a novel augmented Lagrange multiplier method, equivalent to the constraint maximum likelihood estimation, and carefully tailored to be locally concave around the true factor and loading parameters. This statistically guarantees the local convexity of the negative Hessian matrix around the true parameters of the factors and loadings, which is nontrivial in the matrix factor modeling and leads to feasible central limit theorems of the estimated factors and loadings. We also theoretically establish the convergence rates of the estimated factor and loading matrices for the GMFM under general conditions that allow for correlations across samples, rows, and columns. Moreover, we provide a model selection criterion to determine the numbers of row and column factors consistently. To numerically compute the constraint maximum likelihood estimator, we provide two algorithms: two-stage alternating maximization and minorization maximization. Extensive simulation studies demonstrate GMFM's superiority in handling discrete and mixed-type variables. An empirical data analysis of the company's operating performance shows that GMFM does clustering and reconstruction well in the presence of discontinuous entries in the data matrix.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
Exact Posterior Mean and Covariance for Generalized Linear Mixed Models
Authors:
Tonglin Zhang
Abstract:
A novel method is proposed for the exact posterior mean and covariance of the random effects given the response in a generalized linear mixed model (GLMM) when the response does not follow normal. The research solves a long-standing problem in Bayesian statistics when an intractable integral appears in the posterior distribution. It is well-known that the posterior distribution of the random effec…
▽ More
A novel method is proposed for the exact posterior mean and covariance of the random effects given the response in a generalized linear mixed model (GLMM) when the response does not follow normal. The research solves a long-standing problem in Bayesian statistics when an intractable integral appears in the posterior distribution. It is well-known that the posterior distribution of the random effects given the response in a GLMM when the response does not follow normal contains intractable integrals. Previous methods rely on Monte Carlo simulations for the posterior distributions. They do not provide the exact posterior mean and covariance of the random effects given the response. The special integral computation (SIC) method is proposed to overcome the difficulty. The SIC method does not use the posterior distribution in the computation. It devises an optimization problem to reach the task. An advantage is that the computation of the posterior distribution is unnecessary. The proposed SIC avoids the main difficulty in Bayesian analysis when intractable integrals appear in the posterior distribution.
△ Less
Submitted 14 September, 2024;
originally announced September 2024.
-
Building Math Agents with Multi-Turn Iterative Preference Learning
Authors:
Wei Xiong,
Chengshuai Shi,
Jiaming Shen,
Aviv Rosenberg,
Zhen Qin,
Daniele Calandriello,
Misha Khalman,
Rishabh Joshi,
Bilal Piot,
Mohammad Saleh,
Chi Jin,
Tong Zhang,
Tianqi Liu
Abstract:
Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach…
▽ More
Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach to further improve model performance. However, existing direct preference learning algorithms are originally designed for the single-turn chat task, and do not fully address the complexities of multi-turn reasoning and external tool integration required for tool-integrated mathematical reasoning tasks. To fill in this gap, we introduce a multi-turn direct preference learning framework, tailored for this context, that leverages feedback from code interpreters and optimizes trajectory-level preferences. This framework includes multi-turn DPO and multi-turn KTO as specific implementations. The effectiveness of our framework is validated through training of various language models using an augmented prompt set from the GSM8K and MATH datasets. Our results demonstrate substantial improvements: a supervised fine-tuned Gemma-1.1-it-7B model's performance increased from 77.5% to 83.9% on GSM8K and from 46.1% to 51.2% on MATH. Similarly, a Gemma-2-it-9B model improved from 84.1% to 86.3% on GSM8K and from 51.0% to 54.5% on MATH.
△ Less
Submitted 27 February, 2025; v1 submitted 3 September, 2024;
originally announced September 2024.
-
Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection
Authors:
Tianyu Zhang,
Hao Lee,
Jing Lei
Abstract:
We study the problem of finding the index of the minimum value of a vector from noisy observations. This problem is relevant in population/policy comparison, discrete maximum likelihood, and model selection. We develop an asymptotically normal test statistic, even in high-dimensional settings and with potentially many ties in the population mean vector, by integrating concepts and tools from cross…
▽ More
We study the problem of finding the index of the minimum value of a vector from noisy observations. This problem is relevant in population/policy comparison, discrete maximum likelihood, and model selection. We develop an asymptotically normal test statistic, even in high-dimensional settings and with potentially many ties in the population mean vector, by integrating concepts and tools from cross-validation and differential privacy. The key technical ingredient is a central limit theorem for globally dependent data. We also propose practical ways to select the tuning parameter that adapts to the signal landscape. Numerical experiments and data examples demonstrate the ability of the proposed method to achieve a favorable bias-variance trade-off in practical scenarios.
△ Less
Submitted 4 December, 2024; v1 submitted 4 August, 2024;
originally announced August 2024.
-
Practical Marketplace Optimization at Uber Using Causally-Informed Machine Learning
Authors:
Bobby Chen,
Siyu Chen,
Jason Dowlatabadi,
Yu Xuan Hong,
Vinayak Iyer,
Uday Mantripragada,
Rishabh Narang,
Apoorv Pandey,
Zijun Qin,
Abrar Sheikh,
Hongtao Sun,
Jiaqi Sun,
Matthew Walker,
Kaichen Wei,
Chen Xu,
Jingnan Yang,
Allen T. Zhang,
Guoqing Zhang
Abstract:
Budget allocation of marketplace levers, such as incentives for drivers and promotions for riders, has long been a technical and business challenge at Uber; understanding lever budget changes' impact and estimating cost efficiency to achieve predefined budgets is crucial, with the goal of optimal allocations that maximize business value; we introduce an end-to-end machine learning and optimization…
▽ More
Budget allocation of marketplace levers, such as incentives for drivers and promotions for riders, has long been a technical and business challenge at Uber; understanding lever budget changes' impact and estimating cost efficiency to achieve predefined budgets is crucial, with the goal of optimal allocations that maximize business value; we introduce an end-to-end machine learning and optimization procedure to automate budget decision-making for cities, relying on feature store, model training and serving, optimizers, and backtesting; proposing state-of-the-art deep learning (DL) estimator based on S-Learner and a novel tensor B-Spline regression model, we solve high-dimensional optimization with ADMM and primal-dual interior point convex optimization, substantially improving Uber's resource allocation efficiency.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning
Authors:
Shuang Qiu,
Dake Zhang,
Rui Yang,
Boxiang Lyu,
Tong Zhang
Abstract:
This paper investigates multi-objective reinforcement learning (MORL), which focuses on learning Pareto optimal policies in the presence of multiple reward functions. Despite MORL's significant empirical success, there is still a lack of satisfactory understanding of various MORL optimization targets and efficient learning algorithms. Our work offers a systematic analysis of several optimization t…
▽ More
This paper investigates multi-objective reinforcement learning (MORL), which focuses on learning Pareto optimal policies in the presence of multiple reward functions. Despite MORL's significant empirical success, there is still a lack of satisfactory understanding of various MORL optimization targets and efficient learning algorithms. Our work offers a systematic analysis of several optimization targets to assess their abilities to find all Pareto optimal policies and controllability over learned policies by the preferences for different objectives. We then identify Tchebycheff scalarization as a favorable scalarization method for MORL. Considering the non-smoothness of Tchebycheff scalarization, we reformulate its minimization problem into a new min-max-max optimization problem. Then, for the stochastic policy class, we propose efficient algorithms using this reformulation to learn Pareto optimal policies. We first propose an online UCB-based algorithm to achieve an $\varepsilon$ learning error with an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ sample complexity for a single given preference. To further reduce the cost of environment exploration under different preferences, we propose a preference-free framework that first explores the environment without pre-defined preferences and then generates solutions for any number of preferences. We prove that it only requires an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ exploration complexity in the exploration phase and demands no additional exploration afterward. Lastly, we analyze the smooth Tchebycheff scalarization, an extension of Tchebycheff scalarization, which is proved to be more advantageous in distinguishing the Pareto optimal policies from other weakly Pareto optimal policies based on entry values of preference vectors. Furthermore, we extend our algorithms and theoretical analysis to accommodate this optimization target.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Pessimism Meets Risk: Risk-Sensitive Offline Reinforcement Learning
Authors:
Dake Zhang,
Boxiang Lyu,
Shuang Qiu,
Mladen Kolar,
Tong Zhang
Abstract:
We study risk-sensitive reinforcement learning (RL), a crucial field due to its ability to enhance decision-making in scenarios where it is essential to manage uncertainty and minimize potential adverse outcomes. Particularly, our work focuses on applying the entropic risk measure to RL problems. While existing literature primarily investigates the online setting, there remains a large gap in unde…
▽ More
We study risk-sensitive reinforcement learning (RL), a crucial field due to its ability to enhance decision-making in scenarios where it is essential to manage uncertainty and minimize potential adverse outcomes. Particularly, our work focuses on applying the entropic risk measure to RL problems. While existing literature primarily investigates the online setting, there remains a large gap in understanding how to efficiently derive a near-optimal policy based on this risk measure using only a pre-collected dataset. We center on the linear Markov Decision Process (MDP) setting, a well-regarded theoretical framework that has yet to be examined from a risk-sensitive standpoint. In response, we introduce two provably sample-efficient algorithms. We begin by presenting a risk-sensitive pessimistic value iteration algorithm, offering a tight analysis by leveraging the structure of the risk-sensitive performance measure. To further improve the obtained bounds, we propose another pessimistic algorithm that utilizes variance information and reference-advantage decomposition, effectively improving both the dependence on the space dimension $d$ and the risk-sensitivity factor. To the best of our knowledge, we obtain the first provably efficient risk-sensitive offline RL algorithms.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Aggregated Sure Independence Screening for Variable Selection with Interaction Structures
Authors:
Tonglin Zhang
Abstract:
A new method called the aggregated sure independence screening is proposed for the computational challenges in variable selection of interactions when the number of explanatory variables is much higher than the number of observations (i.e., $p\gg n$). In this problem, the two main challenges are the strong hierarchical restriction and the number of candidates for the main effects and interactions.…
▽ More
A new method called the aggregated sure independence screening is proposed for the computational challenges in variable selection of interactions when the number of explanatory variables is much higher than the number of observations (i.e., $p\gg n$). In this problem, the two main challenges are the strong hierarchical restriction and the number of candidates for the main effects and interactions. If $n$ is a few hundred and $p$ is ten thousand, then the memory needed for the augmented matrix of the full model is more than $100{\rm GB}$ in size, beyond the memory capacity of a personal computer. This issue can be solved by our proposed method but not by our competitors. Two advantages are that the proposed method can include important interactions even if the related main effects are weak or absent, and it can be combined with an arbitrary variable selection method for interactions. The research addresses the main concern for variable selection of interactions because it makes previous methods applicable to the case when $p$ is extremely large.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Convolutional Unscented Kalman Filter for Multi-Object Tracking with Outliers
Authors:
Shiqi Liu,
Wenhan Cao,
Chang Liu,
Tianyi Zhang,
Shengbo Eben Li
Abstract:
Multi-object tracking (MOT) is an essential technique for navigation in autonomous driving. In tracking-by-detection systems, biases, false positives, and misses, which are referred to as outliers, are inevitable due to complex traffic scenarios. Recent tracking methods are based on filtering algorithms that overlook these outliers, leading to reduced tracking accuracy or even loss of the objects…
▽ More
Multi-object tracking (MOT) is an essential technique for navigation in autonomous driving. In tracking-by-detection systems, biases, false positives, and misses, which are referred to as outliers, are inevitable due to complex traffic scenarios. Recent tracking methods are based on filtering algorithms that overlook these outliers, leading to reduced tracking accuracy or even loss of the objects trajectory. To handle this challenge, we adopt a probabilistic perspective, regarding the generation of outliers as misspecification between the actual distribution of measurement data and the nominal measurement model used for filtering. We further demonstrate that, by designing a convolutional operation, we can mitigate this misspecification. Incorporating this operation into the widely used unscented Kalman filter (UKF) in commonly adopted tracking algorithms, we derive a variant of the UKF that is robust to outliers, called the convolutional UKF (ConvUKF). We show that ConvUKF maintains the Gaussian conjugate property, thus allowing for real-time tracking. We also prove that ConvUKF has a bounded tracking error in the presence of outliers, which implies robust stability. The experimental results on the KITTI and nuScenes datasets show improved accuracy compared to representative baseline algorithms for MOT tasks.
△ Less
Submitted 15 September, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Analysis of Broken Randomized Experiments by Principal Stratification
Authors:
Qinqing Liu,
Xiang Peng,
Tao Zhang,
Yuhao Deng
Abstract:
Although randomized controlled trials have long been regarded as the ``gold standard'' for evaluating treatment effects, there is no natural prevention from post-treatment events. For example, non-compliance makes the actual treatment different from the assigned treatment, truncation-by-death renders the outcome undefined or ill-defined, and missingness prevents the outcomes from being measured. I…
▽ More
Although randomized controlled trials have long been regarded as the ``gold standard'' for evaluating treatment effects, there is no natural prevention from post-treatment events. For example, non-compliance makes the actual treatment different from the assigned treatment, truncation-by-death renders the outcome undefined or ill-defined, and missingness prevents the outcomes from being measured. In this paper, we develop a statistical analysis framework using principal stratification to investigate the treatment effect in broken randomized experiments. The average treatment effect in compliers and always-survivors is adopted as the target causal estimand. We establish the asymptotic property for the estimator. To relax the identification assumptions, we also propose an interventionist estimand defined in compliers by adjusting for baseline covariates. We apply the framework to study the effect of training on earnings in the Job Corps study and find that the training program improves employment and earnings in the long term.
△ Less
Submitted 23 April, 2025; v1 submitted 26 May, 2024;
originally announced May 2024.
-
Faster Sampling via Stochastic Gradient Proximal Sampler
Authors:
Xunpeng Huang,
Difan Zou,
Yi-An Ma,
Hanze Dong,
Tong Zhang
Abstract:
Stochastic gradients have been widely integrated into Langevin-based methods to improve their scalability and efficiency in solving large-scale sampling problems. However, the proximal sampler, which exhibits much faster convergence than Langevin-based algorithms in the deterministic setting Lee et al. (2021), has yet to be explored in its stochastic variants. In this paper, we study the Stochasti…
▽ More
Stochastic gradients have been widely integrated into Langevin-based methods to improve their scalability and efficiency in solving large-scale sampling problems. However, the proximal sampler, which exhibits much faster convergence than Langevin-based algorithms in the deterministic setting Lee et al. (2021), has yet to be explored in its stochastic variants. In this paper, we study the Stochastic Proximal Samplers (SPS) for sampling from non-log-concave distributions. We first establish a general framework for implementing stochastic proximal samplers and establish the convergence theory accordingly. We show that the convergence to the target distribution can be guaranteed as long as the second moment of the algorithm trajectory is bounded and restricted Gaussian oracles can be well approximated. We then provide two implementable variants based on Stochastic gradient Langevin dynamics (SGLD) and Metropolis-adjusted Langevin algorithm (MALA), giving rise to SPS-SGLD and SPS-MALA. We further show that SPS-SGLD and SPS-MALA can achieve $ε$-sampling error in total variation (TV) distance within $\tilde{\mathcal{O}}(dε^{-2})$ and $\tilde{\mathcal{O}}(d^{1/2}ε^{-2})$ gradient complexities, which outperform the best-known result by at least an $\tilde{\mathcal{O}}(d^{1/3})$ factor. This enhancement in performance is corroborated by our empirical studies on synthetic data with various dimensions, demonstrating the efficiency of our proposed algorithm.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Reverse Transition Kernel: A Flexible Framework to Accelerate Diffusion Inference
Authors:
Xunpeng Huang,
Difan Zou,
Hanze Dong,
Yi Zhang,
Yi-An Ma,
Tong Zhang
Abstract:
To generate data from trained diffusion models, most inference algorithms, such as DDPM, DDIM, and other variants, rely on discretizing the reverse SDEs or their equivalent ODEs. In this paper, we view such approaches as decomposing the entire denoising diffusion process into several segments, each corresponding to a reverse transition kernel (RTK) sampling subproblem. Specifically, DDPM uses a Ga…
▽ More
To generate data from trained diffusion models, most inference algorithms, such as DDPM, DDIM, and other variants, rely on discretizing the reverse SDEs or their equivalent ODEs. In this paper, we view such approaches as decomposing the entire denoising diffusion process into several segments, each corresponding to a reverse transition kernel (RTK) sampling subproblem. Specifically, DDPM uses a Gaussian approximation for the RTK, resulting in low per-subproblem complexity but requiring a large number of segments (i.e., subproblems), which is conjectured to be inefficient. To address this, we develop a general RTK framework that enables a more balanced subproblem decomposition, resulting in $\tilde O(1)$ subproblems, each with strongly log-concave targets. We then propose leveraging two fast sampling algorithms, the Metropolis-Adjusted Langevin Algorithm (MALA) and Underdamped Langevin Dynamics (ULD), for solving these strongly log-concave subproblems. This gives rise to the RTK-MALA and RTK-ULD algorithms for diffusion inference. In theory, we further develop the convergence guarantees for RTK-MALA and RTK-ULD in total variation (TV) distance: RTK-ULD can achieve $ε$ target error within $\tilde{\mathcal O}(d^{1/2}ε^{-1})$ under mild conditions, and RTK-MALA enjoys a $\mathcal{O}(d^{2}\log(d/ε))$ convergence rate under slightly stricter conditions. These theoretical results surpass the state-of-the-art convergence rates for diffusion inference and are well supported by numerical experiments.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
RLHF Workflow: From Reward Modeling to Online RLHF
Authors:
Hanze Dong,
Wei Xiong,
Bo Pang,
Haoxiang Wang,
Han Zhao,
Yingbo Zhou,
Nan Jiang,
Doyen Sahoo,
Caiming Xiong,
Tong Zhang
Abstract:
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill i…
▽ More
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.
△ Less
Submitted 12 November, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
Efficient and Adaptive Posterior Sampling Algorithms for Bandits
Authors:
Bingshan Hu,
Zhiming Huang,
Tianyue H. Zhang,
Mathias Lécuyer,
Nidhi Hegde
Abstract:
We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e^{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e^{64}$ to $1270$. Additionally, motivated by large-scale real-wo…
▽ More
We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e^{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e^{64}$ to $1270$. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-$α$) and Thompson Sampling with Timestamp Duelling (TS-TD-$α$), where $α\in [0,1]$ controls the trade-off between utility and computation. Both algorithms achieve $O \left(K\ln^{α+1}(T)/Δ\right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Δ$ denotes the single round performance loss when pulling a sub-optimal arm.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm
Authors:
Miao Lu,
Han Zhong,
Tong Zhang,
Jose Blanchet
Abstract:
The sim-to-real gap, which represents the disparity between training and testing environments, poses a significant challenge in reinforcement learning (RL). A promising approach to addressing this challenge is distributionally robust RL, often framed as a robust Markov decision process (RMDP). In this framework, the objective is to find a robust policy that achieves good performance under the wors…
▽ More
The sim-to-real gap, which represents the disparity between training and testing environments, poses a significant challenge in reinforcement learning (RL). A promising approach to addressing this challenge is distributionally robust RL, often framed as a robust Markov decision process (RMDP). In this framework, the objective is to find a robust policy that achieves good performance under the worst-case scenario among all environments within a pre-specified uncertainty set centered around the training environment. Unlike previous work, which relies on a generative model or a pre-collected offline dataset enjoying good coverage of the deployment environment, we tackle robust RL via interactive data collection, where the learner interacts with the training environment only and refines the policy through trial and error. In this robust RL paradigm, two main challenges emerge: managing distributional robustness while striking a balance between exploration and exploitation during data collection. Initially, we establish that sample-efficient learning without additional assumptions is unattainable owing to the curse of support shift; i.e., the potential disjointedness of the distributional supports between the training and testing environments. To circumvent such a hardness result, we introduce the vanishing minimal value assumption to RMDPs with a total-variation (TV) distance robust set, postulating that the minimal value of the optimal robust value function is zero. We prove that such an assumption effectively eliminates the support shift issue for RMDPs with a TV distance robust set, and present an algorithm with a provable sample complexity guarantee. Our work makes the initial step to uncovering the inherent difficulty of robust RL via interactive data collection and sufficient conditions for designing a sample-efficient algorithm accompanied by sharp sample complexity analysis.
△ Less
Submitted 4 November, 2024; v1 submitted 4 April, 2024;
originally announced April 2024.
-
Theoretical Guarantees for the Subspace-Constrained Tyler's Estimator
Authors:
Gilad Lerman,
Feng Yu,
Teng Zhang
Abstract:
This work analyzes the subspace-constrained Tyler's estimator (STE) designed for recovering a low-dimensional subspace within a dataset that may be highly corrupted with outliers. It assumes a weak inlier-outlier model and allows the fraction of inliers to be smaller than a fraction that leads to computational hardness of the robust subspace recovery problem. It shows that in this setting, if the…
▽ More
This work analyzes the subspace-constrained Tyler's estimator (STE) designed for recovering a low-dimensional subspace within a dataset that may be highly corrupted with outliers. It assumes a weak inlier-outlier model and allows the fraction of inliers to be smaller than a fraction that leads to computational hardness of the robust subspace recovery problem. It shows that in this setting, if the initialization of STE, which is an iterative algorithm, satisfies a certain condition, then STE can effectively recover the underlying subspace. It further shows that under the generalized haystack model, STE initialized by the Tyler's M-estimator (TME), can recover the subspace when the fraction of iniliers is too small for TME to handle.
△ Less
Submitted 12 April, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
On the Benefits of Over-parameterization for Out-of-Distribution Generalization
Authors:
Yifan Hao,
Yong Lin,
Difan Zou,
Tong Zhang
Abstract:
In recent years, machine learning models have achieved success based on the independently and identically distributed assumption. However, this assumption can be easily violated in real-world applications, leading to the Out-of-Distribution (OOD) problem. Understanding how modern over-parameterized DNNs behave under non-trivial natural distributional shifts is essential, as current theoretical und…
▽ More
In recent years, machine learning models have achieved success based on the independently and identically distributed assumption. However, this assumption can be easily violated in real-world applications, leading to the Out-of-Distribution (OOD) problem. Understanding how modern over-parameterized DNNs behave under non-trivial natural distributional shifts is essential, as current theoretical understanding is insufficient. Existing theoretical works often provide meaningless results for over-parameterized models in OOD scenarios or even contradict empirical findings. To this end, we are investigating the performance of the over-parameterized model in terms of OOD generalization under the general benign overfitting conditions. Our analysis focuses on a random feature model and examines non-trivial natural distributional shifts, where the benign overfitting estimators demonstrate a constant excess OOD loss, despite achieving zero excess in-distribution (ID) loss. We demonstrate that in this scenario, further increasing the model's parameterization can significantly reduce the OOD loss. Intuitively, the variance term of ID loss remains low due to orthogonality of long-tail features, meaning overfitting noise during training generally doesn't raise testing loss. However, in OOD cases, distributional shift increases the variance term. Thankfully, the inherent shift is unrelated to individual x, maintaining the orthogonality of long-tail features. Expanding the hidden dimension can additionally improve this orthogonality by mapping the features into higher-dimensional spaces, thereby reducing the variance term. We further show that model ensembles also improve OOD loss, akin to increasing model capacity. These insights explain the empirical phenomenon of enhanced OOD generalization through model ensembles, supported by consistent simulations with theoretical results.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
A Sober Look at the Robustness of CLIPs to Spurious Features
Authors:
Qizhou Wang,
Yong Lin,
Yongqiang Chen,
Ludwig Schmidt,
Bo Han,
Tong Zhang
Abstract:
Large vision language models, such as CLIP, demonstrate impressive robustness to spurious features than single-modal models trained on ImageNet. However, existing test datasets are typically curated based on ImageNet-trained models, which aim to capture the spurious features inherited in ImageNet. Benchmarking CLIP models based on the ImageNet-oriented spurious features may not be sufficient to re…
▽ More
Large vision language models, such as CLIP, demonstrate impressive robustness to spurious features than single-modal models trained on ImageNet. However, existing test datasets are typically curated based on ImageNet-trained models, which aim to capture the spurious features inherited in ImageNet. Benchmarking CLIP models based on the ImageNet-oriented spurious features may not be sufficient to reflect the extent to which CLIP models are robust to spurious correlations within CLIP training data, e.g., LAION. To this end, we craft a new challenging dataset named CounterAnimal designed to reveal the reliance of CLIP models on realistic spurious features. Specifically, we split animal photos into groups according to the backgrounds, and then identify a pair of groups for each class where a CLIP model shows high-performance drops across the two groups. Our evaluations show that the spurious features captured by CounterAnimal are generically learned by CLIP models with different backbones and pre-train data, yet have limited influence for ImageNet models. We provide theoretical insights that the CLIP objective cannot offer additional robustness. Furthermore, we also re-evaluate strategies such as scaling up parameters and high-quality pre-trained data. We find that they still help mitigate the spurious features, providing a promising path for future developments.
△ Less
Submitted 2 November, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
An Improved Analysis of Langevin Algorithms with Prior Diffusion for Non-Log-Concave Sampling
Authors:
Xunpeng Huang,
Hanze Dong,
Difan Zou,
Tong Zhang
Abstract:
Understanding the dimension dependency of computational complexity in high-dimensional sampling problem is a fundamental problem, both from a practical and theoretical perspective. Compared with samplers with unbiased stationary distribution, e.g., Metropolis-adjusted Langevin algorithm (MALA), biased samplers, e.g., Underdamped Langevin Dynamics (ULD), perform better in low-accuracy cases just be…
▽ More
Understanding the dimension dependency of computational complexity in high-dimensional sampling problem is a fundamental problem, both from a practical and theoretical perspective. Compared with samplers with unbiased stationary distribution, e.g., Metropolis-adjusted Langevin algorithm (MALA), biased samplers, e.g., Underdamped Langevin Dynamics (ULD), perform better in low-accuracy cases just because a lower dimension dependency in their complexities. Along this line, Freund et al. (2022) suggest that the modified Langevin algorithm with prior diffusion is able to converge dimension independently for strongly log-concave target distributions. Nonetheless, it remains open whether such property establishes for more general cases. In this paper, we investigate the prior diffusion technique for the target distributions satisfying log-Sobolev inequality (LSI), which covers a much broader class of distributions compared to the strongly log-concave ones. In particular, we prove that the modified Langevin algorithm can also obtain the dimension-independent convergence of KL divergence with different step size schedules. The core of our proof technique is a novel construction of an interpolating SDE, which significantly helps to conduct a more accurate characterization of the discrete updates of the overdamped Langevin dynamics. Our theoretical analysis demonstrates the benefits of prior diffusion for a broader class of target distributions and provides new insights into developing faster sampling algorithms.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
Adaptive Projected Two-Sample Comparisons for Single-Cell Gene Expression Data
Authors:
Tianyu Zhang,
Jing Lei,
Kathryn Roeder
Abstract:
We study high-dimensional two-sample mean comparison and address the curse of dimensionality through data-adaptive projections. Leveraging the low-dimensional and localized signal structures commonly seen in single-cell genomics data, our first proposed method identifies a sparse, informative low-dimensional subspace and then performs statistical inference restricted to this subspace. To address t…
▽ More
We study high-dimensional two-sample mean comparison and address the curse of dimensionality through data-adaptive projections. Leveraging the low-dimensional and localized signal structures commonly seen in single-cell genomics data, our first proposed method identifies a sparse, informative low-dimensional subspace and then performs statistical inference restricted to this subspace. To address the double-dipping issue -- arising from using the same data for projection and inference -- we develop a debiased projected estimator using the semiparametric double-machine learning framework. The resulting inference not only has the usual frequentist validity but also provides useful information on the potential location of the signal due to the sparsity of the projection. Our second method uses a more flexible projection scheme to improve the power against the global null hypothesis and avoid the degeneracy issue commonly faced by existing methods. It is particularly useful when debiasing is practically challenging or when the informative signal is not well-captured by the subspace. Experiments on synthetic data and real datasets demonstrate the theoretical promise and interpretability of the proposed methods.
△ Less
Submitted 10 June, 2025; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
Authors:
Haoxiang Wang,
Yong Lin,
Wei Xiong,
Rui Yang,
Shizhe Diao,
Shuang Qiu,
Han Zhao,
Tong Zhang
Abstract:
Fine-grained control over large language models (LLMs) remains a significant challenge, hindering their adaptability to diverse user needs. While Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. To address this limitation, we introduce the Directi…
▽ More
Fine-grained control over large language models (LLMs) remains a significant challenge, hindering their adaptability to diverse user needs. While Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. To address this limitation, we introduce the Directional Preference Alignment (DPA) framework. Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles. Additionally, DPA models user preferences as directions (i.e., unit vectors) in the reward space to achieve user-dependent preference control. Our method involves training a multi-objective reward model and then fine-tuning the LLM with a preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF method adopted by Llama 2. This method enjoys a better performance trade-off across various reward objectives. In comparison with the scalar-reward RLHF, DPA offers users intuitive control over LLM generation: they can arithmetically specify their desired trade-offs (e.g., more helpfulness with less verbosity). We also validate the effectiveness of DPA with real-world alignment experiments on Mistral-7B. Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines such as Direct Preference Optimization (DPO).
△ Less
Submitted 6 March, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
Provably Efficient Partially Observable Risk-Sensitive Reinforcement Learning with Hindsight Observation
Authors:
Tonghe Zhang,
Yu Chen,
Longbo Huang
Abstract:
This work pioneers regret analysis of risk-sensitive reinforcement learning in partially observable environments with hindsight observation, addressing a gap in theoretical exploration. We introduce a novel formulation that integrates hindsight observations into a Partially Observable Markov Decision Process (POMDP) framework, where the goal is to optimize accumulated reward under the entropic ris…
▽ More
This work pioneers regret analysis of risk-sensitive reinforcement learning in partially observable environments with hindsight observation, addressing a gap in theoretical exploration. We introduce a novel formulation that integrates hindsight observations into a Partially Observable Markov Decision Process (POMDP) framework, where the goal is to optimize accumulated reward under the entropic risk measure. We develop the first provably efficient RL algorithm tailored for this setting. We also prove by rigorous analysis that our algorithm achieves polynomial regret $\tilde{O}\left(\frac{e^{|γ|H}-1}{|γ|H}H^2\sqrt{KHS^2OA}\right)$, which outperforms or matches existing upper bounds when the model degenerates to risk-neutral or fully observable settings. We adopt the method of change-of-measure and develop a novel analytical tool of beta vectors to streamline mathematical derivations. These techniques are of particular interest to the theoretical study of reinforcement learning.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption
Authors:
Chenlu Ye,
Jiafan He,
Quanquan Gu,
Tong Zhang
Abstract:
This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL), where the transition dynamics can be corrupted by an adversary. Existing studies on corruption-robust RL mostly focus on the setting of model-free RL, where robust least-square regression is often employed for value function estimation. However, these techniques cannot be directly applied to mod…
▽ More
This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL), where the transition dynamics can be corrupted by an adversary. Existing studies on corruption-robust RL mostly focus on the setting of model-free RL, where robust least-square regression is often employed for value function estimation. However, these techniques cannot be directly applied to model-based RL. In this paper, we focus on model-based RL and take the maximum likelihood estimation (MLE) approach to learn transition model. Our work encompasses both online and offline settings. In the online setting, we introduce an algorithm called corruption-robust optimistic MLE (CR-OMLE), which leverages total-variation (TV)-based information ratios as uncertainty weights for MLE. We prove that CR-OMLE achieves a regret of $\tilde{\mathcal{O}}(\sqrt{T} + C)$, where $C$ denotes the cumulative corruption level after $T$ episodes. We also prove a lower bound to show that the additive dependence on $C$ is optimal. We extend our weighting technique to the offline setting, and propose an algorithm named corruption-robust pessimistic MLE (CR-PMLE). Under a uniform coverage condition, CR-PMLE exhibits suboptimality worsened by $\mathcal{O}(C/n)$, nearly matching the lower bound. To the best of our knowledge, this is the first work on corruption-robust model-based RL algorithms with provable guarantees.
△ Less
Submitted 20 July, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.