Search | arXiv e-print repository

Optimal Convergence Rates of Deep Neural Network Classifiers

Authors: Zihan Zhang, Lei Shi, Ding-Xuan Zhou

Abstract: In this paper, we study the binary classification problem on $[0,1]^d$ under the Tsybakov noise condition (with exponent $s \in [0,\infty]$) and the compositional assumption. This assumption requires the conditional class probability function of the data distribution to be the composition of $q+1$ vector-valued multivariate functions, where each component function is either a maximum value functio… ▽ More In this paper, we study the binary classification problem on $[0,1]^d$ under the Tsybakov noise condition (with exponent $s \in [0,\infty]$) and the compositional assumption. This assumption requires the conditional class probability function of the data distribution to be the composition of $q+1$ vector-valued multivariate functions, where each component function is either a maximum value function or a Hölder-$β$ smooth function that depends only on $d_*$ of its input variables. Notably, $d_*$ can be significantly smaller than the input dimension $d$. We prove that, under these conditions, the optimal convergence rate for the excess 0-1 risk of classifiers is $$ \left( \frac{1}{n} \right)^{\frac{β\cdot(1\wedgeβ)^q}{{\frac{d_*}{s+1}+(1+\frac{1}{s+1})\cdotβ\cdot(1\wedgeβ)^q}}}\;\;\;, $$ which is independent of the input dimension $d$. Additionally, we demonstrate that ReLU deep neural networks (DNNs) trained with hinge loss can achieve this optimal convergence rate up to a logarithmic factor. This result provides theoretical justification for the excellent performance of ReLU DNNs in practical classification tasks, particularly in high-dimensional settings. The technique used to establish these results extends the oracle inequality presented in our previous work. The generalized approach is of independent interest. △ Less

Submitted 17 June, 2025; originally announced June 2025.

arXiv:2506.12751 [pdf, ps, other]

Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions

Authors: Yue Kang, Mingshuo Liu, Bongsoo Yi, Jing Lyu, Zhi Zhang, Doudou Zhou, Yao Li

Abstract: Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we addre… ▽ More Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound $\tilde{O}_T(\sqrt{T})$ in terms of the time horizon $T$. We then extend our methods to the high-dimensional sparse setting and show that the same regret rate can be attained with the sparsity index. Next, we introduce GSTOR, an algorithm that is agnostic to general reward functions, and establish regret bounds under a Gaussian design assumption. Finally, we validate the efficiency and effectiveness of our algorithms through experiments on both synthetic and real-world datasets. △ Less

Submitted 15 June, 2025; originally announced June 2025.

arXiv:2506.12412 [pdf, ps, other]

Cross-Domain Conditional Diffusion Models for Time Series Imputation

Authors: Kexin Zhang, Baoyu Jing, K. Selçuk Candan, Dawei Zhou, Qingsong Wen, Han Liu, Kaize Ding

Abstract: Cross-domain time series imputation is an underexplored data-centric research task that presents significant challenges, particularly when the target domain suffers from high missing rates and domain shifts in temporal dynamics. Existing time series imputation approaches primarily focus on the single-domain setting, which cannot effectively adapt to a new domain with domain shifts. Meanwhile, conv… ▽ More Cross-domain time series imputation is an underexplored data-centric research task that presents significant challenges, particularly when the target domain suffers from high missing rates and domain shifts in temporal dynamics. Existing time series imputation approaches primarily focus on the single-domain setting, which cannot effectively adapt to a new domain with domain shifts. Meanwhile, conventional domain adaptation techniques struggle with data incompleteness, as they typically assume the data from both source and target domains are fully observed to enable adaptation. For the problem of cross-domain time series imputation, missing values introduce high uncertainty that hinders distribution alignment, making existing adaptation strategies ineffective. Specifically, our proposed solution tackles this problem from three perspectives: (i) Data: We introduce a frequency-based time series interpolation strategy that integrates shared spectral components from both domains while retaining domain-specific temporal structures, constructing informative priors for imputation. (ii) Model: We design a diffusion-based imputation model that effectively learns domain-shared representations and captures domain-specific temporal dependencies with dedicated denoising networks. (iii) Algorithm: We further propose a cross-domain consistency alignment strategy that selectively regularizes output-level domain discrepancies, enabling effective knowledge transfer while preserving domain-specific characteristics. Extensive experiments on three real-world datasets demonstrate the superiority of our proposed approach. Our code implementation is available here. △ Less

Submitted 14 June, 2025; originally announced June 2025.

Comments: Accepted by ECML-PKDD 2025

arXiv:2506.00818 [pdf, other]

Generalized Linear Markov Decision Process

Authors: Sinian Zhang, Kaicheng Zhang, Ziping Xu, Tianxi Cai, Doudou Zhou

Abstract: The linear Markov Decision Process (MDP) framework offers a principled foundation for reinforcement learning (RL) with strong theoretical guarantees and sample efficiency. However, its restrictive assumption-that both transition dynamics and reward functions are linear in the same feature space-limits its applicability in real-world domains, where rewards often exhibit nonlinear or discrete struct… ▽ More The linear Markov Decision Process (MDP) framework offers a principled foundation for reinforcement learning (RL) with strong theoretical guarantees and sample efficiency. However, its restrictive assumption-that both transition dynamics and reward functions are linear in the same feature space-limits its applicability in real-world domains, where rewards often exhibit nonlinear or discrete structures. Motivated by applications such as healthcare and e-commerce, where data is scarce and reward signals can be binary or count-valued, we propose the Generalized Linear MDP (GLMDP) framework-an extension of the linear MDP framework-that models rewards using generalized linear models (GLMs) while maintaining linear transition dynamics. We establish the Bellman completeness of GLMDPs with respect to a new function class that accommodates nonlinear rewards and develop two offline RL algorithms: Generalized Pessimistic Value Iteration (GPEVI) and a semi-supervised variant (SS-GPEVI) that utilizes both labeled and unlabeled trajectories. Our algorithms achieve theoretical guarantees on policy suboptimality and demonstrate improved sample efficiency in settings where reward labels are expensive or limited. △ Less

Submitted 31 May, 2025; originally announced June 2025.

Comments: 34 pages, 9 figures

arXiv:2505.19209 [pdf, ps, other]

MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search

Authors: Zonglin Yang, Wanhao Liu, Ben Gao, Yujie Liu, Wei Li, Tong Xie, Lidong Bing, Wanli Ouyang, Erik Cambria, Dongzhan Zhou

Abstract: Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse… ▽ More Large language models (LLMs) have shown promise in automating scientific hypothesis generation, yet existing approaches primarily yield coarse-grained hypotheses lacking critical methodological and experimental details. We introduce and formally define the novel task of fine-grained scientific hypothesis discovery, which entails generating detailed, experimentally actionable hypotheses from coarse initial research directions. We frame this as a combinatorial optimization problem and investigate the upper limits of LLMs' capacity to solve it when maximally leveraged. Specifically, we explore four foundational questions: (1) how to best harness an LLM's internal heuristics to formulate the fine-grained hypothesis it itself would judge as the most promising among all the possible hypotheses it might generate, based on its own internal scoring-thus defining a latent reward landscape over the hypothesis space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment with ground-truth hypotheses; (3) whether shaping the reward landscape using an ensemble of diverse LLMs of similar capacity yields better outcomes than defining it with repeated instances of the strongest LLM among them; and (4) whether an ensemble of identical LLMs provides a more reliable reward landscape than a single LLM. To address these questions, we propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis, progressing from general concepts to specific experimental configurations. We show that this hierarchical process smooths the reward landscape and enables more effective optimization. Empirical evaluations on a new benchmark of expert-annotated fine-grained hypotheses from recent chemistry literature show that our method consistently outperforms strong baselines. △ Less

Submitted 25 May, 2025; originally announced May 2025.

arXiv:2505.12023 [pdf, ps, other]

Model-X Change-Point Detection of Conditional Distribution

Authors: Yiwen Huang, Yan Dong, Mengying Yan, Ziye Tian, Chuan Hong, Doudou Zhou, Molei Liu

Abstract: The dynamic nature of many real-world systems can lead to temporal outcome model shifts, causing a deterioration in model accuracy and reliability over time. This requires change-point detection on the outcome models to guide model retraining and adjustments. However, inferring the change point of conditional models is more prone to loss of validity or power than classic detection problems for mar… ▽ More The dynamic nature of many real-world systems can lead to temporal outcome model shifts, causing a deterioration in model accuracy and reliability over time. This requires change-point detection on the outcome models to guide model retraining and adjustments. However, inferring the change point of conditional models is more prone to loss of validity or power than classic detection problems for marginal distributions. This is due to both the temporal covariate shift and the complexity of the outcome model. To address these challenges, we propose a novel model-X Conditional Random Testing (CRT) method computationally enhanced with latent mixture model (LMM) distillation for simultaneous change-point detection and localization of the conditional outcome model. Built upon the model-X framework, our approach can effectively adjust for the potential bias caused by the temporal covariate shift and allow the flexible use of general machine learning methods for outcome modeling. It preserves good validity against complex or erroneous outcome models, even with imperfect knowledge of the temporal covariate shift learned from some auxiliary unlabeled data. Moreover, the incorporation of LMM distillation significantly reduces the computational burden of the CRT by eliminating the need for repeated complex model refitting in its resampling procedure and preserves the statistical validity and power well. Theoretical validity of the proposed method is justified. Extensive simulation studies and a real-world example demonstrate the statistical effectiveness and computational scalability of our method as well as its significant improvements over existing methods. △ Less

Submitted 20 May, 2025; v1 submitted 17 May, 2025; originally announced May 2025.

Comments: 13 pages, 5 figures

arXiv:2505.08198 [pdf, ps, other]

SIM-Shapley: A Stable and Computationally Efficient Approach to Shapley Value Approximation

Authors: Wangxuan Fan, Siqi Li, Doudou Zhou, Yohei Okada, Chuan Hong, Molei Liu, Nan Liu

Abstract: Explainable artificial intelligence (XAI) is essential for trustworthy machine learning (ML), particularly in high-stakes domains such as healthcare and finance. Shapley value (SV) methods provide a principled framework for feature attribution in complex models but incur high computational costs, limiting their scalability in high-dimensional settings. We propose Stochastic Iterative Momentum for… ▽ More Explainable artificial intelligence (XAI) is essential for trustworthy machine learning (ML), particularly in high-stakes domains such as healthcare and finance. Shapley value (SV) methods provide a principled framework for feature attribution in complex models but incur high computational costs, limiting their scalability in high-dimensional settings. We propose Stochastic Iterative Momentum for Shapley Value Approximation (SIM-Shapley), a stable and efficient SV approximation method inspired by stochastic optimization. We analyze variance theoretically, prove linear $Q$-convergence, and demonstrate improved empirical stability and low bias in practice on real-world datasets. In our numerical experiments, SIM-Shapley reduces computation time by up to 85% relative to state-of-the-art baselines while maintaining comparable feature attribution quality. Beyond feature attribution, our stochastic mini-batch iterative framework extends naturally to a broader class of sample average approximation problems, offering a new avenue for improving computational efficiency with stability guarantees. Code is publicly available at https://github.com/nliulab/SIM-Shapley. △ Less

Submitted 12 May, 2025; originally announced May 2025.

Comments: 21 pages, 6 figures, 5 tables

arXiv:2503.07976 [pdf, ps, other]

Two-Dimensional Deep ReLU CNN Approximation for Korobov Functions: A Constructive Approach

Authors: Qin Fang, Lei Shi, Min Xu, Ding-Xuan Zhou

Abstract: This paper investigates approximation capabilities of two-dimensional (2D) deep convolutional neural networks (CNNs), with Korobov functions serving as a benchmark. We focus on 2D CNNs, comprising multi-channel convolutional layers with zero-padding and ReLU activations, followed by a fully connected layer. We propose a fully constructive approach for building 2D CNNs to approximate Korobov functi… ▽ More This paper investigates approximation capabilities of two-dimensional (2D) deep convolutional neural networks (CNNs), with Korobov functions serving as a benchmark. We focus on 2D CNNs, comprising multi-channel convolutional layers with zero-padding and ReLU activations, followed by a fully connected layer. We propose a fully constructive approach for building 2D CNNs to approximate Korobov functions and provide rigorous analysis of the complexity of the constructed networks. Our results demonstrate that 2D CNNs achieve near-optimal approximation rates under the continuous weight selection model, significantly alleviating the curse of dimensionality. This work provides a solid theoretical foundation for 2D CNNs and illustrates their potential for broader applications in function approximation. △ Less

Submitted 10 March, 2025; originally announced March 2025.

arXiv:2412.16684 [pdf, other]

MATES: Multi-view Aggregated Two-Sample Test

Authors: Zexi Cai, Wenbo Fei, Doudou Zhou

Abstract: The two-sample test is a fundamental problem in statistics with a wide range of applications. In the realm of high-dimensional data, nonparametric methods have gained prominence due to their flexibility and minimal distributional assumptions. However, many existing methods tend to be more effective when the two distributions differ primarily in their first and/or second moments. In many real-world… ▽ More The two-sample test is a fundamental problem in statistics with a wide range of applications. In the realm of high-dimensional data, nonparametric methods have gained prominence due to their flexibility and minimal distributional assumptions. However, many existing methods tend to be more effective when the two distributions differ primarily in their first and/or second moments. In many real-world scenarios, distributional differences may arise in higher-order moments, rendering traditional methods less powerful. To address this limitation, we propose a novel framework to aggregate information from multiple moments to build a test statistic. Each moment is regarded as one view of the data and contributes to the detection of some specific type of discrepancy, thus allowing the test statistic to capture more complex distributional differences. The novel multi-view aggregated two-sample test (MATES) leverages a graph-based approach, where the test statistic is constructed from the weighted similarity graphs of the pooled sample. Under mild conditions on the multi-view weighted similarity graphs, we establish theoretical properties of MATES, including a distribution-free limiting distribution under the null hypothesis, which enables straightforward type-I error control. Extensive simulation studies demonstrate that MATES effectively distinguishes subtle differences between distributions. We further validate the method on the S&P100 data, showcasing its power in detecting complex distributional variations. △ Less

Submitted 21 December, 2024; originally announced December 2024.

arXiv:2410.23450 [pdf, other]

Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

Authors: Ruhan Wang, Yu Yang, Zhishuai Liu, Dongruo Zhou, Pan Xu

Abstract: We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on the decision transformer (DT), which can predict actions conditioned on desired return guidance and complete trajectory histo… ▽ More We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on the decision transformer (DT), which can predict actions conditioned on desired return guidance and complete trajectory history. Previous works tackle the dynamics shift problem by augmenting the reward in the trajectory from the source domain to match the optimal trajectory in the target domain. However, this strategy can not be directly applicable in RCSL owing to (1) the unique form of the RCSL policy class, which explicitly depends on the return, and (2) the absence of a straightforward representation of the optimal trajectory distribution. We propose the Return Augmented Decision Transformer (RADT) method, where we augment the return in the source domain by aligning its distribution with that in the target domain. We provide the theoretical analysis demonstrating that the RCSL policy learned from RADT achieves the same level of suboptimality as would be obtained without a dynamics shift. We introduce two practical implementations RADT-DARA and RADT-MV respectively. Extensive experiments conducted on D4RL datasets reveal that our methods generally outperform dynamic programming based methods in off-dynamics RL scenarios. △ Less

Submitted 30 October, 2024; originally announced October 2024.

Comments: 26 pages, 10 tables, 10 figures

arXiv:2410.14092 [pdf, other]

Efficient Sparse PCA via Block-Diagonalization

Authors: Alberto Del Pia, Dekun Zhou, Yinglun Zhu

Abstract: Sparse Principal Component Analysis (Sparse PCA) is a pivotal tool in data analysis and dimensionality reduction. However, Sparse PCA is a challenging problem in both theory and practice: it is known to be NP-hard and current exact methods generally require exponential runtime. In this paper, we propose a novel framework to efficiently approximate Sparse PCA by (i) approximating the general input… ▽ More Sparse Principal Component Analysis (Sparse PCA) is a pivotal tool in data analysis and dimensionality reduction. However, Sparse PCA is a challenging problem in both theory and practice: it is known to be NP-hard and current exact methods generally require exponential runtime. In this paper, we propose a novel framework to efficiently approximate Sparse PCA by (i) approximating the general input covariance matrix with a re-sorted block-diagonal matrix, (ii) solving the Sparse PCA sub-problem in each block, and (iii) reconstructing the solution to the original problem. Our framework is simple and powerful: it can leverage any off-the-shelf Sparse PCA algorithm and achieve significant computational speedups, with a minor additive error that is linear in the approximation error of the block-diagonal matrix. Suppose $g(k, d)$ is the runtime of an algorithm (approximately) solving Sparse PCA in dimension $d$ and with sparsity constant $k$. Our framework, when integrated with this algorithm, reduces the runtime to $\mathcal{O}\left(\frac{d}{d^\star} \cdot g(k, d^\star) + d^2\right)$, where $d^\star \leq d$ is the largest block size of the block-diagonal matrix. For instance, integrating our framework with the Branch-and-Bound algorithm reduces the complexity from $g(k, d) = \mathcal{O}(k^3\cdot d^k)$ to $\mathcal{O}(k^3\cdot d \cdot (d^\star)^{k-1})$, demonstrating exponential speedups if $d^\star$ is small. We perform large-scale evaluations on many real-world datasets: for exact Sparse PCA algorithm, our method achieves an average speedup factor of 100.50, while maintaining an average approximation error of 0.61%; for approximate Sparse PCA algorithm, our method achieves an average speedup factor of 6.00 and an average approximation error of -0.91%, meaning that our method oftentimes finds better solutions. △ Less

Submitted 4 March, 2025; v1 submitted 17 October, 2024; originally announced October 2024.

Comments: 29 pages, 1 figure

arXiv:2410.06484 [pdf, other]

Model-assisted and Knowledge-guided Transfer Regression for the Underrepresented Population

Authors: Doudou Zhou, Mengyan Li, Tianxi Cai, Molei Liu

Abstract: Covariate shift and outcome model heterogeneity are two prominent challenges in leveraging external sources to improve risk modeling for underrepresented cohorts in paucity of accurate labels. We consider the transfer learning problem targeting some unlabeled minority sample encountering (i) covariate shift to the labeled source sample collected on a different cohort; and (ii) outcome model hetero… ▽ More Covariate shift and outcome model heterogeneity are two prominent challenges in leveraging external sources to improve risk modeling for underrepresented cohorts in paucity of accurate labels. We consider the transfer learning problem targeting some unlabeled minority sample encountering (i) covariate shift to the labeled source sample collected on a different cohort; and (ii) outcome model heterogeneity with some majority sample informative to the targeted minority model. In this scenario, we develop a novel model-assisted and knowledge-guided transfer learning targeting underrepresented population (MAKEUP) approach for high-dimensional regression models. Our MAKEUP approach includes a model-assisted debiasing step in response to the covariate shift, accompanied by a knowledge-guided sparsifying procedure leveraging the majority data to enhance learning on the minority group. We also develop a model selection method to avoid negative knowledge transfer that can work in the absence of gold standard labels on the target sample. Theoretical analyses show that MAKEUP provides efficient estimation for the target model on the minority group. It maintains robustness to the high complexity and misspecification of the nuisance models used for covariate shift correction, as well as adaptivity to the model heterogeneity and potential negative transfer between the majority and minority groups. Numerical studies demonstrate similar advantages in finite sample settings over existing approaches. We also illustrate our approach through a real-world application about the transfer learning of Type II diabetes genetic risk models on some underrepresented ancestry group. △ Less

Submitted 8 October, 2024; originally announced October 2024.

arXiv:2410.01047 [pdf, ps, other]

Spherical Analysis of Learning Nonlinear Functionals

Authors: Zhenyu Yang, Shuo Huang, Han Feng, Ding-Xuan Zhou

Abstract: In recent years, there has been growing interest in the field of functional neural networks. They have been proposed and studied with the aim of approximating continuous functionals defined on sets of functions on Euclidean domains. In this paper, we consider functionals defined on sets of functions on spheres. The approximation ability of deep ReLU neural networks is investigated by novel spheric… ▽ More In recent years, there has been growing interest in the field of functional neural networks. They have been proposed and studied with the aim of approximating continuous functionals defined on sets of functions on Euclidean domains. In this paper, we consider functionals defined on sets of functions on spheres. The approximation ability of deep ReLU neural networks is investigated by novel spherical analysis using an encoder-decoder framework. An encoder comes up first to accommodate the infinite-dimensional nature of the domain of functionals. It utilizes spherical harmonics to help us extract the latent finite-dimensional information of functions, which in turn facilitates in the next step of approximation analysis using fully connected neural networks. Moreover, real-world objects are frequently sampled discretely and are often corrupted by noise. Therefore, encoders with discrete input and those with discrete and random noise input are constructed, respectively. The approximation rates with different encoder structures are provided therein. △ Less

Submitted 1 October, 2024; originally announced October 2024.

arXiv:2409.07745 [pdf, other]

Generalized Independence Test for Modern Data

Authors: Mingshuo Liu, Doudou Zhou, Hao Chen

Abstract: The test of independence is a crucial component of modern data analysis. However, traditional methods often struggle with the complex dependency structures found in high-dimensional data. To overcome this challenge, we introduce a novel test statistic that captures intricate relationships using similarity and dissimilarity information derived from the data. The statistic exhibits strong power acro… ▽ More The test of independence is a crucial component of modern data analysis. However, traditional methods often struggle with the complex dependency structures found in high-dimensional data. To overcome this challenge, we introduce a novel test statistic that captures intricate relationships using similarity and dissimilarity information derived from the data. The statistic exhibits strong power across a broad range of alternatives for high-dimensional data, as demonstrated in extensive simulation studies. Under mild conditions, we show that the new test statistic converges to the $χ^2_4$ distribution under the permutation null distribution, ensuring straightforward type I error control. Furthermore, our research advances the moment method in proving the joint asymptotic normality of multiple double-indexed permutation statistics. We showcase the practical utility of this new test with an application to the Genotype-Tissue Expression dataset, where it effectively measures associations between human tissues. △ Less

Submitted 12 September, 2024; originally announced September 2024.

arXiv:2405.06415 [pdf, other]

Generalization analysis with deep ReLU networks for metric and similarity learning

Authors: Junyu Zhou, Puyu Wang, Ding-Xuan Zhou

Abstract: While considerable theoretical progress has been devoted to the study of metric and similarity learning, the generalization mystery is still missing. In this paper, we study the generalization performance of metric and similarity learning by leveraging the specific structure of the true metric (the target function). Specifically, by deriving the explicit form of the true metric for metric and simi… ▽ More While considerable theoretical progress has been devoted to the study of metric and similarity learning, the generalization mystery is still missing. In this paper, we study the generalization performance of metric and similarity learning by leveraging the specific structure of the true metric (the target function). Specifically, by deriving the explicit form of the true metric for metric and similarity learning with the hinge loss, we construct a structured deep ReLU neural network as an approximation of the true metric, whose approximation ability relies on the network complexity. Here, the network complexity corresponds to the depth, the number of nonzero weights and the computation units of the network. Consider the hypothesis space which consists of the structured deep ReLU networks, we develop the excess generalization error bounds for a metric and similarity learning problem by estimating the approximation error and the estimation error carefully. An optimal excess risk rate is derived by choosing the proper capacity of the constructed hypothesis space. To the best of our knowledge, this is the first-ever-known generalization analysis providing the excess generalization error for metric and similarity learning. In addition, we investigate the properties of the true metric of metric and similarity learning with general losses. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: 15 pages, 1 figure

arXiv:2403.16459 [pdf, other]

On the rates of convergence for learning with convolutional neural networks

Authors: Yunfei Yang, Han Feng, Ding-Xuan Zhou

Abstract: We study approximation and learning capacities of convolutional neural networks (CNNs) with one-side zero-padding and multiple channels. Our first result proves a new approximation bound for CNNs with certain constraint on the weights. Our second result gives new analysis on the covering number of feed-forward neural networks with CNNs as special cases. The analysis carefully takes into account th… ▽ More We study approximation and learning capacities of convolutional neural networks (CNNs) with one-side zero-padding and multiple channels. Our first result proves a new approximation bound for CNNs with certain constraint on the weights. Our second result gives new analysis on the covering number of feed-forward neural networks with CNNs as special cases. The analysis carefully takes into account the size of the weights and hence gives better bounds than the existing literature in some situations. Using these two results, we are able to derive rates of convergence for estimators based on CNNs in many learning problems. In particular, we establish minimax optimal convergence rates of the least squares based on CNNs for learning smooth functions in the nonparametric regression setting. For binary classification, we derive convergence rates for CNN classifiers with hinge loss and logistic loss. It is also shown that the obtained rates for classification are minimax optimal in some common settings. △ Less

Submitted 8 April, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.14926 [pdf, other]

Contrastive Learning on Multimodal Analysis of Electronic Health Records

Authors: Tianxi Cai, Feiqing Huang, Ryumei Nakada, Linjun Zhang, Doudou Zhou

Abstract: Electronic health record (EHR) systems contain a wealth of multimodal clinical data including structured data like clinical codes and unstructured data such as clinical notes. However, many existing EHR-focused studies has traditionally either concentrated on an individual modality or merged different modalities in a rather rudimentary fashion. This approach often results in the perception of stru… ▽ More Electronic health record (EHR) systems contain a wealth of multimodal clinical data including structured data like clinical codes and unstructured data such as clinical notes. However, many existing EHR-focused studies has traditionally either concentrated on an individual modality or merged different modalities in a rather rudimentary fashion. This approach often results in the perception of structured and unstructured data as separate entities, neglecting the inherent synergy between them. Specifically, the two important modalities contain clinically relevant, inextricably linked and complementary health information. A more complete picture of a patient's medical history is captured by the joint analysis of the two modalities of data. Despite the great success of multimodal contrastive learning on vision-language, its potential remains under-explored in the realm of multimodal EHR, particularly in terms of its theoretical understanding. To accommodate the statistical analysis of multimodal EHR data, in this paper, we propose a novel multimodal feature embedding generative model and design a multimodal contrastive loss to obtain the multimodal EHR feature representation. Our theoretical analysis demonstrates the effectiveness of multimodal learning compared to single-modality learning and connects the solution of the loss function to the singular value decomposition of a pointwise mutual information matrix. This connection paves the way for a privacy-preserving algorithm tailored for multimodal EHR feature representation learning. Simulation studies show that the proposed algorithm performs well under a variety of configurations. We further validate the clinical utility of the proposed algorithm in real-world EHR data. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: 34 pages

arXiv:2403.12284 [pdf, other]

The Wreaths of KHAN: Uniform Graph Feature Selection with False Discovery Rate Control

Authors: Jiajun Liang, Yue Liu, Doudou Zhou, Sinian Zhang, Junwei Lu

Abstract: Graphical models find numerous applications in biology, chemistry, sociology, neuroscience, etc. While substantial progress has been made in graph estimation, it remains largely unexplored how to select significant graph signals with uncertainty assessment, especially those graph features related to topological structures including cycles (i.e., wreaths), cliques, hubs, etc. These features play a… ▽ More Graphical models find numerous applications in biology, chemistry, sociology, neuroscience, etc. While substantial progress has been made in graph estimation, it remains largely unexplored how to select significant graph signals with uncertainty assessment, especially those graph features related to topological structures including cycles (i.e., wreaths), cliques, hubs, etc. These features play a vital role in protein substructure analysis, drug molecular design, and brain network connectivity analysis. To fill the gap, we propose a novel inferential framework for general high dimensional graphical models to select graph features with false discovery rate controlled. Our method is based on the maximum of $p$-values from single edges that comprise the topological feature of interest, thus is able to detect weak signals. Moreover, we introduce the $K$-dimensional persistent Homology Adaptive selectioN (KHAN) algorithm to select all the homological features within $K$ dimensions with the uniform control of the false discovery rate over continuous filtration levels. The KHAN method applies a novel discrete Gram-Schmidt algorithm to select statistically significant generators from the homology group. We apply the structural screening method to identify the important residues of the SARS-CoV-2 spike protein during the binding process to the ACE2 receptors. We score the residues for all domains in the spike protein by the $p$-value weighted filtration level in the network persistent homology for the closed, partially open, and open states and identify the residues crucial for protein conformational changes and thus being potential targets for inhibition. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.11960 [pdf, other]

doi 10.1145/3627673.3679642

Causality-Aware Spatiotemporal Graph Neural Networks for Spatiotemporal Time Series Imputation

Authors: Baoyu Jing, Dawei Zhou, Kan Ren, Carl Yang

Abstract: Spatiotemporal time series are usually collected via monitoring sensors placed at different locations, which usually contain missing values due to various failures, such as mechanical damages and Internet outages. Imputing the missing values is crucial for analyzing time series. When recovering a specific data point, most existing methods consider all the information relevant to that point regardl… ▽ More Spatiotemporal time series are usually collected via monitoring sensors placed at different locations, which usually contain missing values due to various failures, such as mechanical damages and Internet outages. Imputing the missing values is crucial for analyzing time series. When recovering a specific data point, most existing methods consider all the information relevant to that point regardless of the cause-and-effect relationship. During data collection, it is inevitable that some unknown confounders are included, e.g., background noise in time series and non-causal shortcut edges in the constructed sensor network. These confounders could open backdoor paths and establish non-causal correlations between the input and output. Over-exploiting these non-causal correlations could cause overfitting. In this paper, we first revisit spatiotemporal time series imputation from a causal perspective and show how to block the confounders via the frontdoor adjustment. Based on the results of frontdoor adjustment, we introduce a novel Causality-Aware Spatiotemporal Graph Neural Network (Casper), which contains a novel Prompt Based Decoder (PBD) and a Spatiotemporal Causal Attention (SCA). PBD could reduce the impact of confounders and SCA could discover the sparse causal relationships among embeddings. Theoretical analysis reveals that SCA discovers causal relationships based on the values of gradients. We evaluate Casper on three real-world datasets, and the experimental results show that Casper could outperform the baselines and could effectively discover causal relationships. △ Less

Submitted 23 October, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: Accepted by CIKM'2024. Fixed typos

arXiv:2402.12875 [pdf, other]

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Authors: Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma

Abstract: Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of… ▽ More Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}^0$, a proper subset of $ \mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers. △ Less

Submitted 21 September, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: 38 pages, 10 figures. Accepted by ICLR 2024

arXiv:2402.08998 [pdf, other]

Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path

Authors: Qiwei Di, Jiafan He, Dongruo Zhou, Quanquan Gu

Abstract: We study the Stochastic Shortest Path (SSP) problem with a linear mixture transition kernel, where an agent repeatedly interacts with a stochastic environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the cost function or an upper bound of the expected length for the optimal policy. In this paper, we p… ▽ More We study the Stochastic Shortest Path (SSP) problem with a linear mixture transition kernel, where an agent repeatedly interacts with a stochastic environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the cost function or an upper bound of the expected length for the optimal policy. In this paper, we propose a new algorithm to eliminate these restrictive assumptions. Our algorithm is based on extended value iteration with a fine-grained variance-aware confidence set, where the variance is estimated recursively from high-order moments. Our algorithm achieves an $\tilde{\mathcal O}(dB_*\sqrt{K})$ regret bound, where $d$ is the dimension of the feature mapping in the linear transition kernel, $B_*$ is the upper bound of the total cumulative cost for the optimal policy, and $K$ is the number of episodes. Our regret upper bound matches the $Ω(dB_*\sqrt{K})$ lower bound of linear mixture SSPs in Min et al. (2022), which suggests that our algorithm is nearly minimax optimal. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: 28 pages, 1 figure, In ICML 2023

arXiv:2401.02890 [pdf, other]

Nonlinear functional regression by functional deep neural network with kernel embedding

Authors: Zhongjie Shi, Jun Fan, Linhao Song, Ding-Xuan Zhou, Johan A. K. Suykens

Abstract: Recently, deep learning has been widely applied in functional data analysis (FDA) with notable empirical success. However, the infinite dimensionality of functional data necessitates an effective dimension reduction approach for functional learning tasks, particularly in nonlinear functional regression. In this paper, we introduce a functional deep neural network with an adaptive and discretizatio… ▽ More Recently, deep learning has been widely applied in functional data analysis (FDA) with notable empirical success. However, the infinite dimensionality of functional data necessitates an effective dimension reduction approach for functional learning tasks, particularly in nonlinear functional regression. In this paper, we introduce a functional deep neural network with an adaptive and discretization-invariant dimension reduction method. Our functional network architecture consists of three parts: first, a kernel embedding step that features an integral transformation with an adaptive smooth kernel; next, a projection step that utilizes eigenfunction bases based on a projection Mercer kernel for the dimension reduction; and finally, a deep ReLU neural network is employed for the prediction. Explicit rates of approximating nonlinear smooth functionals across various input function spaces by our proposed functional network are derived. Additionally, we conduct a generalization analysis for the empirical risk minimization (ERM) algorithm applied to our functional net, by employing a novel two-stage oracle inequality and the established functional approximation results. Ultimately, we conduct numerical experiments on both simulated and real datasets to demonstrate the effectiveness and benefits of our functional net. △ Less

Submitted 12 May, 2025; v1 submitted 5 January, 2024; originally announced January 2024.

arXiv:2312.15611 [pdf, other]

Inference of Dependency Knowledge Graph for Electronic Health Records

Authors: Zhiwei Xu, Ziming Gan, Doudou Zhou, Shuting Shen, Junwei Lu, Tianxi Cai

Abstract: The effective analysis of high-dimensional Electronic Health Record (EHR) data, with substantial potential for healthcare research, presents notable methodological challenges. Employing predictive modeling guided by a knowledge graph (KG), which enables efficient feature selection, can enhance both statistical efficiency and interpretability. While various methods have emerged for constructing KGs… ▽ More The effective analysis of high-dimensional Electronic Health Record (EHR) data, with substantial potential for healthcare research, presents notable methodological challenges. Employing predictive modeling guided by a knowledge graph (KG), which enables efficient feature selection, can enhance both statistical efficiency and interpretability. While various methods have emerged for constructing KGs, existing techniques often lack statistical certainty concerning the presence of links between entities, especially in scenarios where the utilization of patient-level EHR data is limited due to privacy concerns. In this paper, we propose the first inferential framework for deriving a sparse KG with statistical guarantee based on the dynamic log-linear topic model proposed by \cite{arora2016latent}. Within this model, the KG embeddings are estimated by performing singular value decomposition on the empirical pointwise mutual information matrix, offering a scalable solution. We then establish entrywise asymptotic normality for the KG low-rank estimator, enabling the recovery of sparse graph edges with controlled type I error. Our work uniquely addresses the under-explored domain of statistical inference about non-linear statistics under the low-rank temporal dependent models, a critical gap in existing research. We validate our approach through extensive simulation studies and then apply the method to real-world EHR data in constructing clinical KGs and generating clinical feature embeddings. △ Less

Submitted 24 December, 2023; originally announced December 2023.

arXiv:2311.14222 [pdf, other]

Risk Bounds of Accelerated SGD for Overparameterized Linear Regression

Authors: Xuheng Li, Yihe Deng, Jingfeng Wu, Dongruo Zhou, Quanquan Gu

Abstract: Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest se… ▽ More Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result. △ Less

Submitted 23 November, 2023; originally announced November 2023.

Comments: 85 pages, 5 figures

arXiv:2311.11563 [pdf]

Time-varying effect in the competing risks based on restricted mean time lost

Authors: Zhiyin Yu, Zhaojin Li, Chengfeng Zhang, Yawen Hou, Derun Zhou, Zheng Chen

Abstract: Patients with breast cancer tend to die from other diseases, so for studies that focus on breast cancer, a competing risks model is more appropriate. Considering subdistribution hazard ratio, which is used often, limited to model assumptions and clinical interpretation, we aimed to quantify the effects of prognostic factors by an absolute indicator, the difference in restricted mean time lost (RMT… ▽ More Patients with breast cancer tend to die from other diseases, so for studies that focus on breast cancer, a competing risks model is more appropriate. Considering subdistribution hazard ratio, which is used often, limited to model assumptions and clinical interpretation, we aimed to quantify the effects of prognostic factors by an absolute indicator, the difference in restricted mean time lost (RMTL), which is more intuitive. Additionally, prognostic factors may have dynamic effects (time-varying effects) in long-term follow-up. However, existing competing risks regression models only provide a static view of covariate effects, leading to a distorted assessment of the prognostic factor. To address this issue, we proposed a dynamic effect RMTL regression that can explore the between-group cumulative difference in mean life lost over a period of time and obtain the real-time effect by the speed of accumulation, as well as personalized predictions on a time scale. Through Monte Carlo simulation, we validated the dynamic effects estimated by the proposed regression having low bias and a coverage rate of around 95%. Applying this model to an elderly early-stage breast cancer cohort, we found that most factors had different patterns of dynamic effects, revealing meaningful physiological mechanisms underlying diseases. Moreover, from the perspective of prediction, the mean C-index in external validation reached 0.78. Dynamic effect RMTL regression can analyze both dynamic cumulative effects and real-time effects of covariates, providing a more comprehensive prognosis and better prediction when competing risks exist. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2309.04236 [pdf, other]

Adaptive Distributed Kernel Ridge Regression: A Feasible Distributed Learning Scheme for Data Silos

Authors: Di Wang, Xiaotong Liu, Shao-Bo Lin, Ding-Xuan Zhou

Abstract: Data silos, mainly caused by privacy and interoperability, significantly constrain collaborations among different organizations with similar data for the same purpose. Distributed learning based on divide-and-conquer provides a promising way to settle the data silos, but it suffers from several challenges, including autonomy, privacy guarantees, and the necessity of collaborations. This paper focu… ▽ More Data silos, mainly caused by privacy and interoperability, significantly constrain collaborations among different organizations with similar data for the same purpose. Distributed learning based on divide-and-conquer provides a promising way to settle the data silos, but it suffers from several challenges, including autonomy, privacy guarantees, and the necessity of collaborations. This paper focuses on developing an adaptive distributed kernel ridge regression (AdaDKRR) by taking autonomy in parameter selection, privacy in communicating non-sensitive information, and the necessity of collaborations in performance improvement into account. We provide both solid theoretical verification and comprehensive experiments for AdaDKRR to demonstrate its feasibility and effectiveness. Theoretically, we prove that under some mild conditions, AdaDKRR performs similarly to running the optimal learning algorithms on the whole data, verifying the necessity of collaborations and showing that no other distributed learning scheme can essentially beat AdaDKRR under the same conditions. Numerically, we test AdaDKRR on both toy simulations and two real-world applications to show that AdaDKRR is superior to other existing distributed learning schemes. All these results show that AdaDKRR is a feasible scheme to defend against data silos, which are highly desired in numerous application regions such as intelligent decision-making, pricing forecasting, and performance prediction for products. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: 46pages, 13figures

arXiv:2308.09605 [pdf, other]

Solving PDEs on Spheres with Physics-Informed Convolutional Neural Networks

Authors: Guanhang Lei, Zhen Lei, Lei Shi, Chenyu Zeng, Ding-Xuan Zhou

Abstract: Physics-informed neural networks (PINNs) have been demonstrated to be efficient in solving partial differential equations (PDEs) from a variety of experimental perspectives. Some recent studies have also proposed PINN algorithms for PDEs on surfaces, including spheres. However, theoretical understanding of the numerical performance of PINNs, especially PINNs on surfaces or manifolds, is still lack… ▽ More Physics-informed neural networks (PINNs) have been demonstrated to be efficient in solving partial differential equations (PDEs) from a variety of experimental perspectives. Some recent studies have also proposed PINN algorithms for PDEs on surfaces, including spheres. However, theoretical understanding of the numerical performance of PINNs, especially PINNs on surfaces or manifolds, is still lacking. In this paper, we establish rigorous analysis of the physics-informed convolutional neural network (PICNN) for solving PDEs on the sphere. By using and improving the latest approximation results of deep convolutional neural networks and spherical harmonic analysis, we prove an upper bound for the approximation error with respect to the Sobolev norm. Subsequently, we integrate this with innovative localization complexity analysis to establish fast convergence rates for PICNN. Our theoretical results are also confirmed and supplemented by our experiments. In light of these findings, we explore potential strategies for circumventing the curse of dimensionality that arises when solving high-dimensional PDEs. △ Less

Submitted 5 August, 2024; v1 submitted 18 August, 2023; originally announced August 2023.

arXiv:2308.08562 [pdf, other]

Bayesian Inference of Phenotypic Plasticity of Cancer Cells Based on Dynamic Model for Temporal Cell Proportion Data

Authors: Shuli Chen, Yuman Wang, Da Zhou, Jie Hu

Abstract: Mounting evidence underscores the prevalent hierarchical organization of cancer tissues. At the foundation of this hierarchy reside cancer stem cells, a subset of cells endowed with the pivotal role of engendering the entire cancer tissue through cell differentiation. In recent times, substantial attention has been directed towards the phenomenon of cancer cell plasticity, where the dynamic interc… ▽ More Mounting evidence underscores the prevalent hierarchical organization of cancer tissues. At the foundation of this hierarchy reside cancer stem cells, a subset of cells endowed with the pivotal role of engendering the entire cancer tissue through cell differentiation. In recent times, substantial attention has been directed towards the phenomenon of cancer cell plasticity, where the dynamic interconversion between cancer stem cells and non-stem cancer cells has garnered significant interest. Since the task of detecting cancer cell plasticity from empirical data remains a formidable challenge, we propose a Bayesian statistical framework designed to infer phenotypic plasticity within cancer cells, utilizing temporal data on cancer stem cell proportions. Our approach is grounded in a stochastic model, adept at capturing the dynamic behaviors of cells. Leveraging Bayesian analysis, we explore the moment equation governing cancer stem cell proportions, derived from the Kolmogorov forward equation of our stochastic model. With improved Euler method for ordinary differential equations, a new statistical method for parameter estimation in nonlinear ordinary differential equations models is developed, which also provides novel ideas for the study of compositional data. Extensive simulations robustly validate the efficacy of our proposed method. To further corroborate our findings, we apply our approach to analyze published data from SW620 colon cancer cell lines. Our results harmonize with \emph{in situ} experiments, thereby reinforcing the utility of our method in discerning and quantifying phenotypic plasticity within cancer cells. △ Less

Submitted 14 August, 2023; originally announced August 2023.

arXiv:2307.16792 [pdf, ps, other]

Classification with Deep Neural Networks and Logistic Loss

Authors: Zihan Zhang, Lei Shi, Ding-Xuan Zhou

Abstract: Deep neural networks (DNNs) trained with the logistic loss (i.e., the cross entropy loss) have made impressive advancements in various binary classification tasks. However, generalization analysis for binary classification with DNNs and logistic loss remains scarce. The unboundedness of the target function for the logistic loss is the main obstacle to deriving satisfactory generalization bounds. I… ▽ More Deep neural networks (DNNs) trained with the logistic loss (i.e., the cross entropy loss) have made impressive advancements in various binary classification tasks. However, generalization analysis for binary classification with DNNs and logistic loss remains scarce. The unboundedness of the target function for the logistic loss is the main obstacle to deriving satisfactory generalization bounds. In this paper, we aim to fill this gap by establishing a novel and elegant oracle-type inequality, which enables us to deal with the boundedness restriction of the target function, and using it to derive sharp convergence rates for fully connected ReLU DNN classifiers trained with logistic loss. In particular, we obtain optimal convergence rates (up to log factors) only requiring the Hölder smoothness of the conditional class probability $η$ of data. Moreover, we consider a compositional assumption that requires $η$ to be the composition of several vector-valued functions of which each component function is either a maximum value function or a Hölder smooth function only depending on a small number of its input variables. Under this assumption, we derive optimal convergence rates (up to log factors) which are independent of the input dimension of data. This result explains why DNN classifiers can perform well in practical high-dimensional classification problems. Besides the novel oracle-type inequality, the sharp convergence rates given in our paper also owe to a tight error bound for approximating the natural logarithm function near zero (where it is unbounded) by ReLU DNNs. In addition, we justify our claims for the optimality of rates by proving corresponding minimax lower bounds. All these results are new in the literature and will deepen our theoretical understanding of classification with DNNs. △ Less

Submitted 21 April, 2024; v1 submitted 31 July, 2023; originally announced July 2023.

arXiv:2307.12461 [pdf, ps, other]

Rates of Approximation by ReLU Shallow Neural Networks

Authors: Tong Mao, Ding-Xuan Zhou

Abstract: Neural networks activated by the rectified linear unit (ReLU) play a central role in the recent development of deep learning. The topic of approximating functions from Hölder spaces by these networks is crucial for understanding the efficiency of the induced learning algorithms. Although the topic has been well investigated in the setting of deep neural networks with many layers of hidden neurons,… ▽ More Neural networks activated by the rectified linear unit (ReLU) play a central role in the recent development of deep learning. The topic of approximating functions from Hölder spaces by these networks is crucial for understanding the efficiency of the induced learning algorithms. Although the topic has been well investigated in the setting of deep neural networks with many layers of hidden neurons, it is still open for shallow networks having only one hidden layer. In this paper, we provide rates of uniform approximation by these networks. We show that ReLU shallow neural networks with $m$ hidden neurons can uniformly approximate functions from the Hölder space $W_\infty^r([-1, 1]^d)$ with rates $O((\log m)^{\frac{1}{2} +d}m^{-\frac{r}{d}\frac{d+2}{d+4}})$ when $r<d/2 +2$. Such rates are very close to the optimal one $O(m^{-\frac{r}{d}})$ in the sense that $\frac{d+2}{d+4}$ is close to $1$, when the dimension $d$ is large. △ Less

Submitted 23 July, 2023; originally announced July 2023.

arXiv:2307.03487 [pdf, ps, other]

Learning Theory of Distribution Regression with Neural Networks

Authors: Zhongjie Shi, Zhan Yu, Ding-Xuan Zhou

Abstract: In this paper, we aim at establishing an approximation theory and a learning theory of distribution regression via a fully connected neural network (FNN). In contrast to the classical regression methods, the input variables of distribution regression are probability measures. Then we often need to perform a second-stage sampling process to approximate the actual information of the distribution. On… ▽ More In this paper, we aim at establishing an approximation theory and a learning theory of distribution regression via a fully connected neural network (FNN). In contrast to the classical regression methods, the input variables of distribution regression are probability measures. Then we often need to perform a second-stage sampling process to approximate the actual information of the distribution. On the other hand, the classical neural network structure requires the input variable to be a vector. When the input samples are probability distributions, the traditional deep neural network method cannot be directly used and the difficulty arises for distribution regression. A well-defined neural network structure for distribution inputs is intensively desirable. There is no mathematical model and theoretical analysis on neural network realization of distribution regression. To overcome technical difficulties and address this issue, we establish a novel fully connected neural network framework to realize an approximation theory of functionals defined on the space of Borel probability measures. Furthermore, based on the established functional approximation results, in the hypothesis space induced by the novel FNN structure with distribution inputs, almost optimal learning rates for the proposed distribution regression model up to logarithmic terms are derived via a novel two-stage error decomposition technique. △ Less

Submitted 7 July, 2023; originally announced July 2023.

arXiv:2306.08321 [pdf, other]

Nonparametric regression using over-parameterized shallow ReLU neural networks

Authors: Yunfei Yang, Ding-Xuan Zhou

Abstract: It is shown that over-parameterized neural networks can achieve minimax optimal rates of convergence (up to logarithmic factors) for learning functions from certain smooth function classes, if the weights are suitably constrained or regularized. Specifically, we consider the nonparametric regression of estimating an unknown $d$-variate function by using shallow ReLU neural networks. It is assumed… ▽ More It is shown that over-parameterized neural networks can achieve minimax optimal rates of convergence (up to logarithmic factors) for learning functions from certain smooth function classes, if the weights are suitably constrained or regularized. Specifically, we consider the nonparametric regression of estimating an unknown $d$-variate function by using shallow ReLU neural networks. It is assumed that the regression function is from the Hölder space with smoothness $α<(d+3)/2$ or a variation space corresponding to shallow neural networks, which can be viewed as an infinitely wide neural network. In this setting, we prove that least squares estimators based on shallow neural networks with certain norm constraints on the weights are minimax optimal, if the network width is sufficiently large. As a byproduct, we derive a new size-independent bound for the local Rademacher complexity of shallow ReLU neural networks, which may be of independent interest. △ Less

Submitted 15 May, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

Journal ref: Journal of Machine Learning Research, 25(165):1-35, 2024

arXiv:2305.19640 [pdf, other]

Fine-grained analysis of non-parametric estimation for pairwise learning

Authors: Junyu Zhou, Shuo Huang, Han Feng, Puyu Wang, Ding-Xuan Zhou

Abstract: In this paper, we are concerned with the generalization performance of non-parametric estimation for pairwise learning. Most of the existing work requires the hypothesis space to be convex or a VC-class, and the loss to be convex. However, these restrictive assumptions limit the applicability of the results in studying many popular methods, especially kernel methods and neural networks. We signifi… ▽ More In this paper, we are concerned with the generalization performance of non-parametric estimation for pairwise learning. Most of the existing work requires the hypothesis space to be convex or a VC-class, and the loss to be convex. However, these restrictive assumptions limit the applicability of the results in studying many popular methods, especially kernel methods and neural networks. We significantly relax these restrictive assumptions and establish a sharp oracle inequality of the empirical minimizer with a general hypothesis space for the Lipschitz continuous pairwise losses. Our results can be used to handle a wide range of pairwise learning problems including ranking, AUC maximization, pairwise regression, and metric and similarity learning. As an application, we apply our general results to study pairwise least squares regression and derive an excess generalization bound that matches the minimax lower bound for pointwise least squares regression up to a logrithmic term. The key novelty here is to construct a structured deep ReLU neural network as an approximation of the true predictor and design the targeted hypothesis space consisting of the structured networks with controllable complexity. This successful application demonstrates that the obtained general results indeed help us to explore the generalization performance on a variety of problems that cannot be handled by existing approaches. △ Less

Submitted 21 June, 2024; v1 submitted 31 May, 2023; originally announced May 2023.

Comments: 30 pages, 1 figure

arXiv:2305.17126 [pdf, other]

Large Language Models as Tool Makers

Authors: Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou

Abstract: Recent research has highlighted the potential of large language models (LLMs) to improve their problem-solving capabilities with the aid of suitable external tools. In our work, we further advance this concept by introducing a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) to… ▽ More Recent research has highlighted the potential of large language models (LLMs) to improve their problem-solving capabilities with the aid of suitable external tools. In our work, we further advance this concept by introducing a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving. Our approach consists of two phases: 1) tool making: an LLM acts as the tool maker that crafts tools for a set of tasks. 2) tool using: another LLM acts as the tool user, which applies the tool built by the tool maker for problem-solving. On the problem-solving server side, tool-making enables continual tool generation and caching as new requests emerge. This framework enables subsequent requests to access cached tools via their corresponding APIs, enhancing the efficiency of task resolution. Recognizing that tool-making requires more sophisticated capabilities, we assign this task to a powerful, albeit resource-intensive, model. Conversely, the simpler tool-using phase is delegated to a lightweight model. This strategic division of labor allows the once-off cost of tool-making to be spread over multiple instances of tool-using, significantly reducing average costs while maintaining strong performance. Furthermore, our method offers a functional cache through the caching and reuse of tools, which stores the functionality of a class of requests instead of the natural language responses from LLMs, thus extending the applicability of the conventional cache mechanism. We evaluate our approach across various complex reasoning tasks, including Big-Bench tasks. With GPT-4 as the tool maker and GPT-3.5 as the tool user, LATM demonstrates performance equivalent to using GPT-4 for both roles, but with a significantly reduced inference cost. △ Less

Submitted 10 March, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: Code available at https://github.com/ctlllll/LLM-ToolMaker

arXiv:2305.16891 [pdf, other]

Generalization Guarantees of Gradient Descent for Multi-Layer Neural Networks

Authors: Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, Ding-Xuan Zhou

Abstract: Recently, significant progress has been made in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling parameters. In this paper, we greatly extend the previous work \cite{lei2022stabil… ▽ More Recently, significant progress has been made in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling parameters. In this paper, we greatly extend the previous work \cite{lei2022stability,richards2021stability} by conducting a comprehensive stability and generalization analysis of GD for multi-layer NNs. For two-layer NNs, our results are established under general network scaling parameters, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of over-parameterization. As a direct application of our general findings, we derive the excess risk rate of $O(1/\sqrt{n})$ for GD algorithms in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for under-parameterized and over-parameterized NNs trained by GD to attain the desired risk rate of $O(1/\sqrt{n})$. Moreover, we demonstrate that as the scaling parameter increases or the network complexity decreases, less over-parameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of $O(1/n)$ for GD in both two-layer and three-layer NNs. △ Less

Submitted 29 September, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: 38 pages, 2 figures

arXiv:2305.11965 [pdf, other]

Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization

Authors: Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, Tianbao Yang

Abstract: In this paper, we aim to optimize a contrastive loss with individualized temperatures in a principled and systematic manner for self-supervised learning. The common practice of using a global temperature parameter $τ$ ignores the fact that ``not all semantics are created equal", meaning that different anchor data may have different numbers of samples with similar semantics, especially when data ex… ▽ More In this paper, we aim to optimize a contrastive loss with individualized temperatures in a principled and systematic manner for self-supervised learning. The common practice of using a global temperature parameter $τ$ ignores the fact that ``not all semantics are created equal", meaning that different anchor data may have different numbers of samples with similar semantics, especially when data exhibits long-tails. First, we propose a new robust contrastive loss inspired by distributionally robust optimization (DRO), providing us an intuition about the effect of $τ$ and a mechanism for automatic temperature individualization. Then, we propose an efficient stochastic algorithm for optimizing the robust contrastive loss with a provable convergence guarantee without using large mini-batch sizes. Theoretical and experimental results show that our algorithm automatically learns a suitable $τ$ for each sample. Specifically, samples with frequent semantics use large temperatures to keep local semantic structures, while samples with rare semantics use small temperatures to induce more separable features. Our method not only outperforms prior strong baselines (e.g., SimCLR, CLIP) on unimodal and bimodal datasets with larger improvements on imbalanced data but also is less sensitive to hyper-parameters. To our best knowledge, this is the first methodical approach to optimizing a contrastive loss with individualized temperatures. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: 33 pages, 11 figures, accepted by ICML2023

arXiv:2305.07408 [pdf, other]

Distributed Gradient Descent for Functional Learning

Authors: Zhan Yu, Jun Fan, Zhongjie Shi, Ding-Xuan Zhou

Abstract: In recent years, different types of distributed and parallel learning schemes have received increasing attention for their strong advantages in handling large-scale data information. In the information era, to face the big data challenges {that} stem from functional data analysis very recently, we propose a novel distributed gradient descent functional learning (DGDFL) algorithm to tackle function… ▽ More In recent years, different types of distributed and parallel learning schemes have received increasing attention for their strong advantages in handling large-scale data information. In the information era, to face the big data challenges {that} stem from functional data analysis very recently, we propose a novel distributed gradient descent functional learning (DGDFL) algorithm to tackle functional data across numerous local machines (processors) in the framework of reproducing kernel Hilbert space. Based on integral operator approaches, we provide the first theoretical understanding of the DGDFL algorithm in many different aspects of the literature. On the way of understanding DGDFL, firstly, a data-based gradient descent functional learning (GDFL) algorithm associated with a single-machine model is proposed and comprehensively studied. Under mild conditions, confidence-based optimal learning rates of DGDFL are obtained without the saturation boundary on the regularity index suffered in previous works in functional regression. We further provide a semi-supervised DGDFL approach to weaken the restriction on the maximal number of local machines to ensure optimal rates. To our best knowledge, the DGDFL provides the first divide-and-conquer iterative training approach to functional learning based on data samples of intrinsically infinite-dimensional random functions (functional covariates) and enriches the methodologies for functional data analysis. △ Less

Submitted 21 July, 2024; v1 submitted 12 May, 2023; originally announced May 2023.

Comments: 48 pages

arXiv:2304.04443 [pdf, other]

Approximation of Nonlinear Functionals Using Deep ReLU Networks

Authors: Linhao Song, Jun Fan, Di-Rong Chen, Ding-Xuan Zhou

Abstract: In recent years, functional neural networks have been proposed and studied in order to approximate nonlinear continuous functionals defined on $L^p([-1, 1]^s)$ for integers $s\ge1$ and $1\le p<\infty$. However, their theoretical properties are largely unknown beyond universality of approximation or the existing analysis does not apply to the rectified linear unit (ReLU) activation function. To fil… ▽ More In recent years, functional neural networks have been proposed and studied in order to approximate nonlinear continuous functionals defined on $L^p([-1, 1]^s)$ for integers $s\ge1$ and $1\le p<\infty$. However, their theoretical properties are largely unknown beyond universality of approximation or the existing analysis does not apply to the rectified linear unit (ReLU) activation function. To fill in this void, we investigate here the approximation power of functional deep neural networks associated with the ReLU activation function by constructing a continuous piecewise linear interpolation under a simple triangulation. In addition, we establish rates of approximation of the proposed functional deep ReLU networks under mild regularity conditions. Finally, our study may also shed some light on the understanding of functional data learning algorithms. △ Less

Submitted 10 April, 2023; originally announced April 2023.

arXiv:2304.01561 [pdf, other]

doi 10.1007/s00365-024-09679-z

Optimal rates of approximation by shallow ReLU$^k$ neural networks and applications to nonparametric regression

Authors: Yunfei Yang, Ding-Xuan Zhou

Abstract: We study the approximation capacity of some variation spaces corresponding to shallow ReLU$^k$ neural networks. It is shown that sufficiently smooth functions are contained in these spaces with finite variation norms. For functions with less smoothness, the approximation rates in terms of the variation norm are established. Using these results, we are able to prove the optimal approximation rates… ▽ More We study the approximation capacity of some variation spaces corresponding to shallow ReLU$^k$ neural networks. It is shown that sufficiently smooth functions are contained in these spaces with finite variation norms. For functions with less smoothness, the approximation rates in terms of the variation norm are established. Using these results, we are able to prove the optimal approximation rates in terms of the number of neurons for shallow ReLU$^k$ neural networks. It is also shown how these results can be used to derive approximation bounds for deep neural networks and convolutional neural networks (CNNs). As applications, we study convergence rates for nonparametric regression using three ReLU neural network models: shallow neural network, over-parameterized neural network, and CNN. In particular, we show that shallow neural networks can achieve the minimax optimal rates for learning Hölder functions, which complements recent results for deep neural networks. It is also proven that over-parameterized (deep or shallow) neural networks can achieve nearly optimal rates for nonparametric regression. △ Less

Submitted 8 January, 2024; v1 submitted 4 April, 2023; originally announced April 2023.

Comments: Version 3 improves some approximation bounds by using recent results from arXiv:2307.15285

arXiv:2302.10371 [pdf, other]

Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency

Authors: Heyang Zhao, Jiafan He, Dongruo Zhou, Tong Zhang, Quanquan Gu

Abstract: Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this… ▽ More Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K σ_k^2} + d)$ regret, where $σ_k^2$ is the variance of the noise at the round $k$, $d$ is the dimension of the contexts and $K$ is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems. △ Less

Submitted 20 February, 2023; originally announced February 2023.

Comments: 43 pages, 2 tables

arXiv:2212.06132 [pdf, ps, other]

Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes

Authors: Jiafan He, Heyang Zhao, Dongruo Zhou, Quanquan Gu

Abstract: We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition probability can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the d… ▽ More We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition probability can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the optimal value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest. △ Less

Submitted 3 November, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: 33 pages, 1 table. In ICML 2023

arXiv:2210.10643 [pdf, other]

Towards Accurate Subgraph Similarity Computation via Neural Graph Pruning

Authors: Linfeng Liu, Xu Han, Dawei Zhou, Li-Ping Liu

Abstract: Subgraph similarity search, one of the core problems in graph search, concerns whether a target graph approximately contains a query graph. The problem is recently touched by neural methods. However, current neural methods do not consider pruning the target graph, though pruning is critically important in traditional calculations of subgraph similarities. One obstacle to applying pruning in neural… ▽ More Subgraph similarity search, one of the core problems in graph search, concerns whether a target graph approximately contains a query graph. The problem is recently touched by neural methods. However, current neural methods do not consider pruning the target graph, though pruning is critically important in traditional calculations of subgraph similarities. One obstacle to applying pruning in neural methods is {the discrete property of pruning}. In this work, we convert graph pruning to a problem of node relabeling and then relax it to a differentiable problem. Based on this idea, we further design a novel neural network to approximate a type of subgraph distance: the subgraph edit distance (SED). {In particular, we construct the pruning component using a neural structure, and the entire model can be optimized end-to-end.} In the design of the model, we propose an attention mechanism to leverage the information about the query graph and guide the pruning of the target graph. Moreover, we develop a multi-head pruning strategy such that the model can better explore multiple ways of pruning the target graph. The proposed model establishes new state-of-the-art results across seven benchmark datasets. Extensive analysis of the model indicates that the proposed model can reasonably prune the target graph for SED computation. The implementation of our algorithm is released at our Github repo: https://github.com/tufts-ml/Prune4SED. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Journal ref: Transactions on Machine Learning Research (TMLR) October 2022

arXiv:2209.13762 [pdf, other]

Consensus Knowledge Graph Learning via Multi-view Sparse Low Rank Block Model

Authors: Tianxi Cai, Dong Xia, Luwan Zhang, Doudou Zhou

Abstract: Network analysis has been a powerful tool to unveil relationships and interactions among a large number of objects. Yet its effectiveness in accurately identifying important node-node interactions is challenged by the rapidly growing network size, with data being collected at an unprecedented granularity and scale. Common wisdom to overcome such high dimensionality is collapsing nodes into smaller… ▽ More Network analysis has been a powerful tool to unveil relationships and interactions among a large number of objects. Yet its effectiveness in accurately identifying important node-node interactions is challenged by the rapidly growing network size, with data being collected at an unprecedented granularity and scale. Common wisdom to overcome such high dimensionality is collapsing nodes into smaller groups and conducting connectivity analysis on the group level. Dividing efforts into two phases inevitably opens a gap in consistency and drives down efficiency. Consensus learning emerges as a new normal for common knowledge discovery with multiple data sources available. In this paper, we propose a unified multi-view sparse low-rank block model (msLBM) framework, which enables simultaneous grouping and connectivity analysis by combining multiple data sources. The msLBM framework efficiently represents overlapping information across large scale concepts and accommodates different types of heterogeneity across sources. Both features are desirable when analyzing high dimensional electronic health record (EHR) datasets from multiple health systems. An estimating procedure based on the alternating minimization algorithm is proposed. Our theoretical results demonstrate that a consensus knowledge graph can be more accurately learned by leveraging multi-source datasets, and statistically optimal rates can be achieved under mild conditions. Applications to the real world EHR data suggest that our proposed msLBM algorithm can more reliably reveal network structure among clinical concepts by effectively combining summary level EHR data from multiple health systems. △ Less

Submitted 4 October, 2024; v1 submitted 27 September, 2022; originally announced September 2022.

arXiv:2209.08005 [pdf, ps, other]

Stability and Generalization for Markov Chain Stochastic Gradient Methods

Authors: Puyu Wang, Yunwen Lei, Yiming Ying, Ding-Xuan Zhou

Abstract: Recently there is a large amount of work devoted to the study of Markov chain stochastic gradient methods (MC-SGMs) which mainly focus on their convergence analysis for solving minimization problems. In this paper, we provide a comprehensive generalization analysis of MC-SGMs for both minimization and minimax problems through the lens of algorithmic stability in the framework of statistical learni… ▽ More Recently there is a large amount of work devoted to the study of Markov chain stochastic gradient methods (MC-SGMs) which mainly focus on their convergence analysis for solving minimization problems. In this paper, we provide a comprehensive generalization analysis of MC-SGMs for both minimization and minimax problems through the lens of algorithmic stability in the framework of statistical learning theory. For empirical risk minimization (ERM) problems, we establish the optimal excess population risk bounds for both smooth and non-smooth cases by introducing on-average argument stability. For minimax problems, we develop a quantitative connection between on-average argument stability and generalization error which extends the existing results for uniform stability \cite{lei2021stability}. We further develop the first nearly optimal convergence rates for convex-concave problems both in expectation and with high probability, which, combined with our stability results, show that the optimal generalization bounds can be attained for both smooth and non-smooth cases. To the best of our knowledge, this is the first generalization analysis of SGMs when the gradients are sampled from a Markov process. △ Less

Submitted 16 September, 2022; originally announced September 2022.

arXiv:2209.04188 [pdf, ps, other]

Differentially Private Stochastic Gradient Descent with Low-Noise

Authors: Puyu Wang, Yunwen Lei, Yiming Ying, Ding-Xuan Zhou

Abstract: Modern machine learning algorithms aim to extract fine-grained information from data to provide accurate predictions, which often conflicts with the goal of privacy protection. This paper addresses the practical and theoretical importance of developing privacy-preserving machine learning algorithms that ensure good performance while preserving privacy. In this paper, we focus on the privacy and ut… ▽ More Modern machine learning algorithms aim to extract fine-grained information from data to provide accurate predictions, which often conflicts with the goal of privacy protection. This paper addresses the practical and theoretical importance of developing privacy-preserving machine learning algorithms that ensure good performance while preserving privacy. In this paper, we focus on the privacy and utility (measured by excess risk bounds) performances of differentially private stochastic gradient descent (SGD) algorithms in the setting of stochastic convex optimization. Specifically, we examine the pointwise problem in the low-noise setting for which we derive sharper excess risk bounds for the differentially private SGD algorithm. In the pairwise learning setting, we propose a simple differentially private SGD algorithm based on gradient perturbation. Furthermore, we develop novel utility bounds for the proposed algorithm, proving that it achieves optimal excess risk rates even for non-smooth losses. Notably, we establish fast learning rates for privacy-preserving pairwise learning under the low-noise condition, which is the first of its kind. △ Less

Submitted 14 July, 2023; v1 submitted 9 September, 2022; originally announced September 2022.

arXiv:2208.06972 [pdf, other]

Is the NFL's franchise tag fair to players?

Authors: Darwin Zhou

Abstract: There has been a consistent criticism over the past decade of the NFL franchise tag's monetary limitations due to its biased institutions in favor of the team rather than the player. But the question whether the NFL's franchise tag is fair or unfair to players has never been systematically studied. In this paper, I investigate the effects of NFL players' contract extensions when on a franchise tag… ▽ More There has been a consistent criticism over the past decade of the NFL franchise tag's monetary limitations due to its biased institutions in favor of the team rather than the player. But the question whether the NFL's franchise tag is fair or unfair to players has never been systematically studied. In this paper, I investigate the effects of NFL players' contract extensions when on a franchise tag compared to when they are not and analyze them through statistical and economic lens. Through my research, I find that indeed the current franchise tag designation is unfair to players when it comes to contract extension. I then propose a solution to remedy this unfairness, that is, removing the opportunity to franchise tag players for multiple years, and adding an option for the player to either test free agency but receive zero pay until they settle on a contract (the team can also match the offer) or sign the franchise tag, to provide more flexibility for the player and the team. △ Less

Submitted 15 August, 2022; v1 submitted 14 August, 2022; originally announced August 2022.

arXiv:2208.06528 [pdf, ps, other]

Dynamic Bayesian Learning for Spatiotemporal Mechanistic Models

Authors: Sudipto Banerjee, Xiang Chen, Ian Frankenburg, Daniel Zhou

Abstract: We develop an approach for Bayesian learning of spatiotemporal dynamical mechanistic models. Such learning consists of statistical emulation of the mechanistic system that can efficiently interpolate the output of the system from arbitrary inputs. The emulated learner can then be used to train the system from noisy data achieved by melding information from observed data with the emulated mechanist… ▽ More We develop an approach for Bayesian learning of spatiotemporal dynamical mechanistic models. Such learning consists of statistical emulation of the mechanistic system that can efficiently interpolate the output of the system from arbitrary inputs. The emulated learner can then be used to train the system from noisy data achieved by melding information from observed data with the emulated mechanistic system. This joint melding of mechanistic systems employ hierarchical state-space models with Gaussian process regression. Assuming the dynamical system is controlled by a finite collection of inputs, Gaussian process regression learns the effect of these parameters through a number of training runs, driving the stochastic innovations of the spatiotemporal state-space component. This enables efficient modeling of the dynamics over space and time. This article details exact inference with analytically accessible posterior distributions in hierarchical matrix-variate Normal and Wishart models in designing the emulator. This step obviates expensive iterative algorithms such as Markov chain Monte Carlo or variational approximations. We also show how emulation is applicable to large-scale emulation by designing a dynamic Bayesian transfer learning framework. Inference on mechanistic model parameters proceeds using Markov chain Monte Carlo as a post-emulation step using the emulator as a regression component. We demonstrate this framework through solving inverse problems arising in the analysis of ordinary and partial nonlinear differential equations and, in addition, to a black-box computer model generating spatiotemporal dynamics across a graphical model. △ Less

Submitted 9 July, 2025; v1 submitted 12 August, 2022; originally announced August 2022.

arXiv:2208.05363 [pdf, ps, other]

Learning Two-Player Mixture Markov Games: Kernel Function Approximation and Correlated Equilibrium

Authors: Chris Junchi Li, Dongruo Zhou, Quanquan Gu, Michael I. Jordan

Abstract: We consider learning Nash equilibria in two-player zero-sum Markov Games with nonlinear function approximation, where the action-value function is approximated by a function in a Reproducing Kernel Hilbert Space (RKHS). The key challenge is how to do exploration in the high-dimensional function space. We propose a novel online learning algorithm to find a Nash equilibrium by minimizing the duality… ▽ More We consider learning Nash equilibria in two-player zero-sum Markov Games with nonlinear function approximation, where the action-value function is approximated by a function in a Reproducing Kernel Hilbert Space (RKHS). The key challenge is how to do exploration in the high-dimensional function space. We propose a novel online learning algorithm to find a Nash equilibrium by minimizing the duality gap. At the core of our algorithms are upper and lower confidence bounds that are derived based on the principle of optimism in the face of uncertainty. We prove that our algorithm is able to attain an $O(\sqrt{T})$ regret with polynomial computational complexity, under very mild assumptions on the reward function and the underlying dynamic of the Markov Games. We also propose several extensions of our algorithm, including an algorithm with Bernstein-type bonus that can achieve a tighter regret bound, and another algorithm for model misspecification that can be applied to neural function approximation. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: 42 pages

arXiv:2208.05134 [pdf, other]

Doubly Robust Augmented Model Accuracy Transfer Inference with High Dimensional Features

Authors: Doudou Zhou, Molei Liu, Mengyan Li, Tianxi Cai

Abstract: Due to label scarcity and covariate shift happening frequently in real-world studies, transfer learning has become an essential technique to train models generalizable to some target populations using existing labeled source data. Most existing transfer learning research has been focused on model estimation, while there is a paucity of literature on transfer inference for model accuracy despite it… ▽ More Due to label scarcity and covariate shift happening frequently in real-world studies, transfer learning has become an essential technique to train models generalizable to some target populations using existing labeled source data. Most existing transfer learning research has been focused on model estimation, while there is a paucity of literature on transfer inference for model accuracy despite its importance. We propose a novel $\mathbf{D}$oubly $\mathbf{R}$obust $\mathbf{A}$ugmented $\mathbf{M}$odel $\mathbf{A}$ccuracy $\mathbf{T}$ransfer $\mathbf{I}$nferen$\mathbf{C}$e (DRAMATIC) method for point and interval estimation of commonly used classification performance measures in an unlabeled target population using labeled source data. Specifically, DRAMATIC derives and evaluates the risk model for a binary response $Y$ against some low dimensional predictors $\mathbf{A}$ on the target population, leveraging $Y$ from source data only and high dimensional adjustment features $\mathbf{X}$ from both the source and target data. The proposed estimators are doubly robust in the sense that they are $n^{1/2}$ consistent when at least one model is correctly specified and certain model sparsity assumptions hold. Simulation results demonstrate that the point estimation have negligible bias and the confidence intervals derived by DRAMATIC attain satisfactory empirical coverage levels. We further illustrate the utility of our method to transfer the genetic risk prediction model and its accuracy evaluation for type II diabetes across two patient cohorts in Mass General Brigham (MGB) collected using different sampling mechanisms and at different time points. △ Less

Submitted 8 November, 2022; v1 submitted 10 August, 2022; originally announced August 2022.

arXiv:2206.05581 [pdf, other]

Federated Offline Reinforcement Learning

Authors: Doudou Zhou, Yufeng Zhang, Aaron Sonabend-W, Zhaoran Wang, Junwei Lu, Tianxi Cai

Abstract: Evidence-based or data-driven dynamic treatment regimes are essential for personalized medicine, which can benefit from offline reinforcement learning (RL). Although massive healthcare data are available across medical institutions, they are prohibited from sharing due to privacy constraints. Besides, heterogeneity exists in different sites. As a result, federated offline RL algorithms are necessa… ▽ More Evidence-based or data-driven dynamic treatment regimes are essential for personalized medicine, which can benefit from offline reinforcement learning (RL). Although massive healthcare data are available across medical institutions, they are prohibited from sharing due to privacy constraints. Besides, heterogeneity exists in different sites. As a result, federated offline RL algorithms are necessary and promising to deal with the problems. In this paper, we propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites. The proposed model makes the analysis of the site-level features possible. We design the first federated policy optimization algorithm for offline RL with sample complexity. The proposed algorithm is communication-efficient, which requires only a single round of communication interaction by exchanging summary statistics. We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed. Extensive simulations demonstrate the effectiveness of the proposed algorithm. The method is applied to a sepsis dataset in multiple sites to illustrate its use in clinical settings. △ Less

Submitted 27 January, 2024; v1 submitted 11 June, 2022; originally announced June 2022.

Showing 1–50 of 149 results for author: Zhou, D