-
Transfer Learning for Matrix Completion
Authors:
Dali Liu,
Haolei Weng
Abstract:
In this paper, we explore the knowledge transfer under the setting of matrix completion, which aims to enhance the estimation of a low-rank target matrix with auxiliary data available. We propose a transfer learning procedure given prior information on which source datasets are favorable. We study its convergence rates and prove its minimax optimality. Our analysis reveals that with the source mat…
▽ More
In this paper, we explore the knowledge transfer under the setting of matrix completion, which aims to enhance the estimation of a low-rank target matrix with auxiliary data available. We propose a transfer learning procedure given prior information on which source datasets are favorable. We study its convergence rates and prove its minimax optimality. Our analysis reveals that with the source matrices close enough to the target matrix, out method outperforms the traditional method using the single target data. In particular, we leverage the advanced sharp concentration inequalities introduced in \cite{brailovskaya2024universality} to eliminate a logarithmic factor in the convergence rate, which is crucial for proving the minimax optimality. When the relevance of source datasets is unknown, we develop an efficient detection procedure to identify informative sources and establish its selection consistency. Simulations and real data analysis are conducted to support the validity of our methodology.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
Private Geometric Median in Nearly-Linear Time
Authors:
Syamantak Kumar,
Daogao Liu,
Kevin Tian,
Chutong Yang
Abstract:
Estimating the geometric median of a dataset is a robust counterpart to mean estimation, and is a fundamental problem in computational geometry. Recently, [HSU24] gave an $(\varepsilon, δ)$-differentially private algorithm obtaining an $α$-multiplicative approximation to the geometric median objective, $\frac 1 n \sum_{i \in [n]} \|\cdot - \mathbf{x}_i\|$, given a dataset…
▽ More
Estimating the geometric median of a dataset is a robust counterpart to mean estimation, and is a fundamental problem in computational geometry. Recently, [HSU24] gave an $(\varepsilon, δ)$-differentially private algorithm obtaining an $α$-multiplicative approximation to the geometric median objective, $\frac 1 n \sum_{i \in [n]} \|\cdot - \mathbf{x}_i\|$, given a dataset $\mathcal{D} := \{\mathbf{x}_i\}_{i \in [n]} \subset \mathbb{R}^d$. Their algorithm requires $n \gtrsim \sqrt d \cdot \frac 1 {α\varepsilon}$ samples, which they prove is information-theoretically optimal. This result is surprising because its error scales with the \emph{effective radius} of $\mathcal{D}$ (i.e., of a ball capturing most points), rather than the worst-case radius. We give an improved algorithm that obtains the same approximation quality, also using $n \gtrsim \sqrt d \cdot \frac 1 {αε}$ samples, but in time $\widetilde{O}(nd + \frac d {α^2})$. Our runtime is nearly-linear, plus the cost of the cheapest non-private first-order method due to [CLM+16]. To achieve our results, we use subsampling and geometric aggregation tools inspired by FriendlyCore [TCK+22] to speed up the "warm start" component of the [HSU24] algorithm, combined with a careful custom analysis of DP-SGD's sensitivity for the geometric median objective.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and Generalizability
Authors:
Chenhui Xu,
Dancheng Liu,
Jiajie Li,
Amir Nassereldine,
Zhaohui Li,
Jinjun Xiong
Abstract:
Recent advancements in cognitive science and multi-round reasoning techniques for Large Language Models (LLMs) suggest that iterative thinking processes improve problem-solving performance in complex tasks. Inspired by this, approaches like Chain-of-Thought, debating, and self-refinement have been applied to auto-regressive LLMs, achieving significant successes in tasks such as mathematical reason…
▽ More
Recent advancements in cognitive science and multi-round reasoning techniques for Large Language Models (LLMs) suggest that iterative thinking processes improve problem-solving performance in complex tasks. Inspired by this, approaches like Chain-of-Thought, debating, and self-refinement have been applied to auto-regressive LLMs, achieving significant successes in tasks such as mathematical reasoning, commonsense reasoning, and multi-hop question answering. Despite these successes, the theoretical basis for how multi-round reasoning enhances problem-solving abilities remains underexplored. In this work, we investigate the approximation, learnability, and generalization properties of multi-round auto-regressive models. We show that Transformers with finite context windows are universal approximators for steps of Turing-computable functions and can approximate any Turing-computable sequence-to-sequence function through multi-round reasoning. We extend PAC learning to sequence generation and demonstrate that multi-round generation is learnable even when the sequence length exceeds the model's context window. Finally, we examine how generalization error propagates across rounds, and show how the aforementioned approaches can help constrain this error, ensuring outputs stay within an expectation boundary. This work sheds light on the systemic theoretical foundations of multi-round sequence learning and reasoning, emphasizing its role in inference complexity.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Linear-Time User-Level DP-SCO via Robust Statistics
Authors:
Badih Ghazi,
Ravi Kumar,
Daogao Liu,
Pasin Manurangsi
Abstract:
User-level differentially private stochastic convex optimization (DP-SCO) has garnered significant attention due to the paramount importance of safeguarding user privacy in modern large-scale machine learning applications. Current methods, such as those based on differentially private stochastic gradient descent (DP-SGD), often struggle with high noise accumulation and suboptimal utility due to th…
▽ More
User-level differentially private stochastic convex optimization (DP-SCO) has garnered significant attention due to the paramount importance of safeguarding user privacy in modern large-scale machine learning applications. Current methods, such as those based on differentially private stochastic gradient descent (DP-SGD), often struggle with high noise accumulation and suboptimal utility due to the need to privatize every intermediate iterate. In this work, we introduce a novel linear-time algorithm that leverages robust statistics, specifically the median and trimmed mean, to overcome these challenges. Our approach uniquely bounds the sensitivity of all intermediate iterates of SGD with gradient estimation based on robust statistics, thereby significantly reducing the gradient estimation noise for privacy purposes and enhancing the privacy-utility trade-off. By sidestepping the repeated privatization required by previous methods, our algorithm not only achieves an improved theoretical privacy-utility trade-off but also maintains computational efficiency. We complement our algorithm with an information-theoretic lower bound, showing that our upper bound is optimal up to logarithmic factors and the dependence on $ε$. This work sets the stage for more robust and efficient privacy-preserving techniques in machine learning, with implications for future research and application in the field.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Distributed Primal-Dual Algorithms: Unification, Connections, and Insights
Authors:
Runxiong Wu,
Dong Liu,
Xueqin Wang,
Andi Wang
Abstract:
We study primal-dual algorithms for general empirical risk minimization problems in distributed settings, focusing on two prominent classes of algorithms. The first class is the communication-efficient distributed dual coordinate ascent (CoCoA), derived from the coordinate ascent method for solving the dual problem. The second class is the alternating direction method of multipliers (ADMM), includ…
▽ More
We study primal-dual algorithms for general empirical risk minimization problems in distributed settings, focusing on two prominent classes of algorithms. The first class is the communication-efficient distributed dual coordinate ascent (CoCoA), derived from the coordinate ascent method for solving the dual problem. The second class is the alternating direction method of multipliers (ADMM), including consensus ADMM, linearized ADMM, and proximal ADMM. We demonstrate that both classes of algorithms can be transformed into a unified update form that involves only primal and dual variables. This discovery reveals key connections between the two classes of algorithms: CoCoA can be interpreted as a special case of proximal ADMM for solving the dual problem, while consensus ADMM is closely related to a proximal ADMM algorithm. This discovery provides the insight that by adjusting the augmented Lagrangian parameter, we can easily enable the ADMM variants to outperform the CoCoA variants. We further explore linearized versions of ADMM and analyze the effects of tuning parameters on these ADMM variants in the distributed setting. Our theoretical findings are supported by extensive simulation studies and real-world data analysis.
△ Less
Submitted 1 February, 2025;
originally announced February 2025.
-
Synthetic Data Generation for Augmenting Small Samples
Authors:
Dan Liu,
Samer El Kababji,
Nicholas Mitsakakis,
Lisa Pilgram,
Thomas Walters,
Mark Clemons,
Greg Pond,
Alaa El-Hussuna,
Khaled El Emam
Abstract:
Small datasets are common in health research. However, the generalization performance of machine learning models is suboptimal when the training datasets are small. To address this, data augmentation is one solution. Augmentation increases sample size and is seen as a form of regularization that increases the diversity of small datasets, leading them to perform better on unseen data. We found that…
▽ More
Small datasets are common in health research. However, the generalization performance of machine learning models is suboptimal when the training datasets are small. To address this, data augmentation is one solution. Augmentation increases sample size and is seen as a form of regularization that increases the diversity of small datasets, leading them to perform better on unseen data. We found that augmentation improves prognostic performance for datasets that: have fewer observations, with smaller baseline AUC, have higher cardinality categorical variables, and have more balanced outcome variables. No specific generative model consistently outperformed the others. We developed a decision support model that can be used to inform analysts if augmentation would be useful. For seven small application datasets, augmenting the existing data results in an increase in AUC between 4.31% (AUC from 0.71 to 0.75) and 43.23% (AUC from 0.51 to 0.73), with an average 15.55% relative improvement, demonstrating the nontrivial impact of augmentation on small datasets (p=0.0078). Augmentation AUC was higher than resampling only AUC (p=0.016). The diversity of augmented datasets was higher than the diversity of resampled datasets (p=0.046).
△ Less
Submitted 30 January, 2025;
originally announced January 2025.
-
Representational Transfer Learning for Matrix Completion
Authors:
Yong He,
Zeyu Li,
Dong Liu,
Kangxiang Qin,
Jiahui Xie
Abstract:
We propose to transfer representational knowledge from multiple sources to a target noisy matrix completion task by aggregating singular subspaces information. Under our representational similarity framework, we first integrate linear representation information by solving a two-way principal component analysis problem based on a properly debiased matrix-valued dataset. After acquiring better colum…
▽ More
We propose to transfer representational knowledge from multiple sources to a target noisy matrix completion task by aggregating singular subspaces information. Under our representational similarity framework, we first integrate linear representation information by solving a two-way principal component analysis problem based on a properly debiased matrix-valued dataset. After acquiring better column and row representation estimators from the sources, the original high-dimensional target matrix completion problem is then transformed into a low-dimensional linear regression, of which the statistical efficiency is guaranteed. A variety of extensional arguments, including post-transfer statistical inference and robustness against negative transfer, are also discussed alongside. Finally, extensive simulation results and a number of real data cases are reported to support our claims.
△ Less
Submitted 9 December, 2024;
originally announced December 2024.
-
Estimating journey time for two-point vehicle re-identification survey with limited observable scope using 2-dimensional truncated distributions
Authors:
Diyi Liu,
Yangsong Gu,
Lee D. Han
Abstract:
In transportation, Weigh-in motion (WIM) stations, Electronic Toll Collection (ETC) systems, Closed-circuit Television (CCTV) are widely deployed to collect data at different locations. Vehicle re-identification, by matching the same vehicle at different locations, is helpful in understanding the long-distance journey patterns. In this paper, the potential hazards of ignoring the survivorship bias…
▽ More
In transportation, Weigh-in motion (WIM) stations, Electronic Toll Collection (ETC) systems, Closed-circuit Television (CCTV) are widely deployed to collect data at different locations. Vehicle re-identification, by matching the same vehicle at different locations, is helpful in understanding the long-distance journey patterns. In this paper, the potential hazards of ignoring the survivorship bias effects are firstly identified and analyzed using a truncated distribution over a 2-dimensional time-time domain. Given journey time modeled as Exponential or Weibull distribution, Maximum Likelihood Estimation (MLE), Fisher Information (F.I.) and Bootstrap methods are formulated to estimate the parameter of interest and their confidence intervals. Besides formulating journey time distributions, an automated framework querying the observable time-time scope are proposed. For complex distributions (e.g, three parameter Weibull), distributions are modeled in PyTorch to automatically find first and second derivatives and estimated results. Three experiments are designed to demonstrate the effectiveness of the proposed method. In conclusion, the paper describes a very unique aspects in understanding and analyzing traffic status. Although the survivorship bias effects are not recognized and long-ignored, by accurately describing travel time over time-time domain, the proposed approach have potentials in travel time reliability analysis, understanding logistics systems, modeling/predicting product lifespans, etc.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Gaussian Mixture Vector Quantization with Aggregated Categorical Posterior
Authors:
Mingyuan Yan,
Jiawei Wu,
Rushi Shah,
Dianbo Liu
Abstract:
The vector quantization is a widely used method to map continuous representation to discrete space and has important application in tokenization for generative mode, bottlenecking information and many other tasks in machine learning. Vector Quantized Variational Autoencoder (VQ-VAE) is a type of variational autoencoder using discrete embedding as latent. We generalize the technique further, enrich…
▽ More
The vector quantization is a widely used method to map continuous representation to discrete space and has important application in tokenization for generative mode, bottlenecking information and many other tasks in machine learning. Vector Quantized Variational Autoencoder (VQ-VAE) is a type of variational autoencoder using discrete embedding as latent. We generalize the technique further, enriching the probabilistic framework with a Gaussian mixture as the underlying generative model. This framework leverages a codebook of latent means and adaptive variances to capture complex data distributions. This principled framework avoids various heuristics and strong assumptions that are needed with the VQ-VAE to address training instability and to improve codebook utilization. This approach integrates the benefits of both discrete and continuous representations within a variational Bayesian framework. Furthermore, by introducing the \textit{Aggregated Categorical Posterior Evidence Lower Bound} (ALBO), we offer a principled alternative optimization objective that aligns variational distributions with the generative model. Our experiments demonstrate that GM-VQ improves codebook utilization and reduces information loss without relying on handcrafted heuristics.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Adaptive Batch Size for Privately Finding Second-Order Stationary Points
Authors:
Daogao Liu,
Kunal Talwar
Abstract:
There is a gap between finding a first-order stationary point (FOSP) and a second-order stationary point (SOSP) under differential privacy constraints, and it remains unclear whether privately finding an SOSP is more challenging than finding an FOSP. Specifically, Ganesh et al. (2023) claimed that an $α$-SOSP can be found with $α=O(\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{nε})^{3/7})$, where $n$ is the…
▽ More
There is a gap between finding a first-order stationary point (FOSP) and a second-order stationary point (SOSP) under differential privacy constraints, and it remains unclear whether privately finding an SOSP is more challenging than finding an FOSP. Specifically, Ganesh et al. (2023) claimed that an $α$-SOSP can be found with $α=O(\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{nε})^{3/7})$, where $n$ is the dataset size, $d$ is the dimension, and $ε$ is the differential privacy parameter. However, a recent analysis revealed an issue in their saddle point escape procedure, leading to weaker guarantees. Building on the SpiderBoost algorithm framework, we propose a new approach that uses adaptive batch sizes and incorporates the binary tree mechanism. Our method not only corrects this issue but also improves the results for privately finding an SOSP, achieving $α=O(\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{nε})^{1/2})$.
This improved bound matches the state-of-the-art for finding a FOSP, suggesting that privately finding an SOSP may be achievable at no additional cost.
△ Less
Submitted 26 February, 2025; v1 submitted 9 October, 2024;
originally announced October 2024.
-
Improved Sample Complexity for Private Nonsmooth Nonconvex Optimization
Authors:
Guy Kornowski,
Daogao Liu,
Kunal Talwar
Abstract:
We study differentially private (DP) optimization algorithms for stochastic and empirical objectives which are neither smooth nor convex, and propose methods that return a Goldstein-stationary point with sample complexity bounds that improve on existing works. We start by providing a single-pass $(ε,δ)$-DP algorithm that returns an $(α,β)$-stationary point as long as the dataset is of size…
▽ More
We study differentially private (DP) optimization algorithms for stochastic and empirical objectives which are neither smooth nor convex, and propose methods that return a Goldstein-stationary point with sample complexity bounds that improve on existing works. We start by providing a single-pass $(ε,δ)$-DP algorithm that returns an $(α,β)$-stationary point as long as the dataset is of size $\widetildeΩ(\sqrt{d}/αβ^{3}+d/εαβ^{2})$, which is $Ω(\sqrt{d})$ times smaller than the algorithm of Zhang et al. [2024] for this task, where $d$ is the dimension. We then provide a multi-pass polynomial time algorithm which further improves the sample complexity to $\widetildeΩ\left(d/β^2+d^{3/4}/εα^{1/2}β^{3/2}\right)$, by designing a sample efficient ERM algorithm, and proving that Goldstein-stationary points generalize from the empirical loss to the population loss.
△ Less
Submitted 7 June, 2025; v1 submitted 8 October, 2024;
originally announced October 2024.
-
Penalized Principal Component Analysis for Large-dimension Factor Model with Group Pursuit
Authors:
Yong He,
Dong Liu,
Guangming Pan,
Yiming Wang
Abstract:
This paper investigates the intrinsic group structures within the framework of large-dimensional approximate factor models, which portrays homogeneous effects of the common factors on the individuals that fall into the same group. To this end, we propose a fusion Penalized Principal Component Analysis (PPCA) method and derive a closed-form solution for the $\ell_2$-norm optimization problem. We al…
▽ More
This paper investigates the intrinsic group structures within the framework of large-dimensional approximate factor models, which portrays homogeneous effects of the common factors on the individuals that fall into the same group. To this end, we propose a fusion Penalized Principal Component Analysis (PPCA) method and derive a closed-form solution for the $\ell_2$-norm optimization problem. We also show the asymptotic properties of our proposed PPCA estimates. With the PPCA estimates as an initialization, we identify the unknown group structure by a combination of the agglomerative hierarchical clustering algorithm and an information criterion. Then the factor loadings and factor scores are re-estimated conditional on the identified latent groups. Under some regularity conditions, we establish the consistency of the membership estimators as well as that of the group number estimator derived from the information criterion. Theoretically, we show that the post-clustering estimators for the factor loadings and factor scores with group pursuit achieve efficiency gains compared to the estimators by conventional PCA method. Thorough numerical studies validate the established theory and a real financial example illustrates the practical usefulness of the proposed method.
△ Less
Submitted 15 March, 2025; v1 submitted 27 July, 2024;
originally announced July 2024.
-
Subgroup Identification with Latent Factor Structure
Authors:
Yong He,
Dong Liu,
Fuxin Wang,
Mingjuan Zhang,
Wen-Xin Zhou
Abstract:
Subgroup analysis has garnered increasing attention for its ability to identify meaningful subgroups within heterogeneous populations, thereby enhancing predictive power. However, in many fields such as social science and biology, covariates are often highly correlated due to common factors. This correlation poses significant challenges for subgroup identification, an issue that is often overlooke…
▽ More
Subgroup analysis has garnered increasing attention for its ability to identify meaningful subgroups within heterogeneous populations, thereby enhancing predictive power. However, in many fields such as social science and biology, covariates are often highly correlated due to common factors. This correlation poses significant challenges for subgroup identification, an issue that is often overlooked in existing literature. In this paper, we aim to address this gap in the ``diverging dimension" regime by proposing a center-augmented subgroup identification method within the Factor Augmented (sparse) Linear Model framework. This method bridges dimension reduction and sparse regression. Our proposed approach is adaptable to the high cross-sectional dependence among covariates and offers computational advantages with a complexity of $O(nK)$, compared to the $O(n^2)$ complexity of the conventional pairwise fusion penalty method in the literature, where $n$ is the sample size and $K$ is the number of subgroups. We also investigate the asymptotic properties of the oracle estimators under conditions on the minimal distance between group centroids. To implement the proposed approach, we introduce a Difference of Convex functions-based Alternating Direction Method of Multipliers (DC-ADMM) algorithm and demonstrate its convergence to a local minimizer in a finite number of steps. We illustrate the superiority of the proposed method through extensive numerical experiments and a real macroeconomic data example. An \texttt{R} package, \texttt{SILFS}, implementing the method is also available on CRAN.
△ Less
Submitted 17 July, 2024; v1 submitted 30 June, 2024;
originally announced July 2024.
-
Private Online Learning via Lazy Algorithms
Authors:
Hilal Asi,
Tomer Koren,
Daogao Liu,
Kunal Talwar
Abstract:
We study the problem of private online learning, specifically, online prediction from experts (OPE) and online convex optimization (OCO). We propose a new transformation that transforms lazy online learning algorithms into private algorithms. We apply our transformation for differentially private OPE and OCO using existing lazy algorithms for these problems. Our final algorithms obtain regret, whi…
▽ More
We study the problem of private online learning, specifically, online prediction from experts (OPE) and online convex optimization (OCO). We propose a new transformation that transforms lazy online learning algorithms into private algorithms. We apply our transformation for differentially private OPE and OCO using existing lazy algorithms for these problems. Our final algorithms obtain regret, which significantly improves the regret in the high privacy regime $\varepsilon \ll 1$, obtaining $\sqrt{T \log d} + T^{1/3} \log(d)/\varepsilon^{2/3}$ for DP-OPE and $\sqrt{T} + T^{1/3} \sqrt{d}/\varepsilon^{2/3}$ for DP-OCO. We also complement our results with a lower bound for DP-OPE, showing that these rates are optimal for a natural family of low-switching private algorithms.
△ Less
Submitted 21 February, 2025; v1 submitted 5 June, 2024;
originally announced June 2024.
-
Private Stochastic Convex Optimization with Heavy Tails: Near-Optimality from Simple Reductions
Authors:
Hilal Asi,
Daogao Liu,
Kevin Tian
Abstract:
We study the problem of differentially private stochastic convex optimization (DP-SCO) with heavy-tailed gradients, where we assume a $k^{\text{th}}$-moment bound on the Lipschitz constants of sample functions rather than a uniform bound. We propose a new reduction-based approach that enables us to obtain the first optimal rates (up to logarithmic factors) in the heavy-tailed setting, achieving er…
▽ More
We study the problem of differentially private stochastic convex optimization (DP-SCO) with heavy-tailed gradients, where we assume a $k^{\text{th}}$-moment bound on the Lipschitz constants of sample functions rather than a uniform bound. We propose a new reduction-based approach that enables us to obtain the first optimal rates (up to logarithmic factors) in the heavy-tailed setting, achieving error $G_2 \cdot \frac 1 {\sqrt n} + G_k \cdot (\frac{\sqrt d}{nε})^{1 - \frac 1 k}$ under $(ε, δ)$-approximate differential privacy, up to a mild $\textup{polylog}(\frac{1}δ)$ factor, where $G_2^2$ and $G_k^k$ are the $2^{\text{nd}}$ and $k^{\text{th}}$ moment bounds on sample Lipschitz constants, nearly-matching a lower bound of [Lowy and Razaviyayn 2023].
We further give a suite of private algorithms in the heavy-tailed setting which improve upon our basic result under additional assumptions, including an optimal algorithm under a known-Lipschitz constant assumption, a near-linear time algorithm for smooth functions, and an optimal linear time algorithm for smooth generalized linear models.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Bypassing Skip-Gram Negative Sampling: Dimension Regularization as a More Efficient Alternative for Graph Embeddings
Authors:
David Liu,
Arjun Seshadri,
Tina Eliassi-Rad,
Johan Ugander
Abstract:
A wide range of graph embedding objectives decompose into two components: one that enforces similarity, attracting the embeddings of nodes that are perceived as similar, and another that enforces dissimilarity, repelling the embeddings of nodes that are perceived as dissimilar. Without repulsion, the embeddings would collapse into trivial solutions. Skip-Gram Negative Sampling (SGNS) is a popular…
▽ More
A wide range of graph embedding objectives decompose into two components: one that enforces similarity, attracting the embeddings of nodes that are perceived as similar, and another that enforces dissimilarity, repelling the embeddings of nodes that are perceived as dissimilar. Without repulsion, the embeddings would collapse into trivial solutions. Skip-Gram Negative Sampling (SGNS) is a popular and efficient repulsion approach that prevents collapse by repelling each node from a sample of dissimilar nodes. In this work, we show that when repulsion is most needed and the embeddings approach collapse, SGNS node-wise repulsion is, in the aggregate, an approximate re-centering of the node embedding dimensions. Such dimension operations are more scalable than node operations and produce a simpler geometric interpretation of the repulsion. Our theoretical result establishes dimension regularization as an effective and more efficient, compared to skip-gram node contrast, approach to enforcing dissimilarity among embeddings of nodes. We use this result to propose a flexible algorithm augmentation framework that improves the scalability of any existing algorithm using SGNS. The framework prioritizes node attraction and replaces SGNS with dimension regularization. We instantiate this generic framework for LINE and node2vec and show that the augmented algorithms preserve downstream link-prediction performance while reducing GPU memory usage by up to 33.3% and training time by 23.4%. Moreover, we show that completely removing repulsion (a special case of our augmentation framework) in LINE reduces training time by 70.9% on average, while increasing link prediction performance, especially for graphs that are globally sparse but locally dense. In general, however, repulsion is needed, and dimension regularization provides an efficient alternative to SGNS.
△ Less
Submitted 2 June, 2025; v1 submitted 30 April, 2024;
originally announced May 2024.
-
Protection of Guizhou Miao Batik Culture Based on Knowledge Graph and Deep Learning
Authors:
Huafeng Quan,
Yiting Li,
Dashuai Liu,
Yue Zhou
Abstract:
In the globalization trend, China's cultural heritage is in danger of gradually disappearing. The protection and inheritance of these precious cultural resources has become a critical task. This paper focuses on the Miao batik culture in Guizhou Province, China, and explores the application of knowledge graphs, natural language processing, and deep learning techniques in the promotion and protecti…
▽ More
In the globalization trend, China's cultural heritage is in danger of gradually disappearing. The protection and inheritance of these precious cultural resources has become a critical task. This paper focuses on the Miao batik culture in Guizhou Province, China, and explores the application of knowledge graphs, natural language processing, and deep learning techniques in the promotion and protection of batik culture. We propose a dual-channel mechanism that integrates semantic and visual information, aiming to connect batik pattern features with cultural connotations. First, we use natural language processing techniques to automatically extract batik-related entities and relationships from the literature, and construct and visualize a structured batik pattern knowledge graph. Based on this knowledge graph, users can textually search and understand the images, meanings, taboos, and other cultural information of specific patterns. Second, for the batik pattern classification, we propose an improved ResNet34 model. By embedding average pooling and convolutional operations into the residual blocks and introducing long-range residual connections, the classification performance is enhanced. By inputting pattern images into this model, their subjects can be accurately identified, and then the underlying cultural connotations can be understood. Experimental results show that our model outperforms other mainstream models in evaluation metrics such as accuracy, precision, recall, and F1-score, achieving 99.0%, 99.0%, 98.9%, and 99.0%, respectively. This research provides new ideas for the digital protection of batik culture and demonstrates the great potential of artificial intelligence technology in cultural heritage protection.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Q-learning in Dynamic Treatment Regimes with Misclassified Binary Outcome
Authors:
Dan Liu,
Wenqing He
Abstract:
The study of precision medicine involves dynamic treatment regimes (DTRs), which are sequences of treatment decision rules recommended by taking patient-level information as input. The primary goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that leads to the best expected clinical outcome. Statistical methods have been developed in recent years to estima…
▽ More
The study of precision medicine involves dynamic treatment regimes (DTRs), which are sequences of treatment decision rules recommended by taking patient-level information as input. The primary goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that leads to the best expected clinical outcome. Statistical methods have been developed in recent years to estimate an optimal DTR, including Q-learning, a regression-based method in the DTR literature. Although there are many studies concerning Q-learning, little attention has been given in the presence of noisy data, such as misclassified outcomes. In this paper, we investigate the effect of outcome misclassification on Q-learning and propose a correction method to accommodate the misclassification effect. Simulation studies are conducted to demonstrate the satisfactory performance of the proposed method. We illustrate the proposed method in two examples from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study and the smoking cessation program.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
Dynamic Treatment Regimes with Replicated Observations Available for Error-prone Covariates: a Q-learning Approach
Authors:
Dan Liu,
Wenqing He
Abstract:
Dynamic treatment regimes (DTRs) have received an increasing interest in recent years. DTRs are sequences of treatment decision rules tailored to patient-level information. The main goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that yields the best expected clinical outcome. Q-learning has been considered as one of the most popular regression-based met…
▽ More
Dynamic treatment regimes (DTRs) have received an increasing interest in recent years. DTRs are sequences of treatment decision rules tailored to patient-level information. The main goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that yields the best expected clinical outcome. Q-learning has been considered as one of the most popular regression-based methods to estimate the optimal DTR. However, it is rarely studied in an error-prone setting, where the patient information is contaminated with measurement error. In this paper, we study the effect of covariate measurement error on Q-learning and propose a correction method to correct the measurement error in Q-learning. Simulation studies are conducted to assess the performance of the proposed method in Q-learning. We illustrate the use of the proposed method in an application to the sequenced treatment alternatives to relieve depression data.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
Multiple Imputation of Hierarchical Nonlinear Time Series Data with an Application to School Enrollment Data
Authors:
Daphne H. Liu,
Adrian E. Raftery
Abstract:
International comparisons of hierarchical time series data sets based on survey data, such as annual country-level estimates of school enrollment rates, can suffer from large amounts of missing data due to differing coverage of surveys across countries and across times. A popular approach to handling missing data in these settings is through multiple imputation, which can be especially effective w…
▽ More
International comparisons of hierarchical time series data sets based on survey data, such as annual country-level estimates of school enrollment rates, can suffer from large amounts of missing data due to differing coverage of surveys across countries and across times. A popular approach to handling missing data in these settings is through multiple imputation, which can be especially effective when there is an auxiliary variable that is strongly predictive of and has a smaller amount of missing data than the variable of interest. However, standard methods for multiple imputation of hierarchical time series data can perform poorly when the auxiliary variable and the variable of interest have a nonlinear relationship. Performance can also suffer if the multiple imputations are used to estimate an analysis model that makes different assumptions about the data compared to the imputation model, leading to uncongeniality between analysis and imputation models. We propose a Bayesian method for multiple imputation of hierarchical nonlinear time series data that uses a sequential decomposition of the joint distribution and incorporates smoothing splines to account for nonlinear relationships between variables. We compare the proposed method with existing multiple imputation methods through a simulation study and an application to secondary school enrollment data. We find that the proposed method can lead to substantial performance increases for estimation of parameters in uncongenial analysis models and for prediction of individual missing values.
△ Less
Submitted 28 March, 2025; v1 submitted 3 January, 2024;
originally announced January 2024.
-
Fast Rerandomization via the BRAIN
Authors:
Jiuyao Lu,
Daogao Liu,
Zhanran Lin,
Xiaomeng Wang
Abstract:
Randomized experiments are a crucial tool for causal inference in many different fields. Rerandomization addresses any covariate imbalance in such experiments by resampling treatment assignments until certain balance criteria are satisfied. However, rerandomization based on naïve acceptance-rejection sampling is computationally inefficient, especially when numerous independent assignments are requ…
▽ More
Randomized experiments are a crucial tool for causal inference in many different fields. Rerandomization addresses any covariate imbalance in such experiments by resampling treatment assignments until certain balance criteria are satisfied. However, rerandomization based on naïve acceptance-rejection sampling is computationally inefficient, especially when numerous independent assignments are required to perform randomization-based statistical inference. Existing acceleration methods are suboptimal and not applicable in structured experiments, including stratified and clustered experiments. Based on metaheuristics in integer programming, we propose BRAIN -- a novel computationally-lightweight methodology that ensures covariate balance in randomized experiments while significantly accelerating the computation. Our BRAIN method provides unbiased treatment effect estimators with reduced variance compared to complete randomization, preserving the desirable statistical properties of traditional rerandomization. Simulation studies and a real data example demonstrate the benefits of our method in fast sampling while retaining the appealing statistical guarantees.
△ Less
Submitted 25 May, 2025; v1 submitted 28 December, 2023;
originally announced December 2023.
-
A Dataset of Uniswap daily transaction indices by network
Authors:
Nir Chemaya,
Lin William Cong,
Emma Jorgensen,
Dingyue Liu,
Luyao Zhang
Abstract:
Decentralized Finance (DeFi) is reshaping traditional finance by enabling direct transactions without intermediaries, creating a rich source of open financial data. Layer 2 (L2) solutions are emerging to enhance the scalability and efficiency of the DeFi ecosystem, surpassing Layer 1 (L1) systems. However, the impact of L2 solutions is still underexplored, mainly due to the lack of comprehensive t…
▽ More
Decentralized Finance (DeFi) is reshaping traditional finance by enabling direct transactions without intermediaries, creating a rich source of open financial data. Layer 2 (L2) solutions are emerging to enhance the scalability and efficiency of the DeFi ecosystem, surpassing Layer 1 (L1) systems. However, the impact of L2 solutions is still underexplored, mainly due to the lack of comprehensive transaction data indices for economic analysis. This study bridges that gap by analyzing over 50 million transactions from Uniswap, a major decentralized exchange, across both L1 and L2 networks. We created a set of daily indices from blockchain data on Ethereum, Optimism, Arbitrum, and Polygon, offering insights into DeFi adoption, scalability, decentralization, and wealth distribution. Additionally, we developed an open-source Python framework for calculating decentralization indices, making this dataset highly useful for advanced machine learning research. Our work provides valuable resources for data scientists and contributes to the growth of the intelligent Web3 ecosystem.
△ Less
Submitted 22 September, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
On long-term fatigue damage estimation for a floating offshore wind turbine using a surrogate model
Authors:
Ding Peng Liu,
Giulio Ferri,
Taemin Heo,
Enzo Marino,
Lance Manuel
Abstract:
This study is concerned with the estimation of long-term fatigue damage for a floating offshore wind turbine. With the ultimate goal of efficient evaluation of fatigue limit states for floating offshore wind turbine systems, a detailed computational framework is introduced and used to develop a surrogate model using Gaussian process regression. The surrogate model, at first, relies only on a small…
▽ More
This study is concerned with the estimation of long-term fatigue damage for a floating offshore wind turbine. With the ultimate goal of efficient evaluation of fatigue limit states for floating offshore wind turbine systems, a detailed computational framework is introduced and used to develop a surrogate model using Gaussian process regression. The surrogate model, at first, relies only on a small subset of representative sea states and, then, is supplemented by the evaluation of additional sea states that leads to efficient convergence and accurate prediction of fatigue damage. A 5-MW offshore wind turbine supported by a semi-submersible floating platform is selected to demonstrate the proposed framework. The fore-aft bending moment at the turbine tower base and the fairlead tension in the windward mooring line are used for evaluation. Metocean data provide information on joint statistics of the wind and wave along with their relative likelihoods for the installation site in the Mediterranean Sea, near the coast of Sicily. \textcolor{black}{A coupled frequency-domain model} provides needed power spectra for the desired response processes. The proposed approach offers an efficient and accurate alternative to the exhaustive evaluation of a larger number of sea states and, as such, avoids excessive response simulations.
△ Less
Submitted 7 March, 2024; v1 submitted 26 November, 2023;
originally announced November 2023.
-
A Likelihood Approach to Incorporating Self-Report Data in HIV Recency Classification
Authors:
Wenlong Yang,
Danping Liu,
Le Bao,
Runze Li
Abstract:
Estimating new HIV infections is significant yet challenging due to the difficulty in distinguishing between recent and long-term infections. We demonstrate that HIV recency status (recent v.s. long-term) could be determined from the combination of self-report testing history and biomarkers, which are increasingly available in bio-behavioral surveys. HIV recency status is partially observed, given…
▽ More
Estimating new HIV infections is significant yet challenging due to the difficulty in distinguishing between recent and long-term infections. We demonstrate that HIV recency status (recent v.s. long-term) could be determined from the combination of self-report testing history and biomarkers, which are increasingly available in bio-behavioral surveys. HIV recency status is partially observed, given the self-report testing history. For example, people who tested positive for HIV over one year ago should have a long-term infection. Based on the nationally representative samples collected by the Population-based HIV Impact Assessment (PHIA) Project, we propose a likelihood-based probabilistic model for HIV recency classification. The model incorporates both labeled and unlabeled data and integrates the mechanism of how HIV recency status depends on biomarkers and the mechanism of how HIV recency status, together with the self-report time of the most recent HIV test, impacts the test results, via a set of logistic regression models. We compare our method to logistic regression and the binary classification tree (current practice) on Malawi, Zimbabwe, and Zambia PHIA data, as well as on simulated data. Our model obtains more efficient and less biased parameter estimates and is relatively robust to potential reporting error and model misspecification.
△ Less
Submitted 12 November, 2024; v1 submitted 5 September, 2023;
originally announced September 2023.
-
Surrogate method for partial association between mixed data with application to well-being survey analysis
Authors:
Shaobo Li,
Zhaohu Fan,
Ivy Liu,
Philip S. Morrison,
Dungang Liu
Abstract:
This paper is motivated by the analysis of a survey study of college student wellbeing before and after the outbreak of the COVID-19 pandemic. A statistical challenge in well-being survey studies lies in that outcome variables are often recorded in different scales, be it continuous, binary, or ordinal. The presence of mixed data complicates the assessment of the associations between them while ad…
▽ More
This paper is motivated by the analysis of a survey study of college student wellbeing before and after the outbreak of the COVID-19 pandemic. A statistical challenge in well-being survey studies lies in that outcome variables are often recorded in different scales, be it continuous, binary, or ordinal. The presence of mixed data complicates the assessment of the associations between them while adjusting for covariates. In our study, of particular interest are the associations between college students' wellbeing and other mental health measures and how other risk factors moderate these associations during the pandemic. To this end, we propose a unifying framework for studying partial association between mixed data. This is achieved by defining a unified residual using the surrogate method. The idea is to map the residual randomness to the same continuous scale, regardless of the original scales of outcome variables. It applies to virtually all commonly used models for covariate adjustments. We demonstrate the validity of using such defined residuals to assess partial association. In particular, we develop a measure that generalizes classical Kendall's tau in the sense that it can size both partial and marginal associations. More importantly, our development advances the theory of the surrogate method developed in recent years by showing that it can be used without requiring outcome variables having a latent variable structure. The use of our method in the well-being survey analysis reveals (i) significant moderation effects (i.e., the difference between partial and marginal associations) of some key risk factors; and (ii) an elevated moderation effect of physical health, loneliness, and accommodation after the onset of COVID-19.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
Simultaneous Estimation and Dataset Selection for Transfer Learning in High Dimensions by a Non-convex Penalty
Authors:
Zeyu Li,
Dong Liu,
Yong He,
Xinsheng Zhang
Abstract:
In this paper, we propose to estimate model parameters and identify informative source datasets simultaneously for high-dimensional transfer learning problems with the aid of a non-convex penalty, in contrast to the separate useful dataset selection and transfer learning procedures in the existing literature. To numerically solve the non-convex problem with respect to two specific statistical mode…
▽ More
In this paper, we propose to estimate model parameters and identify informative source datasets simultaneously for high-dimensional transfer learning problems with the aid of a non-convex penalty, in contrast to the separate useful dataset selection and transfer learning procedures in the existing literature. To numerically solve the non-convex problem with respect to two specific statistical models, namely the sparse linear regression and the generalized low-rank trace regression models, we adopt the difference of convex (DC) programming with the alternating direction method of multipliers (ADMM) procedures. We theoretically justify the proposed algorithm from both statistical and computational perspectives. Extensive numerical results are reported alongside to validate the theoretical assertions. An \texttt{R} package \texttt{MHDTL} is developed to implement the proposed methods.
△ Less
Submitted 11 November, 2024; v1 submitted 7 June, 2023;
originally announced June 2023.
-
Robust Statistical Inference for Large-dimensional Matrix-valued Time Series via Iterative Huber Regression
Authors:
Yong He,
Xin-Bing Kong,
Dong Liu,
Ran Zhao
Abstract:
Matrix factor model is drawing growing attention for simultaneous two-way dimension reduction of well-structured matrix-valued observations. This paper focuses on robust statistical inference for matrix factor model in the ``diverging dimension" regime. We derive the convergence rates of the robust estimators for loadings, factors and common components under finite second moment assumption of the…
▽ More
Matrix factor model is drawing growing attention for simultaneous two-way dimension reduction of well-structured matrix-valued observations. This paper focuses on robust statistical inference for matrix factor model in the ``diverging dimension" regime. We derive the convergence rates of the robust estimators for loadings, factors and common components under finite second moment assumption of the idiosyncratic errors. In addition, the asymptotic distributions of the estimators are also derived under mild conditions. We propose a rank minimization and an eigenvalue-ratio method to estimate the pair of factor numbers consistently. Numerical studies confirm the iterative Huber regression algorithm is a practical and reliable approach for the estimation of matrix factor model, especially under the cases with heavy-tailed idiosyncratic errors . We illustrate the practical usefulness of the proposed methods by two real datasets, one on financial portfolios and one on the macroeconomic indices of China.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Interpretable machine learning-accelerated seed treatment by nanomaterials for environmental stress alleviation
Authors:
Hengjie Yu,
Dan Luo,
Sam F. Y. Li,
Maozhen Qu,
Da Liu,
Yingchao He,
Fang Cheng
Abstract:
Crops are constantly challenged by different environmental conditions. Seed treatment by nanomaterials is a cost-effective and environmentally-friendly solution for environmental stress mitigation in crop plants. Here, 56 seed nanopriming treatments are used to alleviate environmental stresses in maize. Seven selected nanopriming treatments significantly increase the stress resistance index (SRI)…
▽ More
Crops are constantly challenged by different environmental conditions. Seed treatment by nanomaterials is a cost-effective and environmentally-friendly solution for environmental stress mitigation in crop plants. Here, 56 seed nanopriming treatments are used to alleviate environmental stresses in maize. Seven selected nanopriming treatments significantly increase the stress resistance index (SRI) by 13.9% and 12.6% under salinity stress and combined heat-drought stress, respectively. Metabolomics data reveals that ZnO nanopriming treatment, with the highest SRI value, mainly regulates the pathways of amino acid metabolism, secondary metabolite synthesis, carbohydrate metabolism, and translation. Understanding the mechanism of seed nanopriming is still difficult due to the variety of nanomaterials and the complexity of interactions between nanomaterials and plants. Using the nanopriming data, we present an interpretable structure-activity relationship (ISAR) approach based on interpretable machine learning for predicting and understanding its stress mitigation effects. The post hoc and model-based interpretation approaches of machine learning are combined to provide complementary benefits and give researchers or policymakers more illuminating or trustworthy results. The concentration, size, and zeta potential of nanoparticles are identified as dominant factors for correlating root dry weight under salinity stress, and their effects and interactions are explained. Additionally, a web-based interactive tool is developed for offering prediction-level interpretation and gathering more details about specific nanopriming treatments. This work offers a promising framework for accelerating the agricultural applications of nanomaterials and may profoundly contribute to nanosafety assessment.
△ Less
Submitted 8 April, 2023;
originally announced April 2023.
-
Huber Principal Component Analysis for Large-dimensional Factor Models
Authors:
Yong He,
Lingxiao Li,
Dong Liu,
Wen-Xin Zhou
Abstract:
Factor models have been widely used in economics and finance. However, the heavy-tailed nature of macroeconomic and financial data is often neglected in the existing literature. To address this issue and achieve robustness, we propose an approach to estimate factor loadings and scores by minimizing the Huber loss function, which is motivated by the equivalence of conventional Principal Component A…
▽ More
Factor models have been widely used in economics and finance. However, the heavy-tailed nature of macroeconomic and financial data is often neglected in the existing literature. To address this issue and achieve robustness, we propose an approach to estimate factor loadings and scores by minimizing the Huber loss function, which is motivated by the equivalence of conventional Principal Component Analysis (PCA) and the constrained least squares method in the factor model. We provide two algorithms that use different penalty forms. The first algorithm, which we refer to as Huber PCA, minimizes the $\ell_2$-norm-type Huber loss and performs PCA on the weighted sample covariance matrix. The second algorithm involves an element-wise type Huber loss minimization, which can be solved by an iterative Huber regression algorithm. Our study examines the theoretical minimizer of the element-wise Huber loss function and demonstrates that it has the same convergence rate as conventional PCA when the idiosyncratic errors have bounded second moments. We also derive their asymptotic distributions under mild conditions. Moreover, we suggest a consistent model selection criterion that relies on rank minimization to estimate the number of factors robustly. We showcase the benefits of Huber PCA through extensive numerical experiments and a real financial portfolio selection example. An R package named ``HDRFA" has been developed to implement the proposed robust factor analysis.
△ Less
Submitted 29 March, 2023; v1 submitted 5 March, 2023;
originally announced March 2023.
-
Private (Stochastic) Non-Convex Optimization Revisited: Second-Order Stationary Points and Excess Risks
Authors:
Arun Ganesh,
Daogao Liu,
Sewoong Oh,
Abhradeep Thakurta
Abstract:
We consider the problem of minimizing a non-convex objective while preserving the privacy of the examples in the training data. Building upon the previous variance-reduced algorithm SpiderBoost, we introduce a new framework that utilizes two different kinds of gradient oracles. The first kind of oracles can estimate the gradient of one point, and the second kind of oracles, less precise and more c…
▽ More
We consider the problem of minimizing a non-convex objective while preserving the privacy of the examples in the training data. Building upon the previous variance-reduced algorithm SpiderBoost, we introduce a new framework that utilizes two different kinds of gradient oracles. The first kind of oracles can estimate the gradient of one point, and the second kind of oracles, less precise and more cost-effective, can estimate the gradient difference between two points. SpiderBoost uses the first kind periodically, once every few steps, while our framework proposes using the first oracle whenever the total drift has become large and relies on the second oracle otherwise. This new framework ensures the gradient estimations remain accurate all the time, resulting in improved rates for finding second-order stationary points.
Moreover, we address a more challenging task of finding the global minima of a non-convex objective using the exponential mechanism. Our findings indicate that the regularized exponential mechanism can closely match previous empirical and population risk bounds, without requiring smoothness assumptions for algorithms with polynomial running time. Furthermore, by disregarding running time considerations, we show that the exponential mechanism can achieve a good population risk bound and provide a nearly matching lower bound.
△ Less
Submitted 19 February, 2023;
originally announced February 2023.
-
Algorithmic Aspects of the Log-Laplace Transform and a Non-Euclidean Proximal Sampler
Authors:
Sivakanth Gopi,
Yin Tat Lee,
Daogao Liu,
Ruoqi Shen,
Kevin Tian
Abstract:
The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transfo…
▽ More
The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transform (LLT) of a density. We prove new mathematical properties (with an algorithmic flavor) of the LLT, such as strong convexity-smoothness duality and an isoperimetric inequality, which are used to prove a mixing time on our proximal sampler matching [LST21] under a warm start. As our main application, we show our warm-started sampler improves the value oracle complexity of differentially private convex optimization in $\ell_p$ and Schatten-$p$ norms for $p \in [1, 2]$ to match the Euclidean setting [GLL22], while retaining state-of-the-art excess risk bounds [GLLST23]. We find our investigation of the LLT to be a promising proof-of-concept of its utility as a tool for designing samplers, and outline directions for future exploration.
△ Less
Submitted 22 February, 2023; v1 submitted 12 February, 2023;
originally announced February 2023.
-
ReSQueing Parallel and Private Stochastic Convex Optimization
Authors:
Yair Carmon,
Arun Jambulapati,
Yujia Jin,
Yin Tat Lee,
Daogao Liu,
Aaron Sidford,
Kevin Tian
Abstract:
We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO obj…
▽ More
We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $ε_{\text{opt}}$ with $d^{1/3}ε_{\text{opt}}^{-2/3}$ gradient oracle query depth and $d^{1/3}ε_{\text{opt}}^{-2/3} + ε_{\text{opt}}^{-2}$ gradient queries in total, assuming access to a bounded-variance stochastic gradient estimator. For $ε_{\text{opt}} \in [d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. Given $n$ samples of Lipschitz loss functions, prior works [BFTT19, BFGT20, AFKT21, KLL21] established that if $n \gtrsim d ε_{\text{dp}}^{-2}$, $(ε_{\text{dp}}, δ)$-differential privacy is attained at no asymptotic cost to the SCO utility. However, these prior works all required a superlinear number of gradient queries. We close this gap for sufficiently large $n \gtrsim d^2 ε_{\text{dp}}^{-3}$, by using ReSQue to design an algorithm with near-linear gradient query complexity in this regime.
△ Less
Submitted 27 October, 2023; v1 submitted 1 January, 2023;
originally announced January 2023.
-
AER: Auto-Encoder with Regression for Time Series Anomaly Detection
Authors:
Lawrence Wong,
Dongyu Liu,
Laure Berti-Equille,
Sarah Alnegheimish,
Kalyan Veeramachaneni
Abstract:
Anomaly detection on time series data is increasingly common across various industrial domains that monitor metrics in order to prevent potential accidents and economic losses. However, a scarcity of labeled data and ambiguous definitions of anomalies can complicate these efforts. Recent unsupervised machine learning methods have made remarkable progress in tackling this problem using either singl…
▽ More
Anomaly detection on time series data is increasingly common across various industrial domains that monitor metrics in order to prevent potential accidents and economic losses. However, a scarcity of labeled data and ambiguous definitions of anomalies can complicate these efforts. Recent unsupervised machine learning methods have made remarkable progress in tackling this problem using either single-timestamp predictions or time series reconstructions. While traditionally considered separately, these methods are not mutually exclusive and can offer complementary perspectives on anomaly detection. This paper first highlights the successes and limitations of prediction-based and reconstruction-based methods with visualized time series signals and anomaly scores. We then propose AER (Auto-encoder with Regression), a joint model that combines a vanilla auto-encoder and an LSTM regressor to incorporate the successes and address the limitations of each method. Our model can produce bi-directional predictions while simultaneously reconstructing the original time series by optimizing a joint objective function. Furthermore, we propose several ways of combining the prediction and reconstruction errors through a series of ablation studies. Finally, we compare the performance of the AER architecture against two prediction-based methods and three reconstruction-based methods on 12 well-known univariate time series datasets from NASA, Yahoo, Numenta, and UCR. The results show that AER has the highest averaged F1 score across all datasets (a 23.5% improvement compared to ARIMA) while retaining a runtime similar to its vanilla auto-encoder and regressor components. Our model is available in Orion, an open-source benchmarking tool for time series anomaly detection.
△ Less
Submitted 27 December, 2022;
originally announced December 2022.
-
Bregman Divergence-Based Data Integration with Application to Polygenic Risk Score (PRS) Heterogeneity Adjustment
Authors:
Qinmengge Li,
Matthew T. Patrick,
Haihan Zhang,
Chachrit Khunsriraksakul,
Philip E. Stuart,
Johann E. Gudjonsson,
Rajan Nair,
James T. Elder,
Dajiang J. Liu,
Jian Kang,
Lam C. Tsoi,
Kevin He
Abstract:
Polygenic risk scores (PRS) have recently received much attention for genetics risk prediction. While successful for the Caucasian population, the PRS based on the minority population suffer from small sample sizes, high dimensionality and low signal-to-noise ratios, exacerbating already severe health disparities. Due to population heterogeneity, direct trans-ethnic prediction by utilizing the Cau…
▽ More
Polygenic risk scores (PRS) have recently received much attention for genetics risk prediction. While successful for the Caucasian population, the PRS based on the minority population suffer from small sample sizes, high dimensionality and low signal-to-noise ratios, exacerbating already severe health disparities. Due to population heterogeneity, direct trans-ethnic prediction by utilizing the Caucasian model for the minority population also has limited performance. In addition, due to data privacy, the individual genotype data is not accessible for either the Caucasian population or the minority population. To address these challenges, we propose a Bregman divergence-based estimation procedure to measure and optimally balance the information from different populations. The proposed method only requires the use of encrypted summary statistics and improves the PRS performance for ethnic minority groups by incorporating additional information. We provide the asymptotic consistency and weak oracle property for the proposed method. Simulations and real data analyses also show its advantages in prediction and variable selection.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Private Convex Optimization in General Norms
Authors:
Sivakanth Gopi,
Yin Tat Lee,
Daogao Liu,
Ruoqi Shen,
Kevin Tian
Abstract:
We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\|\cdot\|$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(-k(F+μr))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\|\cdot\|$, generalizing a recent work o…
▽ More
We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\|\cdot\|$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(-k(F+μr))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\|\cdot\|$, generalizing a recent work of [Gopi, Lee, Liu '22] to non-Euclidean settings. We show that this mechanism satisfies Gaussian differential privacy and solves both DP-ERM (empirical risk minimization) and DP-SCO (stochastic convex optimization) by using localization tools from convex geometry. Our framework is the first to apply to private convex optimization in general normed spaces and directly recovers non-private SCO rates achieved by mirror descent as the privacy parameter $ε\to \infty$. As applications, for Lipschitz optimization in $\ell_p$ norms for all $p \in (1, 2)$, we obtain the first optimal privacy-utility tradeoffs; for $p = 1$, we improve tradeoffs obtained by the recent works [Asi, Feldman, Koren, Talwar '21, Bassily, Guzman, Nandi '21] by at least a logarithmic factor. Our $\ell_p$ norm and Schatten-$p$ norm optimization frameworks are complemented with polynomial-time samplers whose query complexity we explicitly bound.
△ Less
Submitted 10 November, 2022; v1 submitted 17 July, 2022;
originally announced July 2022.
-
Model diagnostics of discrete data regression: a unifying framework using functional residuals
Authors:
Zewei Lin,
Dungang Liu
Abstract:
Model diagnostics is an indispensable component of regression analysis, yet it is not well addressed in standard textbooks on generalized linear models. The lack of exposition is attributed to the fact that when outcome data are discrete, classical methods (e.g., Pearson/deviance residual analysis and goodness-of-fit tests) have limited utility in model diagnostics and treatment. This paper establ…
▽ More
Model diagnostics is an indispensable component of regression analysis, yet it is not well addressed in standard textbooks on generalized linear models. The lack of exposition is attributed to the fact that when outcome data are discrete, classical methods (e.g., Pearson/deviance residual analysis and goodness-of-fit tests) have limited utility in model diagnostics and treatment. This paper establishes a novel framework for model diagnostics of discrete data regression. Unlike the literature defining a single-valued quantity as the residual, we propose to use a function as a vehicle to retain the residual information. In the presence of discreteness, we show that such a functional residual is appropriate for summarizing the residual randomness that cannot be captured by the structural part of the model. We establish its theoretical properties, which leads to the innovation of new diagnostic tools including the functional-residual-vs covariate plot and Function-to-Function (Fn-Fn) plot. Our numerical studies demonstrate that the use of these tools can reveal a variety of model misspecifications, such as not properly including a higher-order term, an explanatory variable, an interaction effect, a dispersion parameter, or a zero-inflation component. The functional residual yields, as a byproduct, Liu-Zhang's surrogate residual mainly developed for cumulative link models for ordinal data (Liu and Zhang, 2018, JASA). As a general notion, it considerably broadens the diagnostic scope as it applies to virtually all parametric models for binary, ordinal and count data, all in a unified diagnostic scheme.
△ Less
Submitted 9 July, 2022;
originally announced July 2022.
-
When Does Differentially Private Learning Not Suffer in High Dimensions?
Authors:
Xuechen Li,
Daogao Liu,
Tatsunori Hashimoto,
Huseyin A. Inan,
Janardhan Kulkarni,
Yin Tat Lee,
Abhradeep Guha Thakurta
Abstract:
Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following researc…
▽ More
Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term \emph{restricted Lipschitz continuity} and derive improved bounds for the excess empirical and population risks that are dimension-independent under additional conditions. We empirically show that in private fine-tuning of large language models, gradients obtained during fine-tuning are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimension-independent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in large-scale private fine-tuning. Code to reproduce our results can be found at \url{https://github.com/lxuechen/private-transformers/tree/main/examples/classification/spectral_analysis}.
△ Less
Submitted 26 October, 2022; v1 submitted 30 June, 2022;
originally announced July 2022.
-
Incentive Compatible Pareto Alignment for Multi-Source Large Graphs
Authors:
Jian Liang,
Fangrui Lv,
Di Liu,
Zehui Dai,
Xu Tian,
Shuang Li,
Fei Wang,
Han Li
Abstract:
In this paper, we focus on learning effective entity matching models over multi-source large-scale data. For real applications, we relax typical assumptions that data distributions/spaces, or entity identities are shared between sources, and propose a Relaxed Multi-source Large-scale Entity-matching (RMLE) problem. Challenges of the problem include 1) how to align large-scale entities between sour…
▽ More
In this paper, we focus on learning effective entity matching models over multi-source large-scale data. For real applications, we relax typical assumptions that data distributions/spaces, or entity identities are shared between sources, and propose a Relaxed Multi-source Large-scale Entity-matching (RMLE) problem. Challenges of the problem include 1) how to align large-scale entities between sources to share information and 2) how to mitigate negative transfer from joint learning multi-source data. What's worse, one practical issue is the entanglement between both challenges. Specifically, incorrect alignments may increase negative transfer; while mitigating negative transfer for one source may result in poorly learned representations for other sources and then decrease alignment accuracy. To handle the entangled challenges, we point out that the key is to optimize information sharing first based on Pareto front optimization, by showing that information sharing significantly influences the Pareto front which depicts lower bounds of negative transfer. Consequently, we proposed an Incentive Compatible Pareto Alignment (ICPA) method to first optimize cross-source alignments based on Pareto front optimization, then mitigate negative transfer constrained on the optimized alignments. This mechanism renders each source can learn based on its true preference without worrying about deteriorating representations of other sources. Specifically, the Pareto front optimization encourages minimizing lower bounds of negative transfer, which optimizes whether and which to align. Comprehensive empirical evaluation results on four large-scale datasets are provided to demonstrate the effectiveness and superiority of ICPA. Online A/B test results at a search advertising platform also demonstrate the effectiveness of ICPA in production environments.
△ Less
Submitted 6 December, 2021;
originally announced December 2021.
-
Quality control, data cleaning, imputation
Authors:
Dawei Liu,
Hanne I. Oberman,
Johanna Muñoz,
Jeroen Hoogland,
Thomas P. A. Debray
Abstract:
This chapter addresses important steps during the quality assurance and control of RWD, with particular emphasis on the identification and handling of missing values. A gentle introduction is provided on common statistical and machine learning methods for imputation. We discuss the main strengths and weaknesses of each method, and compare their performance in a literature review. We motivate why t…
▽ More
This chapter addresses important steps during the quality assurance and control of RWD, with particular emphasis on the identification and handling of missing values. A gentle introduction is provided on common statistical and machine learning methods for imputation. We discuss the main strengths and weaknesses of each method, and compare their performance in a literature review. We motivate why the imputation of RWD may require additional efforts to avoid bias, and highlight recent advances that account for informative missingness and repeated observations. Finally, we introduce alternative methods to address incomplete data without the need for imputation.
△ Less
Submitted 29 October, 2021;
originally announced October 2021.
-
Simultaneous Cluster Structure Learning and Estimation of Heterogeneous Graphs for Matrix-variate fMRI Data
Authors:
Dong Liu,
Changwei Zhao,
Yong He,
Lei Liu,
Ying Guo,
Xinsheng Zhang
Abstract:
Graphical models play an important role in neuroscience studies, particularly in brain connectivity analysis. Typically, observations/samples are from several heterogenous groups and the group membership of each observation/sample is unavailable, which poses a great challenge for graph structure learning. In this article, we propose a method which can achieve Simultaneous Clustering and Estimation…
▽ More
Graphical models play an important role in neuroscience studies, particularly in brain connectivity analysis. Typically, observations/samples are from several heterogenous groups and the group membership of each observation/sample is unavailable, which poses a great challenge for graph structure learning. In this article, we propose a method which can achieve Simultaneous Clustering and Estimation of Heterogeneous Graphs (briefly denoted as SCEHG) for matrix-variate function Magnetic Resonance Imaging (fMRI) data. Unlike the conventional clustering methods which rely on the mean differences of various groups, the proposed SCEHG method fully exploits the group differences of conditional dependence relationships among brain regions for learning cluster structure. In essence, by constructing individual-level between-region network measures, we formulate clustering as penalized regression with grouping and sparsity pursuit, which transforms the unsupervised learning into supervised learning. An ADMM algorithm is proposed to solve the corresponding optimization problem. We also propose a generalized criterion to specify the number of clusters. Extensive simulation studies illustrate the superiority of the SCEHG method over some state-of-the-art methods in terms of both clustering and graph recovery accuracy. We also apply the SCEHG procedure to analyze fMRI data associated with ADHD (abbreviated for Attention Deficit Hyperactivity Disorder), which illustrate its empirical usefulness. An R package ``SCEHG" to implement the method is available at https://github.com/heyongstat/SCEHG.
△ Less
Submitted 9 October, 2021;
originally announced October 2021.
-
Stochastic tensor space feature theory with applications to robust machine learning
Authors:
Julio Enrique Castrillon-Candas,
Dingning Liu,
Sicheng Yang,
Xiaoling Zhang,
Mark Kon
Abstract:
In this paper we develop a Multilevel Orthogonal Subspace (MOS) Karhunen-Loeve feature theory based on stochastic tensor spaces, for the construction of robust machine learning features. Training data is treated as instances of a random field within a relevant Bochner space. Our key observation is that separate machine learning classes can reside predominantly in mostly distinct subspaces. Using t…
▽ More
In this paper we develop a Multilevel Orthogonal Subspace (MOS) Karhunen-Loeve feature theory based on stochastic tensor spaces, for the construction of robust machine learning features. Training data is treated as instances of a random field within a relevant Bochner space. Our key observation is that separate machine learning classes can reside predominantly in mostly distinct subspaces. Using the Karhunen-Loeve expansion and a hierarchical expansion of the first (nominal) class, a MOS is constructed to detect anomalous signal components, treating the second class as an outlier of the first. The projection coefficients of the input data into these subspaces are then used to train a Machine Learning (ML) classifier. These coefficients become new features from which much clearer separation surfaces can arise for the underlying classes. Tests in the blood plasma dataset (Alzheimer's Disease Neuroimaging Initiative) show dramatic increases in accuracy. This is in contrast to popular ML methods such as Gradient Boosting, RUS Boost, Random Forest and (Convolutional) Neural Networks.
△ Less
Submitted 20 March, 2025; v1 submitted 4 October, 2021;
originally announced October 2021.
-
The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence
Authors:
Daogao Liu,
Zhou Lu
Abstract:
Stochastic Gradient Descent (SGD) is among the simplest and most popular methods in optimization. The convergence rate for SGD has been extensively studied and tight analyses have been established for the running average scheme, but the sub-optimality of the final iterate is still not well-understood. shamir2013stochastic gave the best known upper bound for the final iterate of SGD minimizing non-…
▽ More
Stochastic Gradient Descent (SGD) is among the simplest and most popular methods in optimization. The convergence rate for SGD has been extensively studied and tight analyses have been established for the running average scheme, but the sub-optimality of the final iterate is still not well-understood. shamir2013stochastic gave the best known upper bound for the final iterate of SGD minimizing non-smooth convex functions, which is $O(\log T/\sqrt{T})$ for Lipschitz convex functions and $O(\log T/ T)$ with additional assumption on strongly convexity. The best known lower bounds, however, are worse than the upper bounds by a factor of $\log T$. harvey2019tight gave matching lower bounds but their construction requires dimension $d= T$. It was then asked by koren2020open how to characterize the final-iterate convergence of SGD in the constant dimension setting.
In this paper, we answer this question in the more general setting for any $d\leq T$, proving $Ω(\log d/\sqrt{T})$ and $Ω(\log d/T)$ lower bounds for the sub-optimality of the final iterate of SGD in minimizing non-smooth Lipschitz convex and strongly convex functions respectively with standard step size schedules. Our results provide the first general dimension dependent lower bound on the convergence of SGD's final iterate, partially resolving a COLT open question raised by koren2020open. We also present further evidence to show the correct rate in one dimension should be $Θ(1/\sqrt{T})$, such as a proof of a tight $O(1/\sqrt{T})$ upper bound for one-dimensional special cases in settings more general than koren2020open.
△ Less
Submitted 28 June, 2021;
originally announced June 2021.
-
Joint Learning of Multiple Differential Networks with fMRI data for Brain Connectivity Alteration Detection
Authors:
Hao Chen,
Ying Guo,
Yong He,
Dong Liu,
Lei Liu,
Xiao-Hua Zhou
Abstract:
In this study we focus on the problem of joint learning of multiple differential networks with function Magnetic Resonance Imaging (fMRI) data sets from multiple research centers. As the research centers may use different scanners and imaging parameters, joint learning of differential networks with fMRI data from different centers may reflect the underlying mechanism of neurological diseases from…
▽ More
In this study we focus on the problem of joint learning of multiple differential networks with function Magnetic Resonance Imaging (fMRI) data sets from multiple research centers. As the research centers may use different scanners and imaging parameters, joint learning of differential networks with fMRI data from different centers may reflect the underlying mechanism of neurological diseases from different perspectives while capturing the common structures. We transform the task as a penalized logistic regression problem, and exploit sparse group Minimax Concave Penalty (gMCP) to induce common structures among multiple differential networks and the sparse structures of each differential network. To further enhance the empirical performance, we develop an ensemble-learning procedure. We conduct thorough simulation study to assess the finite-sample performance of the proposed method and compare with state-of-the-art alternatives. We apply the proposed method to analyze fMRI datasets related with Attention Deficit Hyperactivity Disorder from various research centers. The identified common hub nodes and differential interaction patterns coincides with the existing experimental studies.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
Understanding Neural Networks with Logarithm Determinant Entropy Estimator
Authors:
Zhanghao Zhouyin,
Ding Liu
Abstract:
Understanding the informative behaviour of deep neural networks is challenged by misused estimators and the complexity of network structure, which leads to inconsistent observations and diversified interpretation. Here we propose the LogDet estimator -- a reliable matrix-based entropy estimator that approximates Shannon differential entropy. We construct informative measurements based on LogDet es…
▽ More
Understanding the informative behaviour of deep neural networks is challenged by misused estimators and the complexity of network structure, which leads to inconsistent observations and diversified interpretation. Here we propose the LogDet estimator -- a reliable matrix-based entropy estimator that approximates Shannon differential entropy. We construct informative measurements based on LogDet estimator, verify our method with comparable experiments and utilize it to analyse neural network behaviour. Our results demonstrate the LogDet estimator overcomes the drawbacks that emerge from highly diverse and degenerated distribution thus is reliable to estimate entropy in neural networks. The Network analysis results also find a functional distinction between shallow and deeper layers, which can help understand the compression phenomenon in the Information bottleneck theory of neural networks.
△ Less
Submitted 8 May, 2021;
originally announced May 2021.
-
Multiple Sclerosis Lesion Analysis in Brain Magnetic Resonance Images: Techniques and Clinical Applications
Authors:
Yang Ma,
Chaoyi Zhang,
Mariano Cabezas,
Yang Song,
Zihao Tang,
Dongnan Liu,
Weidong Cai,
Michael Barnett,
Chenyu Wang
Abstract:
Multiple sclerosis (MS) is a chronic inflammatory and degenerative disease of the central nervous system, characterized by the appearance of focal lesions in the white and gray matter that topographically correlate with an individual patient's neurological symptoms and signs. Magnetic resonance imaging (MRI) provides detailed in-vivo structural information, permitting the quantification and catego…
▽ More
Multiple sclerosis (MS) is a chronic inflammatory and degenerative disease of the central nervous system, characterized by the appearance of focal lesions in the white and gray matter that topographically correlate with an individual patient's neurological symptoms and signs. Magnetic resonance imaging (MRI) provides detailed in-vivo structural information, permitting the quantification and categorization of MS lesions that critically inform disease management. Traditionally, MS lesions have been manually annotated on 2D MRI slices, a process that is inefficient and prone to inter-/intra-observer errors. Recently, automated statistical imaging analysis techniques have been proposed to detect and segment MS lesions based on MRI voxel intensity. However, their effectiveness is limited by the heterogeneity of both MRI data acquisition techniques and the appearance of MS lesions. By learning complex lesion representations directly from images, deep learning techniques have achieved remarkable breakthroughs in the MS lesion segmentation task. Here, we provide a comprehensive review of state-of-the-art automatic statistical and deep-learning MS segmentation methods and discuss current and future clinical applications. Further, we review technical strategies, such as domain adaptation, to enhance MS lesion segmentation in real-world clinical settings.
△ Less
Submitted 27 January, 2022; v1 submitted 20 April, 2021;
originally announced April 2021.
-
Private Non-smooth Empirical Risk Minimization and Stochastic Convex Optimization in Subquadratic Steps
Authors:
Janardhan Kulkarni,
Yin Tat Lee,
Daogao Liu
Abstract:
We study the differentially private Empirical Risk Minimization (ERM) and Stochastic Convex Optimization (SCO) problems for non-smooth convex functions. We get a (nearly) optimal bound on the excess empirical risk and excess population loss with subquadratic gradient complexity. More precisely, our differentially private algorithm requires $O(\frac{N^{3/2}}{d^{1/8}}+ \frac{N^2}{d})$ gradient queri…
▽ More
We study the differentially private Empirical Risk Minimization (ERM) and Stochastic Convex Optimization (SCO) problems for non-smooth convex functions. We get a (nearly) optimal bound on the excess empirical risk and excess population loss with subquadratic gradient complexity. More precisely, our differentially private algorithm requires $O(\frac{N^{3/2}}{d^{1/8}}+ \frac{N^2}{d})$ gradient queries for optimal excess empirical risk, which is achieved with the help of subsampling and smoothing the function via convolution. This is the first subquadratic algorithm for the non-smooth case when $d$ is super constant. As a direct application, using the iterative localization approach of Feldman et al. \cite{fkt20}, we achieve the optimal excess population loss for stochastic convex optimization problem, with $O(\min\{N^{5/4}d^{1/8},\frac{ N^{3/2}}{d^{1/8}}\})$ gradient queries. Our work makes progress towards resolving a question raised by Bassily et al. \cite{bfgt20}, giving first algorithms for private ERM and SCO with subquadratic steps.
We note that independently Asi et al. \cite{afkt21} gave other algorithms for private ERM and SCO with subquadratic steps.
△ Less
Submitted 29 March, 2021; v1 submitted 29 March, 2021;
originally announced March 2021.
-
Nonparametric fusion learning: synthesize inferences from diverse sources using depth confidence distribution
Authors:
Dungang Liu,
Regina Y. Liu,
Minge Xie
Abstract:
Fusion learning refers to synthesizing inferences from multiple sources or studies to provide more effective inference and prediction than from any individual source or study alone. Most existing methods for synthesizing inferences rely on parametric model assumptions, such as normality, which often do not hold in practice. In this paper, we propose a general nonparametric fusion learning framewor…
▽ More
Fusion learning refers to synthesizing inferences from multiple sources or studies to provide more effective inference and prediction than from any individual source or study alone. Most existing methods for synthesizing inferences rely on parametric model assumptions, such as normality, which often do not hold in practice. In this paper, we propose a general nonparametric fusion learning framework for synthesizing inferences of the target parameter from multiple sources. The main tool underlying the proposed framework is the notion of depth confidence distribution (depth-CD), which is also developed in this paper. Broadly speaking, a depth-CD is a data-driven nonparametric summary distribution of inferential information for the target parameter. We show that a depth-CD is a useful inferential tool and, moreover, is an omnibus form of confidence regions (or p-values), whose contours of level sets shrink toward the true parameter value. The proposed fusion learning approach combines depth-CDs from the individual studies, with each depth-CD constructed by nonparametric bootstrap and data depth. This approach is shown to be efficient, general and robust. Specifically, it achieves high-order accuracy and Bahadur efficiency under suitably chosen combining elements. It allows the model or inference structure to be different among individual studies. And it readily adapts to heterogeneous studies with a broad range of complex and irregular settings. This property enables it to utilize indirect evidence from incomplete studies to gain efficiency in the overall inference. The advantages of the proposed approach are demonstrated simulations and in a Federal Aviation Administration (FAA) study of aircraft landing performance.
△ Less
Submitted 13 November, 2020;
originally announced November 2020.
-
Cardea: An Open Automated Machine Learning Framework for Electronic Health Records
Authors:
Sarah Alnegheimish,
Najat Alrashed,
Faisal Aleissa,
Shahad Althobaiti,
Dongyu Liu,
Mansour Alsaleh,
Kalyan Veeramachaneni
Abstract:
An estimated 180 papers focusing on deep learning and EHR were published between 2010 and 2018. Despite the common workflow structure appearing in these publications, no trusted and verified software framework exists, forcing researchers to arduously repeat previous work. In this paper, we propose Cardea, an extensible open-source automated machine learning framework encapsulating common predictio…
▽ More
An estimated 180 papers focusing on deep learning and EHR were published between 2010 and 2018. Despite the common workflow structure appearing in these publications, no trusted and verified software framework exists, forcing researchers to arduously repeat previous work. In this paper, we propose Cardea, an extensible open-source automated machine learning framework encapsulating common prediction problems in the health domain and allows users to build predictive models with their own data. This system relies on two components: Fast Healthcare Interoperability Resources (FHIR) -- a standardized data structure for electronic health systems -- and several AUTOML frameworks for automated feature engineering, model selection, and tuning. We augment these components with an adaptive data assembler and comprehensive data- and model- auditing capabilities. We demonstrate our framework via 5 prediction tasks on MIMIC-III and Kaggle datasets, which highlight Cardea's human competitiveness, flexibility in problem definition, extensive feature generation capability, adaptable automatic data assembler, and its usability.
△ Less
Submitted 1 October, 2020;
originally announced October 2020.
-
TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks
Authors:
Alexander Geiger,
Dongyu Liu,
Sarah Alnegheimish,
Alfredo Cuesta-Infante,
Kalyan Veeramachaneni
Abstract:
Time series anomalies can offer information relevant to critical situations facing various fields, from finance and aerospace to the IT, security, and medical domains. However, detecting anomalies in time series data is particularly challenging due to the vague definition of anomalies and said data's frequent lack of labels and highly complex temporal correlations. Current state-of-the-art unsuper…
▽ More
Time series anomalies can offer information relevant to critical situations facing various fields, from finance and aerospace to the IT, security, and medical domains. However, detecting anomalies in time series data is particularly challenging due to the vague definition of anomalies and said data's frequent lack of labels and highly complex temporal correlations. Current state-of-the-art unsupervised machine learning methods for anomaly detection suffer from scalability and portability issues, and may have high false positive rates. In this paper, we propose TadGAN, an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs). To capture the temporal correlations of time series distributions, we use LSTM Recurrent Neural Networks as base models for Generators and Critics. TadGAN is trained with cycle consistency loss to allow for effective time-series data reconstruction. We further propose several novel methods to compute reconstruction errors, as well as different approaches to combine reconstruction errors and Critic outputs to compute anomaly scores. To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one. We compare our approach to 8 baseline anomaly detection methods on 11 datasets from multiple reputable sources such as NASA, Yahoo, Numenta, Amazon, and Twitter. The results show that our approach can effectively detect anomalies and outperform baseline methods in most cases (6 out of 11). Notably, our method has the highest averaged F1 score across all the datasets. Our code is open source and is available as a benchmarking tool.
△ Less
Submitted 14 November, 2020; v1 submitted 16 September, 2020;
originally announced September 2020.
-
An Optimal Hybrid Variance-Reduced Algorithm for Stochastic Composite Nonconvex Optimization
Authors:
Deyi Liu,
Lam M. Nguyen,
Quoc Tran-Dinh
Abstract:
In this note we propose a new variant of the hybrid variance-reduced proximal gradient method in [7] to solve a common stochastic composite nonconvex optimization problem under standard assumptions. We simply replace the independent unbiased estimator in our hybrid- SARAH estimator introduced in [7] by the stochastic gradient evaluated at the same sample, leading to the identical momentum-SARAH es…
▽ More
In this note we propose a new variant of the hybrid variance-reduced proximal gradient method in [7] to solve a common stochastic composite nonconvex optimization problem under standard assumptions. We simply replace the independent unbiased estimator in our hybrid- SARAH estimator introduced in [7] by the stochastic gradient evaluated at the same sample, leading to the identical momentum-SARAH estimator introduced in [2]. This allows us to save one stochastic gradient per iteration compared to [7], and only requires two samples per iteration. Our algorithm is very simple and achieves optimal stochastic oracle complexity bound in terms of stochastic gradient evaluations (up to a constant factor). Our analysis is essentially inspired by [7], but we do not use two different step-sizes.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.