Skip to main content

Showing 1–50 of 112 results for author: Liu, D

Searching in archive stat. Search in all archives.
.
  1. arXiv:2507.02248  [pdf, ps, other

    stat.ML cs.LG

    Transfer Learning for Matrix Completion

    Authors: Dali Liu, Haolei Weng

    Abstract: In this paper, we explore the knowledge transfer under the setting of matrix completion, which aims to enhance the estimation of a low-rank target matrix with auxiliary data available. We propose a transfer learning procedure given prior information on which source datasets are favorable. We study its convergence rates and prove its minimax optimality. Our analysis reveals that with the source mat… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 37 pages, 1 figure

    MSC Class: 15A83 ACM Class: I.2.6; G.3

  2. arXiv:2505.20189  [pdf, ps, other

    cs.DS cs.CR cs.LG stat.ML

    Private Geometric Median in Nearly-Linear Time

    Authors: Syamantak Kumar, Daogao Liu, Kevin Tian, Chutong Yang

    Abstract: Estimating the geometric median of a dataset is a robust counterpart to mean estimation, and is a fundamental problem in computational geometry. Recently, [HSU24] gave an $(\varepsilon, δ)$-differentially private algorithm obtaining an $α$-multiplicative approximation to the geometric median objective, $\frac 1 n \sum_{i \in [n]} \|\cdot - \mathbf{x}_i\|$, given a dataset… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  3. arXiv:2503.03128  [pdf, other

    cs.AI cs.CL cs.LG stat.ML

    Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and Generalizability

    Authors: Chenhui Xu, Dancheng Liu, Jiajie Li, Amir Nassereldine, Zhaohui Li, Jinjun Xiong

    Abstract: Recent advancements in cognitive science and multi-round reasoning techniques for Large Language Models (LLMs) suggest that iterative thinking processes improve problem-solving performance in complex tasks. Inspired by this, approaches like Chain-of-Thought, debating, and self-refinement have been applied to auto-regressive LLMs, achieving significant successes in tasks such as mathematical reason… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

  4. arXiv:2502.08889  [pdf, ps, other

    cs.LG cs.CR cs.DS stat.ML

    Linear-Time User-Level DP-SCO via Robust Statistics

    Authors: Badih Ghazi, Ravi Kumar, Daogao Liu, Pasin Manurangsi

    Abstract: User-level differentially private stochastic convex optimization (DP-SCO) has garnered significant attention due to the paramount importance of safeguarding user privacy in modern large-scale machine learning applications. Current methods, such as those based on differentially private stochastic gradient descent (DP-SGD), often struggle with high noise accumulation and suboptimal utility due to th… ▽ More

    Submitted 12 February, 2025; originally announced February 2025.

  5. arXiv:2502.00470  [pdf, other

    math.OC cs.LG stat.ML

    Distributed Primal-Dual Algorithms: Unification, Connections, and Insights

    Authors: Runxiong Wu, Dong Liu, Xueqin Wang, Andi Wang

    Abstract: We study primal-dual algorithms for general empirical risk minimization problems in distributed settings, focusing on two prominent classes of algorithms. The first class is the communication-efficient distributed dual coordinate ascent (CoCoA), derived from the coordinate ascent method for solving the dual problem. The second class is the alternating direction method of multipliers (ADMM), includ… ▽ More

    Submitted 1 February, 2025; originally announced February 2025.

    Comments: 15 pages, 4 figures, 1 table

  6. arXiv:2501.18741  [pdf

    cs.LG cs.AI stat.ML

    Synthetic Data Generation for Augmenting Small Samples

    Authors: Dan Liu, Samer El Kababji, Nicholas Mitsakakis, Lisa Pilgram, Thomas Walters, Mark Clemons, Greg Pond, Alaa El-Hussuna, Khaled El Emam

    Abstract: Small datasets are common in health research. However, the generalization performance of machine learning models is suboptimal when the training datasets are small. To address this, data augmentation is one solution. Augmentation increases sample size and is seen as a form of regularization that increases the diversity of small datasets, leading them to perform better on unseen data. We found that… ▽ More

    Submitted 30 January, 2025; originally announced January 2025.

  7. arXiv:2412.06233  [pdf, other

    stat.ML cs.LG

    Representational Transfer Learning for Matrix Completion

    Authors: Yong He, Zeyu Li, Dong Liu, Kangxiang Qin, Jiahui Xie

    Abstract: We propose to transfer representational knowledge from multiple sources to a target noisy matrix completion task by aggregating singular subspaces information. Under our representational similarity framework, we first integrate linear representation information by solving a two-way principal component analysis problem based on a properly debiased matrix-valued dataset. After acquiring better colum… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  8. arXiv:2411.02539  [pdf, other

    stat.AP

    Estimating journey time for two-point vehicle re-identification survey with limited observable scope using 2-dimensional truncated distributions

    Authors: Diyi Liu, Yangsong Gu, Lee D. Han

    Abstract: In transportation, Weigh-in motion (WIM) stations, Electronic Toll Collection (ETC) systems, Closed-circuit Television (CCTV) are widely deployed to collect data at different locations. Vehicle re-identification, by matching the same vehicle at different locations, is helpful in understanding the long-distance journey patterns. In this paper, the potential hazards of ignoring the survivorship bias… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

    Comments: 22 pages, 13 figures

  9. arXiv:2410.10180  [pdf, other

    cs.LG stat.ML

    Gaussian Mixture Vector Quantization with Aggregated Categorical Posterior

    Authors: Mingyuan Yan, Jiawei Wu, Rushi Shah, Dianbo Liu

    Abstract: The vector quantization is a widely used method to map continuous representation to discrete space and has important application in tokenization for generative mode, bottlenecking information and many other tasks in machine learning. Vector Quantized Variational Autoencoder (VQ-VAE) is a type of variational autoencoder using discrete embedding as latent. We generalize the technique further, enrich… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  10. arXiv:2410.07502  [pdf, ps, other

    cs.LG cs.CR cs.DS stat.ML

    Adaptive Batch Size for Privately Finding Second-Order Stationary Points

    Authors: Daogao Liu, Kunal Talwar

    Abstract: There is a gap between finding a first-order stationary point (FOSP) and a second-order stationary point (SOSP) under differential privacy constraints, and it remains unclear whether privately finding an SOSP is more challenging than finding an FOSP. Specifically, Ganesh et al. (2023) claimed that an $α$-SOSP can be found with $α=O(\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{nε})^{3/7})$, where $n$ is the… ▽ More

    Submitted 26 February, 2025; v1 submitted 9 October, 2024; originally announced October 2024.

    Comments: Accepted to ICLR 2025. This version corrects an error by introducing a new subprocedure for escaping saddle points and also addresses minor typos

  11. arXiv:2410.05880  [pdf, ps, other

    cs.LG cs.CR math.OC stat.ML

    Improved Sample Complexity for Private Nonsmooth Nonconvex Optimization

    Authors: Guy Kornowski, Daogao Liu, Kunal Talwar

    Abstract: We study differentially private (DP) optimization algorithms for stochastic and empirical objectives which are neither smooth nor convex, and propose methods that return a Goldstein-stationary point with sample complexity bounds that improve on existing works. We start by providing a single-pass $(ε,δ)$-DP algorithm that returns an $(α,β)$-stationary point as long as the dataset is of size… ▽ More

    Submitted 7 June, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: Accepted to ICML 2025; some fixes following reviews

  12. arXiv:2407.19378  [pdf, other

    stat.ME

    Penalized Principal Component Analysis for Large-dimension Factor Model with Group Pursuit

    Authors: Yong He, Dong Liu, Guangming Pan, Yiming Wang

    Abstract: This paper investigates the intrinsic group structures within the framework of large-dimensional approximate factor models, which portrays homogeneous effects of the common factors on the individuals that fall into the same group. To this end, we propose a fusion Penalized Principal Component Analysis (PPCA) method and derive a closed-form solution for the $\ell_2$-norm optimization problem. We al… ▽ More

    Submitted 15 March, 2025; v1 submitted 27 July, 2024; originally announced July 2024.

  13. arXiv:2407.00882  [pdf, other

    stat.ME

    Subgroup Identification with Latent Factor Structure

    Authors: Yong He, Dong Liu, Fuxin Wang, Mingjuan Zhang, Wen-Xin Zhou

    Abstract: Subgroup analysis has garnered increasing attention for its ability to identify meaningful subgroups within heterogeneous populations, thereby enhancing predictive power. However, in many fields such as social science and biology, covariates are often highly correlated due to common factors. This correlation poses significant challenges for subgroup identification, an issue that is often overlooke… ▽ More

    Submitted 17 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

  14. arXiv:2406.03620  [pdf, ps, other

    cs.LG cs.CR cs.DS math.OC stat.ML

    Private Online Learning via Lazy Algorithms

    Authors: Hilal Asi, Tomer Koren, Daogao Liu, Kunal Talwar

    Abstract: We study the problem of private online learning, specifically, online prediction from experts (OPE) and online convex optimization (OCO). We propose a new transformation that transforms lazy online learning algorithms into private algorithms. We apply our transformation for differentially private OPE and OCO using existing lazy algorithms for these problems. Our final algorithms obtain regret, whi… ▽ More

    Submitted 21 February, 2025; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: Fix some small typos

  15. arXiv:2406.02789  [pdf, other

    cs.DS cs.CR cs.LG stat.ML

    Private Stochastic Convex Optimization with Heavy Tails: Near-Optimality from Simple Reductions

    Authors: Hilal Asi, Daogao Liu, Kevin Tian

    Abstract: We study the problem of differentially private stochastic convex optimization (DP-SCO) with heavy-tailed gradients, where we assume a $k^{\text{th}}$-moment bound on the Lipschitz constants of sample functions rather than a uniform bound. We propose a new reduction-based approach that enables us to obtain the first optimal rates (up to logarithmic factors) in the heavy-tailed setting, achieving er… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  16. arXiv:2405.00172  [pdf, ps, other

    cs.LG cs.SI stat.ML

    Bypassing Skip-Gram Negative Sampling: Dimension Regularization as a More Efficient Alternative for Graph Embeddings

    Authors: David Liu, Arjun Seshadri, Tina Eliassi-Rad, Johan Ugander

    Abstract: A wide range of graph embedding objectives decompose into two components: one that enforces similarity, attracting the embeddings of nodes that are perceived as similar, and another that enforces dissimilarity, repelling the embeddings of nodes that are perceived as dissimilar. Without repulsion, the embeddings would collapse into trivial solutions. Skip-Gram Negative Sampling (SGNS) is a popular… ▽ More

    Submitted 2 June, 2025; v1 submitted 30 April, 2024; originally announced May 2024.

    Comments: Published in KDD'25

  17. arXiv:2404.06168  [pdf

    stat.AP

    Protection of Guizhou Miao Batik Culture Based on Knowledge Graph and Deep Learning

    Authors: Huafeng Quan, Yiting Li, Dashuai Liu, Yue Zhou

    Abstract: In the globalization trend, China's cultural heritage is in danger of gradually disappearing. The protection and inheritance of these precious cultural resources has become a critical task. This paper focuses on the Miao batik culture in Guizhou Province, China, and explores the application of knowledge graphs, natural language processing, and deep learning techniques in the promotion and protecti… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

  18. Q-learning in Dynamic Treatment Regimes with Misclassified Binary Outcome

    Authors: Dan Liu, Wenqing He

    Abstract: The study of precision medicine involves dynamic treatment regimes (DTRs), which are sequences of treatment decision rules recommended by taking patient-level information as input. The primary goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that leads to the best expected clinical outcome. Statistical methods have been developed in recent years to estima… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Report number: sim.10223

  19. arXiv:2404.04696  [pdf, ps, other

    stat.ME

    Dynamic Treatment Regimes with Replicated Observations Available for Error-prone Covariates: a Q-learning Approach

    Authors: Dan Liu, Wenqing He

    Abstract: Dynamic treatment regimes (DTRs) have received an increasing interest in recent years. DTRs are sequences of treatment decision rules tailored to patient-level information. The main goal of the DTR study is to identify an optimal DTR, a sequence of treatment decision rules that yields the best expected clinical outcome. Q-learning has been considered as one of the most popular regression-based met… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

  20. arXiv:2401.01872  [pdf, other

    stat.ME stat.AP

    Multiple Imputation of Hierarchical Nonlinear Time Series Data with an Application to School Enrollment Data

    Authors: Daphne H. Liu, Adrian E. Raftery

    Abstract: International comparisons of hierarchical time series data sets based on survey data, such as annual country-level estimates of school enrollment rates, can suffer from large amounts of missing data due to differing coverage of surveys across countries and across times. A popular approach to handling missing data in these settings is through multiple imputation, which can be especially effective w… ▽ More

    Submitted 28 March, 2025; v1 submitted 3 January, 2024; originally announced January 2024.

    Comments: 34 pages, 5 figures

  21. arXiv:2312.17230  [pdf, other

    stat.ME math.OC stat.CO

    Fast Rerandomization via the BRAIN

    Authors: Jiuyao Lu, Daogao Liu, Zhanran Lin, Xiaomeng Wang

    Abstract: Randomized experiments are a crucial tool for causal inference in many different fields. Rerandomization addresses any covariate imbalance in such experiments by resampling treatment assignments until certain balance criteria are satisfied. However, rerandomization based on naïve acceptance-rejection sampling is computationally inefficient, especially when numerous independent assignments are requ… ▽ More

    Submitted 25 May, 2025; v1 submitted 28 December, 2023; originally announced December 2023.

  22. arXiv:2312.02660  [pdf, other

    econ.GN cs.CE cs.CR cs.CY stat.AP

    A Dataset of Uniswap daily transaction indices by network

    Authors: Nir Chemaya, Lin William Cong, Emma Jorgensen, Dingyue Liu, Luyao Zhang

    Abstract: Decentralized Finance (DeFi) is reshaping traditional finance by enabling direct transactions without intermediaries, creating a rich source of open financial data. Layer 2 (L2) solutions are emerging to enhance the scalability and efficiency of the DeFi ecosystem, surpassing Layer 1 (L1) systems. However, the impact of L2 solutions is still underexplored, mainly due to the lack of comprehensive t… ▽ More

    Submitted 22 September, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

  23. On long-term fatigue damage estimation for a floating offshore wind turbine using a surrogate model

    Authors: Ding Peng Liu, Giulio Ferri, Taemin Heo, Enzo Marino, Lance Manuel

    Abstract: This study is concerned with the estimation of long-term fatigue damage for a floating offshore wind turbine. With the ultimate goal of efficient evaluation of fatigue limit states for floating offshore wind turbine systems, a detailed computational framework is introduced and used to develop a surrogate model using Gaussian process regression. The surrogate model, at first, relies only on a small… ▽ More

    Submitted 7 March, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

  24. arXiv:2309.02430  [pdf, other

    stat.AP

    A Likelihood Approach to Incorporating Self-Report Data in HIV Recency Classification

    Authors: Wenlong Yang, Danping Liu, Le Bao, Runze Li

    Abstract: Estimating new HIV infections is significant yet challenging due to the difficulty in distinguishing between recent and long-term infections. We demonstrate that HIV recency status (recent v.s. long-term) could be determined from the combination of self-report testing history and biomarkers, which are increasingly available in bio-behavioral surveys. HIV recency status is partially observed, given… ▽ More

    Submitted 12 November, 2024; v1 submitted 5 September, 2023; originally announced September 2023.

  25. arXiv:2306.05362  [pdf, other

    stat.ME

    Surrogate method for partial association between mixed data with application to well-being survey analysis

    Authors: Shaobo Li, Zhaohu Fan, Ivy Liu, Philip S. Morrison, Dungang Liu

    Abstract: This paper is motivated by the analysis of a survey study of college student wellbeing before and after the outbreak of the COVID-19 pandemic. A statistical challenge in well-being survey studies lies in that outcome variables are often recorded in different scales, be it continuous, binary, or ordinal. The presence of mixed data complicates the assessment of the associations between them while ad… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: 38 pages

  26. arXiv:2306.04182  [pdf, other

    stat.ME

    Simultaneous Estimation and Dataset Selection for Transfer Learning in High Dimensions by a Non-convex Penalty

    Authors: Zeyu Li, Dong Liu, Yong He, Xinsheng Zhang

    Abstract: In this paper, we propose to estimate model parameters and identify informative source datasets simultaneously for high-dimensional transfer learning problems with the aid of a non-convex penalty, in contrast to the separate useful dataset selection and transfer learning procedures in the existing literature. To numerically solve the non-convex problem with respect to two specific statistical mode… ▽ More

    Submitted 11 November, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

  27. arXiv:2306.03317  [pdf, other

    stat.ME

    Robust Statistical Inference for Large-dimensional Matrix-valued Time Series via Iterative Huber Regression

    Authors: Yong He, Xin-Bing Kong, Dong Liu, Ran Zhao

    Abstract: Matrix factor model is drawing growing attention for simultaneous two-way dimension reduction of well-structured matrix-valued observations. This paper focuses on robust statistical inference for matrix factor model in the ``diverging dimension" regime. We derive the convergence rates of the robust estimators for loadings, factors and common components under finite second moment assumption of the… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

  28. arXiv:2304.03928  [pdf

    cs.LG stat.AP

    Interpretable machine learning-accelerated seed treatment by nanomaterials for environmental stress alleviation

    Authors: Hengjie Yu, Dan Luo, Sam F. Y. Li, Maozhen Qu, Da Liu, Yingchao He, Fang Cheng

    Abstract: Crops are constantly challenged by different environmental conditions. Seed treatment by nanomaterials is a cost-effective and environmentally-friendly solution for environmental stress mitigation in crop plants. Here, 56 seed nanopriming treatments are used to alleviate environmental stresses in maize. Seven selected nanopriming treatments significantly increase the stress resistance index (SRI)… ▽ More

    Submitted 8 April, 2023; originally announced April 2023.

    Comments: 30 pages, 6 figures

  29. arXiv:2303.02817  [pdf, other

    stat.ME

    Huber Principal Component Analysis for Large-dimensional Factor Models

    Authors: Yong He, Lingxiao Li, Dong Liu, Wen-Xin Zhou

    Abstract: Factor models have been widely used in economics and finance. However, the heavy-tailed nature of macroeconomic and financial data is often neglected in the existing literature. To address this issue and achieve robustness, we propose an approach to estimate factor loadings and scores by minimizing the Huber loss function, which is motivated by the equivalence of conventional Principal Component A… ▽ More

    Submitted 29 March, 2023; v1 submitted 5 March, 2023; originally announced March 2023.

  30. arXiv:2302.09699  [pdf, ps, other

    cs.LG cs.CR math.OC stat.ML

    Private (Stochastic) Non-Convex Optimization Revisited: Second-Order Stationary Points and Excess Risks

    Authors: Arun Ganesh, Daogao Liu, Sewoong Oh, Abhradeep Thakurta

    Abstract: We consider the problem of minimizing a non-convex objective while preserving the privacy of the examples in the training data. Building upon the previous variance-reduced algorithm SpiderBoost, we introduce a new framework that utilizes two different kinds of gradient oracles. The first kind of oracles can estimate the gradient of one point, and the second kind of oracles, less precise and more c… ▽ More

    Submitted 19 February, 2023; originally announced February 2023.

  31. arXiv:2302.06085  [pdf, ps, other

    cs.DS cs.CR cs.LG math.PR stat.CO

    Algorithmic Aspects of the Log-Laplace Transform and a Non-Euclidean Proximal Sampler

    Authors: Sivakanth Gopi, Yin Tat Lee, Daogao Liu, Ruoqi Shen, Kevin Tian

    Abstract: The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transfo… ▽ More

    Submitted 22 February, 2023; v1 submitted 12 February, 2023; originally announced February 2023.

    Comments: Comments welcome! v2 improves constant in duality result, adds citations

  32. arXiv:2301.00457  [pdf, other

    math.OC cs.CR cs.DS cs.LG stat.ML

    ReSQueing Parallel and Private Stochastic Convex Optimization

    Authors: Yair Carmon, Arun Jambulapati, Yujia Jin, Yin Tat Lee, Daogao Liu, Aaron Sidford, Kevin Tian

    Abstract: We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO obj… ▽ More

    Submitted 27 October, 2023; v1 submitted 1 January, 2023; originally announced January 2023.

  33. arXiv:2212.13558  [pdf, other

    cs.LG stat.ML

    AER: Auto-Encoder with Regression for Time Series Anomaly Detection

    Authors: Lawrence Wong, Dongyu Liu, Laure Berti-Equille, Sarah Alnegheimish, Kalyan Veeramachaneni

    Abstract: Anomaly detection on time series data is increasingly common across various industrial domains that monitor metrics in order to prevent potential accidents and economic losses. However, a scarcity of labeled data and ambiguous definitions of anomalies can complicate these efforts. Recent unsupervised machine learning methods have made remarkable progress in tackling this problem using either singl… ▽ More

    Submitted 27 December, 2022; originally announced December 2022.

    Comments: This work is accepted by IEEE BigData 2022. The paper contains 10 pages, 6 figures, and 4 tables

  34. arXiv:2210.06025  [pdf, other

    stat.ME math.ST

    Bregman Divergence-Based Data Integration with Application to Polygenic Risk Score (PRS) Heterogeneity Adjustment

    Authors: Qinmengge Li, Matthew T. Patrick, Haihan Zhang, Chachrit Khunsriraksakul, Philip E. Stuart, Johann E. Gudjonsson, Rajan Nair, James T. Elder, Dajiang J. Liu, Jian Kang, Lam C. Tsoi, Kevin He

    Abstract: Polygenic risk scores (PRS) have recently received much attention for genetics risk prediction. While successful for the Caucasian population, the PRS based on the minority population suffer from small sample sizes, high dimensionality and low signal-to-noise ratios, exacerbating already severe health disparities. Due to population heterogeneity, direct trans-ethnic prediction by utilizing the Cau… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

    Comments: 35 pages, 6 figures

  35. arXiv:2207.08347  [pdf, ps, other

    cs.LG cs.CR math.OC stat.ML

    Private Convex Optimization in General Norms

    Authors: Sivakanth Gopi, Yin Tat Lee, Daogao Liu, Ruoqi Shen, Kevin Tian

    Abstract: We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\|\cdot\|$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(-k(F+μr))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\|\cdot\|$, generalizing a recent work o… ▽ More

    Submitted 10 November, 2022; v1 submitted 17 July, 2022; originally announced July 2022.

    Comments: SODA 2023

  36. arXiv:2207.04299  [pdf, other

    stat.ME econ.EM

    Model diagnostics of discrete data regression: a unifying framework using functional residuals

    Authors: Zewei Lin, Dungang Liu

    Abstract: Model diagnostics is an indispensable component of regression analysis, yet it is not well addressed in standard textbooks on generalized linear models. The lack of exposition is attributed to the fact that when outcome data are discrete, classical methods (e.g., Pearson/deviance residual analysis and goodness-of-fit tests) have limited utility in model diagnostics and treatment. This paper establ… ▽ More

    Submitted 9 July, 2022; originally announced July 2022.

    Comments: 54 pages, 22 figures

  37. arXiv:2207.00160  [pdf, other

    cs.LG cs.CR stat.ML

    When Does Differentially Private Learning Not Suffer in High Dimensions?

    Authors: Xuechen Li, Daogao Liu, Tatsunori Hashimoto, Huseyin A. Inan, Janardhan Kulkarni, Yin Tat Lee, Abhradeep Guha Thakurta

    Abstract: Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following researc… ▽ More

    Submitted 26 October, 2022; v1 submitted 30 June, 2022; originally announced July 2022.

    Comments: 26 pages; v3 includes additional experiments and clarification

  38. arXiv:2112.02792  [pdf, other

    stat.ML cs.GT cs.LG

    Incentive Compatible Pareto Alignment for Multi-Source Large Graphs

    Authors: Jian Liang, Fangrui Lv, Di Liu, Zehui Dai, Xu Tian, Shuang Li, Fei Wang, Han Li

    Abstract: In this paper, we focus on learning effective entity matching models over multi-source large-scale data. For real applications, we relax typical assumptions that data distributions/spaces, or entity identities are shared between sources, and propose a Relaxed Multi-source Large-scale Entity-matching (RMLE) problem. Challenges of the problem include 1) how to align large-scale entities between sour… ▽ More

    Submitted 6 December, 2021; originally announced December 2021.

  39. arXiv:2110.15877  [pdf

    stat.ME

    Quality control, data cleaning, imputation

    Authors: Dawei Liu, Hanne I. Oberman, Johanna Muñoz, Jeroen Hoogland, Thomas P. A. Debray

    Abstract: This chapter addresses important steps during the quality assurance and control of RWD, with particular emphasis on the identification and handling of missing values. A gentle introduction is provided on common statistical and machine learning methods for imputation. We discuss the main strengths and weaknesses of each method, and compare their performance in a literature review. We motivate why t… ▽ More

    Submitted 29 October, 2021; originally announced October 2021.

    Comments: This is a preprint of a book chapter for Springer

    MSC Class: 62D10 ACM Class: G.3; I.5.1; J.3

  40. arXiv:2110.04516  [pdf, other

    stat.ME

    Simultaneous Cluster Structure Learning and Estimation of Heterogeneous Graphs for Matrix-variate fMRI Data

    Authors: Dong Liu, Changwei Zhao, Yong He, Lei Liu, Ying Guo, Xinsheng Zhang

    Abstract: Graphical models play an important role in neuroscience studies, particularly in brain connectivity analysis. Typically, observations/samples are from several heterogenous groups and the group membership of each observation/sample is unavailable, which poses a great challenge for graph structure learning. In this article, we propose a method which can achieve Simultaneous Clustering and Estimation… ▽ More

    Submitted 9 October, 2021; originally announced October 2021.

  41. arXiv:2110.01729  [pdf, other

    stat.ML cs.LG

    Stochastic tensor space feature theory with applications to robust machine learning

    Authors: Julio Enrique Castrillon-Candas, Dingning Liu, Sicheng Yang, Xiaoling Zhang, Mark Kon

    Abstract: In this paper we develop a Multilevel Orthogonal Subspace (MOS) Karhunen-Loeve feature theory based on stochastic tensor spaces, for the construction of robust machine learning features. Training data is treated as instances of a random field within a relevant Bochner space. Our key observation is that separate machine learning classes can reside predominantly in mostly distinct subspaces. Using t… ▽ More

    Submitted 20 March, 2025; v1 submitted 4 October, 2021; originally announced October 2021.

    MSC Class: 62R10; 60G35; 62-08; 60G60; 65F25; 46B09

  42. arXiv:2106.14588  [pdf, other

    math.OC cs.LG stat.ML

    The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence

    Authors: Daogao Liu, Zhou Lu

    Abstract: Stochastic Gradient Descent (SGD) is among the simplest and most popular methods in optimization. The convergence rate for SGD has been extensively studied and tight analyses have been established for the running average scheme, but the sub-optimality of the final iterate is still not well-understood. shamir2013stochastic gave the best known upper bound for the final iterate of SGD minimizing non-… ▽ More

    Submitted 28 June, 2021; originally announced June 2021.

  43. arXiv:2106.03334  [pdf, other

    stat.ME

    Joint Learning of Multiple Differential Networks with fMRI data for Brain Connectivity Alteration Detection

    Authors: Hao Chen, Ying Guo, Yong He, Dong Liu, Lei Liu, Xiao-Hua Zhou

    Abstract: In this study we focus on the problem of joint learning of multiple differential networks with function Magnetic Resonance Imaging (fMRI) data sets from multiple research centers. As the research centers may use different scanners and imaging parameters, joint learning of differential networks with fMRI data from different centers may reflect the underlying mechanism of neurological diseases from… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: arXiv admin note: text overlap with arXiv:2005.08457 by other authors

  44. arXiv:2105.03705  [pdf, other

    cs.LG stat.ML

    Understanding Neural Networks with Logarithm Determinant Entropy Estimator

    Authors: Zhanghao Zhouyin, Ding Liu

    Abstract: Understanding the informative behaviour of deep neural networks is challenged by misused estimators and the complexity of network structure, which leads to inconsistent observations and diversified interpretation. Here we propose the LogDet estimator -- a reliable matrix-based entropy estimator that approximates Shannon differential entropy. We construct informative measurements based on LogDet es… ▽ More

    Submitted 8 May, 2021; originally announced May 2021.

    Comments: 15pages,22 figures

  45. arXiv:2104.10029  [pdf, other

    cs.CV eess.IV stat.AP

    Multiple Sclerosis Lesion Analysis in Brain Magnetic Resonance Images: Techniques and Clinical Applications

    Authors: Yang Ma, Chaoyi Zhang, Mariano Cabezas, Yang Song, Zihao Tang, Dongnan Liu, Weidong Cai, Michael Barnett, Chenyu Wang

    Abstract: Multiple sclerosis (MS) is a chronic inflammatory and degenerative disease of the central nervous system, characterized by the appearance of focal lesions in the white and gray matter that topographically correlate with an individual patient's neurological symptoms and signs. Magnetic resonance imaging (MRI) provides detailed in-vivo structural information, permitting the quantification and catego… ▽ More

    Submitted 27 January, 2022; v1 submitted 20 April, 2021; originally announced April 2021.

    Comments: Accepted to appear in IEEE Journal of Biomedical And Health Informatics

  46. arXiv:2103.15352  [pdf, other

    cs.LG cs.CR cs.DS stat.ML

    Private Non-smooth Empirical Risk Minimization and Stochastic Convex Optimization in Subquadratic Steps

    Authors: Janardhan Kulkarni, Yin Tat Lee, Daogao Liu

    Abstract: We study the differentially private Empirical Risk Minimization (ERM) and Stochastic Convex Optimization (SCO) problems for non-smooth convex functions. We get a (nearly) optimal bound on the excess empirical risk and excess population loss with subquadratic gradient complexity. More precisely, our differentially private algorithm requires $O(\frac{N^{3/2}}{d^{1/8}}+ \frac{N^2}{d})$ gradient queri… ▽ More

    Submitted 29 March, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

  47. arXiv:2011.07047  [pdf, other

    stat.ME

    Nonparametric fusion learning: synthesize inferences from diverse sources using depth confidence distribution

    Authors: Dungang Liu, Regina Y. Liu, Minge Xie

    Abstract: Fusion learning refers to synthesizing inferences from multiple sources or studies to provide more effective inference and prediction than from any individual source or study alone. Most existing methods for synthesizing inferences rely on parametric model assumptions, such as normality, which often do not hold in practice. In this paper, we propose a general nonparametric fusion learning framewor… ▽ More

    Submitted 13 November, 2020; originally announced November 2020.

    Comments: 47 pages, 10 figures

  48. arXiv:2010.00509  [pdf, other

    cs.LG stat.ML

    Cardea: An Open Automated Machine Learning Framework for Electronic Health Records

    Authors: Sarah Alnegheimish, Najat Alrashed, Faisal Aleissa, Shahad Althobaiti, Dongyu Liu, Mansour Alsaleh, Kalyan Veeramachaneni

    Abstract: An estimated 180 papers focusing on deep learning and EHR were published between 2010 and 2018. Despite the common workflow structure appearing in these publications, no trusted and verified software framework exists, forcing researchers to arduously repeat previous work. In this paper, we propose Cardea, an extensible open-source automated machine learning framework encapsulating common predictio… ▽ More

    Submitted 1 October, 2020; originally announced October 2020.

  49. arXiv:2009.07769  [pdf, other

    cs.LG stat.ML

    TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks

    Authors: Alexander Geiger, Dongyu Liu, Sarah Alnegheimish, Alfredo Cuesta-Infante, Kalyan Veeramachaneni

    Abstract: Time series anomalies can offer information relevant to critical situations facing various fields, from finance and aerospace to the IT, security, and medical domains. However, detecting anomalies in time series data is particularly challenging due to the vague definition of anomalies and said data's frequent lack of labels and highly complex temporal correlations. Current state-of-the-art unsuper… ▽ More

    Submitted 14 November, 2020; v1 submitted 16 September, 2020; originally announced September 2020.

    Comments: Alexander Geiger and Dongyu Liu contributed equally. To appear in the proceedings of IEEE International Conference on Big Data

  50. arXiv:2008.09055  [pdf, ps, other

    math.OC stat.ML

    An Optimal Hybrid Variance-Reduced Algorithm for Stochastic Composite Nonconvex Optimization

    Authors: Deyi Liu, Lam M. Nguyen, Quoc Tran-Dinh

    Abstract: In this note we propose a new variant of the hybrid variance-reduced proximal gradient method in [7] to solve a common stochastic composite nonconvex optimization problem under standard assumptions. We simply replace the independent unbiased estimator in our hybrid- SARAH estimator introduced in [7] by the stochastic gradient evaluated at the same sample, leading to the identical momentum-SARAH es… ▽ More

    Submitted 20 August, 2020; originally announced August 2020.

    Comments: 6 pages

    Report number: STOR-UNC-08-20-P4