Skip to main content

Showing 1–50 of 194 results for author: Lee, C

Searching in archive stat. Search in all archives.
.
  1. arXiv:2504.16230  [pdf, other

    stat.ME stat.AP

    Robust Causal Inference for EHR-based Studies of Point Exposures with Missingness in Eligibility Criteria

    Authors: Luke Benz, Rajarshi Mukherjee, Rui Wang, David Arterburn, Heidi Fischer, Catherine Lee, Susan M. Shortreed, Sebastien Haneuse, Alexander W. Levis

    Abstract: Missingness in variables that define study eligibility criteria is a seldom addressed challenge in electronic health record (EHR)-based settings. It is typically the case that patients with incomplete eligibility information are excluded from analysis without consideration of (implicit) assumptions that are being made, leaving study conclusions subject to potential selection bias. In an effort to… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  2. arXiv:2504.15269  [pdf, other

    stat.ME

    Scalable and robust regression models for continuous proportional data

    Authors: Changwoo J. Lee, Benjamin K. Dahl, Otso Ovaskainen, David B. Dunson

    Abstract: Beta regression is used routinely for continuous proportional data, but it often encounters practical issues such as a lack of robustness of regression parameter estimates to misspecification of the beta distribution. We develop an improved class of generalized linear models starting with the continuous binomial (cobin) distribution and further extending to dispersion mixtures of cobin distributio… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  3. arXiv:2504.15251  [pdf, other

    cs.LG cs.DS math.ST stat.ML

    On Learning Parallel Pancakes with Mostly Uniform Weights

    Authors: Ilias Diakonikolas, Daniel M. Kane, Sushrut Karmalkar, Jasper C. H. Lee, Thanasis Pittas

    Abstract: We study the complexity of learning $k$-mixtures of Gaussians ($k$-GMMs) on $\mathbb{R}^d$. This task is known to have complexity $d^{Ω(k)}$ in full generality. To circumvent this exponential lower bound on the number of components, research has focused on learning families of GMMs satisfying additional structural properties. A natural assumption posits that the component weights are not exponenti… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  4. arXiv:2504.12496  [pdf, ps, other

    stat.ME

    Mean Independent Component Analysis for Multivariate Time Series

    Authors: Chung Eun Lee, Zeda Li

    Abstract: In this article, we introduce the mean independent component analysis for multivariate time series to reduce the parameter space. In particular, we seek for a contemporaneous linear transformation that detects univariate mean independent components so that each component can be modeled separately. The mean independent component analysis is flexible in the sense that no parametric model or distribu… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  5. arXiv:2503.05023  [pdf

    stat.AP

    A Scorecard Model Using Survival Analysis Framework

    Authors: Cheng Lee, Hsi Lee

    Abstract: Credit risk assessment is a crucial aspect of financial decision-making, enabling institutions to predict the likelihood of default and make informed lending choices. Two prominent methodologies in risk modeling are logistic regression and survival analysis. Logistic regression is widely used for creating scorecard models due to its simplicity, interpretability, and effectiveness in estimating the… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

  6. arXiv:2503.02645  [pdf, other

    cs.LG stat.ML stat.OT

    A Generalized Theory of Mixup for Structure-Preserving Synthetic Data

    Authors: Chungpa Lee, Jongho Im, Joseph H. T. Kim

    Abstract: Mixup is a widely adopted data augmentation technique known for enhancing the generalization of machine learning models by interpolating between data points. Despite its success and popularity, limited attention has been given to understanding the statistical properties of the synthetic data it generates. In this paper, we delve into the theoretical underpinnings of mixup, specifically its effects… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Journal ref: Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025

  7. CMHSU: An R Statistical Software Package to Detect Mental Health Status, Substance Use Status, and their Concurrent Status in the North American Healthcare Administrative Databases

    Authors: Mohsen Soltanifar, Chel Hee Lee

    Abstract: The concept of concurrent mental health and substance use (MHSU) and its detection in patients has garnered growing interest among psychiatrists and healthcare policymakers over the past four decades. Researchers have proposed various diagnostic methods, including the Data-Driven Diagnostic Method (DDDM), for the identification of MHSU. However, the absence of a standalone statistical software pac… ▽ More

    Submitted 6 April, 2025; v1 submitted 10 January, 2025; originally announced January 2025.

    Comments: 21 pages; 7 figures; version 4.0

    Report number: 6(2), 50

    Journal ref: Psychiatry International, 2025

  8. arXiv:2501.04959  [pdf

    econ.EM stat.CO

    DisSim-FinBERT: Text Simplification for Core Message Extraction in Complex Financial Texts

    Authors: Wonseong Kim, Christina Niklaus, Choong Lyol Lee, Siegfried Handschuh

    Abstract: This study proposes DisSim-FinBERT, a novel framework that integrates Discourse Simplification (DisSim) with Aspect-Based Sentiment Analysis (ABSA) to enhance sentiment prediction in complex financial texts. By simplifying intricate documents such as Federal Open Market Committee (FOMC) minutes, DisSim improves the precision of aspect identification, resulting in sentiment predictions that align m… ▽ More

    Submitted 26 March, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

    Comments: 28 pages, 5 figures, 2 tables

  9. arXiv:2412.04744  [pdf, other

    stat.ME stat.AP

    Marginally interpretable spatial logistic regression with bridge processes

    Authors: Changwoo J. Lee, David B. Dunson

    Abstract: In including random effects to account for dependent observations, the odds ratio interpretation of logistic regression coefficients is changed from population-averaged to subject-specific. This is unappealing in many applications, motivating a rich literature on methods that maintain the marginal logistic regression structure without random effects, such as generalized estimating equations. Howev… ▽ More

    Submitted 28 February, 2025; v1 submitted 5 December, 2024; originally announced December 2024.

  10. arXiv:2411.17013  [pdf, other

    stat.ME math.ST

    Conditional Extremes with Graphical Models

    Authors: Aiden Farrell, Emma F. Eastoe, Clement Lee

    Abstract: Multivariate extreme value analysis quantifies the probability and magnitude of joint extreme events. River discharges from the upper Danube River basin provide a challenging dataset for such analysis because the data, which is measured on a spatial network, exhibits both asymptotic dependence and asymptotic independence. To account for both features, we extend the conditional multivariate extreme… ▽ More

    Submitted 11 April, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

  11. arXiv:2411.04286  [pdf

    econ.EM stat.AP

    Bounded Rationality in Central Bank Communication

    Authors: Wonseong Kim, Choong Lyol Lee

    Abstract: This study explores the influence of FOMC sentiment on market expectations, focusing on cognitive differences between experts and non-experts. Using sentiment analysis of FOMC minutes, we integrate these insights into a bounded rationality model to examine the impact on inflation expectations. Results show that experts form more conservative expectations, anticipating FOMC stabilization actions, w… ▽ More

    Submitted 6 November, 2024; originally announced November 2024.

    Comments: 72 pages, 5 figures, 8 tables

  12. arXiv:2410.21262  [pdf, other

    cs.LG cs.AI stat.ML

    BLAST: Block-Level Adaptive Structured Matrices for Efficient Deep Neural Network Inference

    Authors: Changwoo Lee, Soo Min Kwon, Qing Qu, Hun-Seok Kim

    Abstract: Large-scale foundation models have demonstrated exceptional performance in language and vision tasks. However, the numerous dense matrix-vector operations involved in these large networks pose significant computational challenges during inference. To address these challenges, we introduce the Block-Level Adaptive STructured (BLAST) matrix, designed to learn and leverage efficient structures preval… ▽ More

    Submitted 29 October, 2024; v1 submitted 28 October, 2024; originally announced October 2024.

  13. arXiv:2409.13190  [pdf, other

    stat.ME

    Nonparametric Causal Survival Analysis with Clustered Interference

    Authors: Chanhwa Lee, Donglin Zeng, Michael Emch, John D. Clemens, Michael G. Hudgens

    Abstract: Inferring treatment effects on a survival time outcome based on data from an observational study is challenging due to the presence of censoring and possible confounding. An additional challenge occurs when a unit's treatment affects the outcome of other units, i.e., there is interference. In some settings, units may be grouped into clusters such that it is reasonable to assume interference only o… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

  14. arXiv:2409.00291  [pdf, ps, other

    stat.ME stat.AP

    Variable selection in the joint frailty model of recurrent and terminal events using Broken Adaptive Ridge regression

    Authors: Christian Chan, Fatemeh Mahmoudi, Chel Hee Lee, Quan Long, Xuewen Lu

    Abstract: We introduce a novel method to simultaneously perform variable selection and estimation in the joint frailty model of recurrent and terminal events using the Broken Adaptive Ridge Regression penalty. The BAR penalty can be summarized as an iteratively reweighted squared $L_2$-penalized regression, which approximates the $L_0$-regularization method. Our method allows for the number of covariates to… ▽ More

    Submitted 30 August, 2024; originally announced September 2024.

  15. arXiv:2408.13642  [pdf, ps, other

    stat.AP math.ST stat.ME

    Change Point Detection in Pairwise Comparison Data with Covariates

    Authors: Yi Han, Thomas C. M. Lee

    Abstract: This paper introduces the novel piecewise stationary covariate-assisted ranking estimation (PS-CARE) model for analyzing time-evolving pairwise comparison data, enhancing item ranking accuracy through the integration of covariate information. By partitioning the data into distinct, stationary segments, the PS-CARE model adeptly detects temporal shifts in item rankings, known as change points, whos… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  16. arXiv:2407.11342  [pdf, other

    stat.ME

    GenTwoArmsTrialSize: An R Statistical Software Package to estimate Generalized Two Arms Randomized Clinical Trial Sample Size

    Authors: Mohsen Soltanifar, Chel Hee Lee, Amin Shirazi, Martha Behnke, Ilfra Raymond-Loher, Getachew A. Dagne

    Abstract: The precise calculation of sample sizes is a crucial aspect in the design of clinical trials particularly for pharmaceutical statisticians. While various R statistical software packages have been developed by researchers to estimate required sample sizes under different assumptions, there has been a notable absence of a standalone R statistical software package that allows researchers to comprehen… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 33 pages, 2 figures, 2 tables

    MSC Class: 62-04; 62-08; 62K05

  17. arXiv:2407.02681  [pdf, other

    cs.LG eess.IV math.OC stat.ML

    Uniform Transformation: Refining Latent Representation in Variational Autoencoders

    Authors: Ye Shi, C. S. George Lee

    Abstract: Irregular distribution in latent space causes posterior collapse, misalignment between posterior and prior, and ill-sampling problem in Variational Autoencoders (VAEs). In this paper, we introduce a novel adaptable three-stage Uniform Transformation (UT) module -- Gaussian Kernel Density Estimation (G-KDE) clustering, non-parametric Gaussian Mixture (GM) Modeling, and Probability Integral Transfor… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted by 2024 IEEE 20th International Conference on Automation Science and Engineering

  18. arXiv:2406.16830  [pdf, other

    stat.ME stat.AP

    Adjusting for Selection Bias Due to Missing Eligibility Criteria in Emulated Target Trials

    Authors: Luke Benz, Rajarshi Mukherjee, Rui Wang, David Arterburn, Catherine Lee, Heidi Fischer, Susan Shortreed, Sebastien Haneuse

    Abstract: Target trial emulation (TTE) is a popular framework for observational studies based on electronic health records (EHR). A key component of this framework is determining the patient population eligible for inclusion in both a target trial of interest and its observational emulation. Missingness in variables that define eligibility criteria, however, presents a major challenge towards determining th… ▽ More

    Submitted 4 October, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

  19. arXiv:2406.10087  [pdf

    cs.LG cs.AI stat.ML

    Biomarker based Cancer Classification using an Ensemble with Pre-trained Models

    Authors: Chongmin Lee, Jihie Kim

    Abstract: Certain cancer types, namely pancreatic cancer is difficult to detect at an early stage; sparking the importance of discovering the causal relationship between biomarkers and cancer to identify cancer efficiently. By allowing for the detection and monitoring of specific biomarkers through a non-invasive method, liquid biopsies enhance the precision and efficacy of medical interventions, advocating… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted to the AIAA Workshop at IJCAI 2024

  20. arXiv:2404.17709  [pdf, other

    stat.ML cs.LG

    Low-rank Matrix Bandits with Heavy-tailed Rewards

    Authors: Yue Kang, Cho-Jui Hsieh, Thomas C. M. Lee

    Abstract: In stochastic low-rank matrix bandit, the expected reward of an arm is equal to the inner product between its feature matrix and some unknown $d_1$ by $d_2$ low-rank parameter matrix $Θ^*$ with rank $r \ll d_1\wedge d_2$. While all prior studies assume the payoffs are mixed with sub-Gaussian noises, in this work we loosen this strict assumption and consider the new problem of \underline{low}-rank… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: The 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024)

  21. arXiv:2404.16166  [pdf, other

    stat.ME stat.AP

    Double Robust Variance Estimation with Parametric Working Models

    Authors: Bonnie E. Shook-Sa, Paul N. Zivich, Chanhwa Lee, Keyi Xue, Rachael K. Ross, Jessie K. Edwards, Jeffrey S. A. Stringer, Stephen R. Cole

    Abstract: Doubly robust estimators have gained popularity in the field of causal inference due to their ability to provide consistent point estimates when either an outcome or exposure model is correctly specified. However, for nonrandomized exposures the influence function based variance estimator frequently used with doubly robust estimators of the average causal effect is only consistent when both workin… ▽ More

    Submitted 4 November, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

    Comments: 32 pages, 14 figures, 13 tables

  22. AutoGFI: Streamlined Generalized Fiducial Inference for Modern Inference Problems in Models with Additive Errors

    Authors: Wei Du, Jan Hannig, Thomas C. M. Lee, Yi Su, Chunzhe Zhang

    Abstract: The concept of fiducial inference was introduced by R. A. Fisher in the 1930s to address the perceived limitations of Bayesian inference, particularly the need for subjective prior distributions in cases with limited prior information. However, Fisher's fiducial approach lost favor due to complications, especially in multi-parameter problems. With renewed interest in fiducial inference in the 2000… ▽ More

    Submitted 24 December, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

  23. arXiv:2402.07048  [pdf, other

    stat.ME stat.ML

    Logistic-beta processes for dependent random probabilities with beta marginals

    Authors: Changwoo J. Lee, Alessandro Zito, Huiyan Sang, David B. Dunson

    Abstract: The beta distribution serves as a canonical tool for modeling probabilities in statistics and machine learning. However, there is limited work on flexible and computationally convenient stochastic process extensions for modeling dependent random probabilities. We propose a novel stochastic process called the logistic-beta process, whose logistic transformation yields a stochastic process with comm… ▽ More

    Submitted 16 March, 2025; v1 submitted 10 February, 2024; originally announced February 2024.

  24. arXiv:2401.07298  [pdf, other

    stat.ML cs.LG

    Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems

    Authors: Yue Kang, Cho-Jui Hsieh, Thomas C. M. Lee

    Abstract: In the stochastic contextual low-rank matrix bandit problem, the expected reward of an action is given by the inner product between the action's feature matrix and some fixed, but initially unknown $d_1$ by $d_2$ matrix $Θ^*$ with rank $r \ll \{d_1, d_2\}$, and an agent sequentially takes actions based on past experience to maximize the cumulative reward. In this paper, we study the generalized lo… ▽ More

    Submitted 14 January, 2024; originally announced January 2024.

    Comments: Revision of the paper accepted by NeurIPS 2022

  25. arXiv:2401.00634  [pdf, other

    stat.ME stat.AP

    A scalable two-stage Bayesian approach accounting for exposure measurement error in environmental epidemiology

    Authors: Changwoo J. Lee, Elaine Symanski, Amal Rammah, Dong Hun Kang, Philip K. Hopke, Eun Sug Park

    Abstract: Accounting for exposure measurement errors has been recognized as a crucial problem in environmental epidemiology for over two decades. Bayesian hierarchical models offer a coherent probabilistic framework for evaluating associations between environmental exposures and health effects, which take into account exposure measurement errors introduced by uncertainty in the estimated exposure as well as… ▽ More

    Submitted 13 January, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

    Comments: 34 pages, 8 figures

  26. arXiv:2312.11769  [pdf, other

    cs.LG cs.DS cs.IT math.ST stat.ML

    Clustering Mixtures of Bounded Covariance Distributions Under Optimal Separation

    Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Thanasis Pittas

    Abstract: We study the clustering problem for mixtures of bounded covariance distributions, under a fine-grained separation assumption. Specifically, given samples from a $k$-component mixture distribution $D = \sum_{i =1}^k w_i P_i$, where each $w_i \ge α$ for some known parameter $α$, and each $P_i$ has unknown covariance $Σ_i \preceq σ^2_i \cdot I_d$ for some unknown $σ_i$, the goal is to cluster the sam… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

  27. arXiv:2311.13347  [pdf, other

    stat.ME

    Loss-based Objective and Penalizing Priors for Model Selection Problems

    Authors: Changwoo J. Lee

    Abstract: Many Bayesian model selection problems, such as variable selection or cluster analysis, start by setting prior model probabilities on a structured model space. Based on a chosen loss function between models, model selection is often performed with a Bayes estimator that minimizes the posterior expected loss. The prior model probabilities and the choice of loss both highly affect the model selectio… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: 31 pages, 3 figures

  28. arXiv:2311.12784  [pdf, ps, other

    math.ST cs.IT cs.LG stat.ML

    Optimality in Mean Estimation: Beyond Worst-Case, Beyond Sub-Gaussian, and Beyond $1+α$ Moments

    Authors: Trung Dang, Jasper C. H. Lee, Maoyuan Song, Paul Valiant

    Abstract: There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the limits of what we can extract from valuable data. The state of the art results for mean estimation in $\mathbb{R}$ are 1) the optimal sub-Gaussian mean estimator by [LV22], with the tight sub-Gaussian constant for all distribution… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: 27 pages, to appear in NeurIPS 2023. Abstract shortened to fit arXiv limit

  29. Causal Quantile Treatment Effects with missing data by double-sampling

    Authors: Shuo Sun, Sebastien Haneuse, Alexander W. Levis, Catherine Lee, David E Arterburn, Heidi Fischer, Susan Shortreed, Rajarshi Mukherjee

    Abstract: Causal weighted quantile treatment effects (WQTE) are a useful complement to standard causal contrasts that focus on the mean when interest lies at the tails of the counterfactual distribution. To-date, however, methods for estimation and inference regarding causal WQTEs have assumed complete data on all relevant factors. In most practical settings, however, data will be missing or incomplete data… ▽ More

    Submitted 14 March, 2025; v1 submitted 13 October, 2023; originally announced October 2023.

  30. Development and validation of an interpretable machine learning-based calculator for predicting 5-year weight trajectories after bariatric surgery: a multinational retrospective cohort SOPHIA study

    Authors: Patrick Saux, Pierre Bauvin, Violeta Raverdy, Julien Teigny, Hélène Verkindt, Tomy Soumphonphakdy, Maxence Debert, Anne Jacobs, Daan Jacobs, Valerie Monpellier, Phong Ching Lee, Chin Hong Lim, Johanna C Andersson-Assarsson, Lena Carlsson, Per-Arne Svensson, Florence Galtier, Guelareh Dezfoulian, Mihaela Moldovanu, Severine Andrieux, Julien Couster, Marie Lepage, Erminia Lembo, Ornella Verrastro, Maud Robert, Paulina Salminen , et al. (9 additional authors not shown)

    Abstract: Background Weight loss trajectories after bariatric surgery vary widely between individuals, and predicting weight loss before the operation remains challenging. We aimed to develop a model using machine learning to provide individual preoperative prediction of 5-year weight loss trajectories after surgery. Methods In this multinational retrospective observational study we enrolled adult participa… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

    Comments: The Lancet Digital Health, 2023

  31. arXiv:2306.16573  [pdf, other

    math.ST cs.IT cs.LG math.PR stat.ML

    Finite-Sample Symmetric Mean Estimation with Fisher Information Rate

    Authors: Shivam Gupta, Jasper C. H. Lee, Eric Price

    Abstract: The mean of an unknown variance-$σ^2$ distribution $f$ can be estimated from $n$ samples with variance $\frac{σ^2}{n}$ and nearly corresponding subgaussian rate. When $f$ is known up to translation, this can be improved asymptotically to $\frac{1}{n\mathcal I}$, where $\mathcal I$ is the Fisher information of the distribution. Such an improvement is not possible for general unknown $f$, but [Stone… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

    Comments: COLT 2023

  32. arXiv:2305.18543  [pdf, other

    cs.LG stat.ML

    Robust Lipschitz Bandits to Adversarial Corruptions

    Authors: Yue Kang, Cho-Jui Hsieh, Thomas C. M. Lee

    Abstract: Lipschitz bandit is a variant of stochastic bandits that deals with a continuous arm set defined on a metric space, where the reward function is subject to a Lipschitz constraint. In this paper, we introduce a new problem of Lipschitz bandits in the presence of adversarial corruptions where an adaptive adversary corrupts the stochastic rewards up to a total budget $C$. The budget is measured by th… ▽ More

    Submitted 8 October, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)

  33. arXiv:2305.15267  [pdf, other

    cs.LG stat.ML

    Training Energy-Based Normalizing Flow with Score-Matching Objectives

    Authors: Chen-Hao Chao, Wei-Fang Sun, Yen-Chang Hsu, Zsolt Kira, Chun-Yi Lee

    Abstract: In this paper, we establish a connection between the parameterization of flow-based and energy-based generative models, and present a new flow-based modeling approach called energy-based normalizing flow (EBFlow). We demonstrate that by optimizing EBFlow with score-matching objectives, the computation of Jacobian determinants for linear transformations can be entirely bypassed. This feature enable… ▽ More

    Submitted 28 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Published at NeurIPS 2023. Code: https://github.com/chen-hao-chao/ebflow

  34. arXiv:2305.08942  [pdf, other

    stat.ME physics.data-an stat.AP

    Probabilistic forecast of nonlinear dynamical systems with uncertainty quantification

    Authors: Mengyang Gu, Yizi Lin, Victor Chang Lee, Diana Qiu

    Abstract: Data-driven modeling is useful for reconstructing nonlinear dynamical systems when the underlying process is unknown or too expensive to compute. Having reliable uncertainty assessment of the forecast enables tools to be deployed to predict new scenarios unobserved before. In this work, we first extend parallel partial Gaussian processes for predicting the vector-valued transition function that li… ▽ More

    Submitted 30 October, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

    Journal ref: Physica D: Nonlinear Phenomena, 133938 (2023)

  35. arXiv:2305.00966  [pdf, other

    cs.DS cs.LG math.ST stat.ML

    A Spectral Algorithm for List-Decodable Covariance Estimation in Relative Frobenius Norm

    Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Ankit Pensia, Thanasis Pittas

    Abstract: We study the problem of list-decodable Gaussian covariance estimation. Given a multiset $T$ of $n$ points in $\mathbb R^d$ such that an unknown $α<1/2$ fraction of points in $T$ are i.i.d. samples from an unknown Gaussian $\mathcal{N}(μ, Σ)$, the goal is to output a list of $O(1/α)$ hypotheses at least one of which is close to $Σ$ in relative Frobenius norm. Our main result is a… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

  36. arXiv:2304.04043  [pdf, other

    stat.ME math.ST stat.ML

    Statistical and computational rates in high rank tensor estimation

    Authors: Chanwoo Lee, Miaoyan Wang

    Abstract: Higher-order tensor datasets arise commonly in recommendation systems, neuroimaging, and social networks. Here we develop probable methods for estimating a possibly high rank signal tensor from noisy observations. We consider a generative latent variable tensor model that incorporates both high rank and low rank models, including but not limited to, simple hypergraphon models, single index models,… ▽ More

    Submitted 8 April, 2023; originally announced April 2023.

    Comments: 38 pages, 8 figures

  37. arXiv:2303.04286  [pdf, other

    stat.ME cs.LG stat.ML

    Sufficient dimension reduction for feature matrices

    Authors: Chanwoo Lee

    Abstract: We address the problem of sufficient dimension reduction for feature matrices, which arises often in sensor network localization, brain neuroimaging, and electroencephalography analysis. In general, feature matrices have both row- and column-wise interpretations and contain structural information that can be lost with naive vectorization approaches. To address this, we propose a method called prin… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

    Comments: 30 pages, 3 figures

  38. arXiv:2302.09440  [pdf, other

    cs.LG stat.ML

    Online Continuous Hyperparameter Optimization for Generalized Linear Contextual Bandits

    Authors: Yue Kang, Cho-Jui Hsieh, Thomas C. M. Lee

    Abstract: In stochastic contextual bandits, an agent sequentially makes actions from a time-dependent action set based on past experience to minimize the cumulative regret. Like many other machine learning algorithms, the performance of bandits heavily depends on the values of hyperparameters, and theoretically derived parameter values may lead to unsatisfactory results in practice. Moreover, it is infeasib… ▽ More

    Submitted 8 April, 2024; v1 submitted 18 February, 2023; originally announced February 2023.

    Comments: Published in Transactions on Machine Learning Research (TMLR)

  39. arXiv:2302.02497  [pdf, other

    math.ST cs.IT cs.LG math.PR stat.ML

    High-dimensional Location Estimation via Norm Concentration for Subgamma Vectors

    Authors: Shivam Gupta, Jasper C. H. Lee, Eric Price

    Abstract: In location estimation, we are given $n$ samples from a known distribution $f$ shifted by an unknown translation $λ$, and want to estimate $λ$ as precisely as possible. Asymptotically, the maximum likelihood estimate achieves the Cramér-Rao bound of error $\mathcal N(0, \frac{1}{n\mathcal I})$, where $\mathcal I$ is the Fisher information of $f$. However, the $n$ required for convergence depends o… ▽ More

    Submitted 5 February, 2023; originally announced February 2023.

  40. arXiv:2302.00951  [pdf, other

    stat.AP

    A Bayesian analysis of current duration data with reporting issues: an application to estimating the distribution of time-between-sex from time-since-last-sex data as collected in cross-sectional surveys in low- and middle-income countries

    Authors: Chi Hyun Lee, Herbert Susmann, Leontine Alkema

    Abstract: Aggregate measures of family planning are used to monitor demand for and usage of contraceptive methods in populations globally, for example as part of the FP2030 initiative. Family planning measures for low- and middle-income countries are typically based on data collected through cross-sectional household surveys. Recently proposed measures account for sexual activity through assessment of the d… ▽ More

    Submitted 2 February, 2023; originally announced February 2023.

  41. arXiv:2301.07513  [pdf, other

    stat.ME stat.CO

    A Bayesian Nonparametric Stochastic Block Model for Directed Acyclic Graphs

    Authors: Clement Lee, Marco Battiston

    Abstract: Random graphs have been widely used in statistics, for example in network and social interaction analysis. In some applications, data may contain an inherent hierarchical ordering among its vertices, which prevents any directed edge between pairs of vertices that do not respect this order. For example, in bibliometrics, older papers cannot cite newer ones. In such situations, the resulting graph f… ▽ More

    Submitted 10 December, 2024; v1 submitted 18 January, 2023; originally announced January 2023.

    Comments: 27 pages, 10 figures, 3 tables

  42. arXiv:2301.04857  [pdf, other

    cs.AI stat.ME

    Neural Spline Search for Quantile Probabilistic Modeling

    Authors: Ruoxi Sun, Chun-Liang Li, Sercan O. Arik, Michael W. Dusenberry, Chen-Yu Lee, Tomas Pfister

    Abstract: Accurate estimation of output quantiles is crucial in many use cases, where it is desired to model the range of possibility. Modeling target distribution at arbitrary quantile levels and at arbitrary input attribute levels are important to offer a comprehensive picture of the data, and requires the quantile function to be expressive enough. The quantile function describing the target distribution… ▽ More

    Submitted 12 January, 2023; originally announced January 2023.

  43. arXiv:2212.10959  [pdf, other

    stat.ME

    Efficient Nonparametric Estimation of Stochastic Policy Effects with Clustered Interference

    Authors: Chanhwa Lee, Donglin Zeng, Michael G. Hudgens

    Abstract: Interference occurs when a unit's treatment (or exposure) affects another unit's outcome. In some settings, units may be grouped into clusters such that it is reasonable to assume that interference, if present, only occurs between individuals in the same cluster, i.e., there is clustered interference. Various causal estimands have been proposed to quantify treatment effects under clustered interfe… ▽ More

    Submitted 23 August, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

  44. arXiv:2211.16333  [pdf, ps, other

    cs.DS cs.LG math.ST stat.ML

    Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions

    Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Ankit Pensia

    Abstract: We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $μ$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $μ$ with high probability. Prior work had obtained… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: To appear in NeurIPS 2022

  45. arXiv:2211.03763  [pdf, ps, other

    stat.AP

    Spatial distribution and determinants of childhood vaccination refusal in the United States

    Authors: Bokgyeong Kang, Sandra Goldlust, Elizabeth C. Lee, John Hughes, Shweta Bansal, Murali Haran

    Abstract: Parental refusal and delay of childhood vaccination has increased in recent years in the United States. This phenomenon challenges maintenance of herd immunity and increases the risk of outbreaks of vaccine-preventable diseases. We examine US county-level vaccine refusal for patients under five years of age collected during the period 2012--2015 from an administrative healthcare dataset. We model… ▽ More

    Submitted 15 March, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

  46. arXiv:2207.13676  [pdf, other

    cs.LG cs.DC stat.ML

    Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization

    Authors: Xingyou Song, Sagi Perel, Chansoo Lee, Greg Kochanski, Daniel Golovin

    Abstract: Vizier is the de-facto blackbox and hyperparameter optimization service across Google, having optimized some of Google's largest products and research efforts. To operate at the scale of tuning thousands of users' critical systems, Google Vizier solved key design challenges in providing multiple different features, while remaining fully fault-tolerant. In this paper, we introduce Open Source (OSS)… ▽ More

    Submitted 10 January, 2023; v1 submitted 27 July, 2022; originally announced July 2022.

    Comments: Published as a conference paper for the systems track at the 1st International Conference on Automated Machine Learning (AutoML-Conf 2022). Code can be found at https://github.com/google/vizier

  47. arXiv:2207.03084  [pdf, other

    cs.LG cs.AI stat.ML

    Pre-training helps Bayesian optimization too

    Authors: Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zelda Mariet, Zachary Nado, Justin Gilmer, Jasper Snoek, Zoubin Ghahramani

    Abstract: Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs o… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: ICML2022 Workshop on Adaptive Experimental Design and Active Learning in the Real World. arXiv admin note: substantial text overlap with arXiv:2109.08215

  48. arXiv:2207.00689  [pdf, other

    stat.ME stat.CO

    Rapidly Mixing Multiple-try Metropolis Algorithms for Model Selection Problems

    Authors: Hyunwoong Chang, Changwoo J. Lee, Zhao Tang Luo, Huiyan Sang, Quan Zhou

    Abstract: The multiple-try Metropolis (MTM) algorithm is an extension of the Metropolis-Hastings (MH) algorithm by selecting the proposed state among multiple trials according to some weight function. Although MTM has gained great popularity owing to its faster empirical convergence and mixing than the standard MH algorithm, its theoretical mixing property is rarely studied in the literature due to its comp… ▽ More

    Submitted 14 October, 2022; v1 submitted 1 July, 2022; originally announced July 2022.

    Comments: Accepted to Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022)

  49. arXiv:2206.02348  [pdf, other

    math.ST cs.DS cs.IT cs.LG stat.ML

    Finite-Sample Maximum Likelihood Estimation of Location

    Authors: Shivam Gupta, Jasper C. H. Lee, Eric Price, Paul Valiant

    Abstract: We consider 1-dimensional location estimation, where we estimate a parameter $λ$ from $n$ samples $λ+ η_i$, with each $η_i$ drawn i.i.d. from a known distribution $f$. For fixed $f$ the maximum-likelihood estimate (MLE) is well-known to be optimal in the limit as $n \to \infty$: it is asymptotically normal with variance matching the Cramér-Rao lower bound of $\frac{1}{n\mathcal{I}}$, where… ▽ More

    Submitted 18 July, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

    Comments: Corrected an inaccuracy in the description of the experimental setup. Also updated funding acknowledgements

  50. arXiv:2205.13320  [pdf, other

    cs.LG cs.AI stat.ML

    Towards Learning Universal Hyperparameter Optimizers with Transformers

    Authors: Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Qiuyi Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc'aurelio Ranzato, Sagi Perel, Nando de Freitas

    Abstract: Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that… ▽ More

    Submitted 13 October, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

    Comments: Published as a conference paper in Neural Information Processing Systems (NeurIPS) 2022. Code can be found in https://github.com/google-research/optformer and Google AI Blog can be found in https://ai.googleblog.com/2022/08/optformer-towards-universal.html