Skip to main content

Showing 1–14 of 14 results for author: Gao, L L

Searching in archive stat. Search in all archives.
.
  1. arXiv:2502.09957  [pdf, other

    stat.ME stat.ML

    Thinning a Wishart Random Matrix

    Authors: Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Daniela Witten, Jacob Bien

    Abstract: Recent work has explored data thinning, a generalization of sample splitting that involves decomposing a (possibly matrix-valued) random variable into independent components. In the special case of a $n \times p$ random matrix with independent and identically distributed $N_p(μ, Σ)$ rows, Dharamshi et al. (2024a) provides a comprehensive analysis of the settings in which thinning is or is not poss… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  2. arXiv:2409.11497  [pdf, other

    stat.ME stat.ML

    Decomposing Gaussians with Unknown Covariance

    Authors: Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Jacob Bien, Daniela Witten

    Abstract: Common workflows in machine learning and statistics rely on the ability to partition the information in a data set into independent portions. Recent work has shown that this may be possible even when conventional sample splitting is not (e.g., when the number of samples $n=1$, or when observations are not independent and identically distributed). However, the approaches that are currently availabl… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

  3. arXiv:2409.03069  [pdf, other

    stat.ME

    Discussion of "Data fission: splitting a single data point"

    Authors: Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten, Jacob Bien

    Abstract: Leiner et al. [2023] introduce an important generalization of sample splitting, which they call data fission. They consider two cases of data fission: P1 fission and P2 fission. While P1 fission is extremely useful and easy to use, Leiner et al. [2023] provide P1 fission operations only for the Gaussian and the Poisson distributions. They provide little guidance on how to apply P2 fission operatio… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: 18 pages, 1 figure

  4. arXiv:2311.16375  [pdf, other

    stat.ME q-bio.QM stat.AP

    Testing for a difference in means of a single feature after clustering

    Authors: Yiqun T. Chen, Lucy L. Gao

    Abstract: For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we propose a new test for the difference in means in a s… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    MSC Class: 62H30; 62H15; 62P10

  5. arXiv:2307.12985  [pdf, other

    stat.ME stat.AP

    Negative binomial count splitting for single-cell RNA sequencing data

    Authors: Anna Neufeld, Joshua Popp, Lucy L. Gao, Alexis Battle, Daniela Witten

    Abstract: The analysis of single-cell RNA sequencing (scRNA-seq) data often involves fitting a latent variable model to learn a low-dimensional representation for the cells. Validating such a model poses a major challenge. If we could sequence the same set of cells twice, we could use one dataset to fit a latent variable model and the other to validate it. In reality, we cannot sequence the same set of cell… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  6. arXiv:2303.12931  [pdf, other

    stat.ME math.ST stat.ML

    Generalized Data Thinning Using Sufficient Statistics

    Authors: Ameer Dharamshi, Anna Neufeld, Keshav Motwani, Lucy L. Gao, Daniela Witten, Jacob Bien

    Abstract: Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent r… ▽ More

    Submitted 11 June, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

  7. arXiv:2303.04746  [pdf, other

    stat.ME

    Necessary and sufficient conditions for multiple objective optimal regression designs

    Authors: Lucy L. Gao, Jane J. Ye, Shangzhi Zeng, Julie Zhou

    Abstract: We typically construct optimal designs based on a single objective function. To better capture the breadth of an experiment's goals, we could instead construct a multiple objective optimal design based on multiple objective functions. While algorithms have been developed to find multi-objective optimal designs (e.g. efficiency-constrained and maximin optimal designs), it is far less clear how to v… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

  8. arXiv:2301.07276  [pdf, other

    stat.ME stat.ML

    Data thinning for convolution-closed distributions

    Authors: Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten

    Abstract: We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, a… ▽ More

    Submitted 20 November, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

  9. arXiv:2207.00554  [pdf, ps, other

    stat.ME stat.AP

    Inference after latent variable estimation for single-cell RNA sequencing data

    Authors: Anna Neufeld, Lucy L. Gao, Joshua Popp, Alexis Battle, Daniela Witten

    Abstract: In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the individual cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-… ▽ More

    Submitted 18 October, 2022; v1 submitted 1 July, 2022; originally announced July 2022.

    Comments: 43 pages, 7 figures

  10. arXiv:2106.07816  [pdf, other

    stat.ME stat.ML

    Tree-Values: selective inference for regression trees

    Authors: Anna C. Neufeld, Lucy L. Gao, Daniela M. Witten

    Abstract: We consider conducting inference on the output of the Classification and Regression Tree (CART) [Breiman et al., 1984] algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting infer… ▽ More

    Submitted 17 October, 2022; v1 submitted 14 June, 2021; originally announced June 2021.

  11. arXiv:2012.02936  [pdf, other

    stat.ME stat.ML

    Selective Inference for Hierarchical Clustering

    Authors: Lucy L. Gao, Jacob Bien, Daniela Witten

    Abstract: Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their mean… ▽ More

    Submitted 31 October, 2022; v1 submitted 4 December, 2020; originally announced December 2020.

    Comments: Final accepted version

  12. arXiv:1910.00745  [pdf, other

    stat.ME

    Minimax D-optimal designs for multivariate regression models with multi-factors

    Authors: Lucy L. Gao, Julie Zhou

    Abstract: In multi-response regression models, the error covariance matrix is never known in practice. Thus, there is a need for optimal designs which are robust against possible misspecification of the error covariance matrix. In this paper, we approximate the error covariance matrix with a neighbourhood of covariance matrices, in order to define minimax D-optimal designs which are robust against small dep… ▽ More

    Submitted 1 October, 2019; originally announced October 2019.

  13. arXiv:1909.11640  [pdf, other

    stat.ME stat.ML

    Testing for Association in Multi-View Network Data

    Authors: Lucy L. Gao, Daniela Witten, Jacob Bien

    Abstract: In this paper, we consider data consisting of multiple networks, each comprised of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multi-view network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a sto… ▽ More

    Submitted 22 March, 2021; v1 submitted 25 September, 2019; originally announced September 2019.

  14. arXiv:1901.03905  [pdf, other

    stat.ME stat.ML

    Are Clusterings of Multiple Data Views Independent?

    Authors: Lucy L. Gao, Jacob Bien, Daniela Witten

    Abstract: In the Pioneer 100 (P100) Wellness Project (Price and others, 2017), multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster th… ▽ More

    Submitted 12 January, 2019; originally announced January 2019.

    Comments: 20 pages, 4 figures, 1 table (main text); 15 pages, 9 figures (supplement)