Search | arXiv e-print repository

arXiv:2502.09574 [pdf, other]

Spatial Transcriptomics Iterative Hierarchical Clustering (stIHC): A Novel Method for Identifying Spatial Gene Co-Expression Modules

Authors: Catherine Higgins, Jingyi Jessica Li, Michelle Carey

Abstract: Recent advancements in spatial transcriptomics technologies allow researchers to simultaneously measure RNA expression levels for hundreds to thousands of genes while preserving spatial information within tissues, providing critical insights into spatial gene expression patterns, tissue organization, and gene functionality. However, existing methods for clustering spatially variable genes (SVGs) i… ▽ More Recent advancements in spatial transcriptomics technologies allow researchers to simultaneously measure RNA expression levels for hundreds to thousands of genes while preserving spatial information within tissues, providing critical insights into spatial gene expression patterns, tissue organization, and gene functionality. However, existing methods for clustering spatially variable genes (SVGs) into co-expression modules often fail to detect rare or unique spatial expression patterns. To address this, we present spatial transcriptomics iterative hierarchical clustering (stIHC), a novel method for clustering SVGs into co-expression modules, representing groups of genes with shared spatial expression patterns. Through three simulations and applications to spatial transcriptomics datasets from technologies such as 10x Visium, 10x Xenium, and Spatial Transcriptomics, stIHC outperforms clustering approaches used by popular SVG detection methods, including SPARK, SPARK-X, MERINGUE, and SpatialDE. Gene Ontology enrichment analysis confirms that genes within each module share consistent biological functions, supporting the functional relevance of spatial co-expression. Robust across technologies with varying gene numbers and spatial resolution, stIHC provides a powerful tool for decoding the spatial organization of gene expression and the functional structure of complex tissues. △ Less

Submitted 13 February, 2025; originally announced February 2025.

arXiv:2501.05012 [pdf, other]

SyNPar: Synthetic Null Data Parallelism for High-Power False Discovery Rate Control in High-Dimensional Variable Selection

Authors: Changhu Wang, Ziheng Zhang, Jingyi Jessica Li

Abstract: Balancing false discovery rate (FDR) and statistical power to ensure reliable discoveries is a key challenge in high-dimensional variable selection. Although several FDR control methods have been proposed, most involve perturbing the original data, either by concatenating knockoff variables or splitting the data into two halves, both of which can lead to a loss of power. In this paper, we introduc… ▽ More Balancing false discovery rate (FDR) and statistical power to ensure reliable discoveries is a key challenge in high-dimensional variable selection. Although several FDR control methods have been proposed, most involve perturbing the original data, either by concatenating knockoff variables or splitting the data into two halves, both of which can lead to a loss of power. In this paper, we introduce a novel approach called Synthetic Null Parallelism (SyNPar), which controls the FDR in high-dimensional variable selection while preserving the original data. SyNPar generates synthetic null data from a model fitted to the original data and modified to reflect the null hypothesis. It then applies the same estimation procedure in parallel to both the original and synthetic null data to estimate coefficients that indicate feature importance. By comparing the coefficients estimated from the null data with those from the original data, SyNPar effectively identifies false positives, functioning as a numerical analog of a likelihood ratio test. We provide theoretical guarantees for FDR control at any desired level while ensuring that the power approaches one with high probability asymptotically. SyNPar is straightforward to implement and can be applied to a wide range of statistical models, including high-dimensional linear regression, generalized linear models, Cox models, and Gaussian graphical models. Through extensive simulations and real data applications, we demonstrate that SyNPar outperforms state-of-the-art methods, including knockoffs and data-splitting methods, in terms of FDR control, power, and computational efficiency. △ Less

Submitted 9 January, 2025; originally announced January 2025.

arXiv:2405.18779 [pdf, other]

Categorization of 33 computational methods to detect spatially variable genes from spatially resolved transcriptomics data

Authors: Guanao Yan, Shuo Harper Hua, Jingyi Jessica Li

Abstract: In the analysis of spatially resolved transcriptomics data, detecting spatially variable genes (SVGs) is crucial. Numerous computational methods exist, but varying SVG definitions and methodologies lead to incomparable results. We review 33 state-of-the-art methods, categorizing SVGs into three types: overall, cell-type-specific, and spatial-domain-marker SVGs. Our review explains the intuitions u… ▽ More In the analysis of spatially resolved transcriptomics data, detecting spatially variable genes (SVGs) is crucial. Numerous computational methods exist, but varying SVG definitions and methodologies lead to incomparable results. We review 33 state-of-the-art methods, categorizing SVGs into three types: overall, cell-type-specific, and spatial-domain-marker SVGs. Our review explains the intuitions underlying these methods, summarizes their applications, and categorizes the hypothesis tests they use in the trade-off between generality and specificity for SVG detection. We discuss challenges in SVG detection and propose future directions for improvement. Our review offers insights for method developers and users, advocating for category-specific benchmarking. △ Less

Submitted 3 October, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

arXiv:2309.13518 [pdf, other]

Categorization and analysis of 14 computational methods for estimating cell potency from single-cell RNA-seq data

Authors: Qingyang Wang, Zhiqian Zhai, Qiuyu Lian, Dongyuan Song, Jingyi Jessica Li

Abstract: In single-cell RNA sequencing (scRNA-seq) analysis, a key challenge is inferring hidden cellular dynamics from static cell snapshots. Various computational methods have been developed to address this, focusing on perspectives like pseudotime trajectories, RNA velocities, and estimating the differentiation potential of cells, often referred to as "cell potency." This review summarizes 14 methods fo… ▽ More In single-cell RNA sequencing (scRNA-seq) analysis, a key challenge is inferring hidden cellular dynamics from static cell snapshots. Various computational methods have been developed to address this, focusing on perspectives like pseudotime trajectories, RNA velocities, and estimating the differentiation potential of cells, often referred to as "cell potency." This review summarizes 14 methods for defining cell potency from scRNA-seq data, categorizing them into average-based, entropy-based, and correlation-based methods based on how they summarize gene expression levels into a potency measure. We highlight the key similarities and differences within and between these categories, offering a high-level intuition for each method. Additionally, we use unified mathematical notations to detail each method's methodology and summarize their usage complexities, including parameters, required inputs, and differences between published descriptions and software implementations. We conclude that cell potency estimation remains an open question without a consensus on the optimal approach, emphasizing the need for benchmark datasets and studies. This review aims to provide a foundation for future benchmark studies, while also addressing the broader challenge of comparing methods that infer cellular dynamics from scRNA-seq data through various perspectives, including pseudotime trajectories, RNA velocities, and cell potency. △ Less

Submitted 30 August, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

arXiv:2101.08860 [pdf]

doi 10.1016/j.xpro.2021.100699

Protocol for Executing and Benchmarking Eight Computational Doublet-Detection Methods in Single-Cell RNA Sequencing Data Analysis

Authors: Nan Miles Xi, Jingyi Jessica Li

Abstract: The existence of doublets is a key confounder in single-cell RNA sequencing (scRNA-seq) data analysis. Computational methods have been developed for detecting doublets from scRNA-seq data. We developed an R package DoubletCollection to integrate the installation and execution of eight doublet-detection methods. DoubletCollection also provides a unified interface to perform and visualize downstream… ▽ More The existence of doublets is a key confounder in single-cell RNA sequencing (scRNA-seq) data analysis. Computational methods have been developed for detecting doublets from scRNA-seq data. We developed an R package DoubletCollection to integrate the installation and execution of eight doublet-detection methods. DoubletCollection also provides a unified interface to perform and visualize downstream analysis after doublet detection. Here, we present a protocol of using DoubletCollection to benchmark doublet-detection methods. This protocol can automatically accommodate new doublet-detection methods in the fast-growing scRNA-seq field. △ Less

Submitted 25 June, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

Journal ref: STAR Protocols 2(3) (2021) 100699

arXiv:2007.01935 [pdf, other]

doi 10.1016/j.patter.2020.100115

Statistical hypothesis testing versus machine-learning binary classification: distinctions and guidelines

Authors: Jingyi Jessica Li, Xin Tong

Abstract: Making binary decisions is a common data analytical task in scientific research and industrial applications. In data sciences, there are two related but distinct strategies: hypothesis testing and binary classification. In practice, how to choose between these two strategies can be unclear and rather confusing. Here we summarize key distinctions between these two strategies in three aspects and li… ▽ More Making binary decisions is a common data analytical task in scientific research and industrial applications. In data sciences, there are two related but distinct strategies: hypothesis testing and binary classification. In practice, how to choose between these two strategies can be unclear and rather confusing. Here we summarize key distinctions between these two strategies in three aspects and list five practical guidelines for data analysts to choose the appropriate strategy for specific analysis needs. We demonstrate the use of those guidelines in a cancer driver gene prediction example. △ Less

Submitted 22 August, 2020; v1 submitted 3 July, 2020; originally announced July 2020.

Journal ref: Patterns 1(7) (2020) 100115

arXiv:1908.07084 [pdf, other]

Issues arising from benchmarking single-cell RNA sequencing imputation methods

Authors: Wei Vivian Li, Jingyi Jessica Li

Abstract: On June 25th, 2018, Huang et al. published a computational method SAVER on Nature Methods for imputing dropout gene expression levels in single cell RNA sequencing (scRNA-seq) data. Huang et al. performed a set of comprehensive benchmarking analyses, including comparison with the data from RNA fluorescence in situ hybridization, to demonstrate that SAVER outperformed two existing scRNA-seq imputat… ▽ More On June 25th, 2018, Huang et al. published a computational method SAVER on Nature Methods for imputing dropout gene expression levels in single cell RNA sequencing (scRNA-seq) data. Huang et al. performed a set of comprehensive benchmarking analyses, including comparison with the data from RNA fluorescence in situ hybridization, to demonstrate that SAVER outperformed two existing scRNA-seq imputation methods, scImpute and MAGIC. However, their computational analyses were based on semi-synthetic data that the authors had generated following the Poisson-Gamma model used in the SAVER method. We have therefore re-examined Huang et al.'s study. We find that the semi-synthetic data have very different properties from those of real scRNA-seq data and that the cell clusters used for benchmarking are inconsistent with the cell types labeled by biologists. We show that a reanalysis based on real scRNA-seq data and grounded on biological knowledge of cell types leads to different results and conclusions from those of Huang et al. △ Less

Submitted 19 August, 2019; originally announced August 2019.

Comments: 5 pages

arXiv:1804.06050 [pdf, other]

doi 10.1007/s40484-018-0144-7

Modeling and analysis of RNA-seq data: a review from a statistical perspective

Authors: Wei Vivian Li, Jingyi Jessica Li

Abstract: Background: Since the invention of next-generation RNA sequencing (RNA-seq) technologies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies. The analysis of RNA-seq data at four different levels (samples, genes, transcripts, and exons) involve multiple statistical and computational questions, so… ▽ More Background: Since the invention of next-generation RNA sequencing (RNA-seq) technologies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies. The analysis of RNA-seq data at four different levels (samples, genes, transcripts, and exons) involve multiple statistical and computational questions, some of which remain challenging up to date. Results: We review RNA-seq analysis tools at the sample, gene, transcript, and exon levels from a statistical perspective. We also highlight the biological and statistical questions of most practical considerations. Conclusion: The development of statistical and computational methods for analyzing RNA- seq data has made significant advances in the past decade. However, methods developed to answer the same biological question often rely on diverse statical models and exhibit different performance under different scenarios. This review discusses and compares multiple commonly used statistical models regarding their assumptions, in the hope of helping users select appropriate methods as needed, as well as assisting developers for future method development. △ Less

Submitted 1 May, 2018; v1 submitted 17 April, 2018; originally announced April 2018.

Journal ref: Quantitative Biology 6 (2018) 195-209

arXiv:1706.02366 [pdf, other]

doi 10.1089/cmb.2017.0059

Hybrid statistical and mechanistic mathematical model guides mobile health intervention for chronic pain

Authors: Sara M. Clifton, Chaeryon Kang, Jingyi Jessica Li, Qi Long, Nirmish Shah, Daniel M. Abrams

Abstract: Nearly a quarter of visits to the Emergency Department are for conditions that could have been managed via outpatient treatment; improvements that allow patients to quickly recognize and receive appropriate treatment are crucial. The growing popularity of mobile technology creates new opportunities for real-time adaptive medical intervention, and the simultaneous growth of big data sources allows… ▽ More Nearly a quarter of visits to the Emergency Department are for conditions that could have been managed via outpatient treatment; improvements that allow patients to quickly recognize and receive appropriate treatment are crucial. The growing popularity of mobile technology creates new opportunities for real-time adaptive medical intervention, and the simultaneous growth of big data sources allows for preparation of personalized recommendations. Here we focus on the reduction of chronic suffering in the sickle cell disease community. Sickle cell disease is a chronic blood disorder in which pain is the most frequent complication. There currently is no standard algorithm or analytical method for real-time adaptive treatment recommendations for pain. Furthermore, current state-of-the-art methods have difficulty in handling continuous-time decision optimization using big data. Facing these challenges, in this study we aim to develop new mathematical tools for incorporating mobile technology into personalized treatment plans for pain. We present a new hybrid model for the dynamics of subjective pain that consists of a dynamical systems approach using differential equations to predict future pain levels, as well as a statistical approach tying system parameters to patient data (both personal characteristics and medication response history). Pilot testing of our approach suggests that it has significant potential to predict pain dynamics given patients' reported pain levels and medication usages. With more abundant data, our hybrid approach should allow physicians to make personalized, data driven recommendations for treating chronic pain. △ Less

Submitted 7 June, 2017; originally announced June 2017.

Comments: 13 pages, 15 figures, 5 tables

Journal ref: J Comput Biol. 24(7) (2017) 675-688

arXiv:1603.05915 [pdf, other]

doi 10.1214/17-AOAS1100

MSIQ: Joint Modeling of Multiple RNA-seq Samples for Accurate Isoform Quantification

Authors: Wei Vivian Li, Anqi Zhao, Shihua Zhang, Jingyi Jessica Li

Abstract: Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structures, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate isoform quantification from RNA-seq data is challenging… ▽ More Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structures, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate isoform quantification from RNA-seq data is challenging due to the information loss in sequencing experiments. A recent accumulation of multiple RNA-seq data sets from the same tissue or cell type provides new opportunities to improve the accuracy of isoform quantification. However, existing statistical or computational methods for multiple RNA-seq samples either pool the samples into one sample or assign equal weights to the samples when estimating isoform abundance. These methods ignore the possible heterogeneity in the quality of different samples and could result in biased and unrobust estimates. In this article, we develop a method, which we call "joint modeling of multiple RNA-seq samples for accurate isoform quantification" (MSIQ), for more accurate and robust isoform quantification by integrating multiple RNA-seq samples under a Bayesian framework. Our method aims to (1) identify a consistent group of samples with homogeneous quality and (2) improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples by allowing for higher weights on the consistent group. We show that MSIQ provides a consistent estimator of isoform abundance, and we demonstrate the accuracy and effectiveness of MSIQ compared with alternative methods through simulation studies on D. melanogaster genes. We justify MSIQ's advantages over existing approaches via application studies on real RNA-seq data from human embryonic stem cells, brain tissues, and the HepG2 immortalized cell line. △ Less

Submitted 2 December, 2017; v1 submitted 18 March, 2016; originally announced March 2016.

MSC Class: 97K80; 47N30

Journal ref: Ann. Appl. Stat. 12(1) (2018) 510-539

arXiv:1601.05158 [pdf, other]

doi 10.1007/s12561-016-9163-y

TROM: A Testing-based Method for Finding Transcriptomic Similarity of Biological Samples

Authors: Wei Vivian Li, Yiling Chen, Jingyi Jessica Li

Abstract: Comparative transcriptomics has gained increasing popularity in genomic research thanks to the development of high-throughput technologies including microarray and next-generation RNA sequencing that have generated numerous transcriptomic data. An important question is to understand the conservation and differentiation of biological processes in different species. We propose a testing-based method… ▽ More Comparative transcriptomics has gained increasing popularity in genomic research thanks to the development of high-throughput technologies including microarray and next-generation RNA sequencing that have generated numerous transcriptomic data. An important question is to understand the conservation and differentiation of biological processes in different species. We propose a testing-based method TROM (Transcriptome Overlap Measure) for comparing transcriptomes within or between different species, and provide a different perspective to interpret transcriptomic similarity in contrast to traditional correlation analyses. Specifically, the TROM method focuses on identifying associated genes that capture molecular characteristics of biological samples, and subsequently comparing the biological samples by testing the overlap of their associated genes. We use simulation and real data studies to demonstrate that TROM is more powerful in identifying similar transcriptomes and more robust to stochastic gene expression noise than Pearson and Spearman correlations. We apply TROM to compare the developmental stages of six Drosophila species, C. elegans, S. purpuratus, D. rerio and mouse liver, and find interesting correspondence patterns that imply conserved gene expression programs in the development of these species. The TROM method is available as an R package on CRAN (http://cran.r-project.org/) with manuals and source codes available at http://www.stat.ucla.edu/ jingyi.li/software-and-data/trom.html. △ Less

Submitted 30 August, 2016; v1 submitted 19 January, 2016; originally announced January 2016.

Journal ref: Statistics in Biosciences 9 (2017) 105-136

arXiv:1212.0587 [pdf]

doi 10.7717/peerj.270

System Wide Analyses have Underestimated Protein Abundances and the Importance of Transcription in Mammals

Authors: Jingyi Jessica Li, Peter J. Bickel, Mark D. Biggin

Abstract: Large scale surveys in mammalian tissue culture cells suggest that the protein expressed at the median abundance is present at 8,000 - 16,000 molecules per cell and that differences in mRNA expression between genes explain only 10-40% of the differences in protein levels. We find, however, that these surveys have significantly underestimated protein abundances and the relative importance of transc… ▽ More Large scale surveys in mammalian tissue culture cells suggest that the protein expressed at the median abundance is present at 8,000 - 16,000 molecules per cell and that differences in mRNA expression between genes explain only 10-40% of the differences in protein levels. We find, however, that these surveys have significantly underestimated protein abundances and the relative importance of transcription. Using individual measurements for 61 housekeeping proteins to rescale whole proteome data from Schwanhausser et al., we find that the median protein detected is expressed at 170,000 molecules per cell and that our corrected protein abundance estimates show a higher correlation with mRNA abundances than do the uncorrected protein data. In addition, we estimated the impact of further errors in mRNA and protein abundances, showing that mRNA levels explain at least 56% of the differences in protein abundance for the genes detected by Schwanhausser et al., though because one major source of error could not be estimated the true percent contribution could be higher. We also employed a second, independent strategy to determine the contribution of mRNA levels to protein expression. We show that the variance in translation rates directly measured by ribosome profiling is only 12% of that inferred by Schwanhausser et al. and that the measured and inferred translation rates correlate only poorly (R2=0.13). Based on this, our second strategy suggests that mRNA levels explain ~81% of the variance in protein levels. We also determined the percent contributions of transcription, RNA degradation, translation and protein degradation to the variance in protein abundances using both of our strategies. While the magnitudes of the two estimates vary, they both suggest that transcription plays a more important role than the earlier studies implied and translation a much smaller role. △ Less

Submitted 30 January, 2014; v1 submitted 3 December, 2012; originally announced December 2012.

Comments: v2 adds a model of all gene's protein and mRNA expression. v3 corrects the omission of dataset files. v4 and 5 extends models of all gene's expression. v6 adds two new ribosome footprint datasets. v7 The final version accepted at PeerJ. Adds NIH3T3 ribosome footprint data and removes modeling of all genes protein expression levels

Journal ref: PeerJ (2014) e270

Showing 1–12 of 12 results for author: Li, J J