Search | arXiv e-print repository

scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM

Authors: Shang-Jung Wen, Jia-Ming Chang, Fang Yu

Abstract: High-dimensional single-cell data poses significant challenges in identifying underlying biological patterns due to the complexity and heterogeneity of cellular states. We propose a comprehensive gene-cell dependency visualization via unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM), specifically designed for analyzing high-dimensional single-cell data like single-cell seq… ▽ More High-dimensional single-cell data poses significant challenges in identifying underlying biological patterns due to the complexity and heterogeneity of cellular states. We propose a comprehensive gene-cell dependency visualization via unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM), specifically designed for analyzing high-dimensional single-cell data like single-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples in a hierarchical structure such that the self-growth structure of clusters satisfies the required variations between and within. We propose a novel Significant Attributes Identification Algorithm to identify features that distinguish clusters. This algorithm pinpoints attributes with minimal variation within a cluster but substantial variation between clusters. These key attributes can then be used for targeted data retrieval and downstream analysis. Furthermore, we present two innovative visualization tools: Cluster Feature Map and Cluster Distribution Map. The Cluster Feature Map highlights the distribution of specific features across the hierarchical structure of GHSOM clusters. This allows for rapid visual assessment of cluster uniqueness based on chosen features. The Cluster Distribution Map depicts leaf clusters as circles on the GHSOM grid, with circle size reflecting cluster data size and color customizable to visualize features like cell type or other attributes. We apply our analysis to three single-cell datasets and one CRISPR dataset (cell-gene database) and evaluate clustering methods with internal and external CH and ARI scores. GHSOM performs well, being the best performer in internal evaluation (CH=4.2). In external evaluation, GHSOM has the third-best performance of all methods. △ Less

Submitted 24 July, 2024; originally announced July 2024.

Comments: Abstract presentation at BIOKDD@ACM KDD 2024

arXiv:2312.16097 [pdf]

Expanding to Arbitrary Study Designs: ANOVA to Estimate Limits of Agreement for MRMC Studies

Authors: Si Wen, Brandon D. Gallas

Abstract: A multi-reader multi-case (MRMC) analysis is applied to account for both reader and case variability when evaluating the clinical performance of a medical imaging device or reader performance under different reading modalities. For a clinical task that measures a quantitative biomarker an agreement analysis, such as limits of agreement (LOA), can be used. In this work, we decompose the total varia… ▽ More A multi-reader multi-case (MRMC) analysis is applied to account for both reader and case variability when evaluating the clinical performance of a medical imaging device or reader performance under different reading modalities. For a clinical task that measures a quantitative biomarker an agreement analysis, such as limits of agreement (LOA), can be used. In this work, we decompose the total variation in the data using a three-way mixed effect ANOVA model to estimate the MRMC variance of individual differences and the LOA between different reading modalities. There are rules for writing down the expectation of the mean squares in terms of the variance components for fully-crossed data, i.e. data where all the readers read all the cases in all modalities being studied. Sometimes the annotation task is labor-intensive and time-consuming or distributed across sites, so that a fully-crossed study is not practical. In this work, we focus on estimating the MRMC variance in the within- and between-readers and within- and between-modalities LOA for an arbitrary study design. Simulation studies were conducted to validate the LOA variance estimates. The method was also applied to a dataset to compare pathologist performance for assessing the density of stromal tumor infiltrating lymphocytes on different platforms. △ Less

Submitted 26 December, 2023; originally announced December 2023.

arXiv:2107.08891 [pdf]

Three-way Mixed Effect ANOVA to Estimate MRMC Limits of Agreement

Authors: Si Wen, Brandon D. Gallas

Abstract: When evaluating the clinical performance of a medical imaging device, a multi-reader multi-case (MRMC) analysis is usually applied to account for both case and reader variability. For a clinical task that equates to a quantitative measurement, an agreement analysis such as a limits of agreement (LOA) method can be used to compare different measurement methods. In this work, we introduce four types… ▽ More When evaluating the clinical performance of a medical imaging device, a multi-reader multi-case (MRMC) analysis is usually applied to account for both case and reader variability. For a clinical task that equates to a quantitative measurement, an agreement analysis such as a limits of agreement (LOA) method can be used to compare different measurement methods. In this work, we introduce four types of comparisons; these types differ depending on whether the measurements are within or between readers and within or between modalities. A three-way mixed effect ANOVA model is applied to estimate the variances of individual differences, which is an essential step for estimating LOA. To verify the estimates of LOA, we propose a hierarchical model to simulate quantitative MRMC data. Two simulation studies were conducted to validate both the simulation and the LOA variance estimates. From the simulation results, we can conclude that our estimate of variance is unbiased, and the uncertainty of the estimation drops as the number of readers and cases increases and rises as the value of true variance increases. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: 31 pages, 5 figures, 2 tables. Submitted to Statistics in Biopharmaceutical Research

arXiv:2103.15142 [pdf]

COSINE: A Web Server for Clonal and Subclonal Structure Inference and Evolution in Cancer Genomics

Authors: Xiguo Yuan, Yuan Zhao, Yang Guo, Linmei Ge, Wei Liu, Shiyu Wen, Qi Li, Zhangbo Wan, Peina Zheng, Tao Guo, Zhida Li, Martin Peifer, Yupeng Cun

Abstract: Cancers evolve from mutation of a single cell with sequential clonal and subclonal expansion of somatic mutation acquisition. Inferring clonal and subclonal structures from bulk or single cell tumor genomic sequencing data has a huge impact on cancer evolution studies. Clonal state and mutational order can provide detailed insight into tumor origin and its future development. In the past decade, a… ▽ More Cancers evolve from mutation of a single cell with sequential clonal and subclonal expansion of somatic mutation acquisition. Inferring clonal and subclonal structures from bulk or single cell tumor genomic sequencing data has a huge impact on cancer evolution studies. Clonal state and mutational order can provide detailed insight into tumor origin and its future development. In the past decade, a variety of methods have been developed for subclonal reconstruction using bulk tumor sequencing data. As these methods have been developed in different programming languages and using different input data formats, their use and comparison can be problematic. Therefore, we established a web server for clonal and subclonal structure inference and evolution of cancer genomic data (COSINE), which included 12 popular subclonal reconstruction methods. We decomposed each method via a detailed workflow of single processing steps with a user-friendly interface. To the best of our knowledge, this is the first web server providing online subclonal inference, including the most popular subclonal reconstruction methods. COSINE is freely accessible at www.clab-cosine.net or http://bio.rj.run:48996/cun-web. △ Less

Submitted 28 March, 2021; originally announced March 2021.

arXiv:2010.06995 [pdf]

doi 10.4103/jpi.jpi_83_20

A Pathologist-Annotated Dataset for Validating Artificial Intelligence: A Project Description and Pilot Study

Authors: Sarah N Dudgeon, Si Wen, Matthew G Hanna, Rajarsi Gupta, Mohamed Amgad, Manasi Sheth, Hetal Marble, Richard Huang, Markus D Herrmann, Clifford H. Szu, Darick Tong, Bruce Werness, Evan Szu, Denis Larsimont, Anant Madabhushi, Evangelos Hytopoulos, Weijie Chen, Rajendra Singh, Steven N. Hart, Joel Saltz, Roberto Salgado, Brandon D Gallas

Abstract: Purpose: In this work, we present a collaboration to create a validation dataset of pathologist annotations for algorithms that process whole slide images (WSIs). We focus on data collection and evaluation of algorithm performance in the context of estimating the density of stromal tumor infiltrating lymphocytes (sTILs) in breast cancer. Methods: We digitized 64 glass slides of hematoxylin- and eo… ▽ More Purpose: In this work, we present a collaboration to create a validation dataset of pathologist annotations for algorithms that process whole slide images (WSIs). We focus on data collection and evaluation of algorithm performance in the context of estimating the density of stromal tumor infiltrating lymphocytes (sTILs) in breast cancer. Methods: We digitized 64 glass slides of hematoxylin- and eosin-stained ductal carcinoma core biopsies prepared at a single clinical site. We created training materials and workflows to crowdsource pathologist image annotations on two modes: an optical microscope and two digital platforms. The workflows collect the ROI type, a decision on whether the ROI is appropriate for estimating the density of sTILs, and if appropriate, the sTIL density value for that ROI. Results: The pilot study yielded an abundant number of cases with nominal sTIL infiltration. Furthermore, we found that the sTIL densities are correlated within a case, and there is notable pathologist variability. Consequently, we outline plans to improve our ROI and case sampling methods. We also outline statistical methods to account for ROI correlations within a case and pathologist variability when validating an algorithm. Conclusion: We have built workflows for efficient data collection and tested them in a pilot study. As we prepare for pivotal studies, we will consider what it will take for the dataset to be fit for a regulatory purpose: study size, patient population, and pathologist training and qualifications. To this end, we will elicit feedback from the FDA via the Medical Device Development Tool program and from the broader digital pathology and AI community. Ultimately, we intend to share the dataset, statistical methods, and lessons learned. △ Less

Submitted 14 October, 2020; originally announced October 2020.

Comments: 26 pages, 4 figures, 2 tables Submitted to the Journal of Pathology Informatics Project web page: https://ncihub.org/groups/eedapstudies

arXiv:1503.01880 [pdf]

Genetic structure of Sino-Tibetan populations revealed by forensic STR loci

Authors: Hong-Bing Yao, Chuan-Chao Wang, Jiang Wang, Xiaolan Tao, Shao-Qing Wen, Qiajun Du, Qiongying Deng, Bingying Xu, Ying Huang, Hong-Dan Wang, Shujin Li, Bin Cong, Liying Ma, Li Jin, Johannes Krause, Hui Li

Abstract: The origin and diversification of Sino-Tibetan populations have been a long-standing hot debate. However, the limited genetic information of Tibetan populations keeps this topic far from clear. In the present study, we genotyped 15 forensic autosomal STRs from 803 unrelated Tibetan individuals from Gansu Province (635 from Gannan and 168 from Tianzhu). We combined these data with published dataset… ▽ More The origin and diversification of Sino-Tibetan populations have been a long-standing hot debate. However, the limited genetic information of Tibetan populations keeps this topic far from clear. In the present study, we genotyped 15 forensic autosomal STRs from 803 unrelated Tibetan individuals from Gansu Province (635 from Gannan and 168 from Tianzhu). We combined these data with published dataset to infer a detailed population affinities and admixture of Sino-Tibetan populations. Our results revealed that the genetic structure of Sino-Tibetan populations was strongly correlated with linguistic affiliations. Although the among-population variances are relatively small, the genetic components for Tibetan, Lolo-Burmese, and Han Chinese were quite distinctive, especially for the Deng, Nu, and Derung of Lolo-Burmese. Southern indigenous populations, such as Tai-Kadai and Hmong-Mien populations might have made substantial genetic contribution to Han Chinese and Altaic populations, but not to Tibetans. Likewise, Han Chinese but not Tibetan shared very similar genetic makeups with Altaic populations, which did not support the North Asian origin of Tibetan populations. The dataset generated here are also valuable for forensic identification. △ Less

Submitted 6 March, 2015; originally announced March 2015.

Comments: 11 pages, 2 figures

arXiv:1412.6274 [pdf]

doi 10.1038/jhg.2015.28

Y Chromosome of Aisin Gioro, the Imperial House of Qing Dynasty

Authors: Shi Yan, Harumasa Tachibana, Lan-Hai Wei, Ge Yu, Shao-Qing Wen, Chuan-Chao Wang

Abstract: House of Aisin Gioro is the imperial family of the last dynasty in Chinese history - Qing Dynasty (1644 - 1911). Aisin Gioro family originated from Jurchen tribes and developed the Manchu people before they conquered China. By investigating the Y chromosomal short tandem repeats (STRs) of 7 modern male individuals who claim belonging to Aisin Gioro family (in which 3 have full records of pedigree)… ▽ More House of Aisin Gioro is the imperial family of the last dynasty in Chinese history - Qing Dynasty (1644 - 1911). Aisin Gioro family originated from Jurchen tribes and developed the Manchu people before they conquered China. By investigating the Y chromosomal short tandem repeats (STRs) of 7 modern male individuals who claim belonging to Aisin Gioro family (in which 3 have full records of pedigree), we found that 3 of them (in which 2 keep full pedigree, whose most recent common ancestor is Nurgaci) shows very close relationship (1 - 2 steps of difference in 17 STR) and the haplotype is rare. We therefore conclude that this haplotype is the Y chromosome of the House of Aisin Gioro. Further tests of single nucleotide polymorphisms (SNPs) indicates that they belong to Haplogroup C3b2b1*-M401(xF5483), although their Y-STR results are distant to the "star cluster", which also belongs to the same haplogroup. This study forms the base for the pedigree research of the imperial family of Qing Dynasty by means of genetics. △ Less

Submitted 19 December, 2014; originally announced December 2014.

Comments: 11 pages, 3 figures, 2 tables. Submitted on 2014.12.19, in advance to acception by a journal. A parallel Chinese version is also available

Journal ref: Journal of Human Genetics, (2 April 2015)

arXiv:1311.6857 [pdf]

Agriculture driving male expansion in Neolithic Time

Authors: Chuan-Chao Wang, Yunzhi Huang, Shao-Qing Wen, Chun Chen, Li Jin, Hui Li

Abstract: The emergence of agriculture is suggested to have driven extensive human population growths. However, genetic evidence from maternal mitochondrial genomes suggests major population expansions began before the emergence of agriculture. Therefore, role of agriculture that played in initial population expansions still remains controversial. Here, we analyzed a set of globally distributed whole Y chro… ▽ More The emergence of agriculture is suggested to have driven extensive human population growths. However, genetic evidence from maternal mitochondrial genomes suggests major population expansions began before the emergence of agriculture. Therefore, role of agriculture that played in initial population expansions still remains controversial. Here, we analyzed a set of globally distributed whole Y chromosome and mitochondrial genomes of 526 male samples from 1000 Genome Project. We found that most major paternal lineage expansions coalesced in Neolithic Time. The estimated effective population sizes through time revealed strong evidence for 10- to 100-fold increase in population growth of males with the advent of agriculture. This sex-biased Neolithic expansion might result from the reduction in hunting-related mortality of males. △ Less

Submitted 26 November, 2013; originally announced November 2013.

Comments: 9 pages, 2 figures

arXiv:1310.5413 [pdf]

Convergence of Y chromosome STR haplotypes from different SNP haplogroups compromises accuracy of haplogroup prediction

Authors: Chuan-Chao Wang, Ling-Xiang Wang, Rukesh Shrestha, Shaoqing Wen, Manfei Zhang, Xinzhu Tong, Li Jin, Hui Li

Abstract: Short tandem repeats (STRs) and single nucleotide polymorphisms (SNPs) are two kinds of commonly used markers in Y chromosome studies of forensic and population genetics. There has been increasing interest in the cost saving strategy by using the STR haplotypes to predict SNP haplogroups. However, the convergence of Y chromosome STR haplotypes from different haplogroups might compromise the accura… ▽ More Short tandem repeats (STRs) and single nucleotide polymorphisms (SNPs) are two kinds of commonly used markers in Y chromosome studies of forensic and population genetics. There has been increasing interest in the cost saving strategy by using the STR haplotypes to predict SNP haplogroups. However, the convergence of Y chromosome STR haplotypes from different haplogroups might compromise the accuracy of haplogroup prediction. Here, we compared the worldwide Y chromosome lineages at both haplogroup level and haplotype level to search for the possible haplotype similarities among haplogroups. The similar haplotypes between haplogroups B and I2, C1 and E1b1b1, C2 and E1b1a1, H1 and J, L and O3a2c1, O1a and N, O3a1c and O3a2b, and M1 and O3a2 have been found, and those similarities reduce the accuracy of prediction. △ Less

Submitted 20 October, 2013; originally announced October 2013.

Comments: 13 pages, 2 figures

Showing 1–9 of 9 results for author: Wen, S