-
Hypergraph Representations of scRNA-seq Data for Improved Clustering with Random Walks
Authors:
Wan He,
Daniel I. Bolnick,
Samuel V. Scarpino,
Tina Eliassi-Rad
Abstract:
Analysis of single-cell RNA sequencing data is often conducted through network projections such as coexpression networks, primarily due to the abundant availability of network analysis tools for downstream tasks. However, this approach has several limitations: loss of higher-order information, inefficient data representation caused by converting a sparse dataset to a fully connected network, and o…
▽ More
Analysis of single-cell RNA sequencing data is often conducted through network projections such as coexpression networks, primarily due to the abundant availability of network analysis tools for downstream tasks. However, this approach has several limitations: loss of higher-order information, inefficient data representation caused by converting a sparse dataset to a fully connected network, and overestimation of coexpression due to zero-inflation. To address these limitations, we propose conceptualizing scRNA-seq expression data as hypergraphs, which are generalized graphs in which the hyperedges can connect more than two vertices. In the context of scRNA-seq data, the hypergraph nodes represent cells and the edges represent genes. Each hyperedge connects all cells where its corresponding gene is actively expressed and records the expression of the gene across different cells. This hypergraph conceptualization enables us to explore multi-way relationships beyond the pairwise interactions in coexpression networks without loss of information. We propose two novel clustering methods: (1) the Dual-Importance Preference Hypergraph Walk (DIPHW) and (2) the Coexpression and Memory-Integrated Dual-Importance Preference Hypergraph Walk (CoMem-DIPHW). They outperform established methods on both simulated and real scRNA-seq datasets. The improvement brought by our proposed methods is especially significant when data modularity is weak. Furthermore, CoMem-DIPHW incorporates the gene coexpression network, cell coexpression network, and the cell-gene expression hypergraph from the single-cell abundance counts data altogether for embedding computation. This approach accounts for both the local level information from single-cell level gene expression and the global level information from the pairwise similarity in the two coexpression networks.
△ Less
Submitted 2 April, 2025; v1 submitted 20 January, 2025;
originally announced January 2025.
-
A Misclassification Network-Based Method for Comparative Genomic Analysis
Authors:
Wan He,
Tina Eliassi-Rad,
Samuel V. Scarpino
Abstract:
Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades with many important applications across the life sciences. Established methods for classifying genomes can be broadly grouped into sequence alignment-based and alignment-free models. Conventional alignment-based models rely on genome similarity measures calculated based on local s…
▽ More
Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades with many important applications across the life sciences. Established methods for classifying genomes can be broadly grouped into sequence alignment-based and alignment-free models. Conventional alignment-based models rely on genome similarity measures calculated based on local sequence alignments or consistent ordering among sequences. However, such methods are computationally expensive when dealing with large ensembles of even moderately sized genomes. In contrast, alignment-free (AF) approaches measure genome similarity based on summary statistics in an unsupervised setting and are efficient enough to analyze large datasets. However, both alignment-based and AF methods typically assume fixed scoring rubrics that lack the flexibility to assign varying importance to different parts of the sequences based on prior knowledge. In this study, we integrate AI and network science approaches to develop a comparative genomic analysis framework that addresses these limitations. Our approach, termed the Genome Misclassification Network Analysis (GMNA), simultaneously leverages misclassified instances, a learned scoring rubric, and label information to classify genomes based on associated metadata and better understand potential drivers of misclassification. We evaluate the utility of the GMNA using Naive Bayes and convolutional neural network models, supplemented by additional experiments with transformer-based models, to construct SARS-CoV-2 sampling location classifiers using over 500,000 viral genome sequences and study the resulting network of misclassifications. We demonstrate the global health potential of the GMNA by leveraging the SARS-CoV-2 genome misclassification networks to investigate the role human mobility played in structuring geographic clustering of SARS-CoV-2.
△ Less
Submitted 15 January, 2025; v1 submitted 9 December, 2024;
originally announced December 2024.
-
Neuroscience needs Network Science
Authors:
Dániel L Barabási,
Ginestra Bianconi,
Ed Bullmore,
Mark Burgess,
SueYeon Chung,
Tina Eliassi-Rad,
Dileep George,
István A. Kovács,
Hernán Makse,
Christos Papadimitriou,
Thomas E. Nichols,
Olaf Sporns,
Kim Stachenfeld,
Zoltán Toroczkai,
Emma K. Towlson,
Anthony M Zador,
Hongkui Zeng,
Albert-László Barabási,
Amy Bernard,
György Buzsáki
Abstract:
The brain is a complex system comprising a myriad of interacting elements, posing significant challenges in understanding its structure, function, and dynamics. Network science has emerged as a powerful tool for studying such intricate systems, offering a framework for integrating multiscale data and complexity. Here, we discuss the application of network science in the study of the brain, address…
▽ More
The brain is a complex system comprising a myriad of interacting elements, posing significant challenges in understanding its structure, function, and dynamics. Network science has emerged as a powerful tool for studying such intricate systems, offering a framework for integrating multiscale data and complexity. Here, we discuss the application of network science in the study of the brain, addressing topics such as network models and metrics, the connectome, and the role of dynamics in neural networks. We explore the challenges and opportunities in integrating multiple data streams for understanding the neural transitions from development to healthy function to disease, and discuss the potential for collaboration between network science and neuroscience communities. We underscore the importance of fostering interdisciplinary opportunities through funding initiatives, workshops, and conferences, as well as supporting students and postdoctoral fellows with interests in both disciplines. By uniting the network science and neuroscience communities, we can develop novel network-based methods tailored to neural circuits, paving the way towards a deeper understanding of the brain and its functions.
△ Less
Submitted 11 May, 2023; v1 submitted 10 May, 2023;
originally announced May 2023.
-
AI-Bind: Improving Binding Predictions for Novel Protein Targets and Ligands
Authors:
Ayan Chatterjee,
Robin Walters,
Zohair Shafi,
Omair Shafi Ahmed,
Michael Sebek,
Deisy Gysi,
Rose Yu,
Tina Eliassi-Rad,
Albert-László Barabási,
Giulia Menichetti
Abstract:
Identifying novel drug-target interactions (DTI) is a critical and rate limiting step in drug discovery. While deep learning models have been proposed to accelerate the identification process, we show that state-of-the-art models fail to generalize to novel (i.e., never-before-seen) structures. We first unveil the mechanisms responsible for this shortcoming, demonstrating how models rely on shortc…
▽ More
Identifying novel drug-target interactions (DTI) is a critical and rate limiting step in drug discovery. While deep learning models have been proposed to accelerate the identification process, we show that state-of-the-art models fail to generalize to novel (i.e., never-before-seen) structures. We first unveil the mechanisms responsible for this shortcoming, demonstrating how models rely on shortcuts that leverage the topology of the protein-ligand bipartite network, rather than learning the node features. Then, we introduce AI-Bind, a pipeline that combines network-based sampling strategies with unsupervised pre-training, allowing us to limit the annotation imbalance and improve binding predictions for novel proteins and ligands. We illustrate the value of AI-Bind by predicting drugs and natural compounds with binding affinity to SARS-CoV-2 viral proteins and the associated human proteins. We also validate these predictions via docking simulations and comparison with recent experimental evidence, and step up the process of interpreting machine learning prediction of protein-ligand binding by identifying potential active binding sites on the amino acid sequence. Overall, AI-Bind offers a powerful high-throughput approach to identify drug-target combinations, with the potential of becoming a powerful tool in drug discovery.
△ Less
Submitted 9 November, 2022; v1 submitted 24 December, 2021;
originally announced December 2021.
-
The why, how, and when of representations for complex systems
Authors:
Leo Torres,
Ann S. Blevins,
Danielle S. Bassett,
Tina Eliassi-Rad
Abstract:
Complex systems thinking is applied to a wide variety of domains, from neuroscience to computer science and economics. The wide variety of implementations has resulted in two key challenges: the progenation of many domain-specific strategies that are seldom revisited or questioned, and the siloing of ideas within a domain due to inconsistency of complex systems language. In this work we offer basi…
▽ More
Complex systems thinking is applied to a wide variety of domains, from neuroscience to computer science and economics. The wide variety of implementations has resulted in two key challenges: the progenation of many domain-specific strategies that are seldom revisited or questioned, and the siloing of ideas within a domain due to inconsistency of complex systems language. In this work we offer basic, domain-agnostic language in order to advance towards a more cohesive vocabulary. We use this language to evaluate each step of the complex systems analysis pipeline, beginning with the system and data collected, then moving through different mathematical formalisms for encoding the observed data (i.e. graphs, simplicial complexes, and hypergraphs), and relevant computational methods for each formalism. At each step we consider different types of \emph{dependencies}; these are properties of the system that describe how the existence of one relation among the parts of a system may influence the existence of another relation. We discuss how dependencies may arise and how they may alter interpretation of results or the entirety of the analysis pipeline. We close with two real-world examples using coauthorship data and email communications data that illustrate how the system under study, the dependencies therein, the research question, and choice of mathematical representation influence the results. We hope this work can serve as an opportunity of reflection for experienced complexity scientists, as well as an introductory resource for new researchers.
△ Less
Submitted 4 June, 2020;
originally announced June 2020.