-
HelixDesign-Antibody: A Scalable Production-Grade Platform for Antibody Design Built on HelixFold3
Authors:
Jie Gao,
Jing Hu,
Shanzhuo Zhang,
Kunrui Zhu,
Sheng Qian,
Yueyang Huang,
Xiaonan Zhang,
Xiaomin Fang
Abstract:
Antibody engineering is essential for developing therapeutics and advancing biomedical research. Traditional discovery methods often rely on time-consuming and resource-intensive experimental screening. To enhance and streamline this process, we introduce a production-grade, high-throughput platform built on HelixFold3, HelixDesign-Antibody, which utilizes the high-accuracy structure prediction mo…
▽ More
Antibody engineering is essential for developing therapeutics and advancing biomedical research. Traditional discovery methods often rely on time-consuming and resource-intensive experimental screening. To enhance and streamline this process, we introduce a production-grade, high-throughput platform built on HelixFold3, HelixDesign-Antibody, which utilizes the high-accuracy structure prediction model, HelixFold3. The platform facilitates the large-scale generation of antibody candidate sequences and evaluates their interaction with antigens. Integrated high-performance computing (HPC) support enables high-throughput screening, addressing challenges such as fragmented toolchains and high computational demands. Validation on multiple antigens showcases the platform's ability to generate diverse and high-quality antibodies, confirming a scaling law where exploring larger sequence spaces increases the likelihood of identifying optimal binders. This platform provides a seamless, accessible solution for large-scale antibody design and is available via the antibody design page of PaddleHelix platform.
△ Less
Submitted 3 July, 2025;
originally announced July 2025.
-
HelixDesign-Binder: A Scalable Production-Grade Platform for Binder Design Built on HelixFold3
Authors:
Jie Gao,
Jun Li,
Jing Hu,
Shanzhuo Zhang,
Kunrui Zhu,
Yueyang Huang,
Xiaonan Zhang,
Xiaomin Fang
Abstract:
Protein binder design is central to therapeutics, diagnostics, and synthetic biology, yet practical deployment remains challenging due to fragmented workflows, high computational costs, and complex tool integration. We present HelixDesign-Binder, a production-grade, high-throughput platform built on HelixFold3 that automates the full binder design pipeline, from backbone generation and sequence de…
▽ More
Protein binder design is central to therapeutics, diagnostics, and synthetic biology, yet practical deployment remains challenging due to fragmented workflows, high computational costs, and complex tool integration. We present HelixDesign-Binder, a production-grade, high-throughput platform built on HelixFold3 that automates the full binder design pipeline, from backbone generation and sequence design to structural evaluation and multi-dimensional scoring. By unifying these stages into a scalable and user-friendly system, HelixDesign-Binder enables efficient exploration of binder candidates with favorable structural, energetic, and physicochemical properties. The platform leverages Baidu Cloud's high-performance infrastructure to support large-scale design and incorporates advanced scoring metrics, including ipTM, predicted binding free energy, and interface hydrophobicity. Benchmarking across six protein targets demonstrates that HelixDesign-Binder reliably produces diverse and high-quality binders, some of which match or exceed validated designs in predicted binding affinity. HelixDesign-Binder is accessible via an interactive web interface in PaddleHelix platform, supporting both academic research and industrial applications in antibody and protein binder development.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Efficient parameter inference in networked dynamical systems via steady states: A surrogate objective function approach integrating mean-field and nonlinear least squares
Authors:
Yanna Ding,
Malik Magdon-Ismail,
Jianxi Gao
Abstract:
In networked dynamical systems, inferring governing parameters is crucial for predicting nodal dynamics, such as gene expression levels, species abundance, or population density. While many parameter estimation techniques rely on time-series data, particularly systems that converge over extreme time ranges, only noisy steady-state data is available, requiring a new approach to infer dynamical para…
▽ More
In networked dynamical systems, inferring governing parameters is crucial for predicting nodal dynamics, such as gene expression levels, species abundance, or population density. While many parameter estimation techniques rely on time-series data, particularly systems that converge over extreme time ranges, only noisy steady-state data is available, requiring a new approach to infer dynamical parameters from noisy observations of steady states. However, the traditional optimization process is computationally demanding, requiring repeated simulation of coupled ordinary differential equations (ODEs). To overcome these limitations, we introduce a surrogate objective function that leverages decoupled equations to compute steady states, significantly reducing computational complexity. Furthermore, by optimizing the surrogate objective function, we obtain steady states that more accurately approximate the ground truth than noisy observations and predict future equilibria when topology changes. We empirically demonstrate the effectiveness of the proposed method across ecological, gene regulatory, and epidemic networks. Our approach provides an efficient and effective way to estimate parameters from steady-state data and has the potential to improve predictions in networked dynamical systems.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning
Authors:
Yang Tan,
Chen Liu,
Jingyuan Gao,
Banghao Wu,
Mingchen Li,
Ruilin Wang,
Lingrong Zhang,
Huiqun Yu,
Guisheng Fan,
Liang Hong,
Bingxin Zhou
Abstract:
Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine…
▽ More
Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating $40+$ protein-related datasets and $40+$ popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders
Authors:
Tianyu Xie,
Harry Richman,
Jiansi Gao,
Frederick A. Matsen IV,
Cheng Zhang
Abstract:
Learning informative representations of phylogenetic tree structures is essential for analyzing evolutionary relationships. Classical distance-based methods have been widely used to project phylogenetic trees into Euclidean space, but they are often sensitive to the choice of distance metric and may lack sufficient resolution. In this paper, we introduce phylogenetic variational autoencoders (Phyl…
▽ More
Learning informative representations of phylogenetic tree structures is essential for analyzing evolutionary relationships. Classical distance-based methods have been widely used to project phylogenetic trees into Euclidean space, but they are often sensitive to the choice of distance metric and may lack sufficient resolution. In this paper, we introduce phylogenetic variational autoencoders (PhyloVAEs), an unsupervised learning framework designed for representation learning and generative modeling of tree topologies. Leveraging an efficient encoding mechanism inspired by autoregressive tree topology generation, we develop a deep latent-variable generative model that facilitates fast, parallelized topology generation. PhyloVAE combines this generative model with a collaborative inference model based on learnable topological features, allowing for high-resolution representations of phylogenetic tree samples. Extensive experiments demonstrate PhyloVAE's robust representation learning capabilities and fast generation of phylogenetic tree topologies.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Precise Antigen-Antibody Structure Predictions Enhance Antibody Development with HelixFold-Multimer
Authors:
Jie Gao,
Jing Hu,
Lihang Liu,
Yang Xue,
Kunrui Zhu,
Xiaonan Zhang,
Xiaomin Fang
Abstract:
The accurate prediction of antigen-antibody structures is essential for advancing immunology and therapeutic development, as it helps elucidate molecular interactions that underlie immune responses. Despite recent progress with deep learning models like AlphaFold and RoseTTAFold, accurately modeling antigen-antibody complexes remains a challenge due to their unique evolutionary characteristics. He…
▽ More
The accurate prediction of antigen-antibody structures is essential for advancing immunology and therapeutic development, as it helps elucidate molecular interactions that underlie immune responses. Despite recent progress with deep learning models like AlphaFold and RoseTTAFold, accurately modeling antigen-antibody complexes remains a challenge due to their unique evolutionary characteristics. HelixFold-Multimer, a specialized model developed for this purpose, builds on the framework of AlphaFold-Multimer and demonstrates improved precision for antigen-antibody structures. HelixFold-Multimer not only surpasses other models in accuracy but also provides essential insights into antibody development, enabling more precise identification of binding sites, improved interaction prediction, and enhanced design of therapeutic antibodies. These advances underscore HelixFold-Multimer's potential in supporting antibody research and therapeutic innovation.
△ Less
Submitted 12 December, 2024;
originally announced December 2024.
-
Joint Design of 5' Untranslated Region and Coding Sequence of mRNA
Authors:
Yang Liu,
Jie Gao,
Xiaonan Zhang,
Xiaomin Fang
Abstract:
Messenger RNA (mRNA) vaccines and therapeutics are emerging as powerful tools against a variety of diseases, including infectious diseases and cancer. The design of mRNA molecules, particularly the untranslated region (UTR) and coding sequence (CDS) is crucial for optimizing translation efficiency and stability. Current design approaches generally focus solely on either the 5' UTR or the CDS, whic…
▽ More
Messenger RNA (mRNA) vaccines and therapeutics are emerging as powerful tools against a variety of diseases, including infectious diseases and cancer. The design of mRNA molecules, particularly the untranslated region (UTR) and coding sequence (CDS) is crucial for optimizing translation efficiency and stability. Current design approaches generally focus solely on either the 5' UTR or the CDS, which limits their ability to comprehensively enhance translation efficiency and stability. To address this, we introduce LinearDesign2, an algorithm that enables the co-design of the 5' UTR and CDS. This integrated approach optimizes translation initiation efficiency (TIE), codon adaptation index (CAI), and minimum free energy (MFE) simultaneously. Comparative analyses reveal that sequences designed by LinearDesign2 exhibit significantly higher TIE than those designed by LinearDesign, with only a slight increase in MFE. Further, we validate the accuracy of the computational TIE metric using large-scale parallel translation experimental data. This study highlights the importance of a joint design strategy for the 5' UTR and CDS in optimizing mRNA performance, paving the way for more efficient mRNA vaccines and therapeutics.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
Holistic structure of neural pathways underlies brain perceptual rivalry: Physical mechanism of auditory stream segregation
Authors:
Yuxuan Wu,
Jinling Gao,
Xiaona Fang,
Jin Wang
Abstract:
Brain perceptual rivalry, exemplified by auditory stream segregation of competing tones (A_, B__, ABA_), serves as a core mechanism of brain perception formation. While increasingly recognized as determining by neural connections rather than specific neural groups, the mechanism of brain perception remains uncertain. We demonstrate that auditory stream segregation arises from the topological struc…
▽ More
Brain perceptual rivalry, exemplified by auditory stream segregation of competing tones (A_, B__, ABA_), serves as a core mechanism of brain perception formation. While increasingly recognized as determining by neural connections rather than specific neural groups, the mechanism of brain perception remains uncertain. We demonstrate that auditory stream segregation arises from the topological structure of holistic neural pathways. By constructing a holistic pathway model using existing neurophysiological data, combining nonlinear neural dynamics and nonequilibrium physics, we uncover the biophysical mechanism of perceptual phase transitions from integrated (ABA_) to segregated streams (A_ or B_), as well as the mechanism of temporal dynamics, perceptual switching path, and attention regulation underlying these transitions. Further, we demonstrate how our framework reveals energy consumption of the auditory system and combines it with neuroelectrophysiology. Two psycho-acoustic experiments validate our predictions of perception alternation and attention modulation. Our framework provides a transformative perspective on how brain networks generate complex perceptual experiences, emphasizing the significance of neural pathway structure in the process of brain function realization.
△ Less
Submitted 7 March, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
Generative causal testing to bridge data-driven models and scientific theories in language neuroscience
Authors:
Richard Antonello,
Chandan Singh,
Shailee Jain,
Aliyah Hsu,
Sihang Guo,
Jianfeng Gao,
Bin Yu,
Alexander Huth
Abstract:
Representations from large language models are highly effective at predicting BOLD fMRI responses to language stimuli. However, these representations are largely opaque: it is unclear what features of the language stimulus drive the response in each brain area. We present generative causal testing (GCT), a framework for generating concise explanations of language selectivity in the brain from pred…
▽ More
Representations from large language models are highly effective at predicting BOLD fMRI responses to language stimuli. However, these representations are largely opaque: it is unclear what features of the language stimulus drive the response in each brain area. We present generative causal testing (GCT), a framework for generating concise explanations of language selectivity in the brain from predictive models and then testing those explanations in follow-up experiments using LLM-generated stimuli.This approach is successful at explaining selectivity both in individual voxels and cortical regions of interest (ROIs), including newly identified microROIs in prefrontal cortex. We show that explanatory accuracy is closely related to the predictive power and stability of the underlying predictive models. Finally, we show that GCT can dissect fine-grained differences between brain areas with similar functional selectivity. These results demonstrate that LLMs can be used to bridge the widening gap between data-driven models and formal scientific theories.
△ Less
Submitted 2 March, 2025; v1 submitted 1 October, 2024;
originally announced October 2024.
-
Machine Learning-Based Prediction of Key Genes Correlated to the Subretinal Lesion Severity in a Mouse Model of Age-Related Macular Degeneration
Authors:
Kuan Yan,
Yue Zeng,
Dai Shi,
Ting Zhang,
Dmytro Matsypura,
Mark C. Gillies,
Ling Zhu,
Junbin Gao
Abstract:
Age-related macular degeneration (AMD) is a major cause of blindness in older adults, severely affecting vision and quality of life. Despite advances in understanding AMD, the molecular factors driving the severity of subretinal scarring (fibrosis) remain elusive, hampering the development of effective therapies. This study introduces a machine learning-based framework to predict key genes that ar…
▽ More
Age-related macular degeneration (AMD) is a major cause of blindness in older adults, severely affecting vision and quality of life. Despite advances in understanding AMD, the molecular factors driving the severity of subretinal scarring (fibrosis) remain elusive, hampering the development of effective therapies. This study introduces a machine learning-based framework to predict key genes that are strongly correlated with lesion severity and to identify potential therapeutic targets to prevent subretinal fibrosis in AMD. Using an original RNA sequencing (RNA-seq) dataset from the diseased retinas of JR5558 mice, we developed a novel and specific feature engineering technique, including pathway-based dimensionality reduction and gene-based feature expansion, to enhance prediction accuracy. Two iterative experiments were conducted by leveraging Ridge and ElasticNet regression models to assess biological relevance and gene impact. The results highlight the biological significance of several key genes and demonstrate the framework's effectiveness in identifying novel therapeutic targets. The key findings provide valuable insights for advancing drug discovery efforts and improving treatment strategies for AMD, with the potential to enhance patient outcomes by targeting the underlying genetic mechanisms of subretinal lesion development.
△ Less
Submitted 8 September, 2024;
originally announced September 2024.
-
Technical Report of HelixFold3 for Biomolecular Structure Prediction
Authors:
Lihang Liu,
Shanzhuo Zhang,
Yang Xue,
Xianbin Ye,
Kunrui Zhu,
Yuxin Li,
Yang Liu,
Jie Gao,
Wenlai Zhao,
Hongkun Yu,
Zhihua Wu,
Xiaonan Zhang,
Xiaomin Fang
Abstract:
The AlphaFold series has transformed protein structure prediction with remarkable accuracy, often matching experimental methods. AlphaFold2, AlphaFold-Multimer, and the latest AlphaFold3 represent significant strides in predicting single protein chains, protein complexes, and biomolecular structures. While AlphaFold2 and AlphaFold-Multimer are open-sourced, facilitating rapid and reliable predicti…
▽ More
The AlphaFold series has transformed protein structure prediction with remarkable accuracy, often matching experimental methods. AlphaFold2, AlphaFold-Multimer, and the latest AlphaFold3 represent significant strides in predicting single protein chains, protein complexes, and biomolecular structures. While AlphaFold2 and AlphaFold-Multimer are open-sourced, facilitating rapid and reliable predictions, AlphaFold3 remains partially accessible through a limited online server and has not been open-sourced, restricting further development. To address these challenges, the PaddleHelix team is developing HelixFold3, aiming to replicate AlphaFold3's capabilities. Leveraging insights from previous models and extensive datasets, HelixFold3 achieves accuracy comparable to AlphaFold3 in predicting the structures of the conventional ligands, nucleic acids, and proteins. The initial release of HelixFold3 is available as open source on GitHub for academic research, promising to advance biomolecular research and accelerate discoveries. The latest version will be continuously updated on the HelixFold3 web server, providing both interactive visualization and API access.
△ Less
Submitted 22 December, 2024; v1 submitted 29 August, 2024;
originally announced August 2024.
-
The need to implement FAIR principles in biomolecular simulations
Authors:
Rommie Amaro,
Johan Åqvist,
Ivet Bahar,
Federica Battistini,
Adam Bellaiche,
Daniel Beltran,
Philip C. Biggin,
Massimiliano Bonomi,
Gregory R. Bowman,
Richard Bryce,
Giovanni Bussi,
Paolo Carloni,
David Case,
Andrea Cavalli,
Chie-En A. Chang,
Thomas E. Cheatham III,
Margaret S. Cheung,
Cris Chipot,
Lillian T. Chong,
Preeti Choudhary,
Gerardo Andres Cisneros,
Cecilia Clementi,
Rosana Collepardo-Guevara,
Peter Coveney,
Roberto Covino
, et al. (103 additional authors not shown)
Abstract:
This letter illustrates the opinion of the molecular dynamics (MD) community on the need to adopt a new FAIR paradigm for the use of molecular simulations. It highlights the necessity of a collaborative effort to create, establish, and sustain a database that allows findability, accessibility, interoperability, and reusability of molecular dynamics simulation data. Such a development would democra…
▽ More
This letter illustrates the opinion of the molecular dynamics (MD) community on the need to adopt a new FAIR paradigm for the use of molecular simulations. It highlights the necessity of a collaborative effort to create, establish, and sustain a database that allows findability, accessibility, interoperability, and reusability of molecular dynamics simulation data. Such a development would democratize the field and significantly improve the impact of MD simulations on life science research. This will transform our working paradigm, pushing the field to a new frontier. We invite you to support our initiative at the MDDB community (https://mddbr.eu/community/) Now published as: Amaro, R.E., et al. The need to implement FAIR principles in biomolecular simulations. Nat Methods (2025) https://doi.org/10.1038/s41592-025-02635-0
△ Less
Submitted 3 April, 2025; v1 submitted 23 July, 2024;
originally announced July 2024.
-
Crafting Interpretable Embeddings by Asking LLMs Questions
Authors:
Vinamra Benara,
Chandan Singh,
John X. Morris,
Richard Antonello,
Ion Stoica,
Alexander G. Huth,
Jianfeng Gao
Abstract:
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks. However, their opaqueness and proliferation into scientific domains such as neuroscience have created a growing need for interpretability. Here, we ask whether we can obtain interpretable embeddings through LLM prompting. We introduce question-answering embeddings (QA-Emb),…
▽ More
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks. However, their opaqueness and proliferation into scientific domains such as neuroscience have created a growing need for interpretability. Here, we ask whether we can obtain interpretable embeddings through LLM prompting. We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM. Training QA-Emb reduces to selecting a set of underlying questions rather than learning model weights.
We use QA-Emb to flexibly generate interpretable models for predicting fMRI voxel responses to language stimuli. QA-Emb significantly outperforms an established interpretable baseline, and does so while requiring very few questions. This paves the way towards building flexible feature spaces that can concretize and evaluate our understanding of semantic brain representations. We additionally find that QA-Emb can be effectively approximated with an efficient model, and we explore broader applications in simple NLP tasks.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
HelixFold-Multimer: Elevating Protein Complex Structure Prediction to New Heights
Authors:
Xiaomin Fang,
Jie Gao,
Jing Hu,
Lihang Liu,
Yang Xue,
Xiaonan Zhang,
Kunrui Zhu
Abstract:
While monomer protein structure prediction tools boast impressive accuracy, the prediction of protein complex structures remains a daunting challenge in the field. This challenge is particularly pronounced in scenarios involving complexes with protein chains from different species, such as antigen-antibody interactions, where accuracy often falls short. Limited by the accuracy of complex predictio…
▽ More
While monomer protein structure prediction tools boast impressive accuracy, the prediction of protein complex structures remains a daunting challenge in the field. This challenge is particularly pronounced in scenarios involving complexes with protein chains from different species, such as antigen-antibody interactions, where accuracy often falls short. Limited by the accuracy of complex prediction, tasks based on precise protein-protein interaction analysis also face obstacles. In this report, we highlight the ongoing advancements of our protein complex structure prediction model, HelixFold-Multimer, underscoring its enhanced performance. HelixFold-Multimer provides precise predictions for diverse protein complex structures, especially in therapeutic protein interactions. Notably, HelixFold-Multimer achieves remarkable success in antigen-antibody and peptide-protein structure prediction, greatly surpassing AlphaFold 3. HelixFold-Multimer is now available for public use on the PaddleHelix platform, offering both a general version and an antigen-antibody version. Researchers can conveniently access and utilize this service for their development needs.
△ Less
Submitted 17 May, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
On the importance of assessing topological convergence in Bayesian phylogenetic inference
Authors:
Marius Brusselmans,
Luiz Max Carvalho,
Samuel L. Hong,
Jiansi Gao,
Frederick A. Matsen IV,
Andrew Rambaut,
Philippe Lemey,
Marc A. Suchard,
Gytis Dudas,
Guy Baele
Abstract:
Modern phylogenetics research is often performed within a Bayesian framework, using sampling algorithms such as Markov chain Monte Carlo (MCMC) to approximate the posterior distribution. These algorithms require careful evaluation of the quality of the generated samples. Within the field of phylogenetics, one frequently adopted diagnostic approach is to evaluate the effective sample size (ESS) and…
▽ More
Modern phylogenetics research is often performed within a Bayesian framework, using sampling algorithms such as Markov chain Monte Carlo (MCMC) to approximate the posterior distribution. These algorithms require careful evaluation of the quality of the generated samples. Within the field of phylogenetics, one frequently adopted diagnostic approach is to evaluate the effective sample size (ESS) and to investigate trace graphs of the sampled parameters. A major limitation of these approaches is that they are developed for continuous parameters and therefore incompatible with a crucial parameter in these inferences: the tree topology. Several recent advancements have aimed at extending these diagnostics to topological space. In this reflection paper, we present two case studies - one on Ebola virus and one on HIV - illustrating how these topological diagnostics can contain information not found in standard diagnostics, and how decisions regarding which of these diagnostics to compute can impact inferences regarding MCMC convergence and mixing. Our results show the importance of running multiple replicate analyses and of carefully assessing topological convergence using the output of these replicate analyses. To this end, we illustrate different ways of assessing and visualizing the topological convergence of these replicates. Given the major importance of detecting convergence and mixing issues in Bayesian phylogenetic analyses, the lack of a unified approach to this problem warrants further action, especially now that additional tools are becoming available to researchers.
△ Less
Submitted 19 August, 2024; v1 submitted 18 February, 2024;
originally announced February 2024.
-
Generalized dimension reduction approach for heterogeneous networked systems with time-delay
Authors:
Cheng Ma,
Gyorgy Korniss,
Boleslaw K. Szymanski,
Jianxi Gao
Abstract:
Networks of interconnected agents are essential to study complex networked systems' state evolution, stability, resilience, and control. Nevertheless, the high dimensionality and nonlinear dynamics are vital factors preventing us from theoretically analyzing them. Recently, the dimension-reduction approaches reduced the system's size by mapping the original system to a one-dimensional system such…
▽ More
Networks of interconnected agents are essential to study complex networked systems' state evolution, stability, resilience, and control. Nevertheless, the high dimensionality and nonlinear dynamics are vital factors preventing us from theoretically analyzing them. Recently, the dimension-reduction approaches reduced the system's size by mapping the original system to a one-dimensional system such that only one effective representative can capture its macroscopic dynamics. However, the approaches dramatically fail as the network becomes heterogeneous and has multiple community structures. Here, we bridge the gap by developing a generalized dimension reduction approach, which enables us to map the original system to a $m$-dimensional system that consists of $m$ interacting components. Notably, by validating it on various dynamical models, this approach accurately predicts the original system state and the tipping point, if any. Furthermore, the numerical results demonstrate that this approach approximates the system evolution and identifies the critical points for complex networks with time delay.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
Explaining black box text modules in natural language with language models
Authors:
Chandan Singh,
Aliyah R. Hsu,
Richard Antonello,
Shailee Jain,
Alexander G. Huth,
Bin Yu,
Jianfeng Gao
Abstract:
Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous v…
▽ More
Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. "Black box" indicates that we only have access to the module's inputs/outputs.
We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation is. We study SASC in 3 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals. Finally, we show that SASC can generate explanations for the response of individual fMRI voxels to language stimuli, with potential applications to fine-grained brain mapping. All code for using SASC and reproducing results is made available on Github.
△ Less
Submitted 15 November, 2023; v1 submitted 16 May, 2023;
originally announced May 2023.
-
STGIC: a graph and image convolution-based method for spatial transcriptomic clustering
Authors:
Chen Zhang,
Junhui Gao,
Lingxin Kong,
Guangshuo cao,
Xiangyu Guo,
Wei Liu
Abstract:
Spatial transcriptomic (ST) clustering employs spatial and transcription information to group spots spatially coherent and transcriptionally similar together into the same spatial domain. Graph convolution network (GCN) and graph attention network (GAT), fed with spatial coordinates derived adjacency and transcription profile derived feature matrix are often used to solve the problem. Our proposed…
▽ More
Spatial transcriptomic (ST) clustering employs spatial and transcription information to group spots spatially coherent and transcriptionally similar together into the same spatial domain. Graph convolution network (GCN) and graph attention network (GAT), fed with spatial coordinates derived adjacency and transcription profile derived feature matrix are often used to solve the problem. Our proposed method STGIC (spatial transcriptomic clustering with graph and image convolution) utilizes an adaptive graph convolution (AGC) to get high quality pseudo-labels and then resorts to dilated convolution framework (DCF) for virtual image converted from gene expression information and spatial coordinates of spots. The dilation rates and kernel sizes are set appropriately and updating of weight values in the kernels is made to be subject to the spatial distance from the position of corresponding elements to kernel centers so that feature extraction of each spot is better guided by spatial distance to neighbor spots. Self-supervision realized by KL-divergence, spatial continuity loss and cross entropy calculated among spots with high confidence pseudo-labels make up the training objective of DCF. STGIC attains state-of-the-art (SOTA) clustering performance on the benchmark dataset of human dorsolateral prefrontal cortex (DLPFC). Besides, it's capable of depicting fine structures of other tissues from other species as well as guiding the identification of marker genes. Also, STGIC is expandable to Stereo-seq data with high spatial resolution.
△ Less
Submitted 23 October, 2023; v1 submitted 19 March, 2023;
originally announced March 2023.
-
AI of Brain and Cognitive Sciences: From the Perspective of First Principles
Authors:
Luyao Chen,
Zhiqiang Chen,
Longsheng Jiang,
Xiang Liu,
Linlu Xu,
Bo Zhang,
Xiaolong Zou,
Jinying Gao,
Yu Zhu,
Xizi Gong,
Shan Yu,
Sen Song,
Liangyi Chen,
Fang Fang,
Si Wu,
Jia Liu
Abstract:
Nowadays, we have witnessed the great success of AI in various applications, including image classification, game playing, protein structure analysis, language translation, and content generation. Despite these powerful applications, there are still many tasks in our daily life that are rather simple to humans but pose great challenges to AI. These include image and language understanding, few-sho…
▽ More
Nowadays, we have witnessed the great success of AI in various applications, including image classification, game playing, protein structure analysis, language translation, and content generation. Despite these powerful applications, there are still many tasks in our daily life that are rather simple to humans but pose great challenges to AI. These include image and language understanding, few-shot learning, abstract concepts, and low-energy cost computing. Thus, learning from the brain is still a promising way that can shed light on the development of next-generation AI. The brain is arguably the only known intelligent machine in the universe, which is the product of evolution for animals surviving in the natural environment. At the behavior level, psychology and cognitive sciences have demonstrated that human and animal brains can execute very intelligent high-level cognitive functions. At the structure level, cognitive and computational neurosciences have unveiled that the brain has extremely complicated but elegant network forms to support its functions. Over years, people are gathering knowledge about the structure and functions of the brain, and this process is accelerating recently along with the initiation of giant brain projects worldwide. Here, we argue that the general principles of brain functions are the most valuable things to inspire the development of AI. These general principles are the standard rules of the brain extracting, representing, manipulating, and retrieving information, and here we call them the first principles of the brain. This paper collects six such first principles. They are attractor network, criticality, random network, sparse coding, relational memory, and perceptual learning. On each topic, we review its biological background, fundamental property, potential application to AI, and future development.
△ Less
Submitted 19 January, 2023;
originally announced January 2023.
-
D-CryptO: Deep learning-based analysis of colon organoid morphology from brightfield images
Authors:
Lyan Abdul,
Jocelyn Xu,
Alexander Sotra,
Abbas Chaudary,
Jerry Gao,
Shravanthi Rajasekar,
Nicky Anvari,
Hamidreza Mahyar,
Boyang Zhang
Abstract:
Stem cell-derived organoids are a promising tool to model native human tissues as they resemble human organs functionally and structurally compared to traditional monolayer cell-based assays. For instance, colon organoids can spontaneously develop crypt-like structures similar to those found in the native colon. While analyzing the structural development of organoids can be a valuable readout, usi…
▽ More
Stem cell-derived organoids are a promising tool to model native human tissues as they resemble human organs functionally and structurally compared to traditional monolayer cell-based assays. For instance, colon organoids can spontaneously develop crypt-like structures similar to those found in the native colon. While analyzing the structural development of organoids can be a valuable readout, using traditional image analysis tools makes it challenging because of the heterogeneities and the abstract nature of organoid morphologies. To address this limitation, we developed and validated a deep learning-based image analysis tool, named D-CryptO, for the classification of organoid morphology. D-CryptO can automatically assess the crypt formation and opacity of colorectal organoids from brightfield images to determine the extent of organoid structural maturity. To validate this tool, changes in organoid morphology were analyzed during organoid passaging and short-term forskolin stimulation. To further demonstrate the potential of D-CryptO for drug testing, organoid structures were analyzed following treatments with a panel of chemotherapeutic drugs. With D-CryptO, subtle variations in how colon organoids responded to the different chemotherapeutic drugs were detected, which suggest potentially distinct mechanisms of action. This tool could be expanded to other organoid types, like intestinal organoids, to facilitate 3D tissue morphological analysis.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Explaining Patterns in Data with Language Models via Interpretable Autoprompting
Authors:
Chandan Singh,
John X. Morris,
Jyoti Aneja,
Alexander M. Rush,
Jianfeng Gao
Abstract:
Large language models (LLMs) have displayed an impressive ability to harness natural language to perform complex tasks. In this work, we explore whether we can leverage this learned ability to find and explain patterns in data. Specifically, given a pre-trained LLM and data examples, we introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explainin…
▽ More
Large language models (LLMs) have displayed an impressive ability to harness natural language to perform complex tasks. In this work, we explore whether we can leverage this learned ability to find and explain patterns in data. Specifically, given a pre-trained LLM and data examples, we introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt iteratively alternates between generating explanations with an LLM and reranking them based on their performance when used as a prompt. Experiments on a wide range of datasets, from synthetic mathematics to natural-language understanding, show that iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Moreover, the prompts produced by iPrompt are simultaneously human-interpretable and highly effective for generalization: on real-world sentiment classification datasets, iPrompt produces prompts that match or even improve upon human-written prompts for GPT-3. Finally, experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery. All code for using the methods and data here is made available on Github.
△ Less
Submitted 26 January, 2023; v1 submitted 4 October, 2022;
originally announced October 2022.
-
TCR: A Transformer Based Deep Network for Predicting Cancer Drugs Response
Authors:
Jie Gao,
Jing Hu,
Wanqing Sun,
Yili Shen,
Xiaonan Zhang,
Xiaomin Fang,
Fan Wang,
Guodong Zhao
Abstract:
Predicting clinical outcomes to anti-cancer drugs on a personalized basis is challenging in cancer treatment due to the heterogeneity of tumors. Traditional computational efforts have been made to model the effect of drug response on individual samples depicted by their molecular profile, yet overfitting occurs because of the high dimension for omics data, hindering models from clinical applicatio…
▽ More
Predicting clinical outcomes to anti-cancer drugs on a personalized basis is challenging in cancer treatment due to the heterogeneity of tumors. Traditional computational efforts have been made to model the effect of drug response on individual samples depicted by their molecular profile, yet overfitting occurs because of the high dimension for omics data, hindering models from clinical application. Recent research shows that deep learning is a promising approach to build drug response models by learning alignment patterns between drugs and samples. However, existing studies employed the simple feature fusion strategy and only considered the drug features as a whole representation while ignoring the substructure information that may play a vital role when aligning drugs and genes. Hereby in this paper, we propose TCR (Transformer based network for Cancer drug Response) to predict anti-cancer drug response. By utilizing an attention mechanism, TCR is able to learn the interactions between drug atom/sub-structure and molecular signatures efficiently in our study. Furthermore, a dual loss function and cross sampling strategy were designed to improve the prediction power of TCR. We show that TCR outperformed all other methods under various data splitting strategies on all evaluation matrices (some with significant improvement). Extensive experiments demonstrate that TCR shows significantly improved generalization ability on independent in-vitro experiments and in-vivo real patient data. Our study highlights the prediction power of TCR and its potential value for cancer drug repurpose and precision oncology treatment.
△ Less
Submitted 10 July, 2022;
originally announced July 2022.
-
Deep learning models for predicting RNA degradation via dual crowdsourcing
Authors:
Hannah K. Wayment-Steele,
Wipapat Kladwang,
Andrew M. Watkins,
Do Soon Kim,
Bojan Tunguz,
Walter Reade,
Maggie Demkin,
Jonathan Romano,
Roger Wellington-Oguri,
John J. Nicol,
Jiayang Gao,
Kazuki Onodera,
Kazuki Fujikawa,
Hanfei Mao,
Gilles Vandewiele,
Michele Tinti,
Bram Steenwinckel,
Takuya Ito,
Taiga Noumi,
Shujun He,
Keiichiro Ishi,
Youhan Lee,
Fatih Öztürk,
Anthony Chiu,
Emin Öztürk
, et al. (4 additional authors not shown)
Abstract:
Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a ke…
▽ More
Messenger RNA-based medicines hold immense potential, as evidenced by their rapid deployment as COVID-19 vaccines. However, worldwide distribution of mRNA molecules has been limited by their thermostability, which is fundamentally limited by the intrinsic instability of RNA molecules to a chemical degradation reaction called in-line hydrolysis. Predicting the degradation of an RNA molecule is a key task in designing more stable RNA-based therapeutics. Here, we describe a crowdsourced machine learning competition ("Stanford OpenVaccine") on Kaggle, involving single-nucleotide resolution measurements on 6043 102-130-nucleotide diverse RNA constructs that were themselves solicited through crowdsourcing on the RNA design platform Eterna. The entire experiment was completed in less than 6 months, and 41% of nucleotide-level predictions from the winning model were within experimental error of the ground truth measurement. Furthermore, these models generalized to blindly predicting orthogonal degradation data on much longer mRNA molecules (504-1588 nucleotides) with improved accuracy compared to previously published models. Top teams integrated natural language processing architectures and data augmentation techniques with predictions from previous dynamic programming models for RNA secondary structure. These results indicate that such models are capable of representing in-line hydrolysis with excellent accuracy, supporting their use for designing stabilized messenger RNAs. The integration of two crowdsourcing platforms, one for data set creation and another for machine learning, may be fruitful for other urgent problems that demand scientific discovery on rapid timescales.
△ Less
Submitted 22 April, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.
-
A new parsimonious method for classifying Cancer Tissue-of-Origin Based on DNA Methylation 450K data
Authors:
Shen Jia,
Yulin Zhang,
Yiming Mao,
Jiawei Gao,
Yixuan Chen,
Yuxuan Jiang,
Haochen Luo,
Kebo Lv,
Jionglong Su
Abstract:
DNA methylation is a well-studied genetic modification that regulates gene transcription of Eukaryotes. Its alternations have been recognized as a significant component of cancer development. In this study, we use the DNA methylation 450k data from The Cancer Genome Atlas to evaluate the efficacy of DNA methylation data on cancer classification for 30 cancer types. We propose a new method for gene…
▽ More
DNA methylation is a well-studied genetic modification that regulates gene transcription of Eukaryotes. Its alternations have been recognized as a significant component of cancer development. In this study, we use the DNA methylation 450k data from The Cancer Genome Atlas to evaluate the efficacy of DNA methylation data on cancer classification for 30 cancer types. We propose a new method for gene selection in high dimensional data(over 450 thousand). Variance filtering is first introduced for dimension reduction and Recursive feature elimination (RFE) is then used for feature selection. We address the problem of selecting a small subsets of genes from large number of methylated sites, and our parsimonious model is demonstrated to be efficient, achieving an accuracy over 91%, outperforming other studies which use DNA micro-arrays and RNA-seq Data . The performance of 20 models, which are based on 4 estimators (Random Forest, Decision Tree, Extra Tree and Support Vector Machine) and 5 classifiers (k-Nearest Neighbours, Support Vector Machine, XGboost, Light GBM and Multi-Layer Perceptron), is compared and robustness of the RFE algorithm is examined. Results suggest that the combined model of extra tree plus catboost classifier offers the best performance in cancer identification, with an overall validation accuracy of 91% , 92.3%, 93.3% and 93.5% for 20, 30, 40 and 50 features respectively. The biological functions in cancer development of 50 selected genes is also explored through enrichment analysis and the results show that 12 out of 16 of our top features have already been identified to be specific with cancer and we also propose some more genes to be tested for future studies. Therefore, our method may be utilzed as an auxiliary diagnostic method to determine the actual clinicopathological status of a specific cancer.
△ Less
Submitted 3 January, 2021;
originally announced January 2021.
-
Universality of noise-induced resilience restoration in spatially-extended ecological systems
Authors:
Cheng Ma,
Gyorgy Korniss,
Boleslaw K. Szymanski,
Jianxi Gao
Abstract:
Many systems may switch to an undesired state due to internal failures or external perturbations, of which critical transitions toward degraded ecosystem states are a prominent example. Resilience restoration focuses on the ability of spatially-extended systems and the required time to recover to their desired states under stochastic environmental conditions. While mean-field approaches may guide…
▽ More
Many systems may switch to an undesired state due to internal failures or external perturbations, of which critical transitions toward degraded ecosystem states are a prominent example. Resilience restoration focuses on the ability of spatially-extended systems and the required time to recover to their desired states under stochastic environmental conditions. While mean-field approaches may guide recovery strategies by indicating the conditions needed to destabilize undesired states, these approaches are not accurately capturing the transition process toward the desired state of spatially-extended systems in stochastic environments. The difficulty is rooted in the lack of mathematical tools to analyze systems with high dimensionality, nonlinearity, and stochastic effects. We bridge this gap by developing new mathematical tools that employ nucleation theory in spatially-embedded systems to advance resilience restoration. We examine our approach on systems following mutualistic dynamics and diffusion models, finding that systems may exhibit single-cluster or multi-cluster phases depending on their sizes and noise strengths, and also construct a new scaling law governing the restoration time for arbitrary system size and noise strength in two-dimensional systems. This approach is not limited to ecosystems and has applications in various dynamical systems, from biology to infrastructural systems.
△ Less
Submitted 9 September, 2021; v1 submitted 23 November, 2020;
originally announced November 2020.
-
STAN: Spatio-Temporal Attention Network for Pandemic Prediction Using Real World Evidence
Authors:
Junyi Gao,
Rakshith Sharma,
Cheng Qian,
Lucas M. Glass,
Jeffrey Spaeder,
Justin Romberg,
Jimeng Sun,
Cao Xiao
Abstract:
Objective: The COVID-19 pandemic has created many challenges that need immediate attention. Various epidemiological and deep learning models have been developed to predict the COVID-19 outbreak, but all have limitations that affect the accuracy and robustness of the predictions. Our method aims at addressing these limitations and making earlier and more accurate pandemic outbreak predictions by (1…
▽ More
Objective: The COVID-19 pandemic has created many challenges that need immediate attention. Various epidemiological and deep learning models have been developed to predict the COVID-19 outbreak, but all have limitations that affect the accuracy and robustness of the predictions. Our method aims at addressing these limitations and making earlier and more accurate pandemic outbreak predictions by (1) using patients' EHR data from different counties and states that encode local disease status and medical resource utilization condition; (2) considering demographic similarity and geographical proximity between locations; and (3) integrating pandemic transmission dynamics into deep learning models. Materials and Methods: We proposed a spatio-temporal attention network (STAN) for pandemic prediction. It uses an attention-based graph convolutional network to capture geographical and temporal trends and predict the number of cases for a fixed number of days into the future. We also designed a physical law-based loss term for enhancing long-term prediction. STAN was tested using both massive real-world patient data and open source COVID-19 statistics provided by Johns Hopkins university across all U.S. counties. Results: STAN outperforms epidemiological modeling methods such as SIR and SEIR and deep learning models on both long-term and short-term predictions, achieving up to 87% lower mean squared error compared to the best baseline prediction model. Conclusions: By using information from real-world patient data and geographical data, STAN can better capture the disease status and medical resource utilization information and thus provides more accurate pandemic modeling. With pandemic transmission law based regularization, STAN also achieves good long-term prediction performance.
△ Less
Submitted 7 December, 2020; v1 submitted 23 July, 2020;
originally announced August 2020.
-
Network resilience
Authors:
Xueming Liu,
Daqing Li,
Manqing Ma,
Boleslaw K. Szymanski,
H Eugene Stanley,
Jianxi Gao
Abstract:
Many systems on our planet are known to shift abruptly and irreversibly from one state to another when they are forced across a "tipping point," such as mass extinctions in ecological networks, cascading failures in infrastructure systems, and social convention changes in human and animal networks. Such a regime shift demonstrates a system's resilience that characterizes the ability of a system to…
▽ More
Many systems on our planet are known to shift abruptly and irreversibly from one state to another when they are forced across a "tipping point," such as mass extinctions in ecological networks, cascading failures in infrastructure systems, and social convention changes in human and animal networks. Such a regime shift demonstrates a system's resilience that characterizes the ability of a system to adjust its activity to retain its basic functionality in the face of internal disturbances or external environmental changes. In the past 50 years, attention was almost exclusively given to low dimensional systems and calibration of their resilience functions and indicators of early warning signals without considerations for the interactions between the components. Only in recent years, taking advantages of the network theory and lavish real data sets, network scientists have directed their interest to the real-world complex networked multidimensional systems and their resilience function and early warning indicators. This report is devoted to a comprehensive review of resilience function and regime shift of complex systems in different domains, such as ecology, biology, social systems and infrastructure. We cover the related research about empirical observations, experimental studies, mathematical modeling, and theoretical analysis. We also discuss some ambiguous definitions, such as robustness, resilience, and stability.
△ Less
Submitted 9 April, 2022; v1 submitted 26 July, 2020;
originally announced July 2020.
-
Network Representation of Large-Scale Heterogeneous RNA Sequences with Integration of Diverse Multi-omics, Interactions, and Annotations Data
Authors:
Nhat Tran,
Jean Gao
Abstract:
Long non-coding RNA, microRNA, and messenger RNA enable key regulations of various biological processes through a variety of diverse interaction mechanisms. Identifying the interactions and cross-talk between these heterogeneous RNA classes is essential in order to uncover the functional role of individual RNA transcripts, especially for unannotated and newly-discovered RNA sequences with no known…
▽ More
Long non-coding RNA, microRNA, and messenger RNA enable key regulations of various biological processes through a variety of diverse interaction mechanisms. Identifying the interactions and cross-talk between these heterogeneous RNA classes is essential in order to uncover the functional role of individual RNA transcripts, especially for unannotated and newly-discovered RNA sequences with no known interactions. Recently, sequence-based deep learning and network embedding methods are becoming promising approaches that can either predict RNA-RNA interactions from a sequence or infer missing interactions from patterns that may exist in the network topology. However, the majority of these methods have several limitations, eg, the inability to perform inductive predictions, to distinguish the directionality of interactions, or to integrate various sequence, interaction, and annotation biological datasets. We proposed a novel deep learning-based framework, rna2rna, which learns from RNA sequences to produce a low-dimensional embedding that preserves the proximities in both the interactions topology and the functional affinity topology. In this proposed embedding space, we have designated a two-part" source and target contexts" to capture the targeting and receptive fields of each RNA transcript, while encapsulating the heterogenous cross-talk interactions between lncRNAs and miRNAs. From experimental results, our method exhibits superior performance in AUPR rates compared to state-of-art approaches at predicting missing interactions in different RNA-RNA interaction databases and was shown to accurately perform link predictions to novel RNA sequences not seen at training time, even without any prior information. Additional results suggest that our proposed framework can capture a manifold for heterogeneous RNA sequences to discover novel functional annotations.
△ Less
Submitted 8 December, 2020; v1 submitted 17 June, 2019;
originally announced June 2019.
-
Evolutionary Game Dynamics for Two Interacting Populations under Environmental Feedback
Authors:
Lulu Gong,
Jian Gao,
Ming Cao
Abstract:
We study the evolutionary dynamics of games under environmental feedback using replicator equations for two interacting populations. One key feature is to consider jointly the co-evolution of the dynamic payoff matrices and the state of the environment: the payoff matrix varies with the changing environment and at the same time, the state of the environment is affected indirectly by the changing p…
▽ More
We study the evolutionary dynamics of games under environmental feedback using replicator equations for two interacting populations. One key feature is to consider jointly the co-evolution of the dynamic payoff matrices and the state of the environment: the payoff matrix varies with the changing environment and at the same time, the state of the environment is affected indirectly by the changing payoff matrix through the evolving population profiles. For such co-evolutionary dynamics, we investigate whether convergence will take place, and if so, how. In particular, we identify the scenarios where oscillation offers the best predictions of long-run behavior by using reversible system theory. The obtained results are useful to describe the evolution of multi-community societies in which individuals' payoffs and societal feedback interact.
△ Less
Submitted 8 June, 2018;
originally announced June 2018.
-
Decoding and mapping task states of the human brain via deep learning
Authors:
Xiaoxiao Wang,
Xiao Liang,
Zhoufan Jiang,
Benedictor Alexander Nguchu,
Yawen Zhou,
Yanming Wang,
Huijuan Wang,
Yu Li,
Yuying Zhu,
Feng Wu,
Jia-Hong Gao,
Benching Qiu
Abstract:
Support vector machine (SVM) based multivariate pattern analysis (MVPA) has delivered promising performance in decoding specific task states based on functional magnetic resonance imaging (fMRI) of the human brain. Conventionally, the SVM-MVPA requires careful feature selection/extraction according to expert knowledge. In this study, we propose a deep neural network (DNN) for directly decoding mul…
▽ More
Support vector machine (SVM) based multivariate pattern analysis (MVPA) has delivered promising performance in decoding specific task states based on functional magnetic resonance imaging (fMRI) of the human brain. Conventionally, the SVM-MVPA requires careful feature selection/extraction according to expert knowledge. In this study, we propose a deep neural network (DNN) for directly decoding multiple brain task states from fMRI signals of the brain without any burden for feature handcrafts. We trained and tested the DNN classifier using task fMRI data from the Human Connectome Project's S1200 dataset (N=1034). In tests to verify its performance, the proposed classification method identified seven tasks with an average accuracy of 93.7%. We also showed the general applicability of the DNN for transfer learning to small datasets (N=43), a situation encountered in typical neuroscience research. The proposed method achieved an average accuracy of 89.0% and 94.7% on a working memory task and a motor classification task, respectively, higher than the accuracy of 69.2% and 68.6% obtained by the SVM-MVPA. A network visualization analysis showed that the DNN automatically detected features from areas of the brain related to each task. Without incurring the burden of handcrafting the features, the proposed deep decoding method can classify brain task states highly accurately, and is a powerful tool for fMRI researchers.
△ Less
Submitted 4 December, 2019; v1 submitted 30 January, 2018;
originally announced January 2018.
-
Embryo as an active granular fluid: stress-coordinated cellular constriction chains
Authors:
Guo-Jie Jason Gao,
Michael C. Holcomb,
Jeffrey H. Thomas,
Jerzy Blawzdziewicz
Abstract:
Mechanical stress plays an intricate role in gene expression in individual cells and sculpting of developing tissues. However, systematic methods of studying how mechanical stress and feedback help to harmonize cellular activities within a tissue have yet to be developed. Motivated by our observation of the cellular constriction chains (CCCs) during the initial phase of ventral furrow formation in…
▽ More
Mechanical stress plays an intricate role in gene expression in individual cells and sculpting of developing tissues. However, systematic methods of studying how mechanical stress and feedback help to harmonize cellular activities within a tissue have yet to be developed. Motivated by our observation of the cellular constriction chains (CCCs) during the initial phase of ventral furrow formation in the Drosophila melanogaster embryo, we propose an active granular fluid (AGF) model that provides valuable insights into cellular coordination in the apical constriction process. In our model, cells are treated as circular particles connected by a predefined force network, and they undergo a random constriction process in which the particle constriction probability P is a function of the stress exerted on the particle by its neighbors. We find that when P favors tensile stress, constricted particles tend to form chain-like structures. In contrast, constricted particles tend to form compact clusters when P favors compression. A remarkable similarity of constricted-particle chains and CCCs observed in vivo provides indirect evidence that tensile-stress feedback coordinates the apical constriction activity. We expect that our particle-based AGF model will be useful in analyzing mechanical feedback effects in a wide variety of morphogenesis and organogenesis phenomena.
△ Less
Submitted 4 March, 2016; v1 submitted 11 January, 2016;
originally announced January 2016.
-
Inverse Folding of RNA Pseudoknot Structures
Authors:
James Z. M. Gao,
Linda Y. M. Li,
Christian M. Reidys
Abstract:
Background: RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and \pairGU-base pairings (secondary structure) and additional cross-serial base pairs. These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In the context of studying natural RNA structures, searchi…
▽ More
Background: RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and \pairGU-base pairings (secondary structure) and additional cross-serial base pairs. These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In the context of studying natural RNA structures, searching for new ribozymes and designing artificial RNA, it is of interest to find RNA sequences folding into a specific structure and to analyze their induced neutral networks. Since the established inverse folding algorithms, {\tt RNAinverse}, {\tt RNA-SSD} as well as {\tt INFO-RNA} are limited to RNA secondary structures, we present in this paper the inverse folding algorithm {\tt Inv} which can deal with 3-noncrossing, canonical pseudoknot structures.
Results: In this paper we present the inverse folding algorithm {\tt Inv}. We give a detailed analysis of {\tt Inv}, including pseudocodes. We show that {\tt Inv} allows to design in particular 3-noncrossing nonplanar RNA pseudoknot 3-noncrossing RNA structures--a class which is difficult to construct via dynamic programming routines. {\tt Inv} is freely available at \url{http://www.combinatorics.cn/cbpc/inv.html}.
Conclusions: The algorithm {\tt Inv} extends inverse folding capabilities to RNA pseudoknot structures. In comparison with {\tt RNAinverse} it uses new ideas, for instance by considering sets of competing structures. As a result, {\tt Inv} is not only able to find novel sequences even for RNA secondary structures, it does so in the context of competing structures that potentially exhibit cross-serial interactions.
△ Less
Submitted 9 March, 2010;
originally announced March 2010.
-
Inverse folding of RNA pseudoknot structures
Authors:
James Z. M. Gao,
Linda Y. M. Li,
Christian M. Reidys
Abstract:
Background:
RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and \pairGU-base pairings (secondary structure) and additional cross-serial base pairs. These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In the context of studying natural RNA structures, searc…
▽ More
Background:
RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and \pairGU-base pairings (secondary structure) and additional cross-serial base pairs. These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In the context of studying natural RNA structures, searching for new ribozymes and designing artificial RNA, it is of interest to find RNA sequences folding into a specific structure and to analyze their induced neutral networks. Since the established inverse folding algorithms, {\tt RNAinverse}, {\tt RNA-SSD} as well as {\tt INFO-RNA} are limited to RNA secondary structures, we present in this paper the inverse folding algorithm {\tt Inv} which can deal with 3-noncrossing, canonical pseudoknot structures.
Results:
In this paper we present the inverse folding algorithm {\tt Inv}. We give a detailed analysis of {\tt Inv}, including pseudocodes. We show that {\tt Inv} allows to design in particular 3-noncrossing nonplanar RNA pseudoknot 3-noncrossing RNA structures-a class which is difficult to construct via dynamic programming routines. {\tt Inv} is freely available at \url{http://www.combinatorics.cn/cbpc/inv.html}.
Conclusions:
The algorithm {\tt Inv} extends inverse folding capabilities to RNA pseudoknot structures. In comparison with {\tt RNAinverse} it uses new ideas, for instance by considering sets of competing structures. As a result, {\tt Inv} is not only able to find novel sequences even for RNA secondary structures, it does so in the context of competing structures that potentially exhibit cross-serial interactions.
△ Less
Submitted 10 March, 2010; v1 submitted 5 May, 2009;
originally announced May 2009.
-
KiWi: A Scalable Subspace Clustering Algorithm for Gene Expression Analysis
Authors:
Obi L. Griffith,
Byron J. Gao,
Mikhail Bilenky,
Yuliya Prichyna,
Martin Ester,
Steven J. M. Jones
Abstract:
Subspace clustering has gained increasing popularity in the analysis of gene expression data. Among subspace cluster models, the recently introduced order-preserving sub-matrix (OPSM) has demonstrated high promise. An OPSM, essentially a pattern-based subspace cluster, is a subset of rows and columns in a data matrix for which all the rows induce the same linear ordering of columns. Existing OPS…
▽ More
Subspace clustering has gained increasing popularity in the analysis of gene expression data. Among subspace cluster models, the recently introduced order-preserving sub-matrix (OPSM) has demonstrated high promise. An OPSM, essentially a pattern-based subspace cluster, is a subset of rows and columns in a data matrix for which all the rows induce the same linear ordering of columns. Existing OPSM discovery methods do not scale well to increasingly large expression datasets. In particular, twig clusters having few genes and many experiments incur explosive computational costs and are completely pruned off by existing methods. However, it is of particular interest to determine small groups of genes that are tightly coregulated across many conditions. In this paper, we present KiWi, an OPSM subspace clustering algorithm that is scalable to massive datasets, capable of discovering twig clusters and identifying negative as well as positive correlations. We extensively validate KiWi using relevant biological datasets and show that KiWi correctly assigns redundant probes to the same cluster, groups experiments with common clinical annotations, differentiates real promoter sequences from negative control sequences, and shows good association with cis-regulatory motif predictions.
△ Less
Submitted 13 April, 2009;
originally announced April 2009.