-
Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery
Authors:
Zhiyuan Peng,
Yuanbo Tang,
Yang Li
Abstract:
DNA sequences encode vital genetic and biological information, yet these unfixed-length sequences cannot serve as the input of common data mining algorithms. Hence, various representation schemes have been developed to transform DNA sequences into fixed-length numerical representations. However, these schemes face difficulties in learning high-quality representations due to the complexity and spar…
▽ More
DNA sequences encode vital genetic and biological information, yet these unfixed-length sequences cannot serve as the input of common data mining algorithms. Hence, various representation schemes have been developed to transform DNA sequences into fixed-length numerical representations. However, these schemes face difficulties in learning high-quality representations due to the complexity and sparsity of DNA data. Additionally, DNA sequences are inherently noisy because of mutations. While several schemes have been proposed for their effectiveness, they often lack semantic structure, making it difficult for biologists to validate and leverage the results. To address these challenges, we propose \textbf{Dy-mer}, an explainable and robust DNA representation scheme based on sparse recovery. Leveraging the underlying semantic structure of DNA, we modify the traditional sparse recovery to capture recurring patterns indicative of biological functions by representing frequent K-mers as basis vectors and reconstructing each DNA sequence through simple concatenation. Experimental results demonstrate that \textbf{Dy-mer} achieves state-of-the-art performance in DNA promoter classification, yielding a remarkable \textbf{13\%} increase in accuracy. Moreover, its inherent explainability facilitates DNA clustering and motif detection, enhancing its utility in biological research.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Accurate Prediction of Antibody Function and Structure Using Bio-Inspired Antibody Language Model
Authors:
Hongtai Jing,
Zhengtao Gao,
Sheng Xu,
Tao Shen,
Zhangzhi Peng,
Shwai He,
Tao You,
Shuang Ye,
Wei Lin,
Siqi Sun
Abstract:
In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging…
▽ More
In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging co-evolution information from homologous proteins. Despite these advances, predicting the conformation of antibodies remains challenging due to their unique evolution and the high flexibility of their antigen-binding regions. Here, to address this challenge, we present the Bio-inspired Antibody Language Model (BALM). This model is trained on a vast dataset comprising 336 million 40% non-redundant unlabeled antibody sequences, capturing both unique and conserved properties specific to antibodies. Notably, BALM showcases exceptional performance across four antigen-binding prediction tasks. Moreover, we introduce BALMFold, an end-to-end method derived from BALM, capable of swiftly predicting full atomic antibody structures from individual sequences. Remarkably, BALMFold outperforms those well-established methods like AlphaFold2, IgFold, ESMFold, and OmegaFold in the antibody benchmark, demonstrating significant potential to advance innovative engineering and streamline therapeutic antibody development by reducing the need for unnecessary trials.
△ Less
Submitted 31 August, 2023;
originally announced August 2023.
-
What it takes to solve the Origin(s) of Life: An integrated review of techniques
Authors:
OoLEN,
Silke Asche,
Carla Bautista,
David Boulesteix,
Alexandre Champagne-Ruel,
Cole Mathis,
Omer Markovitch,
Zhen Peng,
Alyssa Adams,
Avinash Vicholous Dass,
Arnaud Buch,
Eloi Camprubi,
Enrico Sandro Colizzi,
Stephanie Colón-Santos,
Hannah Dromiack,
Valentina Erastova,
Amanda Garcia,
Ghjuvan Grimaud,
Aaron Halpern,
Stuart A Harrison,
Seán F. Jordan,
Tony Z Jia,
Amit Kahana,
Artemy Kolchinsky,
Odin Moron-Garcia
, et al. (13 additional authors not shown)
Abstract:
Understanding the origin(s) of life (OoL) is a fundamental challenge for science in the 21st century. Research on OoL spans many disciplines, including chemistry, physics, biology, planetary sciences, computer science, mathematics and philosophy. The sheer number of different scientific perspectives relevant to the problem has resulted in the coexistence of diverse tools, techniques, data, and sof…
▽ More
Understanding the origin(s) of life (OoL) is a fundamental challenge for science in the 21st century. Research on OoL spans many disciplines, including chemistry, physics, biology, planetary sciences, computer science, mathematics and philosophy. The sheer number of different scientific perspectives relevant to the problem has resulted in the coexistence of diverse tools, techniques, data, and software in OoL studies. This has made communication between the disciplines relevant to the OoL extremely difficult because the interpretation of data, analyses, or standards of evidence can vary dramatically. Here, we hope to bridge this wide field of study by providing common ground via the consolidation of tools and techniques rather than positing a unifying view on how life emerges. We review the common tools and techniques that have been used significantly in OoL studies in recent years. In particular, we aim to identify which information is most relevant for comparing and integrating the results of experimental analyses into mathematical and computational models. This review aims to provide a baseline expectation and understanding of technical aspects of origins research, rather than being a primer on any particular topic. As such, it spans broadly -- from analytical chemistry to mathematical models -- and highlights areas of future work that will benefit from a multidisciplinary approach to tackling the mystery of life's origin. Ultimately, we hope to empower a new generation of OoL scientists by reviewing how they can investigate life's origin, rather than dictating how to think about the problem.
△ Less
Submitted 24 August, 2023; v1 submitted 22 August, 2023;
originally announced August 2023.
-
Spatial Structure Supports Diversity in Prebiotic Autocatalytic Chemical Ecosystems
Authors:
Alex M. Plum,
Christopher P. Kempes,
Zhen Peng,
David A. Baum
Abstract:
Autocatalysis is thought to have played an important role in the earliest stages of the origin of life. An autocatalytic cycle (AC) is a set of reactions that results in stoichiometric increase in its constituent chemicals. When the reactions of multiple interacting ACs are active in a region of space, they can have interactions analogous to those between species in biological ecosystems. Prior st…
▽ More
Autocatalysis is thought to have played an important role in the earliest stages of the origin of life. An autocatalytic cycle (AC) is a set of reactions that results in stoichiometric increase in its constituent chemicals. When the reactions of multiple interacting ACs are active in a region of space, they can have interactions analogous to those between species in biological ecosystems. Prior studies of autocatalytic chemical ecosystems (ACEs) have suggested avenues for accumulating complexity, such as ecological succession, as well as obstacles such as competitive exclusion. We extend this ecological framework to investigate the effects of surface adsorption, desorption, and diffusion on ACE ecology. Simulating ACEs as particle-based stochastic reaction-diffusion systems in spatial environments-including open, two-dimensional reaction-diffusion systems and adsorptive mineral surfaces-we demonstrate that spatial structure can enhance ACE diversity by i) permitting otherwise mutually exclusive ACs to coexist and ii) subjecting new AC traits to selection.
△ Less
Submitted 25 July, 2024; v1 submitted 29 December, 2022;
originally announced December 2022.
-
Exploiting Pre-Trained ASR Models for Alzheimer's Disease Recognition Through Spontaneous Speech
Authors:
Ying Qin,
Wei Liu,
Zhiyuan Peng,
Si-Ioi Ng,
Jingyu Li,
Haibo Hu,
Tan Lee
Abstract:
Alzheimer's disease (AD) is a progressive neurodegenerative disease and recently attracts extensive attention worldwide. Speech technology is considered a promising solution for the early diagnosis of AD and has been enthusiastically studied. Most recent works concentrate on the use of advanced BERT-like classifiers for AD detection. Input to these classifiers are speech transcripts produced by au…
▽ More
Alzheimer's disease (AD) is a progressive neurodegenerative disease and recently attracts extensive attention worldwide. Speech technology is considered a promising solution for the early diagnosis of AD and has been enthusiastically studied. Most recent works concentrate on the use of advanced BERT-like classifiers for AD detection. Input to these classifiers are speech transcripts produced by automatic speech recognition (ASR) models. The major challenge is that the quality of transcription could degrade significantly under complex acoustic conditions in the real world. The detection performance, in consequence, is largely limited. This paper tackles the problem via tailoring and adapting pre-trained neural-network based ASR model for the downstream AD recognition task. Only bottom layers of the ASR model are retained. A simple fully-connected neural network is added on top of the tailored ASR model for classification. The heavy BERT classifier is discarded. The resulting model is light-weight and can be fine-tuned in an end-to-end manner for AD recognition. Our proposed approach takes only raw speech as input, and no extra transcription process is required. The linguistic information of speech is implicitly encoded in the tailored ASR model and contributes to boosting the performance. Experiments show that our proposed approach outperforms the best manual transcript-based RoBERTa by an absolute margin of 4.6% in terms of accuracy. Our best-performing models achieve the accuracy of 83.2% and 78.0% in the long-audio and short-audio competition tracks of the 2021 NCMMSC Alzheimer's Disease Recognition Challenge, respectively.
△ Less
Submitted 4 October, 2021;
originally announced October 2021.
-
An ecological framework for the analysis of prebiotic chemical reaction networks and their dynamical behavior
Authors:
Zhen Peng,
Alex Plum,
Praful Gagrani,
David A. Baum
Abstract:
It is becoming widely accepted that very early in the origin of life, even before the emergence of genetic encoding, reaction networks of diverse small chemicals might have manifested key properties of life, namely self-propagation and adaptive evolution. To explore this possibility, we formalize the dynamics of chemical reaction networks within the framework of chemical ecosystem ecology. To capt…
▽ More
It is becoming widely accepted that very early in the origin of life, even before the emergence of genetic encoding, reaction networks of diverse small chemicals might have manifested key properties of life, namely self-propagation and adaptive evolution. To explore this possibility, we formalize the dynamics of chemical reaction networks within the framework of chemical ecosystem ecology. To capture the idea that life-like chemical systems are maintained out of equilibrium by fluxes of energy-rich food chemicals, we model chemical ecosystems in well-mixed containers that are subject to constant dilution by a solution with a fixed concentration of food chemicals. Modelling all chemical reactions as fully reversible, we show that seeding an autocatalytic cycle (AC) with tiny amounts of one or more of its member chemicals results in logistic growth of all member chemicals in the cycle. This finding justifies drawing an instructive analogy between an AC and the population of a biological species. We extend this finding to show that pairs of ACs can show competitive, predator-prey, or mutualistic associations just like biological species. Furthermore, when there is stochasticity in the environment, particularly in the seeding of ACs, chemical ecosystems can show complex dynamics that can resemble evolution. The evolutionary character is especially clear when the network architecture results in ecological precedence (survival of the first), which makes the path of succession historically contingent on the order in which cycles are seeded. For all its simplicity, the framework developed here is helpful for visualizing how autocatalysis in prebiotic chemical reaction networks can yield life-like properties. Furthermore, chemical ecosystem ecology could provide a useful foundation for exploring the emergence of adaptive dynamics and the origins of polymer-based genetic systems.
△ Less
Submitted 9 January, 2020; v1 submitted 8 January, 2020;
originally announced January 2020.