-
ChemAU: Harness the Reasoning of LLMs in Chemical Research with Adaptive Uncertainty Estimation
Authors:
Xinyi Liu,
Lipeng Ma,
Yixuan Li,
Weidong Yang,
Qingyuan Zhou,
Jiayi Song,
Shuhao Li,
Ben Fei
Abstract:
Large Language Models (LLMs) are widely used across various scenarios due to their exceptional reasoning capabilities and natural language understanding. While LLMs demonstrate strong performance in tasks involving mathematics and coding, their effectiveness diminishes significantly when applied to chemistry-related problems. Chemistry problems typically involve long and complex reasoning steps, w…
▽ More
Large Language Models (LLMs) are widely used across various scenarios due to their exceptional reasoning capabilities and natural language understanding. While LLMs demonstrate strong performance in tasks involving mathematics and coding, their effectiveness diminishes significantly when applied to chemistry-related problems. Chemistry problems typically involve long and complex reasoning steps, which contain specific terminology, including specialized symbol systems and complex nomenclature conventions. These characteristics often cause general LLMs to experience hallucinations during the reasoning process due to their lack of specific knowledge. However, existing methods are struggling to effectively leverage chemical expertise and formulas. Moreover, current uncertainty estimation methods, designed to mitigate potential reasoning errors, are unable to precisely identify specific steps or key knowledge. In this work, we propose a novel framework called ChemAU, which incorporates our adaptive uncertainty estimation method that applies different uncertainty values based on the position of reasoning steps within the whole reasoning chain. Leveraging this method, ChemAU identifies gaps in chemistry knowledge and precisely supplements chemical expertise with the specialized domain model, thereby correcting and updating the previously flawed reasoning chain. Our experiments with three popular LLMs across three chemistry datasets demonstrate that ChemAU significantly enhances both reasoning accuracy and uncertainty estimation.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Association in Facial Phenotype, Gene, Disease: A Dataset for Explainable Rare Genetic Diseases Diagnosis
Authors:
Jie Song,
Mengqiao He,
Shumin Ren,
Bairong Shen
Abstract:
Many rare genetic diseases exhibit recognizable facial phenotypes, which are often used as diagnostic clues. However, current facial phenotype diagnostic models, which are trained on image datasets, have high accuracy but often suffer from an inability to explain their predictions, which reduces physicians' confidence in the model output.In this paper, we constructed a dataset, called FGDD, which…
▽ More
Many rare genetic diseases exhibit recognizable facial phenotypes, which are often used as diagnostic clues. However, current facial phenotype diagnostic models, which are trained on image datasets, have high accuracy but often suffer from an inability to explain their predictions, which reduces physicians' confidence in the model output.In this paper, we constructed a dataset, called FGDD, which was collected from 509 publications and contains 1147 data records, in which each data record represents a patient group and contains patient information, variation information, and facial phenotype information. To verify the availability of the dataset, we evaluated the performance of commonly used classification algorithms on the dataset and analyzed the explainability from global and local perspectives. FGDD aims to support the training of disease diagnostic models, provide explainable results, and increase physicians' confidence with solid evidence. It also allows us to explore the complex relationship between genes, diseases, and facial phenotypes, to gain a deeper understanding of the pathogenesis and clinical manifestations of rare genetic diseases.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
CMADiff: Cross-Modal Aligned Diffusion for Controllable Protein Generation
Authors:
Changjian Zhou,
Yuexi Qiu,
Tongtong Ling,
Jiafeng Li,
Shuanghe Liu,
Xiangjing Wang,
Jia Song,
Wensheng Xiang
Abstract:
AI-assisted protein design has emerged as a critical tool for advancing biotechnology, as deep generative models have demonstrated their reliability in this domain. However, most existing models primarily utilize protein sequence or structural data for training, neglecting the physicochemical properties of proteins.Moreover, they are deficient to control the generation of proteins in intuitive con…
▽ More
AI-assisted protein design has emerged as a critical tool for advancing biotechnology, as deep generative models have demonstrated their reliability in this domain. However, most existing models primarily utilize protein sequence or structural data for training, neglecting the physicochemical properties of proteins.Moreover, they are deficient to control the generation of proteins in intuitive conditions. To address these limitations,we propose CMADiff here, a novel framework that enables controllable protein generation by aligning the physicochemical properties of protein sequences with text-based descriptions through a latent diffusion process. Specifically, CMADiff employs a Conditional Variational Autoencoder (CVAE) to integrate physicochemical features as conditional input, forming a robust latent space that captures biological traits. In this latent space, we apply a conditional diffusion process, which is guided by BioAligner, a contrastive learning-based module that aligns text descriptions with protein features, enabling text-driven control over protein sequence generation. Validated by a series of evaluations including AlphaFold3, the experimental results indicate that CMADiff outperforms protein sequence generation benchmarks and holds strong potential for future applications. The implementation and code are available at https://github.com/HPC-NEAU/PhysChemDiff.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Advanced Deep Learning Methods for Protein Structure Prediction and Design
Authors:
Yichao Zhang,
Ningyuan Deng,
Xinyuan Song,
Ziqian Bi,
Tianyang Wang,
Zheyu Yao,
Keyu Chen,
Ming Li,
Qian Niu,
Junyu Liu,
Benji Peng,
Sen Zhang,
Ming Liu,
Li Zhang,
Xuanhe Pan,
Jinlang Wang,
Pohsun Feng,
Yizhu Wen,
Lawrence KQ Yan,
Hongming Tseng,
Yan Zhong,
Yunze Wang,
Ziyuan Qin,
Bowen Jing,
Junjie Yang
, et al. (3 additional authors not shown)
Abstract:
After AlphaFold won the Nobel Prize, protein prediction with deep learning once again became a hot topic. We comprehensively explore advanced deep learning methods applied to protein structure prediction and design. It begins by examining recent innovations in prediction architectures, with detailed discussions on improvements such as diffusion based frameworks and novel pairwise attention modules…
▽ More
After AlphaFold won the Nobel Prize, protein prediction with deep learning once again became a hot topic. We comprehensively explore advanced deep learning methods applied to protein structure prediction and design. It begins by examining recent innovations in prediction architectures, with detailed discussions on improvements such as diffusion based frameworks and novel pairwise attention modules. The text analyses key components including structure generation, evaluation metrics, multiple sequence alignment processing, and network architecture, thereby illustrating the current state of the art in computational protein modelling. Subsequent chapters focus on practical applications, presenting case studies that range from individual protein predictions to complex biomolecular interactions. Strategies for enhancing prediction accuracy and integrating deep learning techniques with experimental validation are thoroughly explored. The later sections review the industry landscape of protein design, highlighting the transformative role of artificial intelligence in biotechnology and discussing emerging market trends and future challenges. Supplementary appendices provide essential resources such as databases and open source tools, making this volume a valuable reference for researchers and students.
△ Less
Submitted 29 March, 2025; v1 submitted 14 March, 2025;
originally announced March 2025.
-
Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence
Authors:
Yingying Sun,
Jun A,
Zhiwei Liu,
Rui Sun,
Liujia Qian,
Samuel H. Payne,
Wout Bittremieux,
Markus Ralser,
Chen Li,
Yi Chen,
Zhen Dong,
Yasset Perez-Riverol,
Asif Khan,
Chris Sander,
Ruedi Aebersold,
Juan Antonio Vizcaíno,
Jonathan R Krieger,
Jianhua Yao,
Han Wen,
Linfeng Zhang,
Yunping Zhu,
Yue Xuan,
Benjamin Boyang Sun,
Liang Qiao,
Henning Hermjakob
, et al. (37 additional authors not shown)
Abstract:
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights.…
▽ More
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Exploring Optimal Transport-Based Multi-Grained Alignments for Text-Molecule Retrieval
Authors:
Zijun Min,
Bingshuai Liu,
Liang Zhang,
Jia Song,
Jinsong Su,
Song He,
Xiaochen Bo
Abstract:
The field of bioinformatics has seen significant progress, making the cross-modal text-molecule retrieval task increasingly vital. This task focuses on accurately retrieving molecule structures based on textual descriptions, by effectively aligning textual descriptions and molecules to assist researchers in identifying suitable molecular candidates. However, many existing approaches overlook the d…
▽ More
The field of bioinformatics has seen significant progress, making the cross-modal text-molecule retrieval task increasingly vital. This task focuses on accurately retrieving molecule structures based on textual descriptions, by effectively aligning textual descriptions and molecules to assist researchers in identifying suitable molecular candidates. However, many existing approaches overlook the details inherent in molecule sub-structures. In this work, we introduce the Optimal TRansport-based Multi-grained Alignments model (ORMA), a novel approach that facilitates multi-grained alignments between textual descriptions and molecules. Our model features a text encoder and a molecule encoder. The text encoder processes textual descriptions to generate both token-level and sentence-level representations, while molecules are modeled as hierarchical heterogeneous graphs, encompassing atom, motif, and molecule nodes to extract representations at these three levels. A key innovation in ORMA is the application of Optimal Transport (OT) to align tokens with motifs, creating multi-token representations that integrate multiple token alignments with their corresponding motifs. Additionally, we employ contrastive learning to refine cross-modal alignments at three distinct scales: token-atom, multitoken-motif, and sentence-molecule, ensuring that the similarities between correctly matched text-molecule pairs are maximized while those of unmatched pairs are minimized. To our knowledge, this is the first attempt to explore alignments at both the motif and multi-token levels. Experimental results on the ChEBI-20 and PCdes datasets demonstrate that ORMA significantly outperforms existing state-of-the-art (SOTA) models.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
CoPRA: Bridging Cross-domain Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity Prediction
Authors:
Rong Han,
Xiaohong Liu,
Tong Pan,
Jing Xu,
Xiaoyu Wang,
Wuyang Lan,
Zhenyu Li,
Zixuan Wang,
Jiangning Song,
Guangyu Wang,
Ting Chen
Abstract:
Accurately measuring protein-RNA binding affinity is crucial in many biological processes and drug design. Previous computational methods for protein-RNA binding affinity prediction rely on either sequence or structure features, unable to capture the binding mechanisms comprehensively. The recent emerging pre-trained language models trained on massive unsupervised sequences of protein and RNA have…
▽ More
Accurately measuring protein-RNA binding affinity is crucial in many biological processes and drug design. Previous computational methods for protein-RNA binding affinity prediction rely on either sequence or structure features, unable to capture the binding mechanisms comprehensively. The recent emerging pre-trained language models trained on massive unsupervised sequences of protein and RNA have shown strong representation ability for various in-domain downstream tasks, including binding site prediction. However, applying different-domain language models collaboratively for complex-level tasks remains unexplored. In this paper, we propose CoPRA to bridge pre-trained language models from different biological domains via Complex structure for Protein-RNA binding Affinity prediction. We demonstrate for the first time that cross-biological modal language models can collaborate to improve binding affinity prediction. We propose a Co-Former to combine the cross-modal sequence and structure information and a bi-scope pre-training strategy for improving Co-Former's interaction understanding. Meanwhile, we build the largest protein-RNA binding affinity dataset PRA310 for performance evaluation. We also test our model on a public dataset for mutation effect prediction. CoPRA reaches state-of-the-art performance on all the datasets. We provide extensive analyses and verify that CoPRA can (1) accurately predict the protein-RNA binding affinity; (2) understand the binding affinity change caused by mutations; and (3) benefit from scaling data and model size.
△ Less
Submitted 3 January, 2025; v1 submitted 21 August, 2024;
originally announced September 2024.
-
RGDA-DDI: Residual graph attention network and dual-attention based framework for drug-drug interaction prediction
Authors:
Changjian Zhou,
Xin Zhang,
Jiafeng Li,
Jia Song,
Wensheng Xiang
Abstract:
Recent studies suggest that drug-drug interaction (DDI) prediction via computational approaches has significant importance for understanding the functions and co-prescriptions of multiple drugs. However, the existing silico DDI prediction methods either ignore the potential interactions among drug-drug pairs (DDPs), or fail to explicitly model and fuse the multi-scale drug feature representations…
▽ More
Recent studies suggest that drug-drug interaction (DDI) prediction via computational approaches has significant importance for understanding the functions and co-prescriptions of multiple drugs. However, the existing silico DDI prediction methods either ignore the potential interactions among drug-drug pairs (DDPs), or fail to explicitly model and fuse the multi-scale drug feature representations for better prediction. In this study, we propose RGDA-DDI, a residual graph attention network (residual-GAT) and dual-attention based framework for drug-drug interaction prediction. A residual-GAT module is introduced to simultaneously learn multi-scale feature representations from drugs and DDPs. In addition, a dual-attention based feature fusion block is constructed to learn local joint interaction representations. A series of evaluation metrics demonstrate that the RGDA-DDI significantly improved DDI prediction performance on two public benchmark datasets, which provides a new insight into drug development.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
pVACview: an interactive visualization tool for efficient neoantigen prioritization and selection
Authors:
Huiming Xia,
My Hoang,
Evelyn Schmidt,
Susanna Kiwala,
Joshua McMichael,
Zachary L. Skidmore,
Bryan Fisk,
Jonathan J. Song,
Jasreet Hundal,
Thomas Mooney,
Jason R. Walker,
S. Peter Goedegebuure,
Christopher A. Miller,
William E. Gillanders,
Obi L. Griffith,
Malachi Griffith
Abstract:
Neoantigen targeting therapies including personalized vaccines have shown promise in the treatment of cancers. Accurate identification/prioritization of neoantigens is highly relevant to designing clinical trials, predicting treatment response, and understanding mechanisms of resistance. With the advent of massively parallel sequencing technologies, it is now possible to predict neoantigens based…
▽ More
Neoantigen targeting therapies including personalized vaccines have shown promise in the treatment of cancers. Accurate identification/prioritization of neoantigens is highly relevant to designing clinical trials, predicting treatment response, and understanding mechanisms of resistance. With the advent of massively parallel sequencing technologies, it is now possible to predict neoantigens based on patient-specific variant information. However, numerous factors must be considered when prioritizing neoantigens for use in personalized therapies. Complexities such as alternative transcript annotations, various binding, presentation and immunogenicity prediction algorithms, and variable peptide lengths/registers all potentially impact the neoantigen selection process. While computational tools generate numerous algorithmic predictions for neoantigen characterization, results from these pipelines are difficult to navigate and require extensive knowledge of the underlying tools for accurate interpretation. Due to the intricate nature and number of salient neoantigen features, presenting all relevant information to facilitate candidate selection for downstream applications is a difficult challenge that current tools fail to address. We have created pVACview, the first interactive tool designed to aid in the prioritization and selection of neoantigen candidates for personalized neoantigen therapies. pVACview has a user-friendly and intuitive interface where users can upload, explore, select and export their neoantigen candidates. The tool allows users to visualize candidates using variant, transcript and peptide information. pVACview will allow researchers to analyze and prioritize neoantigen candidates with greater efficiency and accuracy in basic and translational settings. The application is available as part of the pVACtools pipeline at pvactools.org and as an online server at pvacview.org.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Combining Radiomics and Machine Learning Approaches for Objective ASD Diagnosis: Verifying White Matter Associations with ASD
Authors:
Junlin Song,
Yuzhuo Chen,
Yuan Yao,
Zetong Chen,
Renhao Guo,
Lida Yang,
Xinyi Sui,
Qihang Wang,
Xijiao Li,
Aihua Cao,
Wei Li
Abstract:
Autism Spectrum Disorder is a condition characterized by a typical brain development leading to impairments in social skills, communication abilities, repetitive behaviors, and sensory processing. There have been many studies combining brain MRI images with machine learning algorithms to achieve objective diagnosis of autism, but the correlation between white matter and autism has not been fully u…
▽ More
Autism Spectrum Disorder is a condition characterized by a typical brain development leading to impairments in social skills, communication abilities, repetitive behaviors, and sensory processing. There have been many studies combining brain MRI images with machine learning algorithms to achieve objective diagnosis of autism, but the correlation between white matter and autism has not been fully utilized. To address this gap, we develop a computer-aided diagnostic model focusing on white matter regions in brain MRI by employing radiomics and machine learning methods. This study introduced a MultiUNet model for segmenting white matter, leveraging the UNet architecture and utilizing manually segmented MRI images as the training data. Subsequently, we extracted white matter features using the Pyradiomics toolkit and applied different machine learning models such as Support Vector Machine, Random Forest, Logistic Regression, and K-Nearest Neighbors to predict autism. The prediction sets all exceeded 80% accuracy. Additionally, we employed Convolutional Neural Network to analyze segmented white matter images, achieving a prediction accuracy of 86.84%. Notably, Support Vector Machine demonstrated the highest prediction accuracy at 89.47%. These findings not only underscore the efficacy of the models but also establish a link between white matter abnormalities and autism. Our study contributes to a comprehensive evaluation of various diagnostic models for autism and introduces a computer-aided diagnostic algorithm for early and objective autism diagnosis based on MRI white matter regions.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion Generation
Authors:
Lijun Liu,
Jiali Yang,
Jianfei Song,
Xinglin Yang,
Lele Niu,
Zeqi Cai,
Hui Shi,
Tingjun Hou,
Chang-yu Hsieh,
Weiran Shen,
Yafeng Deng
Abstract:
Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifyin…
▽ More
Recombinant adeno-associated virus (rAAV) vectors have revolutionized gene therapy, but their broad tropism and suboptimal transduction efficiency limit their clinical applications. To overcome these limitations, researchers have focused on designing and screening capsid libraries to identify improved vectors. However, the large sequence space and limited resources present challenges in identifying viable capsid variants. In this study, we propose an end-to-end diffusion model to generate capsid sequences with enhanced viability. Using publicly available AAV2 data, we generated 38,000 diverse AAV2 viral protein (VP) sequences, and evaluated 8,000 for viral selection. The results attested the superiority of our model compared to traditional methods. Additionally, in the absence of AAV9 capsid data, apart from one wild-type sequence, we used the same model to directly generate a number of viable sequences with up to 9 mutations. we transferred the remaining 30,000 samples to the AAV9 domain. Furthermore, we conducted mutagenesis on AAV9 VP hypervariable regions VI and V, contributing to the continuous improvement of the AAV9 VP sequence. This research represents a significant advancement in the design and functional validation of rAAV vectors, offering innovative solutions to enhance specificity and transduction efficiency in gene therapy applications.
△ Less
Submitted 17 April, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains
Authors:
Jiale Zhao,
Wanru Zhuang,
Jia Song,
Yaqi Li,
Shuqi Lu
Abstract:
In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue t…
▽ More
In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue that modeling proteins at both residue and atom levels is important since the side chain atoms can also be crucial for numerous downstream tasks, for example, molecular docking. Nevertheless, we find that naively combining residue and atom information during pre-training typically fails. We identify a key reason is the information leakage caused by the inclusion of atom structure in the input, which renders residue-level pre-training tasks trivial and results in insufficiently expressive residue representations. To address this issue, we introduce a span mask pre-training strategy on 3D protein chains to learn meaningful representations of both residues and atoms. This leads to a simple yet effective approach to learning protein representation suitable for diverse downstream tasks. Extensive experimental results on binding site prediction and function prediction tasks demonstrate our proposed pre-training approach significantly outperforms other methods. Our code will be made public.
△ Less
Submitted 2 June, 2024; v1 submitted 2 February, 2024;
originally announced February 2024.
-
Pro-PRIME: A general Temperature-Guided Language model to engineer enhanced Stability and Activity in Proteins
Authors:
Fan Jiang,
Mingchen Li,
Jiajun Dong,
Yuanxi Yu,
Xinyu Sun,
Banghao Wu,
Jin Huang,
Liqi Kang,
Yufeng Pei,
Liang Zhang,
Shaojie Wang,
Wenxue Xu,
Jingyao Xin,
Wanli Ouyang,
Guisheng Fan,
Lirong Zheng,
Yang Tan,
Zhiqiang Hu,
Yi Xiong,
Yan Feng,
Guangyu Yang,
Qian Liu,
Jie Song,
Jia Liu,
Liang Hong
, et al. (1 additional authors not shown)
Abstract:
Designing protein mutants of both high stability and activity is a critical yet challenging task in protein engineering. Here, we introduce PRIME, a deep learning model, which can suggest protein mutants of improved stability and activity without any prior experimental mutagenesis data of the specified protein. Leveraging temperature-aware language modeling, PRIME demonstrated superior predictive…
▽ More
Designing protein mutants of both high stability and activity is a critical yet challenging task in protein engineering. Here, we introduce PRIME, a deep learning model, which can suggest protein mutants of improved stability and activity without any prior experimental mutagenesis data of the specified protein. Leveraging temperature-aware language modeling, PRIME demonstrated superior predictive power compared to current state-of-the-art models on the public mutagenesis dataset over 283 protein assays. Furthermore, we validated PRIME's predictions on five proteins, examining the top 30-45 single-site mutations' impact on various protein properties, including thermal stability, antigen-antibody binding affinity, and the ability to polymerize non-natural nucleic acid or resilience to extreme alkaline conditions. Remarkably, over 30% of the AI-recommended mutants exhibited superior performance compared to their pre-mutation counterparts across all proteins and desired properties. Moreover, we have developed an efficient, and successful method based on PRIME to rapidly obtain multi-site mutants with enhanced activity and stability. Hence, PRIME demonstrates the general applicability in protein engineering.
△ Less
Submitted 27 October, 2024; v1 submitted 24 July, 2023;
originally announced July 2023.
-
Recent advances in artificial intelligence for retrosynthesis
Authors:
Zipeng Zhong,
Jie Song,
Zunlei Feng,
Tiantao Liu,
Lingxiang Jia,
Shaolun Yao,
Tingjun Hou,
Mingli Song
Abstract:
Retrosynthesis is the cornerstone of organic chemistry, providing chemists in material and drug manufacturing access to poorly available and brand-new molecules. Conventional rule-based or expert-based computer-aided synthesis has obvious limitations, such as high labor costs and limited search space. In recent years, dramatic breakthroughs driven by artificial intelligence have revolutionized ret…
▽ More
Retrosynthesis is the cornerstone of organic chemistry, providing chemists in material and drug manufacturing access to poorly available and brand-new molecules. Conventional rule-based or expert-based computer-aided synthesis has obvious limitations, such as high labor costs and limited search space. In recent years, dramatic breakthroughs driven by artificial intelligence have revolutionized retrosynthesis. Here we aim to present a comprehensive review of recent advances in AI-based retrosynthesis. For single-step and multi-step retrosynthesis both, we first list their goal and provide a thorough taxonomy of existing methods. Afterwards, we analyze these methods in terms of their mechanism and performance, and introduce popular evaluation metrics for them, in which we also provide a detailed comparison among representative methods on several public datasets. In the next part we introduce popular databases and established platforms for retrosynthesis. Finally, this review concludes with a discussion about promising research directions in this field.
△ Less
Submitted 14 January, 2023;
originally announced January 2023.
-
Root-aligned SMILES: A Tight Representation for Chemical Reaction Prediction
Authors:
Zipeng Zhong,
Jie Song,
Zunlei Feng,
Tiantao Liu,
Lingxiang Jia,
Shaolun Yao,
Min Wu,
Tingjun Hou,
Mingli Song
Abstract:
Chemical reaction prediction, involving forward synthesis and retrosynthesis prediction, is a fundamental problem in organic synthesis. A popular computational paradigm formulates synthesis prediction as a sequence-to-sequence translation problem, where the typical SMILES is adopted for molecule representations. However, the general-purpose SMILES neglects the characteristics of chemical reactions…
▽ More
Chemical reaction prediction, involving forward synthesis and retrosynthesis prediction, is a fundamental problem in organic synthesis. A popular computational paradigm formulates synthesis prediction as a sequence-to-sequence translation problem, where the typical SMILES is adopted for molecule representations. However, the general-purpose SMILES neglects the characteristics of chemical reactions, where the molecular graph topology is largely unaltered from reactants to products, resulting in the suboptimal performance of SMILES if straightforwardly applied. In this article, we propose the root-aligned SMILES (R-SMILES), which specifies a tightly aligned one-to-one mapping between the product and the reactant SMILES for more efficient synthesis prediction. Due to the strict one-to-one mapping and reduced edit distance, the computational model is largely relieved from learning the complex syntax and dedicated to learning the chemical knowledge for reactions. We compare the proposed R-SMILES with various state-of-the-art baselines and show that it significantly outperforms them all, demonstrating the superiority of the proposed method.
△ Less
Submitted 12 August, 2022; v1 submitted 21 March, 2022;
originally announced March 2022.
-
Multivariate functional group sparse regression: functional predictor selection
Authors:
Ali Mahzarnia,
Jun Song
Abstract:
In this paper, we propose methods for functional predictor selection and the estimation of smooth functional coefficients simultaneously in a scalar-on-function regression problem under high-dimensional multivariate functional data setting. In particular, we develop two methods for functional group-sparse regression under a generic Hilbert space of infinite dimension. We show the convergence of al…
▽ More
In this paper, we propose methods for functional predictor selection and the estimation of smooth functional coefficients simultaneously in a scalar-on-function regression problem under high-dimensional multivariate functional data setting. In particular, we develop two methods for functional group-sparse regression under a generic Hilbert space of infinite dimension. We show the convergence of algorithms and the consistency of the estimation and the selection (oracle property) under infinite-dimensional Hilbert spaces. Simulation studies show the effectiveness of the methods in both the selection and the estimation of functional coefficients. The applications to the functional magnetic resonance imaging (fMRI) reveal the regions of the human brain related to ADHD and IQ.
△ Less
Submitted 8 July, 2021; v1 submitted 5 July, 2021;
originally announced July 2021.
-
Transition behavior of the seizure dynamics modulated by the astrocyte inositol triphosphate noise
Authors:
JiaJia Li,
Peihua Feng,
Liang Zhao,
Junying Chen,
Mengmeng Du,
Yangyang Yu,
Jian Song,
Ying Wu
Abstract:
Epilepsy is a neurological disorder with recurrent seizures of complexity and randomness. Until now, the mechanism of epileptic randomness has not been fully elucidated. Inspired by the recent finding that astrocyte GTPase-activating protein (G-protein)-coupled receptors could be involved in stochastic epileptic seizures, we proposed a neuron-astrocyte network model, incorporating the noise of the…
▽ More
Epilepsy is a neurological disorder with recurrent seizures of complexity and randomness. Until now, the mechanism of epileptic randomness has not been fully elucidated. Inspired by the recent finding that astrocyte GTPase-activating protein (G-protein)-coupled receptors could be involved in stochastic epileptic seizures, we proposed a neuron-astrocyte network model, incorporating the noise of the astrocytic second messager, inositol triphosphate (IP3) which is modulated by the G-protein)-coupled receptor activation. Based on this model, we have statistically analysed the transitions of epileptic seizures by performing tens of simulation trials. Our simulation results show that the increase of the IP3 noise intensity induces the depolarization-block epileptic seizures together with an increase in neuronal firing frequency. Meanwhile, a bistable state of neuronal firing emerges under certain noise intensity, during which the neuronal firing pattern switches between regular sparse spiking and epileptic seizure states. This random presence of epileptic seizures is absent when the noise intensity continues to increase, accompanying with an increase in the epileptic depolarization block duration. The simulation results also shed light on the fact that calcium signals in astrocytes play significant roles in the pattern formations of the epileptic seizure. Our results provide a potential pathway for understanding the epileptic randomness.
△ Less
Submitted 31 October, 2022; v1 submitted 26 May, 2021;
originally announced June 2021.
-
Small-Angle X-Ray Scattering Signatures of Conformational Heterogeneity and Homogeneity of Disordered Protein Ensembles
Authors:
Jianhui Song,
Jichen Li,
Hue Sun Chan
Abstract:
Physically, disordered ensembles of non-homopolymeric polypeptides are expected to be heterogeneous; i.e., they should differ from those homogeneous ensembles of homopolymers that harbor an essentially unique relationship between average values of end-to-end distance $R_{\rm EE}$ and radius of gyration $R_{\rm g}$. It was posited recently, however, that small-angle X-ray scattering (SAXS) data on…
▽ More
Physically, disordered ensembles of non-homopolymeric polypeptides are expected to be heterogeneous; i.e., they should differ from those homogeneous ensembles of homopolymers that harbor an essentially unique relationship between average values of end-to-end distance $R_{\rm EE}$ and radius of gyration $R_{\rm g}$. It was posited recently, however, that small-angle X-ray scattering (SAXS) data on conformational dimensions of disordered proteins can be rationalized almost exclusively by homopolymer ensembles. Assessing this perspective, chain-model simulations are used to evaluate the discriminatory power of SAXS-determined molecular form factors (MFFs) with regard to homogeneous versus heterogeneous ensembles. The general approach adopted here is not bound by any assumption about ensemble encodability, in that the postulated heterogeneous ensembles we evaluated are not restricted to those entailed by simple interaction schemes. Our analysis of MFFs for certain heterogeneous ensembles with more narrowly distributed $R_{\rm EE}$ and $R_{\rm g}$ indicates that while they deviates from MFFs of homogeneous ensembles, the differences can be rather small. Remarkably, some heterogeneous ensembles with asphericity and $R_{\rm EE}$ drastically different from those of homogeneous ensembles can nonetheless exhibit practically identical MFFs, demonstrating that SAXS MFFs do not afford unique characterizations of basic properties of conformational ensembles in general. In other words, the ensemble to MFF mapping is practically many-to-one and likely non-smooth. Heteropolymeric variations of the $R_{\rm EE}$--$R_{\rm g}$ relationship were further showcased using an analytical perturbation theory developed here for flexible heteropolymers. Ramifications of our findings for interpretation of experimental data are discussed.
△ Less
Submitted 9 June, 2021; v1 submitted 27 May, 2021;
originally announced May 2021.
-
Pairwise heuristic sequence alignment algorithm based on deep reinforcement learning
Authors:
Yong Joon Song,
Dong Jin Ji,
Hye In Seo,
Gyu Bum Han,
Dong Ho Cho
Abstract:
Various methods have been developed to analyze the association between organisms and their genomic sequences. Among them, sequence alignment is the most frequently used for comparative analysis of biological genomes. However, the traditional sequence alignment method is considerably complicated in proportion to the sequences' length, and it is significantly challenging to align long sequences such…
▽ More
Various methods have been developed to analyze the association between organisms and their genomic sequences. Among them, sequence alignment is the most frequently used for comparative analysis of biological genomes. However, the traditional sequence alignment method is considerably complicated in proportion to the sequences' length, and it is significantly challenging to align long sequences such as a human genome. Currently, several multiple sequence alignment algorithms are available that can reduce the complexity and improve the alignment performance of various genomes. However, there have been relatively fewer attempts to improve the alignment performance of the pairwise alignment algorithm. After grasping these problems, we intend to propose a new sequence alignment method using deep reinforcement learning. This research shows the application method of the deep reinforcement learning to the sequence alignment system and the way how the deep reinforcement learning can improve the conventional sequence alignment method.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.
-
Robust Nucleus Detection with Partially Labeled Exemplars
Authors:
Linqing Feng,
Jun Ho Song,
Jiwon Kim,
Soomin Jeong,
Jin Sung Park,
Jinhyun Kim
Abstract:
Quantitative analysis of cell nuclei in microscopic images is an essential yet challenging source of biological and pathological information. The major challenge is accurate detection and segmentation of densely packed nuclei in images acquired under a variety of conditions. Mask R-CNN-based methods have achieved state-of-the-art nucleus segmentation. However, the current pipeline requires fully a…
▽ More
Quantitative analysis of cell nuclei in microscopic images is an essential yet challenging source of biological and pathological information. The major challenge is accurate detection and segmentation of densely packed nuclei in images acquired under a variety of conditions. Mask R-CNN-based methods have achieved state-of-the-art nucleus segmentation. However, the current pipeline requires fully annotated training images, which are time consuming to create and sometimes noisy. Importantly, nuclei often appear similar within the same image. This similarity could be utilized to segment nuclei with only partially labeled training examples. We propose a simple yet effective region-proposal module for the current Mask R-CNN pipeline to perform few-exemplar learning. To capture the similarities between unlabeled regions and labeled nuclei, we apply decomposed self-attention to learned features. On the self-attention map, we observe strong activation at the centers and edges of all nuclei, including unlabeled nuclei. On this basis, our region-proposal module propagates partial annotations to the whole image and proposes effective bounding boxes for the bounding box-regression and binary mask-generation modules. Our method effectively learns from unlabeled regions thereby improving detection performance. We test our method with various nuclear images. When trained with only 1/4 of the nuclei annotated, our approach retains a detection accuracy comparable to that from training with fully annotated data. Moreover, our method can serve as a bootstrapping step to create full annotations of datasets, iteratively generating and correcting annotations until a predetermined coverage and accuracy are reached. The source code is available at https://github.com/feng-lab/nuclei.
△ Less
Submitted 13 November, 2019; v1 submitted 23 July, 2019;
originally announced July 2019.
-
The lubricity of mucin solutions is robust toward changes in physiological conditions
Authors:
Jian Song,
Benjamin Winkeljann,
Oliver Lieleg
Abstract:
Solutions of manually purified gastric mucins have been shown to be promising lubricants for biomedical purposes, where they can efficiently reduce friction and wear. However, so far, such mucin solutions have been mostly tested in specific settings, and variations in the composition of the lubricating fluid have not been systematically explored. We here fill this gap and determine the viscosity,…
▽ More
Solutions of manually purified gastric mucins have been shown to be promising lubricants for biomedical purposes, where they can efficiently reduce friction and wear. However, so far, such mucin solutions have been mostly tested in specific settings, and variations in the composition of the lubricating fluid have not been systematically explored. We here fill this gap and determine the viscosity, adsorption behavior, and lubricity of porcine gastric mucin solutions on hydrophobic surfaces at different pH levels, mucin and salt concentrations and in the presence of other proteins. We demonstrate that mucin solutions provide excellent lubricity even at very low concentrations of 0.01 % (w/v), over a broad range of pH levels and even at elevated ionic strength. Furthermore, we provide mechanistic insights into mucin lubricity, which help explain how certain variations in physiologically relevant parameters can limit the lubricating potential of mucin solutions. Our results motivate that solutions of manually purified mucin solutions can be powerful biomedical lubricants, e.g. serving as eye drops, mouth sprays or as a personal lubricant for intercourse.
△ Less
Submitted 18 July, 2019; v1 submitted 18 April, 2019;
originally announced April 2019.
-
Pro-arrhythmogenic effects of heterogeneous tissue curvature: A suggestion for role of left atrial appendage in atrial fibrillation
Authors:
Jun-Seop Song,
Jaehyeok Kim,
Byounghyun Lim,
Young-Seon Lee,
Minki Hwang,
Boyoung Joung,
Eun Bo Shim,
Hui-Nam Pak
Abstract:
Background: The arrhythmogenic role of atrial complex morphology has not yet been clearly elucidated. We hypothesized that bumpy tissue geometry can induce action potential duration (APD) dispersion and wavebreak in atrial fibrillation (AF).
Methods and Results: We simulated 2D-bumpy atrial model by varying the degree of bumpiness, and 3D-left atrial (LA) models integrated by LA computed tomogra…
▽ More
Background: The arrhythmogenic role of atrial complex morphology has not yet been clearly elucidated. We hypothesized that bumpy tissue geometry can induce action potential duration (APD) dispersion and wavebreak in atrial fibrillation (AF).
Methods and Results: We simulated 2D-bumpy atrial model by varying the degree of bumpiness, and 3D-left atrial (LA) models integrated by LA computed tomographic (CT) images taken from 14 patients with persistent AF. We also analyzed wave-dynamic parameters with bipolar electrograms during AF and compared them with LA-CT geometry in 30 patients with persistent AF. In 2D-bumpy model, APD dispersion increased (p<0.001) and wavebreak occurred spontaneously when the surface bumpiness was higher, showing phase transition-like behavior (p<0.001). Bumpiness gradient 2D-model showed that spiral wave drifted in the direction of higher bumpiness, and phase singularity (PS) points were mostly located in areas with higher bumpiness. In 3D-LA model, PS density was higher in LA appendage (LAA) compared to other LA parts (p<0.05). In 30 persistent AF patients, the surface bumpiness of LAA was 5.8-times that of other LA parts (p<0.001), and exceeded critical bumpiness to induce wavebreak. Wave dynamics complexity parameters were consistently dominant in LAA (p<0.001).
Conclusion: The bumpy tissue geometry promotes APD dispersion, wavebreak, and spiral wave drift in in silico human atrial tissue, and corresponds to clinical electro-anatomical maps.
△ Less
Submitted 14 September, 2018; v1 submitted 5 March, 2018;
originally announced March 2018.
-
Quantum transport senses community structure in networks
Authors:
Chenchao Zhao,
Jun S. Song
Abstract:
Quantum time evolution exhibits rich physics, attributable to the interplay between the density and phase of a wave function. However, unlike classical heat diffusion, the wave nature of quantum mechanics has not yet been extensively explored in modern data analysis. We propose that the Laplace transform of quantum transport (QT) can be used to construct an ensemble of maps from a given complex ne…
▽ More
Quantum time evolution exhibits rich physics, attributable to the interplay between the density and phase of a wave function. However, unlike classical heat diffusion, the wave nature of quantum mechanics has not yet been extensively explored in modern data analysis. We propose that the Laplace transform of quantum transport (QT) can be used to construct an ensemble of maps from a given complex network to a circle $S^1$, such that closely-related nodes on the network are grouped into sharply concentrated clusters on $S^1$. The resulting QT clustering (QTC) algorithm is as powerful as the state-of-the-art spectral clustering in discerning complex geometric patterns and more robust when clusters show strong density variations or heterogeneity in size. The observed phenomenon of QTC can be interpreted as a collective behavior of the microscopic nodes that evolve as macroscopic cluster orbitals in an effective tight-binding model recapitulating the network. Python source code implementing the algorithm and examples are available at https://github.com/jssong-lab/QTC.
△ Less
Submitted 12 January, 2018; v1 submitted 14 November, 2017;
originally announced November 2017.
-
Causality inference in stochastic systems from neurons to currencies: Profiting from small sample size
Authors:
Danh-Tai Hoang,
Juyong Song,
Vipul Periwal,
Junghyo Jo
Abstract:
Success in modeling complex phenomena such as human perception hinges critically on the availability of data and computational power. Significant progress has been made in modeling such phenomena using probabilistic methods, particularly in image analysis and speech recognition. Maximum Likelihood Estimation (MLE) combined with Bayesian model selection is the basis of much of this progress, as MLE…
▽ More
Success in modeling complex phenomena such as human perception hinges critically on the availability of data and computational power. Significant progress has been made in modeling such phenomena using probabilistic methods, particularly in image analysis and speech recognition. Maximum Likelihood Estimation (MLE) combined with Bayesian model selection is the basis of much of this progress, as MLE converges to the true model with copious data. In the sciences, large enough datasets are rarae aves, so alternatives to MLE must be developed for small sample size. We introduce a data-driven statistical physics approach to model inference based on minimizing a free energy of data and show superior model recovery for small sample sizes. We demonstrate coupling strength inference in non-equilibrium kinetic Ising models, including in the difficult large coupling variability regime, and show scaling to systems of arbitrary size. As applications, we infer a functional connectivity network in the salamander retina and a currency exchange rate network from time-series data of neuronal spiking and currency exchange rates, respectively. Accurate small sample size inference is critical for devising a profitable currency hedging strategy.
△ Less
Submitted 8 May, 2018; v1 submitted 17 May, 2017;
originally announced May 2017.
-
Conformational Heterogeneity and FRET Data Interpretation for Dimensions of Unfolded Proteins
Authors:
Jianhui Song,
Gregory-Neal Gomes,
Tongfei Shi,
Claudiu C. Gradinaru,
Hue Sun Chan
Abstract:
A mathematico-physically valid formulation is required to infer properties of disordered protein conformations from single-molecule Förster resonance energy transfer (smFRET). Conformational dimensions inferred by conventional approaches that presume a homogeneous conformational ensemble can be unphysical. When all possible---heterogeneous as well as homogeneous---conformational distributions are…
▽ More
A mathematico-physically valid formulation is required to infer properties of disordered protein conformations from single-molecule Förster resonance energy transfer (smFRET). Conformational dimensions inferred by conventional approaches that presume a homogeneous conformational ensemble can be unphysical. When all possible---heterogeneous as well as homogeneous---conformational distributions are taken into account without prejudgement, a single value of average transfer efficiency $\langle E\rangle$ between dyes at two chain ends is generally consistent with highly diverse, multiple values of the average radius of gyration $\langle R_{\rm g}\rangle$. Here we utilize unbiased conformational statistics from a coarse-grained explicit-chain model to establish a general logical framework to quantify this fundamental ambiguity in smFRET inference. As an application, we address the long-standing controversy regarding the denaturant dependence of $\langle R_{\rm g}\rangle$ of unfolded proteins, focusing on Protein L as an example. Conventional smFRET inference concluded that $\langle R_{\rm g}\rangle$ of unfolded Protein L is highly sensitive to [GuHCl], but data from small-angle X-ray scattering (SAXS) suggested a near-constant $\langle R_{\rm g}\rangle$ irrespective of [GuHCl]. Strikingly, the present analysis indicates that although the reported $\langle E\rangle$ values for Protein L at [GuHCl] = 1 M and 7 M are very different at 0.75 and 0.45, respectively, the Bayesian $R^2_{\rm g}$ distributions consistent with these two $\langle E\rangle$ values overlap by as much as $75\%$. Our findings suggest, in general, that the smFRET-SAXS discrepancy regarding unfolded protein dimensions likely arise from highly heterogeneous conformational ensembles at low or zero denaturant, and that additional experimental probes are needed to ascertain the nature of this heterogeneity.
△ Less
Submitted 31 July, 2017; v1 submitted 17 May, 2017;
originally announced May 2017.
-
Exact heat kernel on a hypersphere and its applications in kernel SVM
Authors:
Chenchao Zhao,
Jun S. Song
Abstract:
Many contemporary statistical learning methods assume a Euclidean feature space. This paper presents a method for defining similarity based on hyperspherical geometry and shows that it often improves the performance of support vector machine compared to other competing similarity measures. Specifically, the idea of using heat diffusion on a hypersphere to measure similarity has been previously pro…
▽ More
Many contemporary statistical learning methods assume a Euclidean feature space. This paper presents a method for defining similarity based on hyperspherical geometry and shows that it often improves the performance of support vector machine compared to other competing similarity measures. Specifically, the idea of using heat diffusion on a hypersphere to measure similarity has been previously proposed, demonstrating promising results based on a heuristic heat kernel obtained from the zeroth order parametrix expansion; however, how well this heuristic kernel agrees with the exact hyperspherical heat kernel remains unknown. This paper presents a higher order parametrix expansion of the heat kernel on a unit hypersphere and discusses several problems associated with this expansion method. We then compare the heuristic kernel with an exact form of the heat kernel expressed in terms of a uniformly and absolutely convergent series in high-dimensional angular momentum eigenmodes. Being a natural measure of similarity between sample points dwelling on a hypersphere, the exact kernel often shows superior performance in kernel SVM classifications applied to text mining, tumor somatic mutation imputation, and stock market analysis.
△ Less
Submitted 19 November, 2017; v1 submitted 4 February, 2017;
originally announced February 2017.
-
Random-phase-approximation theory for sequence-dependent, biologically functional liquid-liquid phase separation of intrinsically disordered proteins
Authors:
Yi-Hsuan Lin,
Jianhui Song,
Julie D. Forman-Kay,
Hue Sun Chan
Abstract:
Intrinsically disordered proteins (IDPs) are typically low in nonpolar/hydrophobic but relatively high in polar, charged, and aromatic amino acid compositions. Some IDPs undergo liquid-liquid phase separation in the aqueous milieu of the living cell. The resulting phase with enhanced IDP concentration can function as a major component of membraneless organelles that, by creating their own IDP-rich…
▽ More
Intrinsically disordered proteins (IDPs) are typically low in nonpolar/hydrophobic but relatively high in polar, charged, and aromatic amino acid compositions. Some IDPs undergo liquid-liquid phase separation in the aqueous milieu of the living cell. The resulting phase with enhanced IDP concentration can function as a major component of membraneless organelles that, by creating their own IDP-rich microenvironments, stimulate critical biological functions. IDP phase behaviors are governed by their amino acid sequences. To make progress in understanding this sequence-phase relationship, we report further advances in a recently introduced application of random-phase-approximation (RPA) heteropolymer theory to account for sequence-specific electrostatics in IDP phase separation. Here we examine computed variations in phase behavior with respect to block length and charge density of model polyampholytes of alternating equal-length charge blocks to gain insight into trends observed in IDP phase separation. As a real-life example, the theory is applied to rationalize/predict binodal and spinodal phase behaviors of the 236-residue N-terminal disordered region of RNA helicase Ddx4 and its charge-scrambled mutant for which experimental data are available. Fundamental differences are noted between the phase diagrams predicted by RPA and those predicted by mean-field Flory-Huggins and Overbeek-Voorn/Debye-Hückel theories. In the RPA context, a physically plausible dependence of relative permittivity on protein concentration can produce a cooperative effect in favor of IDP-IDP attraction and thus a significant increased tendency to phase separate. Ramifications of these findings for future development of IDP phase separation theory are discussed.
△ Less
Submitted 26 September, 2016;
originally announced September 2016.
-
Minimal Perceptrons for Memorizing Complex Patterns
Authors:
Marissa Pastor,
Juyong Song,
Danh-Tai Hoang,
Junghyo Jo
Abstract:
Feedforward neural networks have been investigated to understand learning and memory, as well as applied to numerous practical problems in pattern classification. It is a rule of thumb that more complex tasks require larger networks. However, the design of optimal network architectures for specific tasks is still an unsolved fundamental problem. In this study, we consider three-layered neural netw…
▽ More
Feedforward neural networks have been investigated to understand learning and memory, as well as applied to numerous practical problems in pattern classification. It is a rule of thumb that more complex tasks require larger networks. However, the design of optimal network architectures for specific tasks is still an unsolved fundamental problem. In this study, we consider three-layered neural networks for memorizing binary patterns. We developed a new complexity measure of binary patterns, and estimated the minimal network size for memorizing them as a function of their complexity. We formulated the minimal network size for regular, random, and complex patterns. In particular, the minimal size for complex patterns, which are neither ordered nor disordered, was predicted by measuring their Hamming distances from known ordered patterns. Our predictions agreed with simulations based on the back-propagation algorithm.
△ Less
Submitted 11 December, 2015;
originally announced December 2015.
-
Spectral Learning of Large Structured HMMs for Comparative Epigenomics
Authors:
Chicheng Zhang,
Jimin Song,
Kevin C Chen,
Kamalika Chaudhuri
Abstract:
We develop a latent variable model and an efficient spectral algorithm motivated by the recent emergence of very large data sets of chromatin marks from multiple human cell types. A natural model for chromatin data in one cell type is a Hidden Markov Model (HMM); we model the relationship between multiple cell types by connecting their hidden states by a fixed tree of known structure. The main cha…
▽ More
We develop a latent variable model and an efficient spectral algorithm motivated by the recent emergence of very large data sets of chromatin marks from multiple human cell types. A natural model for chromatin data in one cell type is a Hidden Markov Model (HMM); we model the relationship between multiple cell types by connecting their hidden states by a fixed tree of known structure. The main challenge with learning parameters of such models is that iterative methods such as EM are very slow, while naive spectral methods result in time and space complexity exponential in the number of cell types. We exploit properties of the tree structure of the hidden states to provide spectral algorithms that are more computationally efficient for current biological datasets. We provide sample complexity bounds for our algorithm and evaluate it experimentally on biological data from nine human cell types. Finally, we show that beyond our specific model, some of our algorithmic ideas can be applied to other graphical models.
△ Less
Submitted 4 June, 2015;
originally announced June 2015.
-
BayMeth: Improved DNA methylation quantification for affinity capture sequencing data using a flexible Bayesian approach
Authors:
Andrea Riebler,
Mirco Menigatti,
Jenny Z. Song,
Aaron L. Statham,
Clare Stirzaker,
Nadiya Mahmud,
Charles A. Mein,
Susan J. Clark,
Mark D. Robinson
Abstract:
DNA methylation (DNAme) is a critical component of the epigenetic regulatory machinery and aberrations in DNAme patterns occur in many diseases, such as cancer. Mapping and understanding DNAme profiles offers considerable promise for reversing the aberrant states. There are several approaches to analyze DNAme, which vary widely in cost, resolution and coverage. Affinity capture and high-throughput…
▽ More
DNA methylation (DNAme) is a critical component of the epigenetic regulatory machinery and aberrations in DNAme patterns occur in many diseases, such as cancer. Mapping and understanding DNAme profiles offers considerable promise for reversing the aberrant states. There are several approaches to analyze DNAme, which vary widely in cost, resolution and coverage. Affinity capture and high-throughput sequencing of methylated DNA strike a good balance between the high cost of whole genome bisulphite sequencing (WGBS) and the low coverage of methylation arrays. However, existing methods cannot adequately differentiate between hypomethylation patterns and low capture efficiency, and do not offer flexibility to integrate copy number variation (CNV). Furthermore, no uncertainty estimates are provided, which may prove useful for combining data from multiple protocols or propagating into downstream analysis. We propose an empirical Bayes framework that uses a fully methylated (i.e. SssI treated) control sample to transform observed read densities into regional methylation estimates. In our model, inefficient capture can be distinguished from low methylation levels by means of larger posterior variances. Furthermore, we can integrate CNV by introducing a multiplicative offset into our Poisson model framework. Notably, our model offers analytic expressions for the mean and variance of the methylation level and thus is fast to compute. Our algorithm outperforms existing approaches in terms of bias, mean-squared error and coverage probabilities as illustrated on multiple reference datasets. Although our method provides advantages even without the SssI-control, considerable improvement is achieved by its incorporation. Our method can be applied to methylated DNA affinity enrichment assays (e.g MBD-seq, MeDIP-seq) and a software implementation is available in the Bioconductor Repitools package.
△ Less
Submitted 11 December, 2013;
originally announced December 2013.
-
Improved annotation of 3-prime untranslated regions and complex loci by combination of strand-specific Direct RNA Sequencing, RNA-seq and ESTs
Authors:
Nick Schurch,
Christian Cole,
Alexander Sherstnev,
Junfang Song,
Céline Duc,
Kate G. Storey,
W. H. Irwin McLean,
Sara J. Brown,
Gordon G. Simpson,
Geoffrey J. Barton
Abstract:
The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct annotation is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental sy…
▽ More
The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct annotation is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3-prime untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3-prime polyadenylation sites to within +/- 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3-prime UTR re-annotation (including extension of one 3-prime UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental data
△ Less
Submitted 11 November, 2013;
originally announced November 2013.