-
Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence
Authors:
Yingying Sun,
Jun A,
Zhiwei Liu,
Rui Sun,
Liujia Qian,
Samuel H. Payne,
Wout Bittremieux,
Markus Ralser,
Chen Li,
Yi Chen,
Zhen Dong,
Yasset Perez-Riverol,
Asif Khan,
Chris Sander,
Ruedi Aebersold,
Juan Antonio VizcaĆno,
Jonathan R Krieger,
Jianhua Yao,
Han Wen,
Linfeng Zhang,
Yunping Zhu,
Yue Xuan,
Benjamin Boyang Sun,
Liang Qiao,
Henning Hermjakob
, et al. (37 additional authors not shown)
Abstract:
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights.…
▽ More
Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
Double-Ended Synthesis Planning with Goal-Constrained Bidirectional Search
Authors:
Kevin Yu,
Jihye Roh,
Ziang Li,
Wenhao Gao,
Runzhong Wang,
Connor W. Coley
Abstract:
Computer-aided synthesis planning (CASP) algorithms have demonstrated expert-level abilities in planning retrosynthetic routes to molecules of low to moderate complexity. However, current search methods assume the sufficiency of reaching arbitrary building blocks, failing to address the common real-world constraint where using specific molecules is desired. To this end, we present a formulation of…
▽ More
Computer-aided synthesis planning (CASP) algorithms have demonstrated expert-level abilities in planning retrosynthetic routes to molecules of low to moderate complexity. However, current search methods assume the sufficiency of reaching arbitrary building blocks, failing to address the common real-world constraint where using specific molecules is desired. To this end, we present a formulation of synthesis planning with starting material constraints. Under this formulation, we propose Double-Ended Synthesis Planning (DESP), a novel CASP algorithm under a bidirectional graph search scheme that interleaves expansions from the target and from the goal starting materials to ensure constraint satisfiability. The search algorithm is guided by a goal-conditioned cost network learned offline from a partially observed hypergraph of valid chemical reactions. We demonstrate the utility of DESP in improving solve rates and reducing the number of search expansions by biasing synthesis planning towards expert goals on multiple new benchmarks. DESP can make use of existing one-step retrosynthesis models, and we anticipate its performance to scale as these one-step model capabilities improve.
△ Less
Submitted 1 November, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
Construction of extra-large scale screening tools for risks of severe mental illnesses using real world healthcare data
Authors:
Dianbo Liu,
Karmel W. Choi,
Paulo Lizano,
William Yuan,
Kun-Hsing Yu,
Jordan W. Smoller,
Isaac Kohane
Abstract:
Importance: The prevalence of severe mental illnesses (SMIs) in the United States is approximately 3% of the whole population. The ability to conduct risk screening of SMIs at large scale could inform early prevention and treatment.
Objective: A scalable machine learning based tool was developed to conduct population-level risk screening for SMIs, including schizophrenia, schizoaffective disorde…
▽ More
Importance: The prevalence of severe mental illnesses (SMIs) in the United States is approximately 3% of the whole population. The ability to conduct risk screening of SMIs at large scale could inform early prevention and treatment.
Objective: A scalable machine learning based tool was developed to conduct population-level risk screening for SMIs, including schizophrenia, schizoaffective disorders, psychosis, and bipolar disorders,using 1) healthcare insurance claims and 2) electronic health records (EHRs).
Design, setting and participants: Data from beneficiaries from a nationwide commercial healthcare insurer with 77.4 million members and data from patients from EHRs from eight academic hospitals based in the U.S. were used. First, the predictive models were constructed and tested using data in case-control cohorts from insurance claims or EHR data. Second, performance of the predictive models across data sources were analyzed. Third, as an illustrative application, the models were further trained to predict risks of SMIs among 18-year old young adults and individuals with substance associated conditions.
Main outcomes and measures: Machine learning-based predictive models for SMIs in the general population were built based on insurance claims and EHR.
△ Less
Submitted 12 January, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Hyperbolic Molecular Representation Learning for Drug Repositioning
Authors:
Ke Yu,
Shyam Visweswaran,
Kayhan Batmanghelich
Abstract:
Learning accurate drug representations is essential for task such as computational drug repositioning. A drug hierarchy is a valuable source that encodes knowledge of relations among drugs in a tree-like structure where drugs that act on the same organs, treat the same disease, or bind to the same biological target are grouped together. However, its utility in learning drug representations has not…
▽ More
Learning accurate drug representations is essential for task such as computational drug repositioning. A drug hierarchy is a valuable source that encodes knowledge of relations among drugs in a tree-like structure where drugs that act on the same organs, treat the same disease, or bind to the same biological target are grouped together. However, its utility in learning drug representations has not yet been explored, and currently described drug representations cannot place novel molecules in a drug hierarchy. Here, we develop a semi-supervised drug embedding that incorporates two sources of information: (1) underlying chemical grammar that is inferred from chemical structures of drugs and drug-like molecules (unsupervised), and (2) hierarchical relations that are encoded in an expert-crafted hierarchy of approved drugs (supervised). We use the Variational Auto-Encoder (VAE) framework to encode the chemical structures of molecules and use the drug-drug similarity information obtained from the hierarchy to induce the clustering of drugs in hyperbolic space. The hyperbolic space is amenable for encoding hierarchical relations. Our qualitative results support that the learned drug embedding can induce the hierarchical relations among drugs. We demonstrate that the learned drug embedding can be used for drug repositioning.
△ Less
Submitted 6 July, 2022;
originally announced August 2022.
-
Ten Quick Tips for Deep Learning in Biology
Authors:
Benjamin D. Lee,
Anthony Gitter,
Casey S. Greene,
Sebastian Raschka,
Finlay Maguire,
Alexander J. Titus,
Michael D. Kessler,
Alexandra J. Lee,
Marc G. Chevrette,
Paul Allen Stewart,
Thiago Britto-Borges,
Evan M. Cofer,
Kun-Hsing Yu,
Juan Jose Carmona,
Elana J. Fertig,
Alexandr A. Kalinin,
Beth Signal,
Benjamin J. Lengerich,
Timothy J. Triche Jr,
Simina M. Boca
Abstract:
Machine learning is a modern approach to problem-solving and task automation. In particular, machine learning is concerned with the development and applications of algorithms that can recognize patterns in data and use them for predictive modeling. Artificial neural networks are a particular class of machine learning algorithms and models that evolved into what is now described as deep learning. G…
▽ More
Machine learning is a modern approach to problem-solving and task automation. In particular, machine learning is concerned with the development and applications of algorithms that can recognize patterns in data and use them for predictive modeling. Artificial neural networks are a particular class of machine learning algorithms and models that evolved into what is now described as deep learning. Given the computational advances made in the last decade, deep learning can now be applied to massive data sets and in innumerable contexts. Therefore, deep learning has become its own subfield of machine learning. In the context of biological research, it has been increasingly used to derive novel insights from high-dimensional biological data. To make the biological applications of deep learning more accessible to scientists who have some experience with machine learning, we solicited input from a community of researchers with varied biological and deep learning interests. These individuals collaboratively contributed to this manuscript's writing using the GitHub version control platform and the Manubot manuscript generation toolset. The goal was to articulate a practical, accessible, and concise set of guidelines and suggestions to follow when using deep learning. In the course of our discussions, several themes became clear: the importance of understanding and applying machine learning fundamentals as a baseline for utilizing deep learning, the necessity for extensive model comparisons with careful evaluation, and the need for critical thought in interpreting results generated by deep learning, among others.
△ Less
Submitted 29 May, 2021;
originally announced May 2021.
-
Semi-Supervised Hierarchical Drug Embedding in Hyperbolic Space
Authors:
Ke Yu,
Shyam Visweswaran,
Kayhan Batmanghelich
Abstract:
Learning accurate drug representation is essential for tasks such as computational drug repositioning and prediction of drug side-effects. A drug hierarchy is a valuable source that encodes human knowledge of drug relations in a tree-like structure where drugs that act on the same organs, treat the same disease, or bind to the same biological target are grouped together. However, its utility in le…
▽ More
Learning accurate drug representation is essential for tasks such as computational drug repositioning and prediction of drug side-effects. A drug hierarchy is a valuable source that encodes human knowledge of drug relations in a tree-like structure where drugs that act on the same organs, treat the same disease, or bind to the same biological target are grouped together. However, its utility in learning drug representations has not yet been explored, and currently described drug representations cannot place novel molecules in a drug hierarchy. Here, we develop a semi-supervised drug embedding that incorporates two sources of information: (1) underlying chemical grammar that is inferred from molecular structures of drugs and drug-like molecules (unsupervised), and (2) hierarchical relations that are encoded in an expert-crafted hierarchy of approved drugs (supervised). We use the Variational Auto-Encoder (VAE) framework to encode the chemical structures of molecules and use the knowledge-based drug-drug similarity to induce the clustering of drugs in hyperbolic space. The hyperbolic space is amenable for encoding hierarchical concepts. Both quantitative and qualitative results support that the learned drug embedding can accurately reproduce the chemical structure and induce the hierarchical relations among drugs. Furthermore, our approach can infer the pharmacological properties of novel molecules by retrieving similar drugs from the embedding space. We demonstrate that the learned drug embedding can be used to find new uses for existing drugs and to discover side-effects. We show that it significantly outperforms baselines in both tasks.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
The Alzheimer's Disease Prediction Of Longitudinal Evolution (TADPOLE) Challenge: Results after 1 Year Follow-up
Authors:
Razvan V. Marinescu,
Neil P. Oxtoby,
Alexandra L. Young,
Esther E. Bron,
Arthur W. Toga,
Michael W. Weiner,
Frederik Barkhof,
Nick C. Fox,
Arman Eshaghi,
Tina Toni,
Marcin Salaterski,
Veronika Lunina,
Manon Ansart,
Stanley Durrleman,
Pascal Lu,
Samuel Iddi,
Dan Li,
Wesley K. Thompson,
Michael C. Donohue,
Aviv Nahon,
Yarden Levy,
Dan Halbersberg,
Mariya Cohen,
Huiling Liao,
Tengfei Li
, et al. (71 additional authors not shown)
Abstract:
We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcome…
▽ More
We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcomes: clinical diagnosis, Alzheimer's Disease Assessment Scale Cognitive Subdomain (ADAS-Cog13), and total volume of the ventricles. The methods used by challenge participants included multivariate linear regression, machine learning methods such as support vector machines and deep neural networks, as well as disease progression models. No single submission was best at predicting all three outcomes. For clinical diagnosis and ventricle volume prediction, the best algorithms strongly outperform simple baselines in predictive ability. However, for ADAS-Cog13 no single submitted prediction method was significantly better than random guesswork. Two ensemble methods based on taking the mean and median over all predictions, obtained top scores on almost all tasks. Better than average performance at diagnosis prediction was generally associated with the additional inclusion of features from cerebrospinal fluid (CSF) samples and diffusion tensor imaging (DTI). On the other hand, better performance at ventricle volume prediction was associated with inclusion of summary statistics, such as the slope or maxima/minima of biomarkers. TADPOLE's unique results suggest that current prediction algorithms provide sufficient accuracy to exploit biomarkers related to clinical diagnosis and ventricle volume, for cohort refinement in clinical trials for Alzheimer's disease. However, results call into question the usage of cognitive test scores for patient selection and as a primary endpoint in clinical trials.
△ Less
Submitted 27 December, 2021; v1 submitted 9 February, 2020;
originally announced February 2020.
-
Electrokinetic behavior of two touching inhomogeneous biological cells and colloidal particles: Effects of multipolar interactions
Authors:
J. P. Huang,
Mikko Karttunen,
K. W. Yu,
L. Dong,
G. Q. Gu
Abstract:
We present a theory to investigate electro-kinetic behavior, namely, electrorotation and dielectrophoresis under alternating current (AC) applied fields for a pair of touching inhomogeneous colloidal particles and biological cells. These inhomogeneous particles are treated as graded ones with physically motivated model dielectric and conductivity profiles. The mutual polarization interaction bet…
▽ More
We present a theory to investigate electro-kinetic behavior, namely, electrorotation and dielectrophoresis under alternating current (AC) applied fields for a pair of touching inhomogeneous colloidal particles and biological cells. These inhomogeneous particles are treated as graded ones with physically motivated model dielectric and conductivity profiles. The mutual polarization interaction between the particles yields a change in their respective dipole moments, and hence in the AC electrokinetic spectra. The multipolar interactions between polarized particles are accurately captured by the multiple images method. In the point-dipole limit, our theory reproduces the known results. We find that the multipolar interactions as well as the spatial fluctuations inside the particles can affect the AC electrokinetic spectra significantly.
△ Less
Submitted 7 November, 2003; v1 submitted 11 June, 2003;
originally announced June 2003.
-
Dielectric behavior of oblate spheroidal particles: Application to erythrocytes suspensions
Authors:
J. P. Huang,
K. W. Yu
Abstract:
We have investigated the effect of particle shape on the eletrorotation (ER) spectrum of living cells suspensions. In particular, we consider coated oblate spheroidal particles and present a theoretical study of ER based on the spectral representation theory. Analytic expressions for the characteristic frequency as well as the dispersion strength can be obtained, thus simplifying the fitting of…
▽ More
We have investigated the effect of particle shape on the eletrorotation (ER) spectrum of living cells suspensions. In particular, we consider coated oblate spheroidal particles and present a theoretical study of ER based on the spectral representation theory. Analytic expressions for the characteristic frequency as well as the dispersion strength can be obtained, thus simplifying the fitting of experimental data on oblate spheroidal cells that abound in the literature. From the theoretical analysis, we find that the cell shape, coating as well as material parameters can change the ER spectrum. We demonstrate good agreement between our theoretical predictions and experimental data on human erthrocytes suspensions.
△ Less
Submitted 26 February, 2002;
originally announced February 2002.
-
Spectral Representation Theory for Dielectric Behavior of Nonspherical Cell Suspensions
Authors:
J. P. Huang,
K. W. Yu,
Jun Lei,
Hong Sun
Abstract:
Recent experiments revealed that the dielectric dispersion spectrum of fission yeast cells in a suspension was mainly composed of two sub-dispersions. The low-frequency sub-dispersion depended on the cell length, while the high-frequency one was independent of it. The cell shape effect was simulated by an ellipsoidal cell model but the comparison between theory and experiment was far from being…
▽ More
Recent experiments revealed that the dielectric dispersion spectrum of fission yeast cells in a suspension was mainly composed of two sub-dispersions. The low-frequency sub-dispersion depended on the cell length, while the high-frequency one was independent of it. The cell shape effect was simulated by an ellipsoidal cell model but the comparison between theory and experiment was far from being satisfactory. Prompted by the discrepancy, we proposed the use of spectral representation to analyze more realistic cell models. We adopted a shell-spheroidal model to analyze the effects of the cell membrane. It is found that the dielectric property of the cell membrane has only a minor effect on the dispersion magnitude ratio and the characteristic frequency ratio. We further included the effect of rotation of dipole induced by an external electric field, and solved the dipole-rotation spheroidal model in the spectral representation. Good agreement between theory and experiment has been obtained.
△ Less
Submitted 23 April, 2001;
originally announced April 2001.
-
Dielectric Behavior of Nonspherical Cell Suspensions
Authors:
Jun Lei,
Jones T. K. Wan,
K. W. Yu,
Hong Sun
Abstract:
Recent experiments revealed that the dielectric dispersion spectrum of fission yeast cells in a suspension was mainly composed of two sub-dispersions. The low-frequency sub-dispersion depended on the cell length, whereas the high-frequency one was independent of it. The cell shape effect was qualitatively simulated by an ellipsoidal cell model. However, the comparison between theory and experime…
▽ More
Recent experiments revealed that the dielectric dispersion spectrum of fission yeast cells in a suspension was mainly composed of two sub-dispersions. The low-frequency sub-dispersion depended on the cell length, whereas the high-frequency one was independent of it. The cell shape effect was qualitatively simulated by an ellipsoidal cell model. However, the comparison between theory and experiment was far from being satisfactory. In an attempt to close up the gap between theory and experiment, we considered the more realistic cells of spherocylinders, i.e., circular cylinders with two hemispherical caps at both ends. We have formulated a Green function formalism for calculating the spectral representation of cells of finite length. The Green function can be reduced because of the azimuthal symmetry of the cell. This simplification enables us to calculate the dispersion spectrum and hence access the effect of cell structure on the dielectric behavior of cell suspensions.
△ Less
Submitted 23 March, 2001;
originally announced March 2001.