-
RiboFlow: Conditional De Novo RNA Sequence-Structure Co-Design via Synergistic Flow Matching
Authors:
Runze Ma,
Zhongyue Zhang,
Zichen Wang,
Chenqing Hua,
Zhuomin Zhou,
Fenglei Cao,
Jiahua Rao,
Shuangjia Zheng
Abstract:
Ribonucleic acid (RNA) binds to molecules to achieve specific biological functions. While generative models are advancing biomolecule design, existing methods for designing RNA that target specific ligands face limitations in capturing RNA's conformational flexibility, ensuring structural validity, and overcoming data scarcity. To address these challenges, we introduce RiboFlow, a synergistic flow…
▽ More
Ribonucleic acid (RNA) binds to molecules to achieve specific biological functions. While generative models are advancing biomolecule design, existing methods for designing RNA that target specific ligands face limitations in capturing RNA's conformational flexibility, ensuring structural validity, and overcoming data scarcity. To address these challenges, we introduce RiboFlow, a synergistic flow matching model to co-design RNA structures and sequences based on target molecules. By integrating RNA backbone frames, torsion angles, and sequence features in an unified architecture, RiboFlow explicitly models RNA's dynamic conformations while enforcing sequence-structure consistency to improve validity. Additionally, we curate RiboBind, a large-scale dataset of RNA-molecule interactions, to resolve the scarcity of high-quality structural data. Extensive experiments reveal that RiboFlow not only outperforms state-of-the-art RNA design methods by a large margin but also showcases controllable capabilities for achieving high binding affinity to target ligands. Our work bridges critical gaps in controllable RNA design, offering a framework for structure-aware, data-efficient generation.
△ Less
Submitted 23 March, 2025; v1 submitted 21 March, 2025;
originally announced March 2025.
-
Active information, missing data and prevalence estimation
Authors:
Ola Hössjer,
Daniel Andrés Díaz-Pachón,
Chen Zhao,
J. Sunil Rao
Abstract:
The topic of this paper is prevalence estimation from the perspective of active information. Prevalence among tested individuals has an upward bias under the assumption that individuals' willingness to be tested for the disease increases with the strength of their symptoms. Active information due to testing bias quantifies the degree at which the willingness to be tested correlates with infection…
▽ More
The topic of this paper is prevalence estimation from the perspective of active information. Prevalence among tested individuals has an upward bias under the assumption that individuals' willingness to be tested for the disease increases with the strength of their symptoms. Active information due to testing bias quantifies the degree at which the willingness to be tested correlates with infection status. Interpreting incomplete testing as a missing data problem, the missingness mechanism impacts the degree at which the bias of the original prevalence estimate can be removed. The reduction in prevalence, when testing bias is adjusted for, translates into an active information due to bias correction, with opposite sign to active information due to testing bias. Prevalence and active information estimates are asymptotically normal, a behavior also illustrated through simulations.
△ Less
Submitted 10 June, 2022;
originally announced June 2022.
-
"Back to the future" projections for COVID-19 surges
Authors:
J. Sunil Rao,
Tianhao Liu,
Daniel Andrés Díaz-Pachón
Abstract:
We argue that information from countries who had earlier COVID-19 surges can be used to inform another country's current model, then generating what we call back-to-the-future (BTF) projections. We show that these projections can be used to accurately predict future COVID-19 surges prior to an inflection point of the daily infection curve. We show, across 12 different countries from all populated…
▽ More
We argue that information from countries who had earlier COVID-19 surges can be used to inform another country's current model, then generating what we call back-to-the-future (BTF) projections. We show that these projections can be used to accurately predict future COVID-19 surges prior to an inflection point of the daily infection curve. We show, across 12 different countries from all populated continents around the world, that our method can often predict future surges in scenarios where the traditional approaches would always predict no future surges. However, as expected, BTF projections cannot accurately predict a surge due to the emergence of a new variant. To generate BTF projections, we make use of a matching scheme for asynchronous time series combined with a response coaching SIR model.
△ Less
Submitted 17 February, 2022;
originally announced February 2022.
-
Quantitative Evaluation of Explainable Graph Neural Networks for Molecular Property Prediction
Authors:
Jiahua Rao,
Shuangjia Zheng,
Yuedong Yang
Abstract:
Advances in machine learning have led to graph neural network-based methods for drug discovery, yielding promising results in molecular design, chemical synthesis planning, and molecular property prediction. However, current graph neural networks (GNNs) remain of limited acceptance in drug discovery is limited due to their lack of interpretability. Although this major weakness has been mitigated b…
▽ More
Advances in machine learning have led to graph neural network-based methods for drug discovery, yielding promising results in molecular design, chemical synthesis planning, and molecular property prediction. However, current graph neural networks (GNNs) remain of limited acceptance in drug discovery is limited due to their lack of interpretability. Although this major weakness has been mitigated by the development of explainable artificial intelligence (XAI) techniques, the "ground truth" assignment in most explainable tasks ultimately rests with subjective judgments by humans so that the quality of model interpretation is hard to evaluate in quantity. In this work, we first build three levels of benchmark datasets to quantitatively assess the interpretability of the state-of-the-art GNN models. Then we implemented recent XAI methods in combination with different GNN algorithms to highlight the benefits, limitations, and future opportunities for drug discovery. As a result, GradInput and IG generally provide the best model interpretability for GNNs, especially when combined with GraphNet and CMPNN. The integrated and developed XAI package is fully open-sourced and can be used by practitioners to train new models on other drug discovery tasks.
△ Less
Submitted 12 July, 2021; v1 submitted 1 July, 2021;
originally announced July 2021.
-
A simple correction for COVID-19 sampling bias
Authors:
Daniel Andrés Díaz-Pachón,
J Sunil Rao
Abstract:
COVID-19 testing has become a standard approach for estimating prevalence which then assist in public health decision making to contain and mitigate the spread of the disease. The sampling designs used are often biased in that they do not reflect the true underlying populations. For instance, individuals with strong symptoms are more likely to be tested than those with no symptoms. This results in…
▽ More
COVID-19 testing has become a standard approach for estimating prevalence which then assist in public health decision making to contain and mitigate the spread of the disease. The sampling designs used are often biased in that they do not reflect the true underlying populations. For instance, individuals with strong symptoms are more likely to be tested than those with no symptoms. This results in biased estimates of prevalence (too high). Typical post-sampling corrections are not always possible. Here we present a simple bias correction methodology derived and adapted from a correction for publication bias in meta analysis studies. The methodology is general enough to allow a wide variety of customization making it more useful in practice. Implementation is easily done using already collected information. Via a simulation and two real datasets, we show that the bias corrections can provide dramatic reductions in estimation error.
△ Less
Submitted 11 January, 2021; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Mobile phone location data reveal the effect and geographic variation of social distancing on the spread of the COVID-19 epidemic
Authors:
Song Gao,
Jinmeng Rao,
Yuhao Kang,
Yunlei Liang,
Jake Kruse,
Doerte Doepfer,
Ajay K. Sethi,
Juan Francisco Mandujano Reyes,
Jonathan Patz,
Brian S. Yandell
Abstract:
The emergence of SARS-CoV-2 and the coronavirus infectious disease (COVID-19) has become a pandemic. Social (physical) distancing is a key non-pharmacologic control measure to reduce the transmission rate of SARS-COV-2, but high-level adherence is needed. Using daily travel distance and stay-at-home time derived from large-scale anonymous mobile phone location data provided by Descartes Labs and S…
▽ More
The emergence of SARS-CoV-2 and the coronavirus infectious disease (COVID-19) has become a pandemic. Social (physical) distancing is a key non-pharmacologic control measure to reduce the transmission rate of SARS-COV-2, but high-level adherence is needed. Using daily travel distance and stay-at-home time derived from large-scale anonymous mobile phone location data provided by Descartes Labs and SafeGraph, we quantify the degree to which social distancing mandates have been followed in the U.S. and its effect on growth of COVID-19 cases. The correlation between the COVID-19 growth rate and travel distance decay rate and dwell time at home change rate was -0.586 (95% CI: -0.742 ~ -0.370) and 0.526 (95% CI: 0.293 ~ 0.700), respectively. Increases in state-specific doubling time of total cases ranged from 1.04 ~ 6.86 days to 3.66 ~ 30.29 days after social distancing orders were put in place, consistent with mechanistic epidemic prediction models. Social distancing mandates reduce the spread of COVID-19 when they are followed.
△ Less
Submitted 23 April, 2020;
originally announced April 2020.
-
Mapping county-level mobility pattern changes in the United States in response to COVID-19
Authors:
Song Gao,
Jinmeng Rao,
Yuhao Kang,
Yunlei Liang,
Jake Kruse
Abstract:
To contain the Coronavirus disease (COVID-19) pandemic, one of the non-pharmacological epidemic control measures in response to the COVID-19 outbreak is reducing the transmission rate of SARS-COV-2 in the population through (physical) social distancing. An interactive web-based mapping platform that provides timely quantitative information on how people in different counties and states reacted to…
▽ More
To contain the Coronavirus disease (COVID-19) pandemic, one of the non-pharmacological epidemic control measures in response to the COVID-19 outbreak is reducing the transmission rate of SARS-COV-2 in the population through (physical) social distancing. An interactive web-based mapping platform that provides timely quantitative information on how people in different counties and states reacted to the social distancing guidelines was developed with the support of the National Science Foundation (NSF). It integrates geographic information systems (GIS) and daily updated human mobility statistical patterns derived from large-scale anonymized and aggregated smartphone location big data at the county-level in the United States, and aims to increase risk awareness of the public, support governmental decision-making, and help enhance community responses to the COVID-19 outbreak.
△ Less
Submitted 16 May, 2020; v1 submitted 9 April, 2020;
originally announced April 2020.
-
Model-Free Episodic Control
Authors:
Charles Blundell,
Benigno Uria,
Alexander Pritzel,
Yazhe Li,
Avraham Ruderman,
Joel Z Leibo,
Jack Rae,
Daan Wierstra,
Demis Hassabis
Abstract:
State of the art deep reinforcement learning algorithms take many millions of interactions to attain human-level performance. Humans, on the other hand, can very quickly exploit highly rewarding nuances of an environment upon first discovery. In the brain, such rapid learning is thought to depend on the hippocampus and its capacity for episodic memory. Here we investigate whether a simple model of…
▽ More
State of the art deep reinforcement learning algorithms take many millions of interactions to attain human-level performance. Humans, on the other hand, can very quickly exploit highly rewarding nuances of an environment upon first discovery. In the brain, such rapid learning is thought to depend on the hippocampus and its capacity for episodic memory. Here we investigate whether a simple model of hippocampal episodic control can learn to solve difficult sequential decision-making tasks. We demonstrate that it not only attains a highly rewarding strategy significantly faster than state-of-the-art deep reinforcement learning algorithms, but also achieves a higher overall reward on some of the more challenging domains.
△ Less
Submitted 14 June, 2016;
originally announced June 2016.
-
Unified theory of human genome reveals a constrained spatial chromosomal arrangement in interphase nuclei
Authors:
Sarosh N. Fatakia,
Ishita S. Mehta,
Basuthkar J. Rao
Abstract:
We investigate a densely packed, non-random arrangement of forty-six chromosomes (46,XY) in human nuclei. Here, we model systems-level chromosomal crosstalk by unifying intrinsic parameters (chromosomal length and number of genes) across all pairs of chromosomes in the genome to derive an extrinsic parameter called effective gene density. The hierarchical clustering and underlying degeneracy in th…
▽ More
We investigate a densely packed, non-random arrangement of forty-six chromosomes (46,XY) in human nuclei. Here, we model systems-level chromosomal crosstalk by unifying intrinsic parameters (chromosomal length and number of genes) across all pairs of chromosomes in the genome to derive an extrinsic parameter called effective gene density. The hierarchical clustering and underlying degeneracy in the effective gene density space reveal systems-level constraints for spatial arrangement of clusters of chromosomes that were previously unknown. Our findings corroborate experimental data on spatial chromosomal arrangement in human nuclei, from fibroblast and lymphocyte cell lines, thereby establishing that human genome constrains chromosomal arrangement. We propose that this unified theory, which requires no additional experimental input, may be extended to other eukaryotic species with annotated genomes to infer their constrained self-organized spatial arrangement of chromosomes.
△ Less
Submitted 19 October, 2015; v1 submitted 27 September, 2015;
originally announced September 2015.
-
Simplifying the mosaic description of DNA sequences
Authors:
Rajeev K. Azad,
J. Subba Rao,
Wentian Li,
Ramakrishna Ramaswamy
Abstract:
By using the Jensen-Shannon divergence, genomic DNA can be divided into compositionally distinct domains through a standard recursive segmentation procedure. Each domain, while significantly different from its neighbours, may however share compositional similarity with one or more distant (non--neighbouring) domains. We thus obtain a coarse--grained description of the given DNA string in terms o…
▽ More
By using the Jensen-Shannon divergence, genomic DNA can be divided into compositionally distinct domains through a standard recursive segmentation procedure. Each domain, while significantly different from its neighbours, may however share compositional similarity with one or more distant (non--neighbouring) domains. We thus obtain a coarse--grained description of the given DNA string in terms of a smaller set of distinct domain labels. This yields a minimal domain description of a given DNA sequence, significantly reducing its organizational complexity. This procedure gives a new means of evaluating genomic complexity as one examines organisms ranging from bacteria to human. The mosaic organization of DNA sequences could have originated from the insertion of fragments of one genome (the parasite) inside another (the host), and we present numerical experiments that are suggestive of this scenario.
△ Less
Submitted 27 July, 2002;
originally announced July 2002.