-
Contributions of the Petabyte Scale Sequence Search Codeathon toward efforts to scale sequence-based searches on SRA
Authors:
Priyanka Ghosh,
Kjiersten Fagnan,
Ryan Connor,
Ravinder Pannu,
Travis J. Wheeler,
Mihai Pop,
C. Titus Brown,
Tessa Pierce-Ward,
Rob Patro,
Jacquelyn S. Michaelis,
Thomas L. Madden,
Christiam Camacho,
Olaitan I. Awe,
Arianna I. Krinos,
René KM Xavier,
Rodrigo Ortega Polo,
Jack W. Roddy,
Adelaide Rhodes,
Alexander Sweeten,
Adrian Viehweger,
Bariş Ekim,
Harihara Subrahmaniam Muralidharan,
Amatur Rahman,
Vinícius W. Salazar,
Andrew Tritt
, et al. (13 additional authors not shown)
Abstract:
The volume of biological data being generated by the scientific community is growing exponentially, reflecting technological advances and research activities. The National Institutes of Health's (NIH) Sequence Read Archive (SRA), which is maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), is a rapidly growing public database that resea…
▽ More
The volume of biological data being generated by the scientific community is growing exponentially, reflecting technological advances and research activities. The National Institutes of Health's (NIH) Sequence Read Archive (SRA), which is maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), is a rapidly growing public database that researchers use to drive scientific discovery across all domains of life. This increase in available data has great promise for pushing scientific discovery but also introduces new challenges that scientific communities need to address. As genomic datasets have grown in scale and diversity, a parade of new methods and associated software have been developed to address the challenges posed by this growth. These methodological advances are vital for maximally leveraging the power of next-generation sequencing (NGS) technologies. With the goal of laying a foundation for evaluation of methods for petabyte-scale sequence search, the Department of Energy (DOE) Office of Biological and Environmental Research (BER), the NIH Office of Data Science Strategy (ODSS), and NCBI held a virtual codeathon 'Petabyte Scale Sequence Search: Metagenomics Benchmarking Codeathon' on September 27 - Oct 1 2021, to evaluate emerging solutions in petabyte scale sequence search. The codeathon attracted experts from national laboratories, research institutions, and universities across the world to (a) develop benchmarking approaches to address challenges in conducting large-scale analyses of metagenomic data (which comprises approximately 20% of SRA), (b) identify potential applications that benefit from SRA-wide searches and the tools required to execute the search, and (c) produce community resources i.e. a public facing repository with information to rebuild and reproduce the problems addressed by each team challenge.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
GramSeq-DTA: A grammar-based drug-target affinity prediction approach fusing gene expression information
Authors:
Kusal Debnath,
Pratip Rana,
Preetam Ghosh
Abstract:
Drug-target affinity (DTA) prediction is a critical aspect of drug discovery. The meaningful representation of drugs and targets is crucial for accurate prediction. Using 1D string-based representations for drugs and targets is a common approach that has demonstrated good results in drug-target affinity prediction. However, these approach lacks information on the relative position of the atoms and…
▽ More
Drug-target affinity (DTA) prediction is a critical aspect of drug discovery. The meaningful representation of drugs and targets is crucial for accurate prediction. Using 1D string-based representations for drugs and targets is a common approach that has demonstrated good results in drug-target affinity prediction. However, these approach lacks information on the relative position of the atoms and bonds. To address this limitation, graph-based representations have been used to some extent. However, solely considering the structural aspect of drugs and targets may be insufficient for accurate DTA prediction. Integrating the functional aspect of these drugs at the genetic level can enhance the prediction capability of the models. To fill this gap, we propose GramSeq-DTA, which integrates chemical perturbation information with the structural information of drugs and targets. We applied a Grammar Variational Autoencoder (GVAE) for drug feature extraction and utilized two different approaches for protein feature extraction: Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). The chemical perturbation data is obtained from the L1000 project, which provides information on the upregulation and downregulation of genes caused by selected drugs. This chemical perturbation information is processed, and a compact dataset is prepared, serving as the functional feature set of the drugs. By integrating the drug, gene, and target features in the model, our approach outperforms the current state-of-the-art DTA prediction models when validated on widely used DTA datasets (BindingDB, Davis, and KIBA). This work provides a novel and practical approach to DTA prediction by merging the structural and functional aspects of biological entities, and it encourages further research in multi-modal DTA prediction.
△ Less
Submitted 2 November, 2024;
originally announced November 2024.
-
A Review of Link Prediction Applications in Network Biology
Authors:
Ahmad F. Al Musawi,
Satyaki Roy,
Preetam Ghosh
Abstract:
In the domain of network biology, the interactions among heterogeneous genomic and molecular entities are represented through networks. Link prediction (LP) methodologies are instrumental in inferring missing or prospective associations within these biological networks. In this review, we systematically dissect the attributes of local, centrality, and embedding-based LP approaches, applied to stat…
▽ More
In the domain of network biology, the interactions among heterogeneous genomic and molecular entities are represented through networks. Link prediction (LP) methodologies are instrumental in inferring missing or prospective associations within these biological networks. In this review, we systematically dissect the attributes of local, centrality, and embedding-based LP approaches, applied to static and dynamic biological networks. We undertake an examination of the current applications of LP metrics for predicting links between diseases, genes, proteins, RNA, microbiomes, drugs, and neurons. We carry out comprehensive performance evaluations on established biological network datasets to show the practical applications of standard LP models. Moreover, we compare the similarity in prediction trends among the models and the specific network attributes that contribute to effective link prediction, before underscoring the role of LP in addressing the formidable challenges prevalent in biological systems, ranging from noise, bias, and data sparseness to interpretability. We conclude the review with an exploration of the essential characteristics expected from future LP models, poised to advance our comprehension of the intricate interactions governing biological systems.
△ Less
Submitted 2 December, 2023;
originally announced December 2023.
-
ResDTA: Predicting Drug-Target Binding Affinity Using Residual Skip Connections
Authors:
Partho Ghosh,
Md. Aynal Haque
Abstract:
The discovery of novel drug target (DT) interactions is an important step in the drug development process. The majority of computer techniques for predicting DT interactions have focused on binary classification, with the goal of determining whether or not a DT pair interacts. Protein ligand interactions, on the other hand, assume a continuous range of binding strength values, also known as bindin…
▽ More
The discovery of novel drug target (DT) interactions is an important step in the drug development process. The majority of computer techniques for predicting DT interactions have focused on binary classification, with the goal of determining whether or not a DT pair interacts. Protein ligand interactions, on the other hand, assume a continuous range of binding strength values, also known as binding affinity, and forecasting this value remains a difficulty. As the amount of affinity data in DT knowledge-bases grows, advanced learning techniques such as deep learning architectures can be used to predict binding affinities. In this paper, we present a deep-learning-based methodology for predicting DT binding affinities using just sequencing information from both targets and drugs. The results show that the proposed deep learning-based model that uses the 1D representations of targets and drugs is an effective approach for drug target binding affinity prediction and it does not require additional chemical domain knowledge to work with. The model in which high-level representations of a drug and a target are constructed via CNNs that uses residual skip connections and also with an additional stream to create a high-level combined representation of the drug-target pair achieved the best Concordance Index (CI) performance in one of the largest benchmark datasets, outperforming the recent state-of-the-art method AttentionDTA and many other machine-learning and deep-learning based baseline methods for DT binding affinity prediction that uses the 1D representations of targets and drugs.
△ Less
Submitted 20 March, 2023;
originally announced March 2023.
-
A Flexible Agent-Based Model to Study COVID-19 Outbreak -- A Generic Approach
Authors:
Anik Burman,
Sayak Chatterjee,
Pramit Ghosh,
Indranil Mukhokadhyay
Abstract:
Understanding dynamics of an outbreak like that of COVID-19 is important in designing effective control measures. This study aims to develop an agent based model that compares changes in infection progression by manipulating different parameters in a synthetic population. Model input includes population characteristics like age, sex, working status etc. of each individual and other factors influen…
▽ More
Understanding dynamics of an outbreak like that of COVID-19 is important in designing effective control measures. This study aims to develop an agent based model that compares changes in infection progression by manipulating different parameters in a synthetic population. Model input includes population characteristics like age, sex, working status etc. of each individual and other factors influencing disease dynamics. Depending on number of epicentres of infection, location of primary cases, sensitivity, proportion of asymptomatic and frequency or duration of lockdown, our simulator tracks every individual and hence infection progression through community over time.
In a closed community of 10000 people, it is seen that without any lockdown, number of cases peak around 6th week and wanes off around 15th week. If primary case is located inside dense population cluster like slums, cases peak early and wane off slowly. With introduction of lockdown, cases peak at slower rate. If sensitivity of identifying infection decreases, cases and deaths increase. Number of cases declines with increase in proportion of asymptomatic cases. The model is robust and provides reproducible estimates with realistic parameter values. It also guides in identifying measures to control outbreak in a community. It is flexible in accommodating different parameters like infectivity period, yield of testing, socio-economic strata, daily travel, awareness level, population density, social distancing, lockdown etc. and can be tailored to study other infections with similar transmission pattern.
△ Less
Submitted 23 June, 2021; v1 submitted 16 June, 2021;
originally announced June 2021.
-
Stability analysis of a signaling circuit with dual species of GTPase switches
Authors:
Lucas M. Stolerman,
Pradipta Ghosh,
Padmini Rangamani
Abstract:
GTPases are molecular switches that regulate a wide range of cellular processes, such as organelle biogenesis, position, shape, and signal transduction. These enzymes operate by toggling between an active ("ON") guanosine triphosphate (GTP)-bound state and an inactive ("OFF") guanosine diphosphate (GDP)-bound state; such a toggle is regulated by GEFs (guanine nucleotide exchange factors) and GAPs…
▽ More
GTPases are molecular switches that regulate a wide range of cellular processes, such as organelle biogenesis, position, shape, and signal transduction. These enzymes operate by toggling between an active ("ON") guanosine triphosphate (GTP)-bound state and an inactive ("OFF") guanosine diphosphate (GDP)-bound state; such a toggle is regulated by GEFs (guanine nucleotide exchange factors) and GAPs (GTPase activating proteins). Here we dissect a network motif between monomeric (m) and trimeric (t) GTPases assembled exclusively in eukaryotic cells of multicellular organisms. To this end, we develop a system of ordinary differential equations in which these two classes of GTPases are interlinked conditional to their ON/OFF states within a motif through feedforward and feedback loops. We provide formulas for the steady states of the system and perform local stability analysis to investigate the role of the different connections between the GTPase switches. A feedforward from the active mGTPase to the GEF of the tGTPase was sufficient to provide two locally stable states: one where both active/inactive forms of the mGTPase can be interpreted as having low concentrations and the other where both m- and tGTPase have high concentrations. When a feedback loop from the GEF of the tGTPase to the GAP of the mGTPase was added to the feedforward system, two other locally stable states emerged, both having the tGTPase inactivated and being interpreted as having low active tGTPase concentrations. Finally, the addition of a second feedback loop, from the active tGTPase to the GAP of the mGTPase, gives rise to a family of steady states parametrized by the inactive tGTPase concentrations. Our findings reveal that the coupling of these two different GTPase motifs can dramatically change their steady state behaviors and shed light on how such coupling may impact information processing in eukaryotic cells.
△ Less
Submitted 17 September, 2020;
originally announced September 2020.
-
Critical community size for COVID-19 -- a model based approach to provide a rationale behind the lockdown
Authors:
Sarmistha Das,
Pramit Ghosh,
Bandana Sen,
Indranil Mukhopadhyay
Abstract:
Background: Restrictive mass quarantine or lockdown has been implemented as the most important controlling measure to fight against COVID-19. Many countries have enforced 2 - 4 weeks' lockdown and are extending the period depending on their current disease scenario. Most probably the 14-day period of estimated communicability of COVID-19 prompted such decision. But the idea that, if the susceptibl…
▽ More
Background: Restrictive mass quarantine or lockdown has been implemented as the most important controlling measure to fight against COVID-19. Many countries have enforced 2 - 4 weeks' lockdown and are extending the period depending on their current disease scenario. Most probably the 14-day period of estimated communicability of COVID-19 prompted such decision. But the idea that, if the susceptible population drops below certain threshold, the infection would naturally die out in small communities after a fixed time (following the outbreak), unless the disease is reintroduced from outside, was proposed by Bartlett in 1957. This threshold was termed as Critical Community Size (CCS). Methods: We propose an SEIR model that explains COVID-19 disease dynamics. Using our model, we have calculated country-specific expected time to extinction (TTE) and CCS that would essentially determine the ideal number of lockdown days required and size of quarantined population. Findings: With the given country-wise rates of death, recovery and other parameters, we have identified that, if at a place the total number of susceptible population drops below CCS, infection will cease to exist after a period of TTE days, unless it is introduced from outside. But the disease will almost die out much sooner. We have calculated the country-specific estimate of the ideal number of lockdown days. Thus, smaller lockdown phase is sufficient to contain COVID-19. On a cautionary note, our model indicates another rise in infection almost a year later but on a lesser magnitude.
△ Less
Submitted 7 April, 2020;
originally announced April 2020.
-
Mechanisms for bacterial gliding motility on soft substrates
Authors:
Joël Tchoufag,
Pushpita Ghosh,
Connor B. Pogue,
Beiyan Nan,
Kranthi K. Mandadapu
Abstract:
The motility mechanism of certain rod-shaped bacteria has long been a mystery, since no external appendages are involved in their motion which is known as gliding. However, the physical principles behind gliding motility still remain poorly understood. Using myxobacteria as a canonical example of such organisms, we identify here the physical principles behind gliding motility, and develop a theore…
▽ More
The motility mechanism of certain rod-shaped bacteria has long been a mystery, since no external appendages are involved in their motion which is known as gliding. However, the physical principles behind gliding motility still remain poorly understood. Using myxobacteria as a canonical example of such organisms, we identify here the physical principles behind gliding motility, and develop a theoretical model that predicts a two-regime behavior of the gliding speed as a function of the substrate stiffness. Our theory describes the elastic, viscous, and capillary interactions between the bacterial membrane carrying a traveling wave, the secreted slime acting as a lubricating film, and the substrate which we model as a soft solid. Defining the myxobacterial gliding as the horizontal motion on the substrate under zero net force, we find the two-regime behavior is due to two different mechanisms of motility thrust. On stiff substrates, the thrust arises from the bacterial shape deformations creating a flow of slime that exerts a pressure along the bacterial length. This pressure in conjunction with the bacterial shape provides the necessary thrust for propulsion. However, we show that such a mechanism cannot lead to gliding on very soft substrates. Instead, we show that capillary effects lead to the formation of a ridge at the slime-substrate-air interface, which creates a thrust in the form of a localized pressure gradient at the tip of the bacteria. To test our theory, we perform experiments with isolated cells on agar substrates of varying stiffness and find the measured gliding speeds to be in good agreement with the predictions from our elasto-capillary-hydrodynamic model. The physical mechanisms reported here serve as an important step towards an accurate theory of friction and substrate-mediated interaction between bacteria in a swarm of cells proliferating in soft media.
△ Less
Submitted 20 July, 2018; v1 submitted 19 July, 2018;
originally announced July 2018.
-
Spreading of non-motile bacteria on a hard agar plate: Comparison between agent-based and stochastic simulations
Authors:
Navdeep Rana,
Pushpita Ghosh,
Prasad Perlekar
Abstract:
We study spreading of a non-motile bacteria colony on a hard agar plate by using agent-based and continuum models. We show that the spreading dynamics depends on the initial nutrient concentration, the motility and the inherent demographic noise. Population fluctuations are inherent in an agent based model whereas, for the continuum model we model them by using a stochastic Langevin equation. We s…
▽ More
We study spreading of a non-motile bacteria colony on a hard agar plate by using agent-based and continuum models. We show that the spreading dynamics depends on the initial nutrient concentration, the motility and the inherent demographic noise. Population fluctuations are inherent in an agent based model whereas, for the continuum model we model them by using a stochastic Langevin equation. We show that the intrinsic population fluctuations coupled with non-linear diffusivity lead to a transition from Diffusion Limited Aggregation (DLA) type morphology to an Eden-like morphology on decreasing the initial nutrient concentration.
△ Less
Submitted 21 October, 2017; v1 submitted 16 June, 2017;
originally announced June 2017.
-
Periodic force induced stabilization or destabilization of the denatured state of a protein
Authors:
Pulak Kumar Ghosh,
Mai Suan Li,
Bidhan Chandra Bag
Abstract:
We have studied the effects of an external sinusoidal force in protein folding kinetics. The externally applied force field acts on the each amino acid residues of polypeptide chains. Our simulation results show that mean protein folding time first increases with driving frequency and then decreases passing through a maximum. With further increase of the driving frequency the mean folding time sta…
▽ More
We have studied the effects of an external sinusoidal force in protein folding kinetics. The externally applied force field acts on the each amino acid residues of polypeptide chains. Our simulation results show that mean protein folding time first increases with driving frequency and then decreases passing through a maximum. With further increase of the driving frequency the mean folding time starts increasing as the noise-induced hoping event (from the denatured state to the native state) begins to experience many oscillations over the mean barrier crossing time period. Thus unlike one-dimensional barrier crossing problems, the external oscillating force field induces both \emph{stabilization or destabilization of the denatured state} of a protein. We have also studied the parametric dependence of the folding dynamics on temperature, viscosity, non-Markovian character of bath in presence of the external field.
△ Less
Submitted 30 May, 2012;
originally announced May 2012.