-
SeqMate: A Novel Large Language Model Pipeline for Automating RNA Sequencing
Authors:
Devam Mondal,
Atharva Inamdar
Abstract:
RNA sequencing techniques, like bulk RNA-seq and Single Cell (sc) RNA-seq, are critical tools for the biologist looking to analyze the genetic activity/transcriptome of a tissue or cell during an experimental procedure. Platforms like Illumina's next-generation sequencing (NGS) are used to produce the raw data for this experimental procedure. This raw FASTQ data must then be prepared via a complex…
▽ More
RNA sequencing techniques, like bulk RNA-seq and Single Cell (sc) RNA-seq, are critical tools for the biologist looking to analyze the genetic activity/transcriptome of a tissue or cell during an experimental procedure. Platforms like Illumina's next-generation sequencing (NGS) are used to produce the raw data for this experimental procedure. This raw FASTQ data must then be prepared via a complex series of data manipulations by bioinformaticians. This process currently takes place on an unwieldy textual user interface like a terminal/command line that requires the user to install and import multiple program packages, preventing the untrained biologist from initiating data analysis. Open-source platforms like Galaxy have produced a more user-friendly pipeline, yet the visual interface remains cluttered and highly technical, remaining uninviting for the natural scientist. To address this, SeqMate is a user-friendly tool that allows for one-click analytics by utilizing the power of a large language model (LLM) to automate both data preparation and analysis (differential expression, trajectory analysis, etc). Furthermore, by utilizing the power of generative AI, SeqMate is also capable of analyzing such findings and producing written reports of upregulated/downregulated/user-prompted genes with sources cited from known repositories like PubMed, PDB, and Uniprot.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Did the lockdown curb the spread of COVID-19 infection rate in India: A data-driven analysis
Authors:
Dipankar Mondal,
Siddhartha P. Chakrabarty
Abstract:
In order to analyze the effectiveness of three successive nationwide lockdown enforced in India, we present a data-driven analysis of four key parameters, reducing the transmission rate, restraining the growth rate, flattening the epidemic curve and improving the health care system. These were quantified by the consideration of four different metrics, namely, reproduction rate, growth rate, doubli…
▽ More
In order to analyze the effectiveness of three successive nationwide lockdown enforced in India, we present a data-driven analysis of four key parameters, reducing the transmission rate, restraining the growth rate, flattening the epidemic curve and improving the health care system. These were quantified by the consideration of four different metrics, namely, reproduction rate, growth rate, doubling time and death to recovery ratio. The incidence data of the COVID-19 (during the period of 2nd March 2020 to 31st May 2020) outbreak in India was analyzed for the best fit to the epidemic curve, making use of the exponential growth, the maximum likelihood estimation, sequential Bayesian method and estimation of time-dependent reproduction. The best fit (based on the data considered) was for the time-dependent approach. Accordingly, this approach was used to assess the impact on the effective reproduction rate. The period of pre-lockdown to the end of lockdown 3, saw a $45\%$ reduction in the rate of effective reproduction rate. During the same period the growth rate reduced from $393\%$ during the pre-lockdown to $33\%$ after lockdown 3, accompanied by the average doubling time increasing form $4$-$6$ days to $12$-$14$ days. Finally, the death-to-recovery ratio dropped from $0.28$ (pre-lockdown) to $0.08$ after lockdown 3. In conclusion, all the four metrics considered to assess the effectiveness of the lockdown, exhibited significant favourable changes, from the pre-lockdown period to the end of lockdown 3. Analysis of the data in the post-lockdown period with these metrics will provide greater clarity with regards to the extent of the success of the lockdown.
△ Less
Submitted 22 June, 2020;
originally announced June 2020.
-
The C-SHIFT algorithm for normalizing covariances
Authors:
Evgenia Chunikhina,
Paul Logan,
Yevgeniy Kovchegov,
Anatoly Yambartsev,
Debashis Mondal,
Andrey Morgun
Abstract:
Omics technologies are powerful tools for analyzing patterns in gene expression data for thousands of genes. Due to a number of systematic variations in experiments, the raw gene expression data is often obfuscated by undesirable technical noises. Various normalization techniques were designed in an attempt to remove these non-biological errors prior to any statistical analysis. One of the reasons…
▽ More
Omics technologies are powerful tools for analyzing patterns in gene expression data for thousands of genes. Due to a number of systematic variations in experiments, the raw gene expression data is often obfuscated by undesirable technical noises. Various normalization techniques were designed in an attempt to remove these non-biological errors prior to any statistical analysis. One of the reasons for normalizing data is the need for recovering the covariance matrix used in gene network analysis. In this paper, we introduce a novel normalization technique, called the covariance shift (C-SHIFT) method. This normalization algorithm uses optimization techniques together with the blessing of dimensionality philosophy and energy minimization hypothesis for covariance matrix recovery under additive noise (in biology, known as the bias). Thus, it is perfectly suited for the analysis of logarithmic gene expression data. Numerical experiments on synthetic data demonstrate the method's advantage over the classical normalization techniques. Namely, the comparison is made with Rank, Quantile, cyclic LOESS (locally estimated scatterplot smoothing), and MAD (median absolute deviation) normalization methods. We also evaluate the performance of C-SHIFT algorithm on real biological data.
△ Less
Submitted 5 August, 2021; v1 submitted 28 March, 2020;
originally announced March 2020.