Search | arXiv e-print repository

Packaging, containerization, and virtualization of computational omics methods: Advances, challenges, and opportunities

Authors: Mohammed Alser, Sharon Waymost, Ram Ayyala, Brendan Lawlor, Richard J. Abdill, Neha Rajkumar, Nathan LaPierre, Jaqueline Brito, Andre M. Ribeiro-dos-Santos, Can Firtina, Nour Almadhoun, Varuni Sarwal, Eleazar Eskin, Qiyang Hu, Derek Strong, Byoung-Do, Kim, Malak S. Abedalthagafi, Onur Mutlu, Serghei Mangul

Abstract: Omics software tools have reshaped the landscape of modern biology and become an essential component of biomedical research. The increasing dependence of biomedical scientists on these powerful tools creates a need for easier installation and greater usability. Packaging, virtualization, and containerization are different approaches to satisfy this need by wrapping omics tools in additional softwa… ▽ More Omics software tools have reshaped the landscape of modern biology and become an essential component of biomedical research. The increasing dependence of biomedical scientists on these powerful tools creates a need for easier installation and greater usability. Packaging, virtualization, and containerization are different approaches to satisfy this need by wrapping omics tools in additional software that makes the omics tools easier to install and use. Here, we systematically review practices across prominent packaging, virtualization, and containerization platforms. We outline the challenges, advantages, and limitations of each approach and some of the most widely used platforms from the perspectives of users, software developers, and system administrators. We also propose principles to make packaging, virtualization, and containerization of omics software more sustainable and robust to increase the reproducibility of biomedical and life science research. △ Less

Submitted 30 March, 2022; originally announced March 2022.

arXiv:2010.02391 [pdf]

RNA-seq data science: From raw data to effective interpretation

Authors: Dhrithi Deshpande, Karishma Chhugani, Yutong Chang, Aaron Karlsberg, Caitlin Loeffler, Jinyang Zhang, Agata Muszynska, Jeremy Rotman, Laura Tao, Brunilda Balliu, Elizabeth Tseng, Eleazar Eskin, Fangqing Zhao, Pejman Mohammadi, Pawel P Labaj, Serghei Mangul

Abstract: RNA-sequencing (RNA-seq) has become an exemplar technology in modern biology and clinical applications over the past decade. It has gained immense popularity in the recent years driven by continuous efforts of the bioinformatics community to develop accurate and scalable computational tools. RNA-seq is a method of analyzing the RNA content of a sample using the modern sequencing platforms. It gene… ▽ More RNA-sequencing (RNA-seq) has become an exemplar technology in modern biology and clinical applications over the past decade. It has gained immense popularity in the recent years driven by continuous efforts of the bioinformatics community to develop accurate and scalable computational tools. RNA-seq is a method of analyzing the RNA content of a sample using the modern sequencing platforms. It generates enormous amounts of transcriptomic data in the form of nucleotide sequences, known as reads. RNA-seq analysis enables the probing of genes and corresponding transcripts which is essential for answering important biological questions, such as detecting novel exons, transcripts, gene expressions, and studying alternative splicing structure. However, obtaining meaningful biological signals from raw data using computational methods is challenging due to the limitations of modern sequencing technologies. The need to leverage these technological challenges have pushed the rapid development of many novel computational tools which have evolved and diversified in accordance with technological advancements, leading to the current myriad population of RNA-seq tools. Our review provides a systemic overview of RNA-seq technology and 235 available RNA-seq tools across various domains published from 2008 to 2020, discussing the interdisciplinary nature of bioinformatics involved in RNA sequencing, analysis, and software development. △ Less

Submitted 16 February, 2021; v1 submitted 5 October, 2020; originally announced October 2020.

arXiv:1911.11304 [pdf]

Metagenomics for clinical diagnostics: technologies and informatics

Authors: Caitlin Loeffler, Keylie M. Gibson, Lana Martin, Liz Chang, Jeremy Rotman, Ian V. Toma, Christopher E. Mason, Eleazar Eskin, Joseph P. Zackular, Keith A. Crandall, David Koslicki, Serghei Mangul

Abstract: The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical valida… ▽ More The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations. △ Less

Submitted 7 August, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

Comments: 75 pages, 7 figures, 2 tables, 4 supplementary table, review paper

arXiv:1408.5530 [pdf, other]

IPED2: Inheritance Path based Pedigree Reconstruction Algorithm for Complicated Pedigrees

Authors: Dan He, Zhanyong Wang, Laxmi Parida, Eleazar Eskin

Abstract: Reconstruction of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. The problem is known to be NP-hard even for datasets known to only contain siblings. Some recent methods have been developed to accurately and efficiently reconstruct pedigrees. These methods, however, still consider relatively simple pedigrees, for example, they are not abl… ▽ More Reconstruction of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. The problem is known to be NP-hard even for datasets known to only contain siblings. Some recent methods have been developed to accurately and efficiently reconstruct pedigrees. These methods, however, still consider relatively simple pedigrees, for example, they are not able to handle half-sibling situations where a pair of individuals only share one parent. In this work, we propose an efficient method, IPED2, based on our previous work, which specifically targets reconstruction of complicated pedigrees that include half-siblings. We note that the presence of half-siblings makes the reconstruction problem significantly more challenging which is why previous methods exclude the possibility of half-siblings. We proposed a novel model as well as an efficient graph algorithm and experiments show that our algorithm achieves relatively accurate reconstruction. To our knowledge, this is the first method that is able to handle pedigree reconstruction based on genotype data only when half-sibling exists in any generation of the pedigree. △ Less

Submitted 23 August, 2014; originally announced August 2014.

Comments: 9 pages

arXiv:1304.8045 [pdf, other]

A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping

Authors: Buhm Han, Jae Hoon Sul, Eleazar Eskin, Paul I. W. de Bakker, Soumya Raychaudhuri

Abstract: Meta-analysis of genome-wide association studies is increasingly popular and many meta-analytic methods have been recently proposed. A majority of meta-analytic methods combine information from multiple studies by assuming that studies are independent since individuals collected in one study are unlikely to be collected again by another study. However, it has become increasingly common to utilize… ▽ More Meta-analysis of genome-wide association studies is increasingly popular and many meta-analytic methods have been recently proposed. A majority of meta-analytic methods combine information from multiple studies by assuming that studies are independent since individuals collected in one study are unlikely to be collected again by another study. However, it has become increasingly common to utilize the same control individuals among multiple studies to reduce genotyping or sequencing cost. This causes those studies that share the same individuals to be dependent, and spurious associations may arise if overlapping subjects are not taken into account in a meta-analysis. In this paper, we propose a general framework for meta-analyzing dependent studies with overlapping subjects. Given dependent studies, our approach "decouples" the studies into independent studies such that meta-analysis methods assuming independent studies can be applied. This enables many meta-analysis methods, such as the random effects model, to account for overlapping subjects. Another advantage is that one can continue to use preferred software in the analysis pipeline which may not support overlapping subjects. Using simulations and the Wellcome Trust Case Control Consortium data, we show that our decoupling approach allows both the fixed and the random effects models to account for overlapping subjects while retaining desirable false positive rate and power. △ Less

Submitted 17 January, 2014; v1 submitted 30 April, 2013; originally announced April 2013.

Comments: 1/17/14: Minor text changes

Showing 1–5 of 5 results for author: Eskin, E