-
Contributions of the Petabyte Scale Sequence Search Codeathon toward efforts to scale sequence-based searches on SRA
Authors:
Priyanka Ghosh,
Kjiersten Fagnan,
Ryan Connor,
Ravinder Pannu,
Travis J. Wheeler,
Mihai Pop,
C. Titus Brown,
Tessa Pierce-Ward,
Rob Patro,
Jacquelyn S. Michaelis,
Thomas L. Madden,
Christiam Camacho,
Olaitan I. Awe,
Arianna I. Krinos,
René KM Xavier,
Rodrigo Ortega Polo,
Jack W. Roddy,
Adelaide Rhodes,
Alexander Sweeten,
Adrian Viehweger,
Bariş Ekim,
Harihara Subrahmaniam Muralidharan,
Amatur Rahman,
Vinícius W. Salazar,
Andrew Tritt
, et al. (13 additional authors not shown)
Abstract:
The volume of biological data being generated by the scientific community is growing exponentially, reflecting technological advances and research activities. The National Institutes of Health's (NIH) Sequence Read Archive (SRA), which is maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), is a rapidly growing public database that resea…
▽ More
The volume of biological data being generated by the scientific community is growing exponentially, reflecting technological advances and research activities. The National Institutes of Health's (NIH) Sequence Read Archive (SRA), which is maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), is a rapidly growing public database that researchers use to drive scientific discovery across all domains of life. This increase in available data has great promise for pushing scientific discovery but also introduces new challenges that scientific communities need to address. As genomic datasets have grown in scale and diversity, a parade of new methods and associated software have been developed to address the challenges posed by this growth. These methodological advances are vital for maximally leveraging the power of next-generation sequencing (NGS) technologies. With the goal of laying a foundation for evaluation of methods for petabyte-scale sequence search, the Department of Energy (DOE) Office of Biological and Environmental Research (BER), the NIH Office of Data Science Strategy (ODSS), and NCBI held a virtual codeathon 'Petabyte Scale Sequence Search: Metagenomics Benchmarking Codeathon' on September 27 - Oct 1 2021, to evaluate emerging solutions in petabyte scale sequence search. The codeathon attracted experts from national laboratories, research institutions, and universities across the world to (a) develop benchmarking approaches to address challenges in conducting large-scale analyses of metagenomic data (which comprises approximately 20% of SRA), (b) identify potential applications that benefit from SRA-wide searches and the tools required to execute the search, and (c) produce community resources i.e. a public facing repository with information to rebuild and reproduce the problems addressed by each team challenge.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Describing the Persistence Landscape for Introducing Microbes into Complex Communities
Authors:
Jason E. McDermott,
William C. Nelson,
Amy E. Zimmerman,
Winston Anthony,
Devin Coleman-Derr,
Joshua Elmore,
Tara Nitka,
Ryan S. McClure,
Pubudu P. Handakumbura,
Adam Guss,
Travis J. Wheeler,
Robert G. Egbert
Abstract:
The introduction of non-native organisms into complex microbiome communities holds enormous potential to benefit society. However, microbiome engineering faces several challenges including successful establishment of the organism into the community, its persistence in the microbiome to serve a specified purpose, and constraint of the organism and its activity to the intended environment. A theoret…
▽ More
The introduction of non-native organisms into complex microbiome communities holds enormous potential to benefit society. However, microbiome engineering faces several challenges including successful establishment of the organism into the community, its persistence in the microbiome to serve a specified purpose, and constraint of the organism and its activity to the intended environment. A theoretical framework is needed to represent the complex interactions that drive these dynamics. Building on the concept of the community functional landscape, we define the persistence landscape as the metabolic, genetic, and broader functional composition and ecological context of the target microbiome that can be used to predict the environmental fitness of an introduced organism. Here, we discuss critical aspects of persistence landscapes that impact interactions between an introduced organism and the target microbiome, including the communitys genetic and metabolic complementation potential, cellular defense strategies, spatial and temporal dynamics, and the introduced organisms ability to compete for resources to survive. Finally, we highlight important knowledge gaps in the fields of microbial ecology and microbiome engineering that limit characterization and engineering of persistence landscapes. As a model for understanding microbiome structure and interaction in the context of microbiome engineering, the persistence landscape model should enable development of novel containment approaches while improving controlled colonization of a complex microbiome community to address pressing challenges in human health, agronomy, and biomanufacturing.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
The need to implement FAIR principles in biomolecular simulations
Authors:
Rommie Amaro,
Johan Åqvist,
Ivet Bahar,
Federica Battistini,
Adam Bellaiche,
Daniel Beltran,
Philip C. Biggin,
Massimiliano Bonomi,
Gregory R. Bowman,
Richard Bryce,
Giovanni Bussi,
Paolo Carloni,
David Case,
Andrea Cavalli,
Chie-En A. Chang,
Thomas E. Cheatham III,
Margaret S. Cheung,
Cris Chipot,
Lillian T. Chong,
Preeti Choudhary,
Gerardo Andres Cisneros,
Cecilia Clementi,
Rosana Collepardo-Guevara,
Peter Coveney,
Roberto Covino
, et al. (103 additional authors not shown)
Abstract:
This letter illustrates the opinion of the molecular dynamics (MD) community on the need to adopt a new FAIR paradigm for the use of molecular simulations. It highlights the necessity of a collaborative effort to create, establish, and sustain a database that allows findability, accessibility, interoperability, and reusability of molecular dynamics simulation data. Such a development would democra…
▽ More
This letter illustrates the opinion of the molecular dynamics (MD) community on the need to adopt a new FAIR paradigm for the use of molecular simulations. It highlights the necessity of a collaborative effort to create, establish, and sustain a database that allows findability, accessibility, interoperability, and reusability of molecular dynamics simulation data. Such a development would democratize the field and significantly improve the impact of MD simulations on life science research. This will transform our working paradigm, pushing the field to a new frontier. We invite you to support our initiative at the MDDB community (https://mddbr.eu/community/) Now published as: Amaro, R.E., et al. The need to implement FAIR principles in biomolecular simulations. Nat Methods (2025) https://doi.org/10.1038/s41592-025-02635-0
△ Less
Submitted 3 April, 2025; v1 submitted 23 July, 2024;
originally announced July 2024.
-
Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance
Authors:
Anna C. Marbut,
John W. Chandler,
Travis J. Wheeler
Abstract:
It is generally thought that transformer-based large language models benefit from pre-training by learning generic linguistic knowledge that can be focused on a specific task during fine-tuning. However, we propose that much of the benefit from pre-training may be captured by geometric characteristics of the latent space representations, divorced from any specific linguistic knowledge. In this wor…
▽ More
It is generally thought that transformer-based large language models benefit from pre-training by learning generic linguistic knowledge that can be focused on a specific task during fine-tuning. However, we propose that much of the benefit from pre-training may be captured by geometric characteristics of the latent space representations, divorced from any specific linguistic knowledge. In this work we explore the relationship between GLUE benchmarking task performance and a variety of measures applied to the latent space resulting from BERT-type contextual language models. We find that there is a strong linear relationship between a measure of quantized cell density and average GLUE performance and that these measures may be predictive of otherwise surprising GLUE performance for several non-standard BERT-type models from the literature. These results may be suggestive of a strategy for decreasing pre-training requirements, wherein model initialization can be informed by the geometric characteristics of the model's latent space.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Reliable Measures of Spread in High Dimensional Latent Spaces
Authors:
Anna C. Marbut,
Katy McKinney-Bock,
Travis J. Wheeler
Abstract:
Understanding geometric properties of natural language processing models' latent spaces allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model's latent space, or how fully the available latent space is being used. In this work, we define data spread and demonstrate that the commonly used measures of data s…
▽ More
Understanding geometric properties of natural language processing models' latent spaces allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model's latent space, or how fully the available latent space is being used. In this work, we define data spread and demonstrate that the commonly used measures of data spread, Average Cosine Similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across models. We propose and examine eight alternative measures of data spread, all but one of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can be used to compare models of different sizes and dimensionalities.
△ Less
Submitted 31 July, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
SODA: a TypeScript/JavaScript Library for Visualizing Biological Sequence Annotation
Authors:
Jack W. Roddy,
George T. Lesica,
Travis J. Wheeler
Abstract:
We present SODA, a lightweight and open-source visualization library for biological sequence annotations that enables straightforward development of flexible, dynamic, and interactive web graphics. SODA is implemented in TypeScript and can be used as a library within TypeScript and JavaScript.
We present SODA, a lightweight and open-source visualization library for biological sequence annotations that enables straightforward development of flexible, dynamic, and interactive web graphics. SODA is implemented in TypeScript and can be used as a library within TypeScript and JavaScript.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.