-
Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis
Authors:
Jeff Meilander,
Chloe Herman,
Andrew Manley,
Georgia Augustine,
Dawn Birdsell,
Evan Bolyen,
Kimberly R. Celona,
Hayden Coffey,
Jill Cocking,
Teddy Donoghue,
Alexis Draves,
Daryn Erickson,
Marissa Foley,
Liz Gehret,
Johannah Hagen,
Crystal Hepp,
Parker Ingram,
David John,
Katarina Kadar,
Paul Keim,
Victoria Lloyd,
Christina Osterink,
Victoria Queeney,
Diego Ramirez,
Antonio Romero
, et al. (12 additional authors not shown)
Abstract:
Human excrement composting (HEC) is a sustainable strategy for human excrement (HE) management that recycles nutrients and mitigates health risks while reducing reliance on freshwater, fossil fuels, and fertilizers. We present a comprehensive microbial time series analysis of HEC and show that the initial gut-like microbiome of HEC systems transitions to a microbiome similar to soil and traditiona…
▽ More
Human excrement composting (HEC) is a sustainable strategy for human excrement (HE) management that recycles nutrients and mitigates health risks while reducing reliance on freshwater, fossil fuels, and fertilizers. We present a comprehensive microbial time series analysis of HEC and show that the initial gut-like microbiome of HEC systems transitions to a microbiome similar to soil and traditional compost in fifteen biological replicates tracked weekly for one year.
△ Less
Submitted 5 November, 2024;
originally announced November 2024.
-
Facilitating Bioinformatics Reproducibility
Authors:
Christopher R. Keefe,
Matthew R. Dillon,
Chloe Herman,
Mary Jewell,
Colin V. Wood,
Evan Bolyen,
J. Gregory Caporaso
Abstract:
Study reproducibility is essential to corroborate, build on, and learn from the results of scientific research but is notoriously challenging in bioinformatics, which often involves large data sets and complex analytic workflows involving many different tools. Additionally many biologists aren't trained in how to effectively record their bioinformatics analysis steps to ensure reproducibility, so…
▽ More
Study reproducibility is essential to corroborate, build on, and learn from the results of scientific research but is notoriously challenging in bioinformatics, which often involves large data sets and complex analytic workflows involving many different tools. Additionally many biologists aren't trained in how to effectively record their bioinformatics analysis steps to ensure reproducibility, so critical information is often missing. Software tools used in bioinformatics can automate provenance tracking of the results they generate, removing most barriers to bioinformatics reproducibility. Here we present an implementation of that idea, Provenance Replay, a tool for generating new executable code from results generated with the QIIME 2 bioinformatics platform, and discuss considerations for bioinformatics developers who wish to implement similar functionality in their software.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
A Double Machine Learning Trend Model for Citizen Science Data
Authors:
Daniel Fink,
Alison Johnston,
Matt Strimas-Mackey,
Tom Auer,
Wesley M. Hochachka,
Shawn Ligocki,
Lauren Oldham Jaromczyk,
Orin Robinson,
Chris Wood,
Steve Kelling,
Amanda D. Rodewald
Abstract:
1. Citizen and community-science (CS) datasets have great potential for estimating interannual patterns of population change given the large volumes of data collected globally every year. Yet, the flexible protocols that enable many CS projects to collect large volumes of data typically lack the structure necessary to keep consistent sampling across years. This leads to interannual confounding, as…
▽ More
1. Citizen and community-science (CS) datasets have great potential for estimating interannual patterns of population change given the large volumes of data collected globally every year. Yet, the flexible protocols that enable many CS projects to collect large volumes of data typically lack the structure necessary to keep consistent sampling across years. This leads to interannual confounding, as changes to the observation process over time are confounded with changes in species population sizes.
2. Here we describe a novel modeling approach designed to estimate species population trends while controlling for the interannual confounding common in citizen science data. The approach is based on Double Machine Learning, a statistical framework that uses machine learning methods to estimate population change and the propensity scores used to adjust for confounding discovered in the data. Additionally, we develop a simulation method to identify and adjust for residual confounding missed by the propensity scores. Using this new method, we can produce spatially detailed trend estimates from citizen science data.
3. To illustrate the approach, we estimated species trends using data from the CS project eBird. We used a simulation study to assess the ability of the method to estimate spatially varying trends in the face of real-world confounding. Results showed that the trend estimates distinguished between spatially constant and spatially varying trends at a 27km resolution. There were low error rates on the estimated direction of population change (increasing/decreasing) and high correlations on the estimated magnitude.
4. The ability to estimate spatially explicit trends while accounting for confounding in citizen science data has the potential to fill important information gaps, helping to estimate population trends for species, regions, or seasons without rigorous monitoring data.
△ Less
Submitted 10 May, 2023; v1 submitted 27 October, 2022;
originally announced October 2022.
-
PDBench: Evaluating Computational Methods for Protein Sequence Design
Authors:
Leonardo V. Castorina,
Rokas Petrenas,
Kartic Subr,
Christopher W. Wood
Abstract:
Proteins perform critical processes in all living systems: converting solar energy into chemical energy, replicating DNA, as the basis of highly performant materials, sensing and much more. While an incredible range of functionality has been sampled in nature, it accounts for a tiny fraction of the possible protein universe. If we could tap into this pool of unexplored protein structures, we could…
▽ More
Proteins perform critical processes in all living systems: converting solar energy into chemical energy, replicating DNA, as the basis of highly performant materials, sensing and much more. While an incredible range of functionality has been sampled in nature, it accounts for a tiny fraction of the possible protein universe. If we could tap into this pool of unexplored protein structures, we could search for novel proteins with useful properties that we could apply to tackle the environmental and medical challenges facing humanity. This is the purpose of protein design.
Sequence design is an important aspect of protein design, and many successful methods to do this have been developed. Recently, deep-learning methods that frame it as a classification problem have emerged as a powerful approach. Beyond their reported improvement in performance, their primary advantage over physics-based methods is that the computational burden is shifted from the user to the developers, thereby increasing accessibility to the design method. Despite this trend, the tools for assessment and comparison of such models remain quite generic. The goal of this paper is to both address the timely problem of evaluation and to shine a spotlight, within the Machine Learning community, on specific assessment criteria that will accelerate impact.
We present a carefully curated benchmark set of proteins and propose a number of standard tests to assess the performance of deep learning based methods. Our robust benchmark provides biological insight into the behaviour of design methods, which is essential for evaluating their performance and utility. We compare five existing models with two novel models for sequence prediction. Finally, we test the designs produced by these models with AlphaFold2, a state-of-the-art structure-prediction algorithm, to determine if they are likely to fold into the intended 3D shapes.
△ Less
Submitted 28 September, 2021; v1 submitted 16 September, 2021;
originally announced September 2021.
-
A generative adversarial approach to facilitate archival-quality histopathologic diagnoses from frozen tissue sections
Authors:
Kianoush Falahkheirkhah,
Tao Guo,
Michael Hwang,
Pheroze Tamboli,
Christopher G Wood,
Jose A Karam,
Kanishka Sircar,
Rohit Bhargava
Abstract:
In clinical diagnostics and research involving histopathology, formalin fixed paraffin embedded (FFPE) tissue is almost universally favored for its superb image quality. However, tissue processing time (more than 24 hours) can slow decision-making. In contrast, fresh frozen (FF) processing (less than 1 hour) can yield rapid information but diagnostic accuracy is suboptimal due to lack of clearing,…
▽ More
In clinical diagnostics and research involving histopathology, formalin fixed paraffin embedded (FFPE) tissue is almost universally favored for its superb image quality. However, tissue processing time (more than 24 hours) can slow decision-making. In contrast, fresh frozen (FF) processing (less than 1 hour) can yield rapid information but diagnostic accuracy is suboptimal due to lack of clearing, morphologic deformation and more frequent artifacts. Here, we bridge this gap using artificial intelligence. We synthesize FFPE-like images ,virtual FFPE, from FF images using a generative adversarial network (GAN) from 98 paired kidney samples derived from 40 patients. Five board-certified pathologists evaluated the results in a blinded test. Image quality of the virtual FFPE data was assessed to be high and showed a close resemblance to real FFPE images. Clinical assessments of disease on the virtual FFPE images showed a higher inter-observer agreement compared to FF images. The nearly instantaneously generated virtual FFPE images can not only reduce time to information but can facilitate more precise diagnosis from routine FF images without extraneous costs and effort.
△ Less
Submitted 24 August, 2021;
originally announced August 2021.
-
A Global View of Standards for Open Image Data Formats and Repositories
Authors:
Jason R. Swedlow,
Pasi Kankaanpää,
Ugis Sarkans,
Wojtek Goscinski,
Graham Galloway,
Ryan P. Sullivan,
Claire M. Brown,
Chris Wood,
Antje Keppler,
Ben Loos,
Sara Zullino,
Dario Livio Longo,
Silvio Aime,
Shuichi Onami
Abstract:
Biological and biomedical imaging datasets record the constitution, architecture and dynamics of living organisms across several orders of magnitude of space and time. Imaging technologies are now used throughout the life and biomedical sciences to achieve discovery and understanding of biological mechanisms in the basic sciences as well as assessment, diagnosis and therapeutic intervention in cli…
▽ More
Biological and biomedical imaging datasets record the constitution, architecture and dynamics of living organisms across several orders of magnitude of space and time. Imaging technologies are now used throughout the life and biomedical sciences to achieve discovery and understanding of biological mechanisms in the basic sciences as well as assessment, diagnosis and therapeutic intervention in clinical trials and animal and human medicine. The universal application and use of imaging raises an important question and opportunity: what is the value and ultimate destination of biological and medical imaging data? In the last few years, several informatics and data science technologies have matured sufficiently so that routine publication of these datasets is now possible. Participants in Global BioImaging from 15 countries and all populated continents have agreed on the need for recommendations and guidelines for the establishment of image data repositories and the formats they use for delivering data to the global scientific community. This white paper presents a shared, global view of criteria for these common, globally applicable guidelines and provisional proposals for open tools and resources that are available now and can provide a foundation for future development.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
Stability of spontaneous, correlated activity in mouse auditory cortex
Authors:
Richard F. Betzel,
Katherine C. Wood,
Christopher Angeloni,
Maria Neimark Geffen,
Danielle S. Bassett
Abstract:
Neural systems can be modeled as networks of functionally connected neural elements. The resulting network can be analyzed using mathematical tools from network science and graph theory to quantify the system's topological organization and to better understand its function. While the network-based approach is common in the analysis of large-scale neural systems probed by non-invasive neuroimaging,…
▽ More
Neural systems can be modeled as networks of functionally connected neural elements. The resulting network can be analyzed using mathematical tools from network science and graph theory to quantify the system's topological organization and to better understand its function. While the network-based approach is common in the analysis of large-scale neural systems probed by non-invasive neuroimaging, few studies have used network science to study the organization of networks reconstructed at the cellular level, and thus many very basic and fundamental questions remain unanswered. Here, we used two-photon calcium imaging to record spontaneous activity from the same set of cells in mouse auditory cortex over the course of several weeks. We reconstruct functional networks in which cells are linked to one another by edges weighted according to the correlation of their fluorescence traces. We show that the networks exhibit modular structure across multiple topological scales and that these multi-scale modules unfold as part of a hierarchy. We also show that, on average, network architecture becomes increasingly dissimilar over time, with similarity decaying monotonically with the distance (in time) between sessions. Finally, we show that a small fraction of cells maintain strongly-correlated activity over multiple days, forming a stable temporal core surrounded by a fluctuating and variable periphery. Our work provides a careful methodological blueprint for future studies of spontaneous activity measured by two-photon calcium imaging using cutting-edge computational methods and machine learning algorithms informed by explicit graphical models from network science. The methods are easily extended to additional datasets, opening the possibility of studying cellular level network organization of neural systems and how that organization is modulated by stimuli or altered in models of disease.
△ Less
Submitted 10 December, 2018;
originally announced December 2018.
-
Bayesian Approximate Kernel Regression with Variable Selection
Authors:
Lorin Crawford,
Kris C. Wood,
Xiang Zhou,
Sayan Mukherjee
Abstract:
Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propose a novel framework that provides an effect size a…
▽ More
Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propose a novel framework that provides an effect size analog of each explanatory variable for Bayesian kernel regression models when the kernel is shift-invariant --- for example, the Gaussian kernel. We use function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. The projection onto the original explanatory variables serves as an analog of effect sizes. The specific function analytic property we use is that shift-invariant kernel functions can be approximated via random Fourier bases. Based on the random Fourier expansion we propose a computationally efficient class of Bayesian approximate kernel regression (BAKR) models for both nonlinear regression and binary classification for which one can compute an analog of effect sizes. We illustrate the utility of BAKR by examining two important problems in statistical genetics: genomic selection (i.e. phenotypic prediction) and association mapping (i.e. inference of significant variants or loci). State-of-the-art methods for genomic selection and association mapping are based on kernel regression and linear models, respectively. BAKR is the first method that is competitive in both settings.
△ Less
Submitted 9 June, 2017; v1 submitted 5 August, 2015;
originally announced August 2015.
-
Bayesian Inference Applied to the Electromagnetic Inverse Problem
Authors:
David M. Schmidt,
John S. George,
C. C. Wood
Abstract:
We present a new approach to the electromagnetic inverse problem that explicitly addresses the ambiguity associated with its ill-posed character. Rather than calculating a single ``best'' solution according to some criterion, our approach produces a large number of likely solutions that both fit the data and any prior information that is used. While the range of the different likely results is r…
▽ More
We present a new approach to the electromagnetic inverse problem that explicitly addresses the ambiguity associated with its ill-posed character. Rather than calculating a single ``best'' solution according to some criterion, our approach produces a large number of likely solutions that both fit the data and any prior information that is used. While the range of the different likely results is representative of the ambiguity in the inverse problem even with prior information present, features that are common across a large number of the different solutions can be identified and are associated with a high degree of probability. This approach is implemented and quantified within the formalism of Bayesian inference which combines prior information with that from measurement in a common framework using a single measure. To demonstrate this approach, a general neural activation model is constructed that includes a variable number of extended regions of activation and can incorporate a great deal of prior information on neural current such as information on location, orientation, strength and spatial smoothness. Taken together, this activation model and the Bayesian inferential approach yield estimates of the probability distributions for the number, location, and extent of active regions. Both simulated MEG data and data from a visual evoked response experiment are used to demonstrate the capabilities of this approach.
△ Less
Submitted 16 June, 1998;
originally announced September 2003.