-
BioAgents: Democratizing Bioinformatics Analysis with Multi-Agent Systems
Authors:
Nikita Mehandru,
Amanda K. Hall,
Olesya Melnichenko,
Yulia Dubinina,
Daniel Tsirulnikov,
David Bamman,
Ahmed Alaa,
Scott Saponas,
Venkat S. Malladi
Abstract:
Creating end-to-end bioinformatics workflows requires diverse domain expertise, which poses challenges for both junior and senior researchers as it demands a deep understanding of both genomics concepts and computational techniques. While large language models (LLMs) provide some assistance, they often fall short in providing the nuanced guidance needed to execute complex bioinformatics tasks, and…
▽ More
Creating end-to-end bioinformatics workflows requires diverse domain expertise, which poses challenges for both junior and senior researchers as it demands a deep understanding of both genomics concepts and computational techniques. While large language models (LLMs) provide some assistance, they often fall short in providing the nuanced guidance needed to execute complex bioinformatics tasks, and require expensive computing resources to achieve high performance. We thus propose a multi-agent system built on small language models, fine-tuned on bioinformatics data, and enhanced with retrieval augmented generation (RAG). Our system, BioAgents, enables local operation and personalization using proprietary data. We observe performance comparable to human experts on conceptual genomics tasks, and suggest next steps to enhance code generation capabilities.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
Continuous Analysis: Evolution of Software Engineering and Reproducibility for Science
Authors:
Venkat S. Malladi,
Maria Yazykova,
Olesya Melnichenko,
Yulia Dubinina
Abstract:
Reproducibility in research remains hindered by complex systems involving data, models, tools, and algorithms. Studies highlight a reproducibility crisis due to a lack of standardized reporting, code and data sharing, and rigorous evaluation. This paper introduces the concept of Continuous Analysis to address the reproducibility challenges in scientific research, extending the DevOps lifecycle. Co…
▽ More
Reproducibility in research remains hindered by complex systems involving data, models, tools, and algorithms. Studies highlight a reproducibility crisis due to a lack of standardized reporting, code and data sharing, and rigorous evaluation. This paper introduces the concept of Continuous Analysis to address the reproducibility challenges in scientific research, extending the DevOps lifecycle. Continuous Analysis proposes solutions through version control, analysis orchestration, and feedback mechanisms, enhancing the reliability of scientific results. By adopting CA, the scientific community can ensure the validity and generalizability of research outcomes, fostering transparency and collaboration and ultimately advancing the field.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
The GA4GH Task Execution API: Enabling Easy Multi Cloud Task Execution
Authors:
Alexander Kanitz,
Matthew H. McLoughlin,
Liam Beckman,
Venkat S. Malladi,
Kyle P. Ellrott
Abstract:
The Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API is a standardized schema and API for describing and executing batch execution tasks. It provides a common way to submit and manage tasks to a variety of compute environments, including on premise High Performance Compute and High Throughput Computing (HPC/HTC) systems, Cloud computing platforms, and hybrid environ…
▽ More
The Global Alliance for Genomics and Health (GA4GH) Task Execution Service (TES) API is a standardized schema and API for describing and executing batch execution tasks. It provides a common way to submit and manage tasks to a variety of compute environments, including on premise High Performance Compute and High Throughput Computing (HPC/HTC) systems, Cloud computing platforms, and hybrid environments. The TES API is designed to be flexible and extensible, allowing it to be adapted to a wide range of use cases, such as "bringing compute to the data" solutions for federated and distributed data analysis or load balancing across multi cloud infrastructures. This API has been adopted by a number of different service providers and utilized by several workflow engines. Using its capabilities, genomes research institutes are building hybrid compute systems to study life science.
△ Less
Submitted 8 February, 2024;
originally announced May 2024.