-
Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models
Authors:
Haonan He,
Yuchen Ren,
Yining Tang,
Ziyang Xu,
Junxian Li,
Minghao Yang,
Di Zhang,
Dong Yuan,
Tao Chen,
Shufei Zhang,
Yuqiang Li,
Nanqing Dong,
Wanli Ouyang,
Dongzhan Zhou,
Peng Ye
Abstract:
Large language models have already demonstrated their formidable capabilities in general domains, ushering in a revolutionary transformation. However, exploring and exploiting the extensive knowledge of these models to comprehend multi-omics biology remains underexplored. To fill this research gap, we first introduce Biology-Instructions, the first large-scale multi-omics biological sequences-rela…
▽ More
Large language models have already demonstrated their formidable capabilities in general domains, ushering in a revolutionary transformation. However, exploring and exploiting the extensive knowledge of these models to comprehend multi-omics biology remains underexplored. To fill this research gap, we first introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset including DNA, RNA, proteins, and multi-molecules, designed to bridge the gap between large language models (LLMs) and complex biological sequences-related tasks. This dataset can enhance the versatility of LLMs by integrating diverse biological sequenced-based prediction tasks with advanced reasoning capabilities, while maintaining conversational fluency. Additionally, we reveal significant performance limitations in even state-of-the-art LLMs on biological sequence-related multi-omics tasks without specialized pre-training and instruction-tuning. We further develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline, demonstrating the powerful ability to understand biology by using Biology-Instructions. Biology-Instructions and ChatMultiOmics are publicly available and crucial resources for enabling more effective integration of LLMs with multi-omics sequence analysis.
△ Less
Submitted 26 December, 2024;
originally announced December 2024.
-
COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models
Authors:
Yuchen Ren,
Wenwei Han,
Qianyuan Zhang,
Yining Tang,
Weiqiang Bai,
Yuchen Cai,
Lifeng Qiao,
Hao Jiang,
Dong Yuan,
Tao Chen,
Siqi Sun,
Pan Tan,
Wanli Ouyang,
Nanqing Dong,
Xinzhu Ma,
Peng Ye
Abstract:
As key elements within the central dogma, DNA, RNA, and proteins play crucial roles in maintaining life by guaranteeing accurate genetic expression and implementation. Although research on these molecules has profoundly impacted fields like medicine, agriculture, and industry, the diversity of machine learning approaches-from traditional statistical methods to deep learning models and large langua…
▽ More
As key elements within the central dogma, DNA, RNA, and proteins play crucial roles in maintaining life by guaranteeing accurate genetic expression and implementation. Although research on these molecules has profoundly impacted fields like medicine, agriculture, and industry, the diversity of machine learning approaches-from traditional statistical methods to deep learning models and large language models-poses challenges for researchers in choosing the most suitable models for specific tasks, especially for cross-omics and multi-omics tasks due to the lack of comprehensive benchmarks. To address this, we introduce the first comprehensive multi-omics benchmark COMET (Benchmark for Biological COmprehensive Multi-omics Evaluation Tasks and Language Models), designed to evaluate models across single-omics, cross-omics, and multi-omics tasks. First, we curate and develop a diverse collection of downstream tasks and datasets covering key structural and functional aspects in DNA, RNA, and proteins, including tasks that span multiple omics levels. Then, we evaluate existing foundational language models for DNA, RNA, and proteins, as well as the newly proposed multi-omics method, offering valuable insights into their performance in integrating and analyzing data from different biological modalities. This benchmark aims to define critical issues in multi-omics research and guide future directions, ultimately promoting advancements in understanding biological processes through integrated and different omics data analysis.
△ Less
Submitted 13 December, 2024;
originally announced December 2024.
-
BEACON: Benchmark for Comprehensive RNA Tasks and Language Models
Authors:
Yuchen Ren,
Zhiyuan Chen,
Lifeng Qiao,
Hongtai Jing,
Yuchen Cai,
Sheng Xu,
Peng Ye,
Xinzhu Ma,
Siqi Sun,
Hongliang Yan,
Dong Yuan,
Wanli Ouyang,
Xihui Liu
Abstract:
RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we i…
▽ More
RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we introduce the first comprehensive RNA benchmark BEACON (\textbf{BE}nchm\textbf{A}rk for \textbf{CO}mprehensive R\textbf{N}A Task and Language Models). First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications, enabling a comprehensive assessment of the performance of methods on various RNA understanding tasks. Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models. Third, we investigate the vital RNA language model components from the tokenizer and positional encoding aspects. Notably, our findings emphasize the superiority of single nucleotide tokenization and the effectiveness of Attention with Linear Biases (ALiBi) over traditional positional encoding methods. Based on these insights, a simple yet strong baseline called BEACON-B is proposed, which can achieve outstanding performance with limited data and computational resources. The datasets and source code of our benchmark are available at https://github.com/terry-r123/RNABenchmark.
△ Less
Submitted 12 December, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
A dynamic model to study the potential TB infections and assessment of control strategies in China
Authors:
Chuanqing Xu,
Kedeng Cheng,
Songbai Guo,
Dehui Yuan,
Xiaoyu Zhao
Abstract:
China is one of the countries with a high burden of tuberculosis, and although the number of new cases of tuberculosis has been decreasing year by year, the number of new infections per year has remained high and the diagnosis rate of tuberculosis-infected patients has remained low. Based on the analysis of TB infection data, we develop a model of TB transmission dynamics that include potentially…
▽ More
China is one of the countries with a high burden of tuberculosis, and although the number of new cases of tuberculosis has been decreasing year by year, the number of new infections per year has remained high and the diagnosis rate of tuberculosis-infected patients has remained low. Based on the analysis of TB infection data, we develop a model of TB transmission dynamics that include potentially infected individuals and BCG vaccination, fit the model parameters to the data on new TB cases, calculate the basic reproduction number \mathcal{R}_v= 0.4442. A parametric sensitivity analysis of \mathcal{R}_v is performed, and we obtained the correlation coefficients of BCG vaccination rate and effectiveness rate with \mathcal{R}_v as -0.810, -0.825. According to the model, we estimate that there are 614,186 (95% CI [562,631,665,741]) potentially infected TB cases in China, accounting for about 39.5% of the total number of TB cases. We assess the feasibility of achieving the goals of the WHO strategy to end tuberculosis in China and find that reducing the number of new cases by 90 per cent by 2035 is very difficult with the current tuberculosis control measures. However, with an effective combination of control measures such as increased detection of potentially infected persons, improved drug treatment, and reduction of overall exposure to tuberculosis patients, it is feasible to reach the WHO strategic goal of ending tuberculosis by 2035.
△ Less
Submitted 25 January, 2024; v1 submitted 22 January, 2024;
originally announced January 2024.
-
Role of genetic polymorphisms in transgenerational inheritance in budding yeast
Authors:
Zuobin Zhu,
Qing Lu,
Dejian Yuan,
Yanke Li,
Xian Man,
Yueran Zhu,
Shi Huang
Abstract:
Transgenerational inheritance of a trait is presumably affected by both genetic and environmental factors but remains poorly understood. We studied the effect of genetic polymorphisms on transgenerational inheritance of yeast segregants that were derived from a cross between a laboratory strain and a wild strain of Saccharomyces cerevisiae. For each SNP analyzed, the parental allele present in les…
▽ More
Transgenerational inheritance of a trait is presumably affected by both genetic and environmental factors but remains poorly understood. We studied the effect of genetic polymorphisms on transgenerational inheritance of yeast segregants that were derived from a cross between a laboratory strain and a wild strain of Saccharomyces cerevisiae. For each SNP analyzed, the parental allele present in less than half of the segregants panel was called the minor allele (MA). We found a nonrandom distribution of MAs in the segregants, indicating natural selection. We compared segregants with high MA content (MAC) relative to those with less and found a more dramatic shortening of the lag phase length for the high MAC group in response to 14 days of ethanol training. Also, the short lag phase as acquired and epigenetically memorized by ethanol training was more dramatically lost after 7 days of recovery in ethanol free medium for the high MAC group. Sodium chloride treatment produced similar observations. Using public datasets, we found MAC linkage to mRNA expression of hundreds of genes. Finally, we found preferential effect of MAC on traits with high number of known additive quantitative trait loci (QTLs). These results provide evidence for the slightly deleterious nature of most MAs and a lower capacity to maintain inheritance of traits in individuals or cells with greater MAC, which have implications for disease prevention and treatment and the "missing heritability" problem in complex traits and diseases.
△ Less
Submitted 12 July, 2013; v1 submitted 27 February, 2013;
originally announced February 2013.
-
Methods for scoring the collective effect of SNPs: Minor alleles of common SNPs quantitatively affect traits/diseases and are under both positive and negative selection
Authors:
Dejian Yuan,
Zuobin Zhu,
Xiaohua Tan,
Jie Liang,
Ceng Zeng,
Jiegen Zhang,
Jun Chen,
Long Ma,
Ayca Dogan,
Gudrun Brockmann,
Oliver Goldmann,
Eva Medina,
Amanda D. Rice,
Richard W. Moyer,
Xian Man,
Ke Yi,
Yanke Li,
Qing Lu,
Yimin Huang,
Dapeng Wang,
Jun Yu,
Hui Guo,
Kun Xia,
Shi Huang
Abstract:
Most common SNPs are popularly assumed to be neutral. We here developed novel methods to examine in animal models and humans whether extreme amount of minor alleles (MAs) carried by an individual may represent extreme trait values and common diseases. We analyzed panels of genetic reference populations and identified the MAs in each panel and the MA content (MAC) that each strain carried. We also…
▽ More
Most common SNPs are popularly assumed to be neutral. We here developed novel methods to examine in animal models and humans whether extreme amount of minor alleles (MAs) carried by an individual may represent extreme trait values and common diseases. We analyzed panels of genetic reference populations and identified the MAs in each panel and the MA content (MAC) that each strain carried. We also analyzed 21 published GWAS datasets of human diseases and identified the MAC of each case or control. MAC was nearly linearly linked to quantitative variations in numerous traits in model organisms, including life span, tumor susceptibility, learning and memory, sensitivity to alcohol and anti-psychotic drugs, and two correlated traits poor reproductive fitness and strong immunity. Similarly, in Europeans or European Americans, enrichment of MAs of fast but not slow evolutionary rate was linked to autoimmune and numerous other diseases, including type 2 diabetes, Parkinson's disease, psychiatric disorders, alcohol and cocaine addictions, cancer, and less life span. Therefore, both high and low MAC correlated with extreme values in many traits, indicating stabilizing selection on most MAs. The methods here are broadly applicable and may help solve the missing heritability problem in complex traits and diseases.
△ Less
Submitted 15 July, 2013; v1 submitted 12 September, 2012;
originally announced September 2012.