-
Incorporating LLM Embeddings for Variation Across the Human Genome
Authors:
Hongqian Niu,
Jordan Bryan,
Xihao Li,
Didong Li
Abstract:
Recent advances in large language model (LLM) embeddings have enabled powerful representations for biological data, but most applications to date focus only on gene-level information. We present one of the first systematic frameworks to generate variant-level embeddings across the entire human genome. Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we constructed semantic text…
▽ More
Recent advances in large language model (LLM) embeddings have enabled powerful representations for biological data, but most applications to date focus only on gene-level information. We present one of the first systematic frameworks to generate variant-level embeddings across the entire human genome. Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we constructed semantic text descriptions for 8.9 billion possible variants and generated embeddings at three scales: 1.5 million HapMap3+MEGA variants, ~90 million imputed UK Biobank variants, and ~9 billion all possible variants. Embeddings were produced with both OpenAI's text-embedding-3-large and the open-source Qwen3-Embedding-0.6B models. Baseline experiments demonstrate high predictive accuracy for variant properties, validating the embeddings as structured representations of genomic variation. We outline two downstream applications: embedding-informed hypothesis testing by extending the Frequentist And Bayesian framework to genome-wide association studies, and embedding-augmented genetic risk prediction that enhances standard polygenic risk scores. These resources, publicly available on Hugging Face, provide a foundation for advancing large-scale genomic discovery and precision medicine.
△ Less
Submitted 24 September, 2025;
originally announced September 2025.
-
Subscedastic weighted least squares estimates
Authors:
Jordan Bryan,
Haibo Zhou,
Didong Li
Abstract:
In the heteroscedastic linear model, the weighted least squares (WLS) estimate of the model coefficients is more efficient than the ordinary least squares (OLS) esti- mate. However, the practical application of WLS is challenging because it requires knowledge of the error variances. Feasible weighted least squares (FLS) estimates, which use approximations of the variances when they are unknown, ma…
▽ More
In the heteroscedastic linear model, the weighted least squares (WLS) estimate of the model coefficients is more efficient than the ordinary least squares (OLS) esti- mate. However, the practical application of WLS is challenging because it requires knowledge of the error variances. Feasible weighted least squares (FLS) estimates, which use approximations of the variances when they are unknown, may either be more or less efficient than the OLS estimate depending on the quality of the approx- imation. A direct comparison between FLS and OLS has significant implications for the application of regression analysis in varied fields, yet such a comparison remains an unresolved challenge. In this study, we address this challenge by identifying the conditions under which FLS estimates using fixed weights demonstrate greater effi- ciency than the OLS estimate. These conditions provide guidance for the design of feasible estimates using random weights. They also shed light on how certain robust regression estimates behave with respect to the linear model with normal errors of unequal variance.
△ Less
Submitted 27 May, 2025; v1 submitted 31 March, 2024;
originally announced April 2024.
-
Linear Source Apportionment using Generalized Least Squares
Authors:
Jordan Bryan,
Peter Hoff
Abstract:
Motivated by applications to water quality monitoring using fluorescence spectroscopy, we develop the source apportionment model for high dimensional profiles of dissolved organic matter (DOM). We describe simple methods to estimate the parameters of a linear source apportionment model, and show how the estimates are related to those of ordinary and generalized least squares. Using this least squa…
▽ More
Motivated by applications to water quality monitoring using fluorescence spectroscopy, we develop the source apportionment model for high dimensional profiles of dissolved organic matter (DOM). We describe simple methods to estimate the parameters of a linear source apportionment model, and show how the estimates are related to those of ordinary and generalized least squares. Using this least squares framework, we analyze the variability of the estimates, and we propose predictors for missing elements of a DOM profile. We demonstrate the practical utility of our results on fluorescence spectroscopy data collected from the Neuse River in North Carolina.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
The multirank likelihood for semiparametric canonical correlation analysis
Authors:
Jordan G. Bryan,
Jonathan Niles-Weed,
Peter D. Hoff
Abstract:
Many analyses of multivariate data focus on evaluating the dependence between two sets of variables, rather than the dependence among individual variables within each set. Canonical correlation analysis (CCA) is a classical data analysis technique that estimates parameters describing the dependence between such sets. However, inference procedures based on traditional CCA rely on the assumption tha…
▽ More
Many analyses of multivariate data focus on evaluating the dependence between two sets of variables, rather than the dependence among individual variables within each set. Canonical correlation analysis (CCA) is a classical data analysis technique that estimates parameters describing the dependence between such sets. However, inference procedures based on traditional CCA rely on the assumption that all variables are jointly normally distributed. We present a semiparametric approach to CCA in which the multivariate margins of each variable set may be arbitrary, but the dependence between variable sets is described by a parametric model that provides low-dimensional summaries of dependence. While maximum likelihood estimation in the proposed model is intractable, we propose two estimation strategies: one using a pseudolikelihood for the model and one using a Markov chain Monte Carlo (MCMC) algorithm that provides Bayesian estimates and confidence regions for the between-set dependence parameters. The MCMC algorithm is derived from a multirank likelihood function, which uses only part of the information in the observed data in exchange for being free of assumptions about the multivariate margins. We apply the proposed Bayesian inference procedure to Brazilian climate data and monthly stock returns from the materials and communications market sectors.
△ Less
Submitted 22 April, 2024; v1 submitted 14 December, 2021;
originally announced December 2021.
-
Graph-Based Machine Learning Improves Just-in-Time Defect Prediction
Authors:
Jonathan Bryan,
Pablo Moriano
Abstract:
The increasing complexity of today's software requires the contribution of thousands of developers. This complex collaboration structure makes developers more likely to introduce defect-prone changes that lead to software faults. Determining when these defect-prone changes are introduced has proven challenging, and using traditional machine learning (ML) methods to make these determinations seems…
▽ More
The increasing complexity of today's software requires the contribution of thousands of developers. This complex collaboration structure makes developers more likely to introduce defect-prone changes that lead to software faults. Determining when these defect-prone changes are introduced has proven challenging, and using traditional machine learning (ML) methods to make these determinations seems to have reached a plateau. In this work, we build contribution graphs consisting of developers and source files to capture the nuanced complexity of changes required to build software. By leveraging these contribution graphs, our research shows the potential of using graph-based ML to improve Just-In-Time (JIT) defect prediction. We hypothesize that features extracted from the contribution graphs may be better predictors of defect-prone changes than intrinsic features derived from software characteristics. We corroborate our hypothesis using graph-based ML for classifying edges that represent defect-prone changes. This new framing of the JIT defect prediction problem leads to remarkably better results. We test our approach on 14 open-source projects and show that our best model can predict whether or not a code change will lead to a defect with an F1 score as high as 77.55% and a Matthews correlation coefficient (MCC) as high as 53.16%. This represents a 152% higher F1 score and a 3% higher MCC over the state-of-the-art JIT defect prediction. We describe limitations, open challenges, and how this method can be used for operational JIT defect prediction.
△ Less
Submitted 14 April, 2023; v1 submitted 11 October, 2021;
originally announced October 2021.
-
Smaller $p$-values in genomics studies using distilled historical information
Authors:
Jordan G. Bryan,
Peter D. Hoff
Abstract:
Medical research institutions have generated massive amounts of biological data by genetically profiling hundreds of cancer cell lines. In parallel, academic biology labs have conducted genetic screens on small numbers of cancer cell lines under custom experimental conditions. In order to share information between these two approaches to scientific discovery, this article proposes a "frequentist a…
▽ More
Medical research institutions have generated massive amounts of biological data by genetically profiling hundreds of cancer cell lines. In parallel, academic biology labs have conducted genetic screens on small numbers of cancer cell lines under custom experimental conditions. In order to share information between these two approaches to scientific discovery, this article proposes a "frequentist assisted by Bayes" (FAB) procedure for hypothesis testing that allows historical information from massive genomics datasets to increase the power of hypothesis tests in specialized studies. The exchange of information takes place through a novel probability model for multimodal genomics data, which distills historical information pertaining to cancer cell lines and genes across a wide variety of experimental contexts. If the relevance of the historical information for a given study is high, then the resulting FAB tests can be more powerful than the corresponding classical tests. If the relevance is low, then the FAB tests yield as many discoveries as the classical tests. Simulations and practical investigations demonstrate that the FAB testing procedure can increase the number of effects discovered in genomics studies while still maintaining strict control of type I error and false discovery rates.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.
-
Data Science: A Three Ring Circus or a Big Tent?
Authors:
Jennifer Bryan,
Hadley Wickham
Abstract:
This is part of a collection of discussion pieces on David Donoho's paper 50 Years of Data Science, appearing in Volume 26, Issue 4 of the Journal of Computational and Graphical Statistics (2017).
This is part of a collection of discussion pieces on David Donoho's paper 50 Years of Data Science, appearing in Volume 26, Issue 4 of the Journal of Computational and Graphical Statistics (2017).
△ Less
Submitted 20 December, 2017;
originally announced December 2017.
-
Weighted Classification Cascades for Optimizing Discovery Significance in the HiggsML Challenge
Authors:
Lester Mackey,
Jordan Bryan,
Man Yue Mo
Abstract:
We introduce a minorization-maximization approach to optimizing common measures of discovery significance in high energy physics. The approach alternates between solving a weighted binary classification problem and updating class weights in a simple, closed-form manner. Moreover, an argument based on convex duality shows that an improvement in weighted classification error on any round yields a co…
▽ More
We introduce a minorization-maximization approach to optimizing common measures of discovery significance in high energy physics. The approach alternates between solving a weighted binary classification problem and updating class weights in a simple, closed-form manner. Moreover, an argument based on convex duality shows that an improvement in weighted classification error on any round yields a commensurate improvement in discovery significance. We complement our derivation with experimental results from the 2014 Higgs boson machine learning challenge.
△ Less
Submitted 10 September, 2015; v1 submitted 9 September, 2014;
originally announced September 2014.