-
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations
Authors:
Evan Miller
Abstract:
Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conc…
▽ More
Evaluations are critical for understanding the capabilities of large language models (LLMs). Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning. This article shows researchers with some training in statistics how to think about and analyze data from language model evaluations. Conceptualizing evaluation questions as having been drawn from an unseen super-population, we present formulas for analyzing evaluation data, measuring differences between two models, and planning an evaluation experiment. We make a number of specific recommendations for running language model evaluations and reporting experiment results in a way that minimizes statistical noise and maximizes informativeness.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
Discovering group dynamics in coordinated time series via hierarchical recurrent switching-state models
Authors:
Michael T. Wojnowicz,
Kaitlin Gili,
Preetish Rath,
Eric Miller,
Jeffrey Miller,
Clifford Hancock,
Meghan O'Donovan,
Seth Elkin-Frankston,
Tad T. BrunyƩ,
Michael C. Hughes
Abstract:
We seek a computationally efficient model for a collection of time series arising from multiple interacting entities (a.k.a. "agents"). Recent models of spatiotemporal patterns across individuals fail to incorporate explicit system-level collective behavior that can influence the trajectories of individual entities. To address this gap in the literature, we present a new hierarchical switching-sta…
▽ More
We seek a computationally efficient model for a collection of time series arising from multiple interacting entities (a.k.a. "agents"). Recent models of spatiotemporal patterns across individuals fail to incorporate explicit system-level collective behavior that can influence the trajectories of individual entities. To address this gap in the literature, we present a new hierarchical switching-state model that can be trained in an unsupervised fashion to simultaneously learn both system-level and individual-level dynamics. We employ a latent system-level discrete state Markov chain that provides top-down influence on latent entity-level chains which in turn govern the emission of each observed time series. Recurrent feedback from the observations to the latent chains at both entity and system levels allows recent situational context to inform how dynamics unfold at all levels in bottom-up fashion. We hypothesize that including both top-down and bottom-up influences on group dynamics will improve interpretability of the learned dynamics and reduce error when forecasting. Our hierarchical switching recurrent dynamical model can be learned via closed-form variational coordinate ascent updates to all latent chains that scale linearly in the number of entities. This is asymptotically no more costly than fitting a separate model for each entity. Analysis of both synthetic data and real basketball team movements suggests our lean parametric model can achieve competitive forecasts compared to larger neural network models that require far more computational resources. Further experiments on soldier data as well as a synthetic task with 64 cooperating entities show how our approach can yield interpretable insights about team dynamics over time.
△ Less
Submitted 2 December, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.
-
Likelihood-ratio inference on differences in quantiles
Authors:
Evan Miller
Abstract:
Quantiles can represent key operational and business metrics, but the computational challenges associated with inference has hampered their adoption in online experimentation. One-sample confidence intervals are trivial to construct; however, two-sample inference has traditionally required bootstrapping or a density estimator. This paper presents a new two-sample difference-in-quantile hypothesis…
▽ More
Quantiles can represent key operational and business metrics, but the computational challenges associated with inference has hampered their adoption in online experimentation. One-sample confidence intervals are trivial to construct; however, two-sample inference has traditionally required bootstrapping or a density estimator. This paper presents a new two-sample difference-in-quantile hypothesis test and confidence interval based on a likelihood-ratio test statistic. A conservative version of the test does not involve a density estimator; a second version of the test, which uses a density estimator, yields confidence intervals very close to the nominal coverage level. It can be computed using only four order statistics from each sample.
△ Less
Submitted 31 July, 2024; v1 submitted 15 September, 2023;
originally announced January 2024.
-
Easy Variational Inference for Categorical Models via an Independent Binary Approximation
Authors:
Michael T. Wojnowicz,
Shuchin Aeron,
Eric L. Miller,
Michael C. Hughes
Abstract:
We pursue tractable Bayesian analysis of generalized linear models (GLMs) for categorical data. Thus far, GLMs are difficult to scale to more than a few dozen categories due to non-conjugacy or strong posterior dependencies when using conjugate auxiliary variable methods. We define a new class of GLMs for categorical data called categorical-from-binary (CB) models. Each CB model has a likelihood t…
▽ More
We pursue tractable Bayesian analysis of generalized linear models (GLMs) for categorical data. Thus far, GLMs are difficult to scale to more than a few dozen categories due to non-conjugacy or strong posterior dependencies when using conjugate auxiliary variable methods. We define a new class of GLMs for categorical data called categorical-from-binary (CB) models. Each CB model has a likelihood that is bounded by the product of binary likelihoods, suggesting a natural posterior approximation. This approximation makes inference straightforward and fast; using well-known auxiliary variables for probit or logistic regression, the product of binary models admits conjugate closed-form variational inference that is embarrassingly parallel across categories and invariant to category ordering. Moreover, an independent binary model simultaneously approximates multiple CB models. Bayesian model averaging over these can improve the quality of the approximation for any given dataset. We show that our approach scales to thousands of categories, outperforming posterior estimation competitors like Automatic Differentiation Variational Inference (ADVI) and No U-Turn Sampling (NUTS) in the time required to achieve fixed prediction quality.
△ Less
Submitted 31 May, 2022;
originally announced June 2022.
-
Dynamical Wasserstein Barycenters for Time-series Modeling
Authors:
Kevin C. Cheng,
Shuchin Aeron,
Michael C. Hughes,
Eric L. Miller
Abstract:
Many time series can be modeled as a sequence of segments representing high-level discrete states, such as running and walking in a human activity application. Flexible models should describe the system state and observations in stationary "pure-state" periods as well as transition periods between adjacent segments, such as a gradual slowdown between running and walking. However, most prior work a…
▽ More
Many time series can be modeled as a sequence of segments representing high-level discrete states, such as running and walking in a human activity application. Flexible models should describe the system state and observations in stationary "pure-state" periods as well as transition periods between adjacent segments, such as a gradual slowdown between running and walking. However, most prior work assumes instantaneous transitions between pure discrete states. We propose a dynamical Wasserstein barycentric (DWB) model that estimates the system state over time as well as the data-generating distributions of pure states in an unsupervised manner. Our model assumes each pure state generates data from a multivariate normal distribution, and characterizes transitions between states via displacement-interpolation specified by the Wasserstein barycenter. The system state is represented by a barycentric weight vector which evolves over time via a random walk on the simplex. Parameter learning leverages the natural Riemannian geometry of Gaussian distributions under the Wasserstein distance, which leads to improved convergence speeds. Experiments on several human activity datasets show that our proposed DWB model accurately learns the generating distribution of pure states while improving state estimation for transition periods compared to the commonly used linear interpolation mixture models.
△ Less
Submitted 31 October, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
mlf-core: a framework for deterministic machine learning
Authors:
Lukas Heumos,
Philipp Ehmele,
Luis Kuhn Cuellar,
Kevin Menden,
Edmund Miller,
Steffen Lemke,
Gisela Gabernet,
Sven Nahnsen
Abstract:
Machine learning has shown extensive growth in recent years and is now routinely applied to sensitive areas. To allow appropriate verification of predictive models before deployment, models must be deterministic. However, major machine learning libraries default to the usage of non-deterministic algorithms based on atomic operations. Solely fixing all random seeds is not sufficient for determinist…
▽ More
Machine learning has shown extensive growth in recent years and is now routinely applied to sensitive areas. To allow appropriate verification of predictive models before deployment, models must be deterministic. However, major machine learning libraries default to the usage of non-deterministic algorithms based on atomic operations. Solely fixing all random seeds is not sufficient for deterministic machine learning. To overcome this shortcoming, various machine learning libraries released deterministic counterparts to the non-deterministic algorithms. We evaluated the effect of these algorithms on determinism and runtime. Based on these results, we formulated a set of requirements for deterministic machine learning and developed a new software solution, the mlf-core ecosystem, which aids machine learning projects to meet and keep these requirements. We applied mlf-core to develop deterministic models in various biomedical fields including a single cell autoencoder with TensorFlow, a PyTorch-based U-Net model for liver-tumor segmentation in CT scans, and a liver cancer classifier based on gene expression profiles with XGBoost.
△ Less
Submitted 16 June, 2022; v1 submitted 15 April, 2021;
originally announced April 2021.
-
A Fully Tensorized Recurrent Neural Network
Authors:
Charles C. Onu,
Jacob E. Miller,
Doina Precup
Abstract:
Recurrent neural networks (RNNs) are powerful tools for sequential modeling, but typically require significant overparameterization and regularization to achieve optimal performance. This leads to difficulties in the deployment of large RNNs in resource-limited settings, while also introducing complications in hyperparameter selection and training. To address these issues, we introduce a "fully te…
▽ More
Recurrent neural networks (RNNs) are powerful tools for sequential modeling, but typically require significant overparameterization and regularization to achieve optimal performance. This leads to difficulties in the deployment of large RNNs in resource-limited settings, while also introducing complications in hyperparameter selection and training. To address these issues, we introduce a "fully tensorized" RNN architecture which jointly encodes the separate weight matrices within each recurrent cell using a lightweight tensor-train (TT) factorization. This approach represents a novel form of weight sharing which reduces model size by several orders of magnitude, while still maintaining similar or better performance compared to standard RNNs. Experiments on image classification and speaker verification tasks demonstrate further benefits for reducing inference times and stabilizing model training and hyperparameter selection.
△ Less
Submitted 10 November, 2021; v1 submitted 8 October, 2020;
originally announced October 2020.
-
On Matched Filtering for Statistical Change Point Detection
Authors:
Kevin C. Cheng,
Eric L. Miller,
Michael C. Hughes,
Shuchin Aeron
Abstract:
Non-parametric and distribution-free two-sample tests have been the foundation of many change point detection algorithms. However, randomness in the test statistic as a function of time makes them susceptible to false positives and localization ambiguity. We address these issues by deriving and applying filters matched to the expected temporal signatures of a change for various sliding window, two…
▽ More
Non-parametric and distribution-free two-sample tests have been the foundation of many change point detection algorithms. However, randomness in the test statistic as a function of time makes them susceptible to false positives and localization ambiguity. We address these issues by deriving and applying filters matched to the expected temporal signatures of a change for various sliding window, two-sample tests under IID assumptions on the data. These filters are derived asymptotically with respect to the window size for the Wasserstein quantile test, the Wasserstein-1 distance test, Maximum Mean Discrepancy squared (MMD^2), and the Kolmogorov-Smirnov (KS) test. The matched filters are shown to have two important properties. First, they are distribution-free, and thus can be applied without prior knowledge of the underlying data distributions. Second, they are peak-preserving, which allows the filtered signal produced by our methods to maintain expected statistical significance. Through experiments on synthetic data as well as activity recognition benchmarks, we demonstrate the utility of this approach for mitigating false positives and improving the test precision. Our method allows for the localization of change points without the use of ad-hoc post-processing to remove redundant detections common to current methods. We further highlight the performance of statistical tests based on the Quantile-Quantile (Q-Q) function and show how the invariance property of the Q-Q function to order-preserving transformations allows these tests to detect change points of different scales with a single threshold within the same dataset.
△ Less
Submitted 27 October, 2020; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Estimation of ascertainment bias and its effect on power in clinical trials with time-to-event outcomes
Authors:
E. J. Greene,
P. Peduzzi,
J. Dziura,
C. Meng,
M. E. Miller,
T. G. Travison,
D. Esserman
Abstract:
While the gold standard for clinical trials is to blind all parties -- participants, researchers, and evaluators -- to treatment assignment, this is not always a possibility. When some or all of the above individuals know the treatment assignment, this leaves the study open to the introduction of post-randomization biases. In the Strategies to Reduce Injuries and Develop Confidence in Elders (STRI…
▽ More
While the gold standard for clinical trials is to blind all parties -- participants, researchers, and evaluators -- to treatment assignment, this is not always a possibility. When some or all of the above individuals know the treatment assignment, this leaves the study open to the introduction of post-randomization biases. In the Strategies to Reduce Injuries and Develop Confidence in Elders (STRIDE) trial, we were presented with the potential for the unblinded clinicians administering the treatment, as well as the individuals enrolled in the study, to introduce ascertainment bias into some but not all events comprising the primary outcome. In this manuscript, we present ways to estimate the ascertainment bias for a time-to-event outcome, and discuss its impact on the overall power of a trial versus changing of the outcome definition to a more stringent unbiased definition that restricts attention to measurements less subject to potentially differential assessment. We found that for the majority of situations, it is better to revise the definition to a more stringent definition, as was done in STRIDE, even though fewer events may be observed.
△ Less
Submitted 2 October, 2020; v1 submitted 28 April, 2020;
originally announced April 2020.
-
Torus Graphs for Multivariate Phase Coupling Analysis
Authors:
Natalie Klein,
Josue Orellana,
Scott Brincat,
Earl K. Miller,
Robert E. Kass
Abstract:
Angular measurements are often modeled as circular random variables, where there are natural circular analogues of moments, including correlation. Because a product of circles is a torus, a d-dimensional vector of circular random variables lies on a d-dimensional torus. For such vectors we present here a class of graphical models, which we call torus graphs, based on the full exponential family wi…
▽ More
Angular measurements are often modeled as circular random variables, where there are natural circular analogues of moments, including correlation. Because a product of circles is a torus, a d-dimensional vector of circular random variables lies on a d-dimensional torus. For such vectors we present here a class of graphical models, which we call torus graphs, based on the full exponential family with pairwise interactions. The topological distinction between a torus and Euclidean space has several important consequences.
Our development was motivated by the problem of identifying phase coupling among oscillatory signals recorded from multiple electrodes in the brain: oscillatory phases across electrodes might tend to advance or recede together, indicating coordination across brain areas. The data analyzed here consisted of 24 phase angles measured repeatedly across 840 experimental trials (replications) during a memory task, where the electrodes were in 4 distinct brain regions, all known to be active while memories are being stored or retrieved. In realistic numerical simulations, we found that a standard pairwise assessment, known as phase locking value, is unable to describe multivariate phase interactions, but that torus graphs can accurately identify conditional associations. Torus graphs generalize several more restrictive approaches that have appeared in various scientific literatures, and produced intuitive results in the data we analyzed. Torus graphs thus unify multivariate analysis of circular data and present fertile territory for future research.
△ Less
Submitted 24 October, 2019;
originally announced October 2019.
-
Adversarial Domain Adaptation for Stable Brain-Machine Interfaces
Authors:
Ali Farshchian,
Juan A. Gallego,
Joseph P. Cohen,
Yoshua Bengio,
Lee E. Miller,
Sara A. Solla
Abstract:
Brain-Machine Interfaces (BMIs) have recently emerged as a clinically viable option to restore voluntary movements after paralysis. These devices are based on the ability to extract information about movement intent from neural signals recorded using multi-electrode arrays chronically implanted in the motor cortices of the brain. However, the inherent loss and turnover of recorded neurons requires…
▽ More
Brain-Machine Interfaces (BMIs) have recently emerged as a clinically viable option to restore voluntary movements after paralysis. These devices are based on the ability to extract information about movement intent from neural signals recorded using multi-electrode arrays chronically implanted in the motor cortices of the brain. However, the inherent loss and turnover of recorded neurons requires repeated recalibrations of the interface, which can potentially alter the day-to-day user experience. The resulting need for continued user adaptation interferes with the natural, subconscious use of the BMI. Here, we introduce a new computational approach that decodes movement intent from a low-dimensional latent representation of the neural data. We implement various domain adaptation methods to stabilize the interface over significantly long times. This includes Canonical Correlation Analysis used to align the latent variables across days; this method requires prior point-to-point correspondence of the time series across domains. Alternatively, we match the empirical probability distributions of the latent variables across days through the minimization of their Kullback-Leibler divergence. These two methods provide a significant and comparable improvement in the performance of the interface. However, implementation of an Adversarial Domain Adaptation Network trained to match the empirical probability distribution of the residuals of the reconstructed neural signals outperforms the two methods based on latent variables, while requiring remarkably few data points to solve the domain adaptation problem.
△ Less
Submitted 15 January, 2019; v1 submitted 28 September, 2018;
originally announced October 2018.
-
Ensemble Multi-task Gaussian Process Regression with Multiple Latent Processes
Authors:
Weitong Ruan,
Eric L. Miller
Abstract:
Multi-task/Multi-output learning seeks to exploit correlation among tasks to enhance performance over learning or solving each task independently. In this paper, we investigate this problem in the context of Gaussian Processes (GPs) and propose a new model which learns a mixture of latent processes by decomposing the covariance matrix into a sum of structured hidden components each of which is con…
▽ More
Multi-task/Multi-output learning seeks to exploit correlation among tasks to enhance performance over learning or solving each task independently. In this paper, we investigate this problem in the context of Gaussian Processes (GPs) and propose a new model which learns a mixture of latent processes by decomposing the covariance matrix into a sum of structured hidden components each of which is controlled by a latent GP over input features and a "weight" over tasks. From this sum structure, we propose a parallelizable parameter learning algorithm with a predetermined initialization for the "weights". We also notice that an ensemble parameter learning approach using mini-batches of training data not only reduces the computation complexity of learning but also improves the regression performance. We evaluate our model on two datasets, the smaller Swiss Jura dataset and another relatively larger ATMS dataset from NOAA. Substantial improvements are observed compared with established alternatives.
△ Less
Submitted 9 May, 2018; v1 submitted 22 September, 2017;
originally announced September 2017.
-
Machine learning for neural decoding
Authors:
Joshua I. Glaser,
Ari S. Benjamin,
Raeed H. Chowdhury,
Matthew G. Perich,
Lee E. Miller,
Konrad P. Kording
Abstract:
Despite rapid advances in machine learning tools, the majority of neural decoding approaches still use traditional methods. Modern machine learning tools, which are versatile and easy to use, have the potential to significantly improve decoding performance. This tutorial describes how to effectively apply these algorithms for typical decoding problems. We provide descriptions, best practices, and…
▽ More
Despite rapid advances in machine learning tools, the majority of neural decoding approaches still use traditional methods. Modern machine learning tools, which are versatile and easy to use, have the potential to significantly improve decoding performance. This tutorial describes how to effectively apply these algorithms for typical decoding problems. We provide descriptions, best practices, and code for applying common machine learning methods, including neural networks and gradient boosting. We also provide detailed comparisons of the performance of various methods at the task of decoding spiking activity in motor cortex, somatosensory cortex, and hippocampus. Modern methods, particularly neural networks and ensembles, significantly outperform traditional approaches, such as Wiener and Kalman filters. Improving the performance of neural decoding algorithms allows neuroscientists to better understand the information contained in a neural population and can help advance engineering applications such as brain machine interfaces.
△ Less
Submitted 3 July, 2020; v1 submitted 2 August, 2017;
originally announced August 2017.
-
Persistent homology analysis of brain artery trees
Authors:
Paul Bendich,
J. S. Marron,
Ezra Miller,
Alex Pieloch,
Sean Skwerer
Abstract:
New representations of tree-structured data objects, using ideas from topological data analysis, enable improved statistical analyses of a population of brain artery trees. A number of representations of each data tree arise from persistence diagrams that quantify branching and looping of vessels at multiple scales. Novel approaches to the statistical analysis, through various summaries of the per…
▽ More
New representations of tree-structured data objects, using ideas from topological data analysis, enable improved statistical analyses of a population of brain artery trees. A number of representations of each data tree arise from persistence diagrams that quantify branching and looping of vessels at multiple scales. Novel approaches to the statistical analysis, through various summaries of the persistence diagrams, lead to heightened correlations with covariates such as age and sex, relative to earlier analyses of this data set. The correlation with age continues to be significant even after controlling for correlations from earlier significant summaries
△ Less
Submitted 24 November, 2014;
originally announced November 2014.
-
Rapid Adaptation of POS Tagging for Domain Specific Uses
Authors:
John E. Miller,
Michael Bloodgood,
Manabu Torii,
K. Vijay-Shanker
Abstract:
Part-of-speech (POS) tagging is a fundamental component for performing natural language tasks such as parsing, information extraction, and question answering. When POS taggers are trained in one domain and applied in significantly different domains, their performance can degrade dramatically. We present a methodology for rapid adaptation of POS taggers to new domains. Our technique is unsupervised…
▽ More
Part-of-speech (POS) tagging is a fundamental component for performing natural language tasks such as parsing, information extraction, and question answering. When POS taggers are trained in one domain and applied in significantly different domains, their performance can degrade dramatically. We present a methodology for rapid adaptation of POS taggers to new domains. Our technique is unsupervised in that a manually annotated corpus for the new domain is not necessary. We use suffix information gathered from large amounts of raw text as well as orthographic information to increase the lexical coverage. We present an experiment in the Biological domain where our POS tagger achieves results comparable to POS taggers specifically trained to this domain.
△ Less
Submitted 31 October, 2014;
originally announced November 2014.
-
Exploiting Structural Complexity for Robust and Rapid Hyperspectral Imaging
Authors:
Gregory Ely,
Shuchin Aeron,
Eric L. Miller
Abstract:
This paper presents several strategies for spectral de-noising of hyperspectral images and hypercube reconstruction from a limited number of tomographic measurements. In particular we show that the non-noisy spectral data, when stacked across the spectral dimension, exhibits low-rank. On the other hand, under the same representation, the spectral noise exhibits a banded structure. Motivated by thi…
▽ More
This paper presents several strategies for spectral de-noising of hyperspectral images and hypercube reconstruction from a limited number of tomographic measurements. In particular we show that the non-noisy spectral data, when stacked across the spectral dimension, exhibits low-rank. On the other hand, under the same representation, the spectral noise exhibits a banded structure. Motivated by this we show that the de-noised spectral data and the unknown spectral noise and the respective bands can be simultaneously estimated through the use of a low-rank and simultaneous sparse minimization operation without prior knowledge of the noisy bands. This result is novel for for hyperspectral imaging applications. In addition, we show that imaging for the Computed Tomography Imaging Systems (CTIS) can be improved under limited angle tomography by using low-rank penalization. For both of these cases we exploit the recent results in the theory of low-rank matrix completion using nuclear norm minimization.
△ Less
Submitted 9 May, 2013;
originally announced May 2013.