Search | arXiv e-print repository

Reluctant Interaction Inference after Additive Modeling

Authors: Yiling Huang, Snigdha Panigrahi, Guo Yu, Jacob Bien

Abstract: Additive models enjoy the flexibility of nonlinear models while still being readily understandable to humans. By contrast, other nonlinear models, which involve interactions between features, are not only harder to fit but also substantially more complicated to explain. Guided by the principle of parsimony, a data analyst therefore may naturally be reluctant to move beyond an additive model unless… ▽ More Additive models enjoy the flexibility of nonlinear models while still being readily understandable to humans. By contrast, other nonlinear models, which involve interactions between features, are not only harder to fit but also substantially more complicated to explain. Guided by the principle of parsimony, a data analyst therefore may naturally be reluctant to move beyond an additive model unless it is truly warranted. To put this principle of interaction reluctance into practice, we formulate the problem as a hypothesis test with a fitted sparse additive model (SPAM) serving as the null. Because our hypotheses on interaction effects are formed after fitting a SPAM to the data, we adopt a selective inference approach to construct p-values that properly account for this data adaptivity. Our approach makes use of external randomization to obtain the distribution of test statistics conditional on the SPAM fit, allowing us to derive valid p-values, corrected for the over-optimism introduced by the data-adaptive process prior to the test. Through experiments on simulated and real data, we illustrate that--even with small amounts of external randomization--this rigorous modeling approach enjoys considerable advantages over naive methods and data splitting. △ Less

Submitted 1 June, 2025; originally announced June 2025.

Comments: 41 pages, 8 figures

arXiv:2505.06760 [pdf, other]

Quantifying uncertainty and stability among highly correlated predictors: a subspace perspective

Authors: Xiaozhu Zhang, Jacob Bien, Armeen Taeb

Abstract: We study the problem of linear feature selection when features are highly correlated. This setting presents two main challenges. First, how should false positives be defined? Intuitively, selecting a null feature that is highly correlated with a true one may be less problematic than selecting a completely uncorrelated null feature. Second, correlation among features can cause variable selection me… ▽ More We study the problem of linear feature selection when features are highly correlated. This setting presents two main challenges. First, how should false positives be defined? Intuitively, selecting a null feature that is highly correlated with a true one may be less problematic than selecting a completely uncorrelated null feature. Second, correlation among features can cause variable selection methods to produce very different feature sets across runs, making it hard to identify stable features. To address these issues, we propose a new framework based on feature subspaces -- the subspaces spanned by selected columns of the feature matrix. This framework leads to a new definition of false positives and negatives based on the "similarity" of feature subspaces. Further, instead of measuring stability of individual features, we measure stability with respect to feature subspaces. We propose and theoretically analyze a subspace generalization of stability selection (Meinshausen and Buhlmann, 2010). This procedure outputs multiple candidate stable models which can be considered interchangeable due to multicollinearity. We also propose a method for identifying substitute structures -- features that can be swapped and yield "equivalent" models. Finally, we demonstrate our framework and algorithms using both synthetic and real gene expression data. Our methods are implemented in the R package substab. △ Less

Submitted 10 May, 2025; originally announced May 2025.

arXiv:2504.12287 [pdf, other]

Trend Filtered Mixture of Experts for Automated Gating of High-Frequency Flow Cytometry Data

Authors: Sangwon Hyun, Tim Coleman, Francois Ribalet, Jacob Bien

Abstract: Ocean microbes are critical to both ocean ecosystems and the global climate. Flow cytometry, which measures cell optical properties in fluid samples, is routinely used in oceanographic research. Despite decades of accumulated data, identifying key microbial populations (a process known as ``gating'') remains a significant analytical challenge. To address this, we focus on gating multidimensional,… ▽ More Ocean microbes are critical to both ocean ecosystems and the global climate. Flow cytometry, which measures cell optical properties in fluid samples, is routinely used in oceanographic research. Despite decades of accumulated data, identifying key microbial populations (a process known as ``gating'') remains a significant analytical challenge. To address this, we focus on gating multidimensional, high-frequency flow cytometry data collected {\it continuously} on board oceanographic research vessels, capturing time- and space-wise variations in the dynamic ocean. Our paper proposes a novel mixture-of-experts model in which both the gating function and the experts are given by trend filtering. The model leverages two key assumptions: (1) Each snapshot of flow cytometry data is a mixture of multivariate Gaussians and (2) the parameters of these Gaussians vary smoothly over time. Our method uses regularization and a constraint to ensure smoothness and that cluster means match biologically distinct microbe types. We demonstrate, using flow cytometry data from the North Pacific Ocean, that our proposed model accurately matches human-annotated gating and corrects significant errors. △ Less

Submitted 16 April, 2025; originally announced April 2025.

Comments: 23 page (including supplement), 9 figures (including supplement)

MSC Class: 62H30 (Primary) 62G08; 92B10; 62J07 (Secondary)

arXiv:2502.09957 [pdf, other]

Thinning a Wishart Random Matrix

Authors: Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Daniela Witten, Jacob Bien

Abstract: Recent work has explored data thinning, a generalization of sample splitting that involves decomposing a (possibly matrix-valued) random variable into independent components. In the special case of a $n \times p$ random matrix with independent and identically distributed $N_p(μ, Σ)$ rows, Dharamshi et al. (2024a) provides a comprehensive analysis of the settings in which thinning is or is not poss… ▽ More Recent work has explored data thinning, a generalization of sample splitting that involves decomposing a (possibly matrix-valued) random variable into independent components. In the special case of a $n \times p$ random matrix with independent and identically distributed $N_p(μ, Σ)$ rows, Dharamshi et al. (2024a) provides a comprehensive analysis of the settings in which thinning is or is not possible: briefly, if $Σ$ is unknown, then one can thin provided that $n>1$. However, in some situations a data analyst may not have direct access to the data itself. For example, to preserve individuals' privacy, a data bank may provide only summary statistics such as the sample mean and sample covariance matrix. While the sample mean follows a Gaussian distribution, the sample covariance follows (up to scaling) a Wishart distribution, for which no thinning strategies have yet been proposed. In this note, we fill this gap: we show that it is possible to generate two independent data matrices with independent $N_p(μ, Σ)$ rows, based only on the sample mean and sample covariance matrix. These independent data matrices can either be used directly within a train-test paradigm, or can be used to derive independent summary statistics. Furthermore, they can be recombined to yield the original sample mean and sample covariance. △ Less

Submitted 14 February, 2025; originally announced February 2025.

arXiv:2410.09039 [pdf, other]

Semi-Supervised Learning of Noisy Mixture of Experts Models

Authors: Oh-Ran Kwon, Gourab Mukherjee, Jacob Bien

Abstract: The mixture of experts (MoE) model is a versatile framework for predictive modeling that has gained renewed interest in the age of large language models. A collection of predictive ``experts'' is learned along with a ``gating function'' that controls how much influence each expert is given when a prediction is made. This structure allows relatively simple models to excel in complex, heterogeneous… ▽ More The mixture of experts (MoE) model is a versatile framework for predictive modeling that has gained renewed interest in the age of large language models. A collection of predictive ``experts'' is learned along with a ``gating function'' that controls how much influence each expert is given when a prediction is made. This structure allows relatively simple models to excel in complex, heterogeneous data settings. In many contemporary settings, unlabeled data are widely available while labeled data are difficult to obtain. Semi-supervised learning methods seek to leverage the unlabeled data. We propose a novel method for semi-supervised learning of MoE models. We start from a semi-supervised MoE model that was developed by oceanographers that makes the strong assumption that the latent clustering structure in unlabeled data maps directly to the influence that the gating function should give each expert in the supervised task. We relax this assumption, imagining a noisy connection between the two, and propose an algorithm based on least trimmed squares, which succeeds even in the presence of misaligned data. Our theoretical analysis characterizes the conditions under which our approach yields estimators with a near-parametric rate of convergence. Simulated and real data examples demonstrate the method's efficacy. △ Less

Submitted 11 October, 2024; originally announced October 2024.

arXiv:2409.11497 [pdf, other]

Decomposing Gaussians with Unknown Covariance

Authors: Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Jacob Bien, Daniela Witten

Abstract: Common workflows in machine learning and statistics rely on the ability to partition the information in a data set into independent portions. Recent work has shown that this may be possible even when conventional sample splitting is not (e.g., when the number of samples $n=1$, or when observations are not independent and identically distributed). However, the approaches that are currently availabl… ▽ More Common workflows in machine learning and statistics rely on the ability to partition the information in a data set into independent portions. Recent work has shown that this may be possible even when conventional sample splitting is not (e.g., when the number of samples $n=1$, or when observations are not independent and identically distributed). However, the approaches that are currently available to decompose multivariate Gaussian data require knowledge of the covariance matrix. In many important problems (such as in spatial or longitudinal data analysis, and graphical modeling), the covariance matrix may be unknown and even of primary interest. Thus, in this work we develop new approaches to decompose Gaussians with unknown covariance. First, we present a general algorithm that encompasses all previous decomposition approaches for Gaussian data as special cases, and can further handle the case of an unknown covariance. It yields a new and more flexible alternative to sample splitting when $n>1$. When $n=1$, we prove that it is impossible to partition the information in a multivariate Gaussian into independent portions without knowing the covariance matrix. Thus, we use the general algorithm to decompose a single multivariate Gaussian with unknown covariance into dependent parts with tractable conditional distributions, and demonstrate their use for inference and validation. The proposed decomposition strategy extends naturally to Gaussian processes. In simulation and on electroencephalography data, we apply these decompositions to the tasks of model selection and post-selection inference in settings where alternative strategies are unavailable. △ Less

Submitted 17 September, 2024; originally announced September 2024.

arXiv:2409.03069 [pdf, other]

Discussion of "Data fission: splitting a single data point"

Authors: Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten, Jacob Bien

Abstract: Leiner et al. [2023] introduce an important generalization of sample splitting, which they call data fission. They consider two cases of data fission: P1 fission and P2 fission. While P1 fission is extremely useful and easy to use, Leiner et al. [2023] provide P1 fission operations only for the Gaussian and the Poisson distributions. They provide little guidance on how to apply P2 fission operatio… ▽ More Leiner et al. [2023] introduce an important generalization of sample splitting, which they call data fission. They consider two cases of data fission: P1 fission and P2 fission. While P1 fission is extremely useful and easy to use, Leiner et al. [2023] provide P1 fission operations only for the Gaussian and the Poisson distributions. They provide little guidance on how to apply P2 fission operations in practice, leaving the reader unsure of how to apply data fission outside of the Gaussian and Poisson settings. In this discussion, we describe how our own work provides P1 fission operations in a wide variety of families and offers insight into when P1 fission is possible. We also provide guidance on how to actually apply P2 fission in practice, with a special focus on logistic regression. Finally, we interpret P2 fission as a remedy for distributional misspecification when carrying out P1 fission operations. △ Less

Submitted 4 September, 2024; originally announced September 2024.

Comments: 18 pages, 1 figure

arXiv:2402.16725 [pdf, other]

Inference on the proportion of variance explained in principal component analysis

Authors: Ronan Perry, Snigdha Panigrahi, Jacob Bien, Daniela Witten

Abstract: Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal compone… ▽ More Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal component. While the PVE is extensively reported in routine data analyses, to the best of our knowledge the notion of inference on the PVE remains unexplored. In this paper, we consider inference on the PVE. We first introduce a new population quantity for the PVE with respect to an unknown matrix mean. Critically, our interest lies in the PVE of the sample principal components (as opposed to unobserved population principal components); thus, the population PVE that we introduce is defined conditional on the sample singular vectors. We show that it is possible to conduct inference, in the sense of confidence intervals, p-values, and point estimates, on this population quantity. Furthermore, we can conduct valid inference on the PVE of a subset of the principal components, even when the subset is selected using a data-driven approach such as the elbow rule. We demonstrate the proposed approach in simulation and in an application to a gene expression dataset. △ Less

Submitted 26 March, 2025; v1 submitted 26 February, 2024; originally announced February 2024.

arXiv:2401.17647 [pdf, other]

doi 10.1080/26939169.2024.2432397

Generative AI for Data Science 101: Coding Without Learning To Code

Authors: Jacob Bien, Gourab Mukherjee

Abstract: Should one teach coding in a required introductory statistics and data science class for non-major students? Many professors advise against it, considering it a distraction from the important and challenging statistical topics that need to be covered. By contrast, other professors argue that the ability to interact flexibly with data will inspire students with a lasting love of the subject and a c… ▽ More Should one teach coding in a required introductory statistics and data science class for non-major students? Many professors advise against it, considering it a distraction from the important and challenging statistical topics that need to be covered. By contrast, other professors argue that the ability to interact flexibly with data will inspire students with a lasting love of the subject and a continued commitment to the material beyond the introductory course. With the release of large language models that write code, we saw an opportunity for a middle ground, which we tried in Fall 2023 in a required introductory data science course in our school's full-time MBA program. We taught students how to write English prompts to the artificial intelligence tool Github Copilot that could be turned into R code and executed. In this short article, we report on our experience using this new approach. △ Less

Submitted 21 September, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

arXiv:2305.18700 [pdf, other]

Predicting Rare Events by Shrinking Towards Proportional Odds

Authors: Gregory Faletto, Jacob Bien

Abstract: Training classifiers is difficult with severe class imbalance, but many rare events are the culmination of a sequence with much more common intermediate outcomes. For example, in online marketing a user first sees an ad, then may click on it, and finally may make a purchase; estimating the probability of purchases is difficult because of their rarity. We show both theoretically and through data ex… ▽ More Training classifiers is difficult with severe class imbalance, but many rare events are the culmination of a sequence with much more common intermediate outcomes. For example, in online marketing a user first sees an ad, then may click on it, and finally may make a purchase; estimating the probability of purchases is difficult because of their rarity. We show both theoretically and through data experiments that the more abundant data in earlier steps may be leveraged to improve estimation of probabilities of rare events. We present PRESTO, a relaxation of the proportional odds model for ordinal regression. Instead of estimating weights for one separating hyperplane that is shifted by separate intercepts for each of the estimated Bayes decision boundaries between adjacent pairs of categorical responses, we estimate separate weights for each of these transitions. We impose an L1 penalty on the differences between weights for the same feature in adjacent weight vectors in order to shrink towards the proportional odds model. We prove that PRESTO consistently estimates the decision boundary weights under a sparsity assumption. Synthetic and real data experiments show that our method can estimate rare probabilities in this setting better than both logistic regression on the rare category, which fails to borrow strength from more abundant categories, and the proportional odds model, which is too inflexible. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: 84 pages, 20 figures. Accepted at the Fortieth International Conference on Machine Learning (ICML 2023)

arXiv:2303.12931 [pdf, other]

Generalized Data Thinning Using Sufficient Statistics

Authors: Ameer Dharamshi, Anna Neufeld, Keshav Motwani, Lucy L. Gao, Daniela Witten, Jacob Bien

Abstract: Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent r… ▽ More Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent random variables can then be used for various model validation and inference tasks, including in contexts where traditional sample splitting fails. In this paper, we generalize their procedure by relaxing this summation requirement and simply asking that some known function of the independent random variables exactly reconstruct $X$. This generalization of the procedure serves two purposes. First, it greatly expands the families of distributions for which thinning can be performed. Second, it unifies sample splitting and data thinning, which on the surface seem to be very different, as applications of the same principle. This shared principle is sufficiency. We use this insight to perform generalized thinning operations for a diverse set of families. △ Less

Submitted 11 June, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

arXiv:2211.01521 [pdf, other]

Inferring independent sets of Gaussian variables after thresholding correlations

Authors: Arkajyoti Saha, Daniela Witten, Jacob Bien

Abstract: We consider testing whether a set of Gaussian variables, selected from the data, is independent of the remaining variables. We assume that this set is selected via a very simple approach that is commonly used across scientific disciplines: we select a set of variables for which the correlation with all variables outside the set falls below some threshold. Unlike other settings in selective inferen… ▽ More We consider testing whether a set of Gaussian variables, selected from the data, is independent of the remaining variables. We assume that this set is selected via a very simple approach that is commonly used across scientific disciplines: we select a set of variables for which the correlation with all variables outside the set falls below some threshold. Unlike other settings in selective inference, failure to account for the selection step leads, in this setting, to excessively conservative (as opposed to anti-conservative) results. Our proposed test properly accounts for the fact that the set of variables is selected from the data, and thus is not overly conservative. To develop our test, we condition on the event that the selection resulted in the set of variables in question. To achieve computational tractability, we develop a new characterization of the conditioning event in terms of the canonical correlation between the groups of random variables. In simulation studies and in the analysis of gene co-expression networks, we show that our approach has much higher power than a ``naive'' approach that ignores the effect of selection. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Comments: 33 pages, 5 figures, 6 figure files (due to subfigures)

arXiv:2210.16710 [pdf, other]

Prediction Sets for High-Dimensional Mixture of Experts Models

Authors: Adel Javanmard, Simeng Shao, Jacob Bien

Abstract: Large datasets make it possible to build predictive models that can capture heterogenous relationships between the response variable and features. The mixture of high-dimensional linear experts model posits that observations come from a mixture of high-dimensional linear regression models, where the mixture weights are themselves feature-dependent. In this paper, we show how to construct valid pre… ▽ More Large datasets make it possible to build predictive models that can capture heterogenous relationships between the response variable and features. The mixture of high-dimensional linear experts model posits that observations come from a mixture of high-dimensional linear regression models, where the mixture weights are themselves feature-dependent. In this paper, we show how to construct valid prediction sets for an $\ell_1$-penalized mixture of experts model in the high-dimensional setting. We make use of a debiasing procedure to account for the bias induced by the penalization and propose a novel strategy for combining intervals to form a prediction set with coverage guarantees in the mixture setting. Synthetic examples and an application to the prediction of critical temperatures of superconducting materials show our method to have reliable practical performance. △ Less

Submitted 29 October, 2022; originally announced October 2022.

Comments: 36 pages, 6 figures, 2 tables

arXiv:2206.01703 [pdf, other]

Interactive Exploration of Large Dendrograms with Prototypes

Authors: Andee Kaplan, Jacob Bien

Abstract: Hierarchical clustering is one of the standard methods taught for identifying and exploring the underlying structures that may be present within a data set. Students are shown examples in which the dendrogram, a visual representation of the hierarchical clustering, reveals a clear clustering structure. However, in practice, data analysts today frequently encounter data sets whose large scale under… ▽ More Hierarchical clustering is one of the standard methods taught for identifying and exploring the underlying structures that may be present within a data set. Students are shown examples in which the dendrogram, a visual representation of the hierarchical clustering, reveals a clear clustering structure. However, in practice, data analysts today frequently encounter data sets whose large scale undermines the usefulness of the dendrogram as a visualization tool. Densely packed branches obscure structure, and overlapping labels are impossible to read. In this paper we present a new workflow for performing hierarchical clustering via the R package called protoshiny that aims to restore hierarchical clustering to its former role of being an effective and versatile visualization tool. Our proposal leverages interactivity combined with the ability to label internal nodes in a dendrogram with a representative data point (called a prototype). After presenting the workflow, we provide three case studies to demonstrate its utility. △ Less

Submitted 3 June, 2022; originally announced June 2022.

Comments: 29 pages, 9 figures, accepted at The American Statistician

arXiv:2201.12387 [pdf, other]

Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States

Authors: Evan L. Ray, Logan C. Brooks, Jacob Bien, Matthew Biggerstaff, Nikos I. Bosse, Johannes Bracher, Estee Y. Cramer, Sebastian Funk, Aaron Gerding, Michael A. Johansson, Aaron Rumack, Yijin Wang, Martha Zorn, Ryan J. Tibshirani, Nicholas G. Reich

Abstract: The U.S. COVID-19 Forecast Hub aggregates forecasts of the short-term burden of COVID-19 in the United States from many contributing teams. We study methods for building an ensemble that combines forecasts from these teams. These experiments have informed the ensemble methods used by the Hub. To be most useful to policy makers, ensemble forecasts must have stable performance in the presence of two… ▽ More The U.S. COVID-19 Forecast Hub aggregates forecasts of the short-term burden of COVID-19 in the United States from many contributing teams. We study methods for building an ensemble that combines forecasts from these teams. These experiments have informed the ensemble methods used by the Hub. To be most useful to policy makers, ensemble forecasts must have stable performance in the presence of two key characteristics of the component forecasts: (1) occasional misalignment with the reported data, and (2) instability in the relative performance of component forecasters over time. Our results indicate that in the presence of these challenges, an untrained and robust approach to ensembling using an equally weighted median of all component forecasts is a good choice to support public health decision makers. In settings where some contributing forecasters have a stable record of good performance, trained ensembles that give those forecasters higher weight can also be helpful. △ Less

Submitted 7 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

arXiv:2201.00494 [pdf, other]

Cluster Stability Selection

Authors: Gregory Faletto, Jacob Bien

Abstract: Stability selection (Meinshausen and Buhlmann, 2010) makes any feature selection method more stable by returning only those features that are consistently selected across many subsamples. We prove (in what is, to our knowledge, the first result of its kind) that for data containing highly correlated proxies for an important latent variable, the lasso typically selects one proxy, yet stability sele… ▽ More Stability selection (Meinshausen and Buhlmann, 2010) makes any feature selection method more stable by returning only those features that are consistently selected across many subsamples. We prove (in what is, to our knowledge, the first result of its kind) that for data containing highly correlated proxies for an important latent variable, the lasso typically selects one proxy, yet stability selection with the lasso can fail to select any proxy, leading to worse predictive performance than the lasso alone. We introduce cluster stability selection, which exploits the practitioner's knowledge that highly correlated clusters exist in the data, resulting in better feature rankings than stability selection in this setting. We consider several feature-combination approaches, including taking a weighted average of the features in each important cluster where weights are determined by the frequency with which cluster members are selected, which we show leads to better predictive models than previous proposals. We present generalizations of theoretical guarantees from Meinshausen and Buhlmann (2010) and Shah and Samworth (2012) to show that cluster stability selection retains the same guarantees. In summary, cluster stability selection enjoys the best of both worlds, yielding a sparse selected set that is both stable and has good predictive performance. △ Less

Submitted 3 January, 2022; originally announced January 2022.

Comments: 77 pages, 6 figures

arXiv:2111.08736 [pdf, other]

doi 10.1098/rspa.2021.0875

Ocean Mover's Distance: Using Optimal Transport for Analyzing Oceanographic Data

Authors: Sangwon Hyun, Aditya Mishra, Christopher L. Follett, Bror Jonsson, Gemma Kulk, Gael Forget, Marie-Fanny Racault, Thomas Jackson, Stephanie Dutkiewicz, Christian L. Müller, Jacob Bien

Abstract: Remote sensing observations from satellites and global biogeochemical models have combined to revolutionize the study of ocean biogeochemical cycling, but comparing the two data streams to each other and across time remains challenging due to the strong spatial-temporal structuring of the ocean. Here, we show that the Wasserstein distance provides a powerful metric for harnessing these structured… ▽ More Remote sensing observations from satellites and global biogeochemical models have combined to revolutionize the study of ocean biogeochemical cycling, but comparing the two data streams to each other and across time remains challenging due to the strong spatial-temporal structuring of the ocean. Here, we show that the Wasserstein distance provides a powerful metric for harnessing these structured datasets for better marine ecosystem and climate predictions. Wasserstein distance complements commonly used point-wise difference methods such as the root mean squared error, by quantifying differences in terms of spatial displacement in addition to magnitude. As a test case we consider Chlorophyll (a key indicator of phytoplankton biomass) in the North-East Pacific Ocean, obtained from model simulations, in situ measurements, and satellite observations. We focus on two main applications: 1) Comparing model predictions with satellite observations, and 2) temporal evolution of Chlorophyll both seasonally and over longer time frames. Wasserstein distance successfully isolates temporal and depth variability and quantifies shifts in biogeochemical province boundaries. It also exposes relevant temporal trends in satellite Chlorophyll consistent with climate change predictions. Our study shows that optimal transport vectors underlying Wasserstein distance provide a novel visualization tool for testing models and better understanding temporal dynamics in the ocean. △ Less

Submitted 4 November, 2022; v1 submitted 16 November, 2021; originally announced November 2021.

Comments: 6 figures

MSC Class: 62P12; 86-10

Journal ref: Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, year 2022, volume 478, number 2262, pages 20210875

arXiv:2108.05350 [pdf, other]

Controlling the False Split Rate in Tree-Based Aggregation

Authors: Simeng Shao, Jacob Bien, Adel Javanmard

Abstract: In many domains, data measurements can naturally be associated with the leaves of a tree, expressing the relationships among these measurements. For example, companies belong to industries, which in turn belong to ever coarser divisions such as sectors; microbes are commonly arranged in a taxonomic hierarchy from species to kingdoms; street blocks belong to neighborhoods, which in turn belong to l… ▽ More In many domains, data measurements can naturally be associated with the leaves of a tree, expressing the relationships among these measurements. For example, companies belong to industries, which in turn belong to ever coarser divisions such as sectors; microbes are commonly arranged in a taxonomic hierarchy from species to kingdoms; street blocks belong to neighborhoods, which in turn belong to larger-scale regions. The problem of tree-based aggregation that we consider in this paper asks which of these tree-defined subgroups of leaves should really be treated as a single entity and which of these entities should be distinguished from each other. We introduce the "false split rate", an error measure that describes the degree to which subgroups have been split when they should not have been. We then propose a multiple hypothesis testing algorithm for tree-based aggregation, which we prove controls this error measure. We focus on two main examples of tree-based aggregation, one which involves aggregating means and the other which involves aggregating regression coefficients. We apply this methodology to aggregate stocks based on their volatility and to aggregate neighborhoods of New York City based on taxi fares. △ Less

Submitted 11 August, 2021; originally announced August 2021.

Comments: 47 pages

arXiv:2101.12503 [pdf, other]

Tree-based Node Aggregation in Sparse Graphical Models

Authors: Ines Wilms, Jacob Bien

Abstract: High-dimensional graphical models are often estimated using regularization that is aimed at reducing the number of edges in a network. In this work, we show how even simpler networks can be produced by aggregating the nodes of the graphical model. We develop a new convex regularized method, called the tree-aggregated graphical lasso or tag-lasso, that estimates graphical models that are both edge-… ▽ More High-dimensional graphical models are often estimated using regularization that is aimed at reducing the number of edges in a network. In this work, we show how even simpler networks can be produced by aggregating the nodes of the graphical model. We develop a new convex regularized method, called the tree-aggregated graphical lasso or tag-lasso, that estimates graphical models that are both edge-sparse and node-aggregated. The aggregation is performed in a data-driven fashion by leveraging side information in the form of a tree that encodes node similarity and facilitates the interpretation of the resulting aggregated nodes. We provide an efficient implementation of the tag-lasso by using the locally adaptive alternating direction method of multipliers and illustrate our proposal's practical advantages in simulation and in applications in finance and biology. △ Less

Submitted 29 January, 2021; originally announced January 2021.

arXiv:2012.02936 [pdf, other]

Selective Inference for Hierarchical Clustering

Authors: Lucy L. Gao, Jacob Bien, Daniela Witten

Abstract: Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their mean… ▽ More Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data. △ Less

Submitted 31 October, 2022; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: Final accepted version

arXiv:2008.11251 [pdf, other]

Modeling Cell Populations Measured By Flow Cytometry With Covariates Using Sparse Mixture of Regressions

Authors: Sangwon Hyun, Mattias Rolf Cape, Francois Ribalet, Jacob Bien

Abstract: The ocean is filled with microscopic microalgae called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplan… ▽ More The ocean is filled with microscopic microalgae called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry, which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real-time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small and large scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper, we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the north-east Pacific in the spring of 2017. △ Less

Submitted 3 August, 2022; v1 submitted 25 August, 2020; originally announced August 2020.

Comments: To appear, Annals of Applied Statistics

MSC Class: 62P12; 62P10; 92-10

arXiv:1910.03627 [pdf, other]

doi 10.1002/sta4.427

Controlling Costs: Feature Selection on a Budget

Authors: Guo Yu, Daniela Witten, Jacob Bien

Abstract: The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty, or intrusiveness). In particular, unnecessarily including an expensive feature i… ▽ More The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty, or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost-conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process. △ Less

Submitted 11 February, 2023; v1 submitted 8 October, 2019; originally announced October 2019.

Journal ref: Stat 11.1 (2022): e427

arXiv:1909.11640 [pdf, other]

Testing for Association in Multi-View Network Data

Authors: Lucy L. Gao, Daniela Witten, Jacob Bien

Abstract: In this paper, we consider data consisting of multiple networks, each comprised of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multi-view network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a sto… ▽ More In this paper, we consider data consisting of multiple networks, each comprised of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multi-view network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two-view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein-protein interaction data from the HINT database (Das and Hint, 2012). We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to co-complex association data. We also extend this proposal to the setting of a network with node covariates. △ Less

Submitted 22 March, 2021; v1 submitted 25 September, 2019; originally announced September 2019.

arXiv:1907.08414 [pdf, other]

Reluctant Interaction Modeling

Authors: Guo Yu, Jacob Bien, Ryan Tibshirani

Abstract: Including pairwise interactions between the predictors of a regression model can produce better predicting models. However, to fit such interaction models on typical data sets in biology and other fields can often require solving enormous variable selection problems with billions of interactions. The scale of such problems demands methods that are computationally cheap (both in time and memory) ye… ▽ More Including pairwise interactions between the predictors of a regression model can produce better predicting models. However, to fit such interaction models on typical data sets in biology and other fields can often require solving enormous variable selection problems with billions of interactions. The scale of such problems demands methods that are computationally cheap (both in time and memory) yet still have sound statistical properties. Motivated by these large-scale problem sizes, we adopt a very simple guiding principle: One should prefer main effects over interactions if all else is equal. This "reluctance" to interactions, while reminiscent of the hierarchy principle for interactions, is much less restrictive. We design a computationally efficient method built upon this principle and provide theoretical results indicating favorable statistical properties. Empirical results show dramatic computational improvement without sacrificing statistical properties. For example, the proposed method can solve a problem with 10 billion interactions with 5-fold cross-validation in under 7 hours on a single CPU. △ Less

Submitted 11 February, 2023; v1 submitted 19 July, 2019; originally announced July 2019.

arXiv:1901.03905 [pdf, other]

Are Clusterings of Multiple Data Views Independent?

Authors: Lucy L. Gao, Jacob Bien, Daniela Witten

Abstract: In the Pioneer 100 (P100) Wellness Project (Price and others, 2017), multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster th… ▽ More In the Pioneer 100 (P100) Wellness Project (Price and others, 2017), multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster the participants using all of the data types and timepoints, in order to fully exploit the available information. However, clustering the participants based on multiple data views implicitly assumes that a single underlying clustering of the participants is shared across all data views. If this assumption does not hold, then clustering the participants using multiple data views may lead to spurious results. In this paper, we seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop a new test for answering this question, which we then apply to clinical, proteomic, and metabolomic data, across two distinct timepoints, from the P100 study. We find that while the subgroups of the participants defined with respect to any single data type seem to be dependent across time, the clustering among the participants based on one data type (e.g. proteomic data) appears not to be associated with the clustering based on another data type (e.g. clinical data). △ Less

Submitted 12 January, 2019; originally announced January 2019.

Comments: 20 pages, 4 figures, 1 table (main text); 15 pages, 9 figures (supplement)

arXiv:1803.06675 [pdf, other]

doi 10.1080/01621459.2020.1796677

Rare Feature Selection in High Dimensions

Authors: Xiaohan Yan, Jacob Bien

Abstract: It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show,… ▽ More It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers. △ Less

Submitted 2 February, 2023; v1 submitted 18 March, 2018; originally announced March 2018.

Comments: 42 pages, 10 figures

Journal ref: Journal of the American Statistical Association, 116:534, 887-900

arXiv:1712.02412 [pdf, other]

doi 10.1093/biomet/asz017

Estimating the error variance in a high-dimensional linear model

Authors: Guo Yu, Jacob Bien

Abstract: The lasso has been studied extensively as a tool for estimating the coefficient vector in the high-dimensional linear model; however, considerably less is known about estimating the error variance in this context. In this paper, we propose the natural lasso estimator for the error variance, which maximizes a penalized likelihood objective. A key aspect of the natural lasso is that the likelihood i… ▽ More The lasso has been studied extensively as a tool for estimating the coefficient vector in the high-dimensional linear model; however, considerably less is known about estimating the error variance in this context. In this paper, we propose the natural lasso estimator for the error variance, which maximizes a penalized likelihood objective. A key aspect of the natural lasso is that the likelihood is expressed in terms of the natural parameterization of the multiparameter exponential family of a Gaussian with unknown mean and variance. The result is a remarkably simple estimator of the error variance with provably good performance in terms of mean squared error. These theoretical results do not require placing any assumptions on the design matrix or the true regression coefficients. We also propose a companion estimator, called the organic lasso, which theoretically does not require tuning of the regularization parameter. Both estimators do well empirically compared to preexisting methods, especially in settings where successful recovery of the true support of the coefficient vector is hard. Finally, we show that existing methods can do well under fewer assumptions than previously known, thus providing a fuller story about the problem of estimating the error variance in high-dimensional linear models. △ Less

Submitted 19 July, 2019; v1 submitted 6 December, 2017; originally announced December 2017.

Comments: Biometrika(2019)

arXiv:1711.10635 [pdf, other]

Valid Inference Corrected for Outlier Removal

Authors: Shuxiao Chen, Jacob Bien

Abstract: Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard "detect-and-forget" approach has been shown to be problema… ▽ More Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard "detect-and-forget" approach has been shown to be problematic, and in this paper we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real data sets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. △ Less

Submitted 10 August, 2019; v1 submitted 28 November, 2017; originally announced November 2017.

Comments: 21 pages, 6 figures, 2 tables

arXiv:1711.03623 [pdf, other]

Interpretable Vector AutoRegressions with Exogenous Time Series

Authors: Ines Wilms, Sumanta Basu, Jacob Bien, David S. Matteson

Abstract: The Vector AutoRegressive (VAR) model is fundamental to the study of multivariate time series. Although VAR models are intensively investigated by many researchers, practitioners often show more interest in analyzing VARX models that incorporate the impact of unmodeled exogenous variables (X) into the VAR. However, since the parameter space grows quadratically with the number of time series, estim… ▽ More The Vector AutoRegressive (VAR) model is fundamental to the study of multivariate time series. Although VAR models are intensively investigated by many researchers, practitioners often show more interest in analyzing VARX models that incorporate the impact of unmodeled exogenous variables (X) into the VAR. However, since the parameter space grows quadratically with the number of time series, estimation quickly becomes challenging. While several proposals have been made to sparsely estimate large VAR models, the estimation of large VARX models is under-explored. Moreover, typically these sparse proposals involve a lasso-type penalty and do not incorporate lag selection into the estimation procedure. As a consequence, the resulting models may be difficult to interpret. In this paper, we propose a lag-based hierarchically sparse estimator, called "HVARX", for large VARX models. We illustrate the usefulness of HVARX on a cross-category management marketing application. Our results show how it provides a highly interpretable model, and improves out-of-sample forecast accuracy compared to a lasso-type approach. △ Less

Submitted 9 November, 2017; originally announced November 2017.

Comments: Presented at NIPS 2017 Symposium on Interpretable Machine Learning

arXiv:1707.09208 [pdf, other]

Sparse Identification and Estimation of Large-Scale Vector AutoRegressive Moving Averages

Authors: Ines Wilms, Sumanta Basu, Jacob Bien, David S. Matteson

Abstract: The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the theory of multivariate time series; however, identifiability issues have led practitioners to abandon it in favor of the simpler but more restrictive Vector AutoRegressive (VAR) model. We narrow this gap with a new optimization-based approach to VARMA identification built upon the principle of parsimony. Among all equival… ▽ More The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the theory of multivariate time series; however, identifiability issues have led practitioners to abandon it in favor of the simpler but more restrictive Vector AutoRegressive (VAR) model. We narrow this gap with a new optimization-based approach to VARMA identification built upon the principle of parsimony. Among all equivalent data-generating models, we use convex optimization to seek the parameterization that is "simplest" in a certain sense. A user-specified strongly convex penalty is used to measure model simplicity, and that same penalty is then used to define an estimator that can be efficiently computed. We establish consistency of our estimators in a double-asymptotic regime. Our non-asymptotic error bound analysis accommodates both model specification and parameter estimation steps, a feature that is crucial for studying large-scale VARMA algorithms. Our analysis also provides new results on penalized estimation of infinite-order VAR, and elastic net regression under a singular covariance structure of regressors, which may be of independent interest. We illustrate the advantage of our method over VAR alternatives on three real data examples. △ Less

Submitted 8 June, 2021; v1 submitted 28 July, 2017; originally announced July 2017.

arXiv:1702.07094 [pdf, other]

BigVAR: Tools for Modeling Sparse High-Dimensional Multivariate Time Series

Authors: William Nicholson, David Matteson, Jacob Bien

Abstract: The R package BigVAR allows for the simultaneous estimation of high-dimensional time series by applying structured penalties to the conventional vector autoregression (VAR) and vector autoregression with exogenous variables (VARX) frameworks. Our methods can be utilized in many forecasting applications that make use of time-dependent data such as macroeconomics, finance, and internet traffic. Our… ▽ More The R package BigVAR allows for the simultaneous estimation of high-dimensional time series by applying structured penalties to the conventional vector autoregression (VAR) and vector autoregression with exogenous variables (VARX) frameworks. Our methods can be utilized in many forecasting applications that make use of time-dependent data such as macroeconomics, finance, and internet traffic. Our package extends solution algorithms from the machine learning and signal processing literatures to a time dependent setting: selecting the regularization parameter by sequential cross validation and provides substantial improvements in forecasting performance over conventional methods. We offer a user-friendly interface that utilizes R's s4 object class structure which makes our methodology easily accessible to practicioners. In this paper, we present an overview of our notation, the models that comprise BigVAR, and the functionality of our package with a detailed example using publicly available macroeconomic data. In addition, we present a simulation study comparing the performance of several procedures that refit the support selected by a BigVAR procedure according to several variants of least squares and conclude that refitting generally degrades forecast performance. △ Less

Submitted 22 February, 2017; originally announced February 2017.

arXiv:1607.00021 [pdf, other]

The Simulator: An Engine to Streamline Simulations

Authors: Jacob Bien

Abstract: The simulator is an R package that streamlines the process of performing simulations by creating a common infrastructure that can be easily used and reused across projects. Methodological statisticians routinely write simulations to compare their methods to preexisting ones. While developing ideas, there is a temptation to write "quick and dirty" simulations to try out ideas. This approach of rapi… ▽ More The simulator is an R package that streamlines the process of performing simulations by creating a common infrastructure that can be easily used and reused across projects. Methodological statisticians routinely write simulations to compare their methods to preexisting ones. While developing ideas, there is a temptation to write "quick and dirty" simulations to try out ideas. This approach of rapid prototyping is useful but can sometimes backfire if bugs are introduced. Using the simulator allows one to remove the "dirty" without sacrificing the "quick." Coding is quick because the statistician focuses exclusively on those aspects of the simulation that are specific to the particular paper being written. Code written with the simulator is succinct, highly readable, and easily shared with others. The modular nature of simulations written with the simulator promotes code reusability, which saves time and facilitates reproducibility. The syntax of the simulator leads to simulation code that is easily human-readable. Other benefits of using the simulator include the ability to "step in" to a simulation and change one aspect without having to rerun the entire simulation from scratch, the straightforward integration of parallel computing into simulations, and the ability to rapidly generate plots, tables, and reports with minimal effort. △ Less

Submitted 30 June, 2016; originally announced July 2016.

arXiv:1606.00451 [pdf, other]

Graph-Guided Banding of the Covariance Matrix

Authors: Jacob Bien

Abstract: Regularization has become a primary tool for developing reliable estimators of the covariance matrix in high-dimensional settings. To curb the curse of dimensionality, numerous methods assume that the population covariance (or inverse covariance) matrix is sparse, while making no particular structural assumptions on the desired pattern of sparsity. A highly-related, yet complementary, literature s… ▽ More Regularization has become a primary tool for developing reliable estimators of the covariance matrix in high-dimensional settings. To curb the curse of dimensionality, numerous methods assume that the population covariance (or inverse covariance) matrix is sparse, while making no particular structural assumptions on the desired pattern of sparsity. A highly-related, yet complementary, literature studies the specific setting in which the measured variables have a known ordering, in which case a banded population matrix is often assumed. While the banded approach is conceptually and computationally easier than asking for "patternless sparsity," it is only applicable in very specific situations (such as when data are measured over time or one-dimensional space). This work proposes a generalization of the notion of bandedness that greatly expands the range of problems in which banded estimators apply. We develop convex regularizers occupying the broad middle ground between the former approach of "patternless sparsity" and the latter reliance on having a known ordering. Our framework defines bandedness with respect to a known graph on the measured variables. Such a graph is available in diverse situations, and we provide a theoretical, computational, and applied treatment of two new estimators. An R package, called ggb, implements these new methods. △ Less

Submitted 15 February, 2018; v1 submitted 1 June, 2016; originally announced June 2016.

arXiv:1604.07451 [pdf, other]

Learning Local Dependence In Ordered Data

Authors: Guo Yu, Jacob Bien

Abstract: In many applications, data come with a natural ordering. This ordering can often induce local dependence among nearby variables. However, in complex data, the width of this dependence may vary, making simple assumptions such as a constant neighborhood size unrealistic. We propose a framework for learning this local dependence based on estimating the inverse of the Cholesky factor of the covariance… ▽ More In many applications, data come with a natural ordering. This ordering can often induce local dependence among nearby variables. However, in complex data, the width of this dependence may vary, making simple assumptions such as a constant neighborhood size unrealistic. We propose a framework for learning this local dependence based on estimating the inverse of the Cholesky factor of the covariance matrix. Penalized maximum likelihood estimation of this matrix yields a simple regression interpretation for local dependence in which variables are predicted by their neighbors. Our proposed method involves solving a convex, penalized Gaussian likelihood problem with a hierarchical group lasso penalty. The problem decomposes into independent subproblems which can be solved efficiently in parallel using first-order methods. Our method yields a sparse, symmetric, positive definite estimator of the precision matrix, encoding a Gaussian graphical model. We derive theoretical results not found in existing methods attaining this structure. In particular, our conditions for signed support recovery and estimation consistency rates in multiple norms are as mild as those in a regression problem. Empirical results show our method performing favorably compared to existing methods. We apply our method to genomic data to flexibly model linkage disequilibrium. Our method is also applied to improve the performance of discriminant analysis in sound recording classification. △ Less

Submitted 7 May, 2017; v1 submitted 25 April, 2016; originally announced April 2016.

Journal ref: Journal of Machine Learning (2017) 18(42) 1-60

arXiv:1604.06815 [pdf, other]

doi 10.1080/10618600.2017.1341414

Non-convex Global Minimization and False Discovery Rate Control for the TREX

Authors: Jacob Bien, Irina Gaynanova, Johannes Lederer, Christian Müller

Abstract: The TREX is a recently introduced method for performing sparse high-dimensional regression. Despite its statistical promise as an alternative to the lasso, square-root lasso, and scaled lasso, the TREX is computationally challenging in that it requires solving a non-convex optimization problem. This paper shows a remarkable result: despite the non-convexity of the TREX problem, there exists a poly… ▽ More The TREX is a recently introduced method for performing sparse high-dimensional regression. Despite its statistical promise as an alternative to the lasso, square-root lasso, and scaled lasso, the TREX is computationally challenging in that it requires solving a non-convex optimization problem. This paper shows a remarkable result: despite the non-convexity of the TREX problem, there exists a polynomial-time algorithm that is guaranteed to find the global minimum. This result adds the TREX to a very short list of non-convex optimization problems that can be globally optimized (principal components analysis being a famous example). After deriving and developing this new approach, we demonstrate that (i) the ability of the preexisting TREX heuristic to reach the global minimum is strongly dependent on the difficulty of the underlying statistical problem, (ii) the new polynomial-time algorithm for TREX permits a novel variable ranking and selection scheme, (iii) this scheme can be incorporated into a rule that controls the false discovery rate (FDR) of included features in the model. To achieve this last aim, we provide an extension of the results of Barber & Candes (2015) to establish that the knockoff filter framework can be applied to the TREX. This investigation thus provides both a rare case study of a heuristic for non-convex optimization and a novel way of exploiting non-convexity for statistical inference. △ Less

Submitted 20 September, 2016; v1 submitted 22 April, 2016; originally announced April 2016.

Journal ref: Journal of Computational and Graphical Statistics 2017, Vol. 27, No. 1, 23-33

arXiv:1512.01631 [pdf]

doi 10.1214/17-STS622

Hierarchical Sparse Modeling: A Choice of Two Group Lasso Formulations

Authors: Xiaohan Yan, Jacob Bien

Abstract: Demanding sparsity in estimated models has become a routine practice in statistics. In many situations, we wish to require that the sparsity patterns attained honor certain problem-specific constraints. Hierarchical sparse modeling (HSM) refers to situations in which these constraints specify that one set of parameters be set to zero whenever another is set to zero. In recent years, numerous paper… ▽ More Demanding sparsity in estimated models has become a routine practice in statistics. In many situations, we wish to require that the sparsity patterns attained honor certain problem-specific constraints. Hierarchical sparse modeling (HSM) refers to situations in which these constraints specify that one set of parameters be set to zero whenever another is set to zero. In recent years, numerous papers have developed convex regularizers for this form of sparsity structure, which arises in many areas of statistics including interaction modeling, time series analysis, and covariance estimation. In this paper, we observe that these methods fall into two frameworks, the group lasso (GL) and latent overlapping group lasso (LOG), which have not been systematically compared in the context of HSM. The purpose of this paper is to provide a side-by-side comparison of these two frameworks for HSM in terms of their statistical properties and computational efficiency. We call special attention to GL's more aggressive shrinkage of parameters deep in the hierarchy, a property not shared by LOG. In terms of computation, we introduce a finite-step algorithm that exactly solves the proximal operator of LOG for a certain simple HSM structure; we later exploit this to develop a novel path-based block coordinate descent scheme for general HSM structures. Both algorithms greatly improve the computational performance of LOG. Finally, we compare the two methods in the context of covariance estimation, where we introduce a new sparsely-banded estimator using LOG, which we show achieves the statistical advantages of an existing GL-based method but is simpler to express and more efficient to compute. △ Less

Submitted 29 November, 2017; v1 submitted 5 December, 2015; originally announced December 2015.

Comments: 30 pages, 13 figures

Journal ref: Statist. Sci. 32 (2017), no. 4, 531--560

arXiv:1508.07497 [pdf, other]

VARX-L: Structured Regularization for Large Vector Autoregressions with Exogenous Variables

Authors: William Nicholson, David Matteson, Jacob Bien

Abstract: The vector autoregression (VAR) has long proven to be an effective method for modeling the joint dynamics of macroeconomic time series as well as forecasting. A major shortcoming of the VAR that has hindered its applicability is its heavy parameterization: the parameter space grows quadratically with the number of series included, quickly exhausting the available degrees of freedom. Consequently,… ▽ More The vector autoregression (VAR) has long proven to be an effective method for modeling the joint dynamics of macroeconomic time series as well as forecasting. A major shortcoming of the VAR that has hindered its applicability is its heavy parameterization: the parameter space grows quadratically with the number of series included, quickly exhausting the available degrees of freedom. Consequently, forecasting using VARs is intractable for low-frequency, high-dimensional macroeconomic data. However, empirical evidence suggests that VARs that incorporate more component series tend to result in more accurate forecasts. Conventional methods that allow for the estimation of large VARs either tend to require ad hoc subjective specifications or are computationally infeasible. Moreover, as global economies become more intricately intertwined, there has been substantial interest in incorporating the impact of stochastic, unmodeled exogenous variables. Vector autoregression with exogenous variables (VARX) extends the VAR to allow for the inclusion of unmodeled variables, but it similarly faces dimensionality challenges. We introduce the VARX-L framework, a structured family of VARX models, and provide methodology that allows for both efficient estimation and accurate forecasting in high-dimensional analysis. VARX-L adapts several prominent scalar regression regularization techniques to a vector time series context in order to greatly reduce the parameter space of VAR and VARX models. We also highlight a compelling extension that allows for shrinking toward reference models, such as a vector random walk. We demonstrate the efficacy of VARX-L in both low- and high-dimensional macroeconomic forecasting applications and simulated data examples. Our methodology is easily reproducible in a publicly available R package. △ Less

Submitted 27 February, 2017; v1 submitted 29 August, 2015; originally announced August 2015.

arXiv:1412.5250 [pdf, other]

High Dimensional Forecasting via Interpretable Vector Autoregression

Authors: William B. Nicholson, Ines Wilms, Jacob Bien, David S. Matteson

Abstract: Vector autoregression (VAR) is a fundamental tool for modeling multivariate time series. However, as the number of component series is increased, the VAR model becomes overparameterized. Several authors have addressed this issue by incorporating regularized approaches, such as the lasso in VAR estimation. Traditional approaches address overparameterization by selecting a low lag order, based on th… ▽ More Vector autoregression (VAR) is a fundamental tool for modeling multivariate time series. However, as the number of component series is increased, the VAR model becomes overparameterized. Several authors have addressed this issue by incorporating regularized approaches, such as the lasso in VAR estimation. Traditional approaches address overparameterization by selecting a low lag order, based on the assumption of short range dependence, assuming that a universal lag order applies to all components. Such an approach constrains the relationship between the components and impedes forecast performance. The lasso-based approaches work much better in high-dimensional situations but do not incorporate the notion of lag order selection. We propose a new class of hierarchical lag structures (HLag) that embed the notion of lag selection into a convex regularizer. The key modeling tool is a group lasso with nested groups which guarantees that the sparsity pattern of lag coefficients honors the VAR's ordered structure. The HLag framework offers three structures, which allow for varying levels of flexibility. A simulation study demonstrates improved performance in forecasting and lag order selection over previous approaches, and a macroeconomic application further highlights forecasting improvements as well as HLag's convenient, interpretable output. △ Less

Submitted 7 September, 2020; v1 submitted 16 December, 2014; originally announced December 2014.

arXiv:1407.4729 [pdf, other]

Sparse Partially Linear Additive Models

Authors: Yin Lou, Jacob Bien, Rich Caruana, Johannes Gehrke

Abstract: The generalized partially linear additive model (GPLAM) is a flexible and interpretable approach to building predictive models. It combines features in an additive manner, allowing each to have either a linear or nonlinear effect on the response. However, the choice of which features to treat as linear or nonlinear is typically assumed known. Thus, to make a GPLAM a viable approach in situations i… ▽ More The generalized partially linear additive model (GPLAM) is a flexible and interpretable approach to building predictive models. It combines features in an additive manner, allowing each to have either a linear or nonlinear effect on the response. However, the choice of which features to treat as linear or nonlinear is typically assumed known. Thus, to make a GPLAM a viable approach in situations in which little is known $a~priori$ about the features, one must overcome two primary model selection challenges: deciding which features to include in the model and determining which of these features to treat nonlinearly. We introduce the sparse partially linear additive model (SPLAM), which combines model fitting and $both$ of these model selection challenges into a single convex optimization problem. SPLAM provides a bridge between the lasso and sparse additive models. Through a statistical oracle inequality and thorough simulation, we demonstrate that SPLAM can outperform other methods across a broad spectrum of statistical regimes, including the high-dimensional ($p\gg N$) setting. We develop efficient algorithms that are applied to real data sets with half a million samples and over 45,000 features with excellent predictive performance. △ Less

Submitted 27 March, 2018; v1 submitted 17 July, 2014; originally announced July 2014.

Comments: Corrected typos

arXiv:1405.6210 [pdf, other]

Convex Banding of the Covariance Matrix

Authors: Jacob Bien, Florentina Bunea, Luo Xiao

Abstract: We introduce a new sparse estimator of the covariance matrix for high-dimensional models in which the variables have a known ordering. Our estimator, which is the solution to a convex optimization problem, is equivalently expressed as an estimator which tapers the sample covariance matrix by a Toeplitz, sparsely-banded, data-adaptive matrix. As a result of this adaptivity, the convex banding estim… ▽ More We introduce a new sparse estimator of the covariance matrix for high-dimensional models in which the variables have a known ordering. Our estimator, which is the solution to a convex optimization problem, is equivalently expressed as an estimator which tapers the sample covariance matrix by a Toeplitz, sparsely-banded, data-adaptive matrix. As a result of this adaptivity, the convex banding estimator enjoys theoretical optimality properties not attained by previous banding or tapered estimators. In particular, our convex banding estimator is minimax rate adaptive in Frobenius and operator norms, up to log factors, over commonly-studied classes of covariance matrices, and over more general classes. Furthermore, it correctly recovers the bandwidth when the true covariance is exactly banded. Our convex formulation admits a simple and efficient algorithm. Empirical studies demonstrate its practical effectiveness and illustrate that our exactly-banded estimator works well even when the true covariance matrix is only close to a banded matrix, confirming our theoretical results. Our method compares favorably with all existing methods, in terms of accuracy and speed. We illustrate the practical merits of the convex banding estimator by showing that it can be used to improve the performance of discriminant analysis for classifying sound recordings. △ Less

Submitted 23 May, 2014; originally announced May 2014.

arXiv:1211.1344 [pdf, ps, other]

doi 10.1214/14-AOAS758

Convex hierarchical testing of interactions

Authors: Jacob Bien, Noah Simon, Robert Tibshirani

Abstract: We consider the testing of all pairwise interactions in a two-class problem with many features. We devise a hierarchical testing framework that considers an interaction only when one or more of its constituent features has a nonzero main effect. The test is based on a convex optimization framework that seamlessly considers main effects and interactions together. We show - both in simulation and on… ▽ More We consider the testing of all pairwise interactions in a two-class problem with many features. We devise a hierarchical testing framework that considers an interaction only when one or more of its constituent features has a nonzero main effect. The test is based on a convex optimization framework that seamlessly considers main effects and interactions together. We show - both in simulation and on a genomic data set from the SAPPHIRe study - a potential gain in power and interpretability over a standard (nonhierarchical) interaction test. △ Less

Submitted 2 June, 2015; v1 submitted 6 November, 2012; originally announced November 2012.

Comments: Published at http://dx.doi.org/10.1214/14-AOAS758 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS758

Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 1, 27-42

arXiv:1205.5050 [pdf, ps, other]

doi 10.1214/13-AOS1096

A lasso for hierarchical interactions

Authors: Jacob Bien, Jonathan Taylor, Robert Tibshirani

Abstract: We add a set of convex constraints to the lasso to produce sparse interaction models that honor the hierarchy restriction that an interaction only be included in a model if one or both variables are marginally important. We give a precise characterization of the effect of this hierarchy constraint, prove that hierarchy holds with probability one and derive an unbiased estimate for the degrees of f… ▽ More We add a set of convex constraints to the lasso to produce sparse interaction models that honor the hierarchy restriction that an interaction only be included in a model if one or both variables are marginally important. We give a precise characterization of the effect of this hierarchy constraint, prove that hierarchy holds with probability one and derive an unbiased estimate for the degrees of freedom of our estimator. A bound on this estimate reveals the amount of fitting "saved" by the hierarchy constraint. We distinguish between parameter sparsity - the number of nonzero coefficients - and practical sparsity - the number of raw variables one must measure to make a new prediction. Hierarchy focuses on the latter, which is more closely tied to important data collection concerns such as cost, time and effort. We develop an algorithm, available in the R package hierNet, and perform an empirical study of our method. △ Less

Submitted 19 June, 2013; v1 submitted 22 May, 2012; originally announced May 2012.

Comments: Published in at http://dx.doi.org/10.1214/13-AOS1096 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1096

Journal ref: Annals of Statistics 2013, Vol. 41, No. 3, 1111-1141

arXiv:1202.5933 [pdf, ps, other]

doi 10.1214/11-AOAS495

Prototype selection for interpretable classification

Authors: Jacob Bien, Robert Tibshirani

Abstract: Prototype methods seek a minimal subset of samples that can serve as a distillation or condensed view of a data set. As the size of modern data sets grows, being able to present a domain specialist with a short list of "representative" samples chosen from the data set is of increasing interpretative value. While much recent statistical research has been focused on producing sparse-in-the-variables… ▽ More Prototype methods seek a minimal subset of samples that can serve as a distillation or condensed view of a data set. As the size of modern data sets grows, being able to present a domain specialist with a short list of "representative" samples chosen from the data set is of increasing interpretative value. While much recent statistical research has been focused on producing sparse-in-the-variables methods, this paper aims at achieving sparsity in the samples. We discuss a method for selecting prototypes in the classification setting (in which the samples fall into known discrete categories). Our method of focus is derived from three basic properties that we believe a good prototype set should satisfy. This intuition is translated into a set cover optimization problem, which we solve approximately using standard approaches. While prototype selection is usually viewed as purely a means toward building an efficient classifier, in this paper we emphasize the inherent value of having a set of prototypical elements. That said, by using the nearest-neighbor rule on the set of prototypes, we can of course discuss our method as a classifier as well. △ Less

Submitted 27 February, 2012; originally announced February 2012.

Comments: Published in at http://dx.doi.org/10.1214/11-AOAS495 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org). arXiv admin note: text overlap with arXiv:0908.2284

Report number: IMS-AOAS-AOAS495

Journal ref: Annals of Applied Statistics 2011, Vol. 5, No. 4, 2403-2424

arXiv:1011.2234 [pdf, ps, other]

Strong rules for discarding predictors in lasso-type problems

Authors: Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, Ryan J. Tibshirani

Abstract: We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui et al (2010) propose "SAFE" rules that guarantee that a coefficient will be zero in the solution, based on the inner products of each predictor with the outcome. In this paper we propose strong rules that are not foolproof but rarely fail in practice. These can be complemen… ▽ More We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui et al (2010) propose "SAFE" rules that guarantee that a coefficient will be zero in the solution, based on the inner products of each predictor with the outcome. In this paper we propose strong rules that are not foolproof but rarely fail in practice. These can be complemented with simple checks of the Karush- Kuhn-Tucker (KKT) conditions to provide safe rules that offer substantial speed and space savings in a variety of statistical convex optimization problems. △ Less

Submitted 24 November, 2010; v1 submitted 9 November, 2010; originally announced November 2010.

Comments: 5

MSC Class: 62J07 62G08

arXiv:1011.0413 [pdf, ps, other]

CUR from a Sparse Optimization Viewpoint

Authors: Jacob Bien, Ya Xu, Michael W. Mahoney

Abstract: The CUR decomposition provides an approximation of a matrix $X$ that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of $X$. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a randomized algorithmic approach, whereas most sparse PCA methods are framed as convex optimization p… ▽ More The CUR decomposition provides an approximation of a matrix $X$ that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of $X$. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a randomized algorithmic approach, whereas most sparse PCA methods are framed as convex optimization problems. In this paper, we try to understand CUR from a sparse optimization viewpoint. We show that CUR is implicitly optimizing a sparse regression objective and, furthermore, cannot be directly cast as a sparse PCA method. We also observe that the sparsity attained by CUR possesses an interesting structure, which leads us to formulate a sparse PCA method that achieves a CUR-like sparsity. △ Less

Submitted 1 November, 2010; originally announced November 2010.

Comments: 9 pages; in NIPS 2010

arXiv:0908.2284 [pdf, other]

Classification by Set Cover: The Prototype Vector Machine

Authors: Jacob Bien, Robert Tibshirani

Abstract: We introduce a new nearest-prototype classifier, the prototype vector machine (PVM). It arises from a combinatorial optimization problem which we cast as a variant of the set cover problem. We propose two algorithms for approximating its solution. The PVM selects a relatively small number of representative points which can then be used for classification. It contains 1-NN as a special case. The… ▽ More We introduce a new nearest-prototype classifier, the prototype vector machine (PVM). It arises from a combinatorial optimization problem which we cast as a variant of the set cover problem. We propose two algorithms for approximating its solution. The PVM selects a relatively small number of representative points which can then be used for classification. It contains 1-NN as a special case. The method is compatible with any dissimilarity measure, making it amenable to situations in which the data are not embedded in an underlying feature space or in which using a non-Euclidean metric is desirable. Indeed, we demonstrate on the much studied ZIP code data how the PVM can reap the benefits of a problem-specific metric. In this example, the PVM outperforms the highly successful 1-NN with tangent distance, and does so retaining fewer than half of the data points. This example highlights the strengths of the PVM in yielding a low-error, highly interpretable model. Additionally, we apply the PVM to a protein classification problem in which a kernel-based distance is used. △ Less

Submitted 17 August, 2009; originally announced August 2009.

Comments: 24 pages, 11 figures

Showing 1–46 of 46 results for author: Bien, J