Skip to main content

Showing 1–46 of 46 results for author: Bien, J

Searching in archive stat. Search in all archives.
.
  1. arXiv:2506.01219  [pdf, ps, other

    stat.ME

    Reluctant Interaction Inference after Additive Modeling

    Authors: Yiling Huang, Snigdha Panigrahi, Guo Yu, Jacob Bien

    Abstract: Additive models enjoy the flexibility of nonlinear models while still being readily understandable to humans. By contrast, other nonlinear models, which involve interactions between features, are not only harder to fit but also substantially more complicated to explain. Guided by the principle of parsimony, a data analyst therefore may naturally be reluctant to move beyond an additive model unless… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: 41 pages, 8 figures

  2. arXiv:2505.06760  [pdf, other

    stat.ME

    Quantifying uncertainty and stability among highly correlated predictors: a subspace perspective

    Authors: Xiaozhu Zhang, Jacob Bien, Armeen Taeb

    Abstract: We study the problem of linear feature selection when features are highly correlated. This setting presents two main challenges. First, how should false positives be defined? Intuitively, selecting a null feature that is highly correlated with a true one may be less problematic than selecting a completely uncorrelated null feature. Second, correlation among features can cause variable selection me… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  3. arXiv:2504.12287  [pdf, other

    stat.ME stat.AP stat.ML

    Trend Filtered Mixture of Experts for Automated Gating of High-Frequency Flow Cytometry Data

    Authors: Sangwon Hyun, Tim Coleman, Francois Ribalet, Jacob Bien

    Abstract: Ocean microbes are critical to both ocean ecosystems and the global climate. Flow cytometry, which measures cell optical properties in fluid samples, is routinely used in oceanographic research. Despite decades of accumulated data, identifying key microbial populations (a process known as ``gating'') remains a significant analytical challenge. To address this, we focus on gating multidimensional,… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: 23 page (including supplement), 9 figures (including supplement)

    MSC Class: 62H30 (Primary) 62G08; 92B10; 62J07 (Secondary)

  4. arXiv:2502.09957  [pdf, other

    stat.ME stat.ML

    Thinning a Wishart Random Matrix

    Authors: Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Daniela Witten, Jacob Bien

    Abstract: Recent work has explored data thinning, a generalization of sample splitting that involves decomposing a (possibly matrix-valued) random variable into independent components. In the special case of a $n \times p$ random matrix with independent and identically distributed $N_p(μ, Σ)$ rows, Dharamshi et al. (2024a) provides a comprehensive analysis of the settings in which thinning is or is not poss… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  5. arXiv:2410.09039  [pdf, other

    stat.ME

    Semi-Supervised Learning of Noisy Mixture of Experts Models

    Authors: Oh-Ran Kwon, Gourab Mukherjee, Jacob Bien

    Abstract: The mixture of experts (MoE) model is a versatile framework for predictive modeling that has gained renewed interest in the age of large language models. A collection of predictive ``experts'' is learned along with a ``gating function'' that controls how much influence each expert is given when a prediction is made. This structure allows relatively simple models to excel in complex, heterogeneous… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

  6. arXiv:2409.11497  [pdf, other

    stat.ME stat.ML

    Decomposing Gaussians with Unknown Covariance

    Authors: Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Jacob Bien, Daniela Witten

    Abstract: Common workflows in machine learning and statistics rely on the ability to partition the information in a data set into independent portions. Recent work has shown that this may be possible even when conventional sample splitting is not (e.g., when the number of samples $n=1$, or when observations are not independent and identically distributed). However, the approaches that are currently availabl… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

  7. arXiv:2409.03069  [pdf, other

    stat.ME

    Discussion of "Data fission: splitting a single data point"

    Authors: Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten, Jacob Bien

    Abstract: Leiner et al. [2023] introduce an important generalization of sample splitting, which they call data fission. They consider two cases of data fission: P1 fission and P2 fission. While P1 fission is extremely useful and easy to use, Leiner et al. [2023] provide P1 fission operations only for the Gaussian and the Poisson distributions. They provide little guidance on how to apply P2 fission operatio… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: 18 pages, 1 figure

  8. arXiv:2402.16725  [pdf, other

    stat.ME

    Inference on the proportion of variance explained in principal component analysis

    Authors: Ronan Perry, Snigdha Panigrahi, Jacob Bien, Daniela Witten

    Abstract: Principal component analysis (PCA) is a longstanding and well-studied approach for dimension reduction. It rests upon the assumption that the underlying signal in the data has low rank, and thus can be well-summarized using a small number of dimensions. The output of PCA is typically represented using a scree plot, which displays the proportion of variance explained (PVE) by each principal compone… ▽ More

    Submitted 26 March, 2025; v1 submitted 26 February, 2024; originally announced February 2024.

  9. Generative AI for Data Science 101: Coding Without Learning To Code

    Authors: Jacob Bien, Gourab Mukherjee

    Abstract: Should one teach coding in a required introductory statistics and data science class for non-major students? Many professors advise against it, considering it a distraction from the important and challenging statistical topics that need to be covered. By contrast, other professors argue that the ability to interact flexibly with data will inspire students with a lasting love of the subject and a c… ▽ More

    Submitted 21 September, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

  10. arXiv:2305.18700  [pdf, other

    stat.ME math.ST stat.ML

    Predicting Rare Events by Shrinking Towards Proportional Odds

    Authors: Gregory Faletto, Jacob Bien

    Abstract: Training classifiers is difficult with severe class imbalance, but many rare events are the culmination of a sequence with much more common intermediate outcomes. For example, in online marketing a user first sees an ad, then may click on it, and finally may make a purchase; estimating the probability of purchases is difficult because of their rarity. We show both theoretically and through data ex… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: 84 pages, 20 figures. Accepted at the Fortieth International Conference on Machine Learning (ICML 2023)

  11. arXiv:2303.12931  [pdf, other

    stat.ME math.ST stat.ML

    Generalized Data Thinning Using Sufficient Statistics

    Authors: Ameer Dharamshi, Anna Neufeld, Keshav Motwani, Lucy L. Gao, Daniela Witten, Jacob Bien

    Abstract: Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent r… ▽ More

    Submitted 11 June, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

  12. arXiv:2211.01521  [pdf, other

    stat.ME stat.ML

    Inferring independent sets of Gaussian variables after thresholding correlations

    Authors: Arkajyoti Saha, Daniela Witten, Jacob Bien

    Abstract: We consider testing whether a set of Gaussian variables, selected from the data, is independent of the remaining variables. We assume that this set is selected via a very simple approach that is commonly used across scientific disciplines: we select a set of variables for which the correlation with all variables outside the set falls below some threshold. Unlike other settings in selective inferen… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: 33 pages, 5 figures, 6 figure files (due to subfigures)

  13. arXiv:2210.16710  [pdf, other

    math.ST stat.ME stat.ML

    Prediction Sets for High-Dimensional Mixture of Experts Models

    Authors: Adel Javanmard, Simeng Shao, Jacob Bien

    Abstract: Large datasets make it possible to build predictive models that can capture heterogenous relationships between the response variable and features. The mixture of high-dimensional linear experts model posits that observations come from a mixture of high-dimensional linear regression models, where the mixture weights are themselves feature-dependent. In this paper, we show how to construct valid pre… ▽ More

    Submitted 29 October, 2022; originally announced October 2022.

    Comments: 36 pages, 6 figures, 2 tables

  14. arXiv:2206.01703  [pdf, other

    stat.OT stat.CO

    Interactive Exploration of Large Dendrograms with Prototypes

    Authors: Andee Kaplan, Jacob Bien

    Abstract: Hierarchical clustering is one of the standard methods taught for identifying and exploring the underlying structures that may be present within a data set. Students are shown examples in which the dendrogram, a visual representation of the hierarchical clustering, reveals a clear clustering structure. However, in practice, data analysts today frequently encounter data sets whose large scale under… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

    Comments: 29 pages, 9 figures, accepted at The American Statistician

  15. arXiv:2201.12387  [pdf, other

    stat.ME

    Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States

    Authors: Evan L. Ray, Logan C. Brooks, Jacob Bien, Matthew Biggerstaff, Nikos I. Bosse, Johannes Bracher, Estee Y. Cramer, Sebastian Funk, Aaron Gerding, Michael A. Johansson, Aaron Rumack, Yijin Wang, Martha Zorn, Ryan J. Tibshirani, Nicholas G. Reich

    Abstract: The U.S. COVID-19 Forecast Hub aggregates forecasts of the short-term burden of COVID-19 in the United States from many contributing teams. We study methods for building an ensemble that combines forecasts from these teams. These experiments have informed the ensemble methods used by the Hub. To be most useful to policy makers, ensemble forecasts must have stable performance in the presence of two… ▽ More

    Submitted 7 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

  16. arXiv:2201.00494  [pdf, other

    stat.ME math.ST stat.CO stat.ML

    Cluster Stability Selection

    Authors: Gregory Faletto, Jacob Bien

    Abstract: Stability selection (Meinshausen and Buhlmann, 2010) makes any feature selection method more stable by returning only those features that are consistently selected across many subsamples. We prove (in what is, to our knowledge, the first result of its kind) that for data containing highly correlated proxies for an important latent variable, the lasso typically selects one proxy, yet stability sele… ▽ More

    Submitted 3 January, 2022; originally announced January 2022.

    Comments: 77 pages, 6 figures

  17. Ocean Mover's Distance: Using Optimal Transport for Analyzing Oceanographic Data

    Authors: Sangwon Hyun, Aditya Mishra, Christopher L. Follett, Bror Jonsson, Gemma Kulk, Gael Forget, Marie-Fanny Racault, Thomas Jackson, Stephanie Dutkiewicz, Christian L. Müller, Jacob Bien

    Abstract: Remote sensing observations from satellites and global biogeochemical models have combined to revolutionize the study of ocean biogeochemical cycling, but comparing the two data streams to each other and across time remains challenging due to the strong spatial-temporal structuring of the ocean. Here, we show that the Wasserstein distance provides a powerful metric for harnessing these structured… ▽ More

    Submitted 4 November, 2022; v1 submitted 16 November, 2021; originally announced November 2021.

    Comments: 6 figures

    MSC Class: 62P12; 86-10

    Journal ref: Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, year 2022, volume 478, number 2262, pages 20210875

  18. arXiv:2108.05350  [pdf, other

    stat.ME cs.LG math.ST stat.ML

    Controlling the False Split Rate in Tree-Based Aggregation

    Authors: Simeng Shao, Jacob Bien, Adel Javanmard

    Abstract: In many domains, data measurements can naturally be associated with the leaves of a tree, expressing the relationships among these measurements. For example, companies belong to industries, which in turn belong to ever coarser divisions such as sectors; microbes are commonly arranged in a taxonomic hierarchy from species to kingdoms; street blocks belong to neighborhoods, which in turn belong to l… ▽ More

    Submitted 11 August, 2021; originally announced August 2021.

    Comments: 47 pages

  19. arXiv:2101.12503  [pdf, other

    stat.ME econ.EM stat.ML

    Tree-based Node Aggregation in Sparse Graphical Models

    Authors: Ines Wilms, Jacob Bien

    Abstract: High-dimensional graphical models are often estimated using regularization that is aimed at reducing the number of edges in a network. In this work, we show how even simpler networks can be produced by aggregating the nodes of the graphical model. We develop a new convex regularized method, called the tree-aggregated graphical lasso or tag-lasso, that estimates graphical models that are both edge-… ▽ More

    Submitted 29 January, 2021; originally announced January 2021.

  20. arXiv:2012.02936  [pdf, other

    stat.ME stat.ML

    Selective Inference for Hierarchical Clustering

    Authors: Lucy L. Gao, Jacob Bien, Daniela Witten

    Abstract: Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their mean… ▽ More

    Submitted 31 October, 2022; v1 submitted 4 December, 2020; originally announced December 2020.

    Comments: Final accepted version

  21. arXiv:2008.11251  [pdf, other

    stat.AP stat.ME stat.ML

    Modeling Cell Populations Measured By Flow Cytometry With Covariates Using Sparse Mixture of Regressions

    Authors: Sangwon Hyun, Mattias Rolf Cape, Francois Ribalet, Jacob Bien

    Abstract: The ocean is filled with microscopic microalgae called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplan… ▽ More

    Submitted 3 August, 2022; v1 submitted 25 August, 2020; originally announced August 2020.

    Comments: To appear, Annals of Applied Statistics

    MSC Class: 62P12; 62P10; 92-10

  22. Controlling Costs: Feature Selection on a Budget

    Authors: Guo Yu, Daniela Witten, Jacob Bien

    Abstract: The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty, or intrusiveness). In particular, unnecessarily including an expensive feature i… ▽ More

    Submitted 11 February, 2023; v1 submitted 8 October, 2019; originally announced October 2019.

    Journal ref: Stat 11.1 (2022): e427

  23. arXiv:1909.11640  [pdf, other

    stat.ME stat.ML

    Testing for Association in Multi-View Network Data

    Authors: Lucy L. Gao, Daniela Witten, Jacob Bien

    Abstract: In this paper, we consider data consisting of multiple networks, each comprised of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multi-view network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a sto… ▽ More

    Submitted 22 March, 2021; v1 submitted 25 September, 2019; originally announced September 2019.

  24. arXiv:1907.08414  [pdf, other

    stat.ME stat.CO

    Reluctant Interaction Modeling

    Authors: Guo Yu, Jacob Bien, Ryan Tibshirani

    Abstract: Including pairwise interactions between the predictors of a regression model can produce better predicting models. However, to fit such interaction models on typical data sets in biology and other fields can often require solving enormous variable selection problems with billions of interactions. The scale of such problems demands methods that are computationally cheap (both in time and memory) ye… ▽ More

    Submitted 11 February, 2023; v1 submitted 19 July, 2019; originally announced July 2019.

  25. arXiv:1901.03905  [pdf, other

    stat.ME stat.ML

    Are Clusterings of Multiple Data Views Independent?

    Authors: Lucy L. Gao, Jacob Bien, Daniela Witten

    Abstract: In the Pioneer 100 (P100) Wellness Project (Price and others, 2017), multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster th… ▽ More

    Submitted 12 January, 2019; originally announced January 2019.

    Comments: 20 pages, 4 figures, 1 table (main text); 15 pages, 9 figures (supplement)

  26. arXiv:1803.06675  [pdf, other

    stat.ME math.ST stat.CO stat.ML

    Rare Feature Selection in High Dimensions

    Authors: Xiaohan Yan, Jacob Bien

    Abstract: It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show,… ▽ More

    Submitted 2 February, 2023; v1 submitted 18 March, 2018; originally announced March 2018.

    Comments: 42 pages, 10 figures

    Journal ref: Journal of the American Statistical Association, 116:534, 887-900

  27. arXiv:1712.02412  [pdf, other

    stat.ME stat.ML

    Estimating the error variance in a high-dimensional linear model

    Authors: Guo Yu, Jacob Bien

    Abstract: The lasso has been studied extensively as a tool for estimating the coefficient vector in the high-dimensional linear model; however, considerably less is known about estimating the error variance in this context. In this paper, we propose the natural lasso estimator for the error variance, which maximizes a penalized likelihood objective. A key aspect of the natural lasso is that the likelihood i… ▽ More

    Submitted 19 July, 2019; v1 submitted 6 December, 2017; originally announced December 2017.

    Comments: Biometrika(2019)

  28. arXiv:1711.10635  [pdf, other

    stat.ME math.ST stat.CO stat.ML

    Valid Inference Corrected for Outlier Removal

    Authors: Shuxiao Chen, Jacob Bien

    Abstract: Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard "detect-and-forget" approach has been shown to be problema… ▽ More

    Submitted 10 August, 2019; v1 submitted 28 November, 2017; originally announced November 2017.

    Comments: 21 pages, 6 figures, 2 tables

  29. arXiv:1711.03623  [pdf, other

    stat.ML stat.AP

    Interpretable Vector AutoRegressions with Exogenous Time Series

    Authors: Ines Wilms, Sumanta Basu, Jacob Bien, David S. Matteson

    Abstract: The Vector AutoRegressive (VAR) model is fundamental to the study of multivariate time series. Although VAR models are intensively investigated by many researchers, practitioners often show more interest in analyzing VARX models that incorporate the impact of unmodeled exogenous variables (X) into the VAR. However, since the parameter space grows quadratically with the number of time series, estim… ▽ More

    Submitted 9 November, 2017; originally announced November 2017.

    Comments: Presented at NIPS 2017 Symposium on Interpretable Machine Learning

  30. arXiv:1707.09208  [pdf, other

    stat.ME

    Sparse Identification and Estimation of Large-Scale Vector AutoRegressive Moving Averages

    Authors: Ines Wilms, Sumanta Basu, Jacob Bien, David S. Matteson

    Abstract: The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the theory of multivariate time series; however, identifiability issues have led practitioners to abandon it in favor of the simpler but more restrictive Vector AutoRegressive (VAR) model. We narrow this gap with a new optimization-based approach to VARMA identification built upon the principle of parsimony. Among all equival… ▽ More

    Submitted 8 June, 2021; v1 submitted 28 July, 2017; originally announced July 2017.

  31. arXiv:1702.07094  [pdf, other

    stat.CO

    BigVAR: Tools for Modeling Sparse High-Dimensional Multivariate Time Series

    Authors: William Nicholson, David Matteson, Jacob Bien

    Abstract: The R package BigVAR allows for the simultaneous estimation of high-dimensional time series by applying structured penalties to the conventional vector autoregression (VAR) and vector autoregression with exogenous variables (VARX) frameworks. Our methods can be utilized in many forecasting applications that make use of time-dependent data such as macroeconomics, finance, and internet traffic. Our… ▽ More

    Submitted 22 February, 2017; originally announced February 2017.

  32. arXiv:1607.00021  [pdf, other

    stat.CO stat.ME stat.ML stat.OT

    The Simulator: An Engine to Streamline Simulations

    Authors: Jacob Bien

    Abstract: The simulator is an R package that streamlines the process of performing simulations by creating a common infrastructure that can be easily used and reused across projects. Methodological statisticians routinely write simulations to compare their methods to preexisting ones. While developing ideas, there is a temptation to write "quick and dirty" simulations to try out ideas. This approach of rapi… ▽ More

    Submitted 30 June, 2016; originally announced July 2016.

  33. arXiv:1606.00451  [pdf, other

    stat.ME math.ST stat.CO stat.ML

    Graph-Guided Banding of the Covariance Matrix

    Authors: Jacob Bien

    Abstract: Regularization has become a primary tool for developing reliable estimators of the covariance matrix in high-dimensional settings. To curb the curse of dimensionality, numerous methods assume that the population covariance (or inverse covariance) matrix is sparse, while making no particular structural assumptions on the desired pattern of sparsity. A highly-related, yet complementary, literature s… ▽ More

    Submitted 15 February, 2018; v1 submitted 1 June, 2016; originally announced June 2016.

  34. arXiv:1604.07451  [pdf, other

    math.ST stat.CO stat.ME stat.ML

    Learning Local Dependence In Ordered Data

    Authors: Guo Yu, Jacob Bien

    Abstract: In many applications, data come with a natural ordering. This ordering can often induce local dependence among nearby variables. However, in complex data, the width of this dependence may vary, making simple assumptions such as a constant neighborhood size unrealistic. We propose a framework for learning this local dependence based on estimating the inverse of the Cholesky factor of the covariance… ▽ More

    Submitted 7 May, 2017; v1 submitted 25 April, 2016; originally announced April 2016.

    Journal ref: Journal of Machine Learning (2017) 18(42) 1-60

  35. arXiv:1604.06815  [pdf, other

    stat.ML cs.OH stat.CO stat.ME

    Non-convex Global Minimization and False Discovery Rate Control for the TREX

    Authors: Jacob Bien, Irina Gaynanova, Johannes Lederer, Christian Müller

    Abstract: The TREX is a recently introduced method for performing sparse high-dimensional regression. Despite its statistical promise as an alternative to the lasso, square-root lasso, and scaled lasso, the TREX is computationally challenging in that it requires solving a non-convex optimization problem. This paper shows a remarkable result: despite the non-convexity of the TREX problem, there exists a poly… ▽ More

    Submitted 20 September, 2016; v1 submitted 22 April, 2016; originally announced April 2016.

    Journal ref: Journal of Computational and Graphical Statistics 2017, Vol. 27, No. 1, 23-33

  36. arXiv:1512.01631  [pdf

    stat.ME math.ST stat.CO stat.ML

    Hierarchical Sparse Modeling: A Choice of Two Group Lasso Formulations

    Authors: Xiaohan Yan, Jacob Bien

    Abstract: Demanding sparsity in estimated models has become a routine practice in statistics. In many situations, we wish to require that the sparsity patterns attained honor certain problem-specific constraints. Hierarchical sparse modeling (HSM) refers to situations in which these constraints specify that one set of parameters be set to zero whenever another is set to zero. In recent years, numerous paper… ▽ More

    Submitted 29 November, 2017; v1 submitted 5 December, 2015; originally announced December 2015.

    Comments: 30 pages, 13 figures

    Journal ref: Statist. Sci. 32 (2017), no. 4, 531--560

  37. arXiv:1508.07497  [pdf, other

    stat.AP

    VARX-L: Structured Regularization for Large Vector Autoregressions with Exogenous Variables

    Authors: William Nicholson, David Matteson, Jacob Bien

    Abstract: The vector autoregression (VAR) has long proven to be an effective method for modeling the joint dynamics of macroeconomic time series as well as forecasting. A major shortcoming of the VAR that has hindered its applicability is its heavy parameterization: the parameter space grows quadratically with the number of series included, quickly exhausting the available degrees of freedom. Consequently,… ▽ More

    Submitted 27 February, 2017; v1 submitted 29 August, 2015; originally announced August 2015.

  38. arXiv:1412.5250  [pdf, other

    stat.ME stat.CO stat.ML

    High Dimensional Forecasting via Interpretable Vector Autoregression

    Authors: William B. Nicholson, Ines Wilms, Jacob Bien, David S. Matteson

    Abstract: Vector autoregression (VAR) is a fundamental tool for modeling multivariate time series. However, as the number of component series is increased, the VAR model becomes overparameterized. Several authors have addressed this issue by incorporating regularized approaches, such as the lasso in VAR estimation. Traditional approaches address overparameterization by selecting a low lag order, based on th… ▽ More

    Submitted 7 September, 2020; v1 submitted 16 December, 2014; originally announced December 2014.

  39. arXiv:1407.4729  [pdf, other

    stat.ME cs.LG stat.ML

    Sparse Partially Linear Additive Models

    Authors: Yin Lou, Jacob Bien, Rich Caruana, Johannes Gehrke

    Abstract: The generalized partially linear additive model (GPLAM) is a flexible and interpretable approach to building predictive models. It combines features in an additive manner, allowing each to have either a linear or nonlinear effect on the response. However, the choice of which features to treat as linear or nonlinear is typically assumed known. Thus, to make a GPLAM a viable approach in situations i… ▽ More

    Submitted 27 March, 2018; v1 submitted 17 July, 2014; originally announced July 2014.

    Comments: Corrected typos

  40. arXiv:1405.6210  [pdf, other

    math.ST stat.CO stat.ME stat.ML

    Convex Banding of the Covariance Matrix

    Authors: Jacob Bien, Florentina Bunea, Luo Xiao

    Abstract: We introduce a new sparse estimator of the covariance matrix for high-dimensional models in which the variables have a known ordering. Our estimator, which is the solution to a convex optimization problem, is equivalently expressed as an estimator which tapers the sample covariance matrix by a Toeplitz, sparsely-banded, data-adaptive matrix. As a result of this adaptivity, the convex banding estim… ▽ More

    Submitted 23 May, 2014; originally announced May 2014.

  41. Convex hierarchical testing of interactions

    Authors: Jacob Bien, Noah Simon, Robert Tibshirani

    Abstract: We consider the testing of all pairwise interactions in a two-class problem with many features. We devise a hierarchical testing framework that considers an interaction only when one or more of its constituent features has a nonzero main effect. The test is based on a convex optimization framework that seamlessly considers main effects and interactions together. We show - both in simulation and on… ▽ More

    Submitted 2 June, 2015; v1 submitted 6 November, 2012; originally announced November 2012.

    Comments: Published at http://dx.doi.org/10.1214/14-AOAS758 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOAS-AOAS758

    Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 1, 27-42

  42. arXiv:1205.5050  [pdf, ps, other

    stat.ME math.ST stat.ML

    A lasso for hierarchical interactions

    Authors: Jacob Bien, Jonathan Taylor, Robert Tibshirani

    Abstract: We add a set of convex constraints to the lasso to produce sparse interaction models that honor the hierarchy restriction that an interaction only be included in a model if one or both variables are marginally important. We give a precise characterization of the effect of this hierarchy constraint, prove that hierarchy holds with probability one and derive an unbiased estimate for the degrees of f… ▽ More

    Submitted 19 June, 2013; v1 submitted 22 May, 2012; originally announced May 2012.

    Comments: Published in at http://dx.doi.org/10.1214/13-AOS1096 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOS-AOS1096

    Journal ref: Annals of Statistics 2013, Vol. 41, No. 3, 1111-1141

  43. Prototype selection for interpretable classification

    Authors: Jacob Bien, Robert Tibshirani

    Abstract: Prototype methods seek a minimal subset of samples that can serve as a distillation or condensed view of a data set. As the size of modern data sets grows, being able to present a domain specialist with a short list of "representative" samples chosen from the data set is of increasing interpretative value. While much recent statistical research has been focused on producing sparse-in-the-variables… ▽ More

    Submitted 27 February, 2012; originally announced February 2012.

    Comments: Published in at http://dx.doi.org/10.1214/11-AOAS495 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org). arXiv admin note: text overlap with arXiv:0908.2284

    Report number: IMS-AOAS-AOAS495

    Journal ref: Annals of Applied Statistics 2011, Vol. 5, No. 4, 2403-2424

  44. arXiv:1011.2234  [pdf, ps, other

    math.ST stat.ML

    Strong rules for discarding predictors in lasso-type problems

    Authors: Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, Ryan J. Tibshirani

    Abstract: We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui et al (2010) propose "SAFE" rules that guarantee that a coefficient will be zero in the solution, based on the inner products of each predictor with the outcome. In this paper we propose strong rules that are not foolproof but rarely fail in practice. These can be complemen… ▽ More

    Submitted 24 November, 2010; v1 submitted 9 November, 2010; originally announced November 2010.

    Comments: 5

    MSC Class: 62J07 62G08

  45. arXiv:1011.0413  [pdf, ps, other

    cs.DS stat.AP stat.ML

    CUR from a Sparse Optimization Viewpoint

    Authors: Jacob Bien, Ya Xu, Michael W. Mahoney

    Abstract: The CUR decomposition provides an approximation of a matrix $X$ that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of $X$. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a randomized algorithmic approach, whereas most sparse PCA methods are framed as convex optimization p… ▽ More

    Submitted 1 November, 2010; originally announced November 2010.

    Comments: 9 pages; in NIPS 2010

  46. arXiv:0908.2284  [pdf, other

    stat.ML

    Classification by Set Cover: The Prototype Vector Machine

    Authors: Jacob Bien, Robert Tibshirani

    Abstract: We introduce a new nearest-prototype classifier, the prototype vector machine (PVM). It arises from a combinatorial optimization problem which we cast as a variant of the set cover problem. We propose two algorithms for approximating its solution. The PVM selects a relatively small number of representative points which can then be used for classification. It contains 1-NN as a special case. The… ▽ More

    Submitted 17 August, 2009; originally announced August 2009.

    Comments: 24 pages, 11 figures