-
The Lasso Distribution: Properties, Sampling Methods, and Applications in Bayesian Lasso Regression
Authors:
Mohammad Javad Davoudabadi,
Jonathon Tidswell,
Samuel Muller,
Garth Tarr,
John T. Ormerod
Abstract:
In this paper, we introduce a new probability distribution, the Lasso distribution. We derive several fundamental properties of the distribution, including closed-form expressions for its moments and moment-generating function. Additionally, we present an efficient and numerically stable algorithm for generating random samples from the distribution, facilitating its use in both theoretical and app…
▽ More
In this paper, we introduce a new probability distribution, the Lasso distribution. We derive several fundamental properties of the distribution, including closed-form expressions for its moments and moment-generating function. Additionally, we present an efficient and numerically stable algorithm for generating random samples from the distribution, facilitating its use in both theoretical and applied settings. We establish that the Lasso distribution belongs to the exponential family. A direct application of the Lasso distribution arises in the context of an existing Gibbs sampler, where the full conditional distribution of each regression coefficient follows this distribution. This leads to a more computationally efficient and theoretically grounded sampling scheme. To facilitate the adoption of our methodology, we provide an R package implementing the proposed methods. Our findings offer new insights into the probabilistic structure underlying the Lasso penalty and provide practical improvements in Bayesian inference for high-dimensional regression problems.
△ Less
Submitted 12 June, 2025; v1 submitted 8 June, 2025;
originally announced June 2025.
-
Data-Adaptive Automatic Threshold Calibration for Stability Selection
Authors:
Martin Huang,
Samuel Muller,
Garth Tarr
Abstract:
Stability selection has gained popularity as a method for enhancing the performance of variable selection algorithms while controlling false discovery rates. However, achieving these desirable properties depends on correctly specifying the stable threshold parameter, which can be challenging. An arbitrary choice of this parameter can substantially alter the set of selected variables, as the variab…
▽ More
Stability selection has gained popularity as a method for enhancing the performance of variable selection algorithms while controlling false discovery rates. However, achieving these desirable properties depends on correctly specifying the stable threshold parameter, which can be challenging. An arbitrary choice of this parameter can substantially alter the set of selected variables, as the variables' selection probabilities are inherently data-dependent. To address this issue, we propose Exclusion Automatic Threshold Selection (EATS), a data-adaptive algorithm that streamlines stability selection by automating the threshold specification process. Additionally, we introduce Automatic Threshold Selection (ATS), the motivation behind EATS. We evaluate our approach through an extensive simulation study, benchmarking across commonly used variable selection algorithms and several static stable threshold values.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
CR-Lasso: Robust cellwise regularized sparse regression
Authors:
Peng Su,
Garth Tarr,
Samuel Muller,
Suojin Wang
Abstract:
Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. We propose CR-Lasso, a robust Lasso-type cellwise regularization procedure that performs feature selection in the presence of cellwise outlier…
▽ More
Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. We propose CR-Lasso, a robust Lasso-type cellwise regularization procedure that performs feature selection in the presence of cellwise outliers by minimising a regression loss and cell deviation measure simultaneously. To evaluate the approach, we conduct empirical studies comparing its selection and prediction performance with several sparse regression methods. We show that CR-Lasso is competitive under the settings considered. We illustrate the effectiveness of the proposed method on real data through an analysis of a bone mineral density dataset.
△ Less
Submitted 1 March, 2024; v1 submitted 11 July, 2023;
originally announced July 2023.
-
Regularized Predictive Models for Beef Eating Quality of Individual Meals
Authors:
Garth Tarr,
Ines Wilms
Abstract:
Faced with changing markets and evolving consumer demands, beef industries are investing in grading systems to maximise value extraction throughout their entire supply chain. The Meat Standards Australia (MSA) system is a customer-oriented total quality management system that stands out internationally by predicting quality grades of specific muscles processed by a designated cooking method. The m…
▽ More
Faced with changing markets and evolving consumer demands, beef industries are investing in grading systems to maximise value extraction throughout their entire supply chain. The Meat Standards Australia (MSA) system is a customer-oriented total quality management system that stands out internationally by predicting quality grades of specific muscles processed by a designated cooking method. The model currently underpinning the MSA system requires laborious effort to estimate and its prediction performance may be less accurate in the presence of unbalanced data sets where many "muscle x cook" combinations have few observations and/or few predictors of palatability are available. This paper proposes a novel predictive method for beef eating quality that bridges a spectrum of muscle x cook-specific models. At one extreme, each muscle x cook combination is modelled independently; at the other extreme a pooled predictive model is obtained across all muscle x cook combinations. Via a data-driven regularization method, we cover all muscle x cook-specific models along this spectrum. We demonstrate that the proposed predictive method attains considerable accuracy improvements relative to independent or pooled approaches on unique MSA data sets.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
Robust Variable Selection under Cellwise Contamination
Authors:
Peng Su,
Garth Tarr,
Samuel Muller
Abstract:
Cellwise outliers are widespread in data and traditional robust methods may fail when applied to datasets under such contamination. We propose a variable selection procedure, that uses a pairwise robust estimator to obtain an initial empirical covariance matrix among the response and potentially many predictors. Then we replace the primary design matrix and the response vector with their robust co…
▽ More
Cellwise outliers are widespread in data and traditional robust methods may fail when applied to datasets under such contamination. We propose a variable selection procedure, that uses a pairwise robust estimator to obtain an initial empirical covariance matrix among the response and potentially many predictors. Then we replace the primary design matrix and the response vector with their robust counterparts based on the estimated covariance matrix. Finally, we adopt the adaptive Lasso to obtain variable selection results. The proposed approach is robust to cellwise outliers in regular and high dimensional settings and empirical results show good performance in comparison with recently proposed alternative robust approaches, particularly in the challenging setting when contamination rates are high but the magnitude of outliers is moderate. Real data applications demonstrate the practical utility of the proposed method.
△ Less
Submitted 4 September, 2023; v1 submitted 24 October, 2021;
originally announced October 2021.
-
Machine learning applications in time series hierarchical forecasting
Authors:
Mahdi Abolghasemi,
Rob J Hyndman,
Garth Tarr,
Christoph Bergmeir
Abstract:
Hierarchical forecasting (HF) is needed in many situations in the supply chain (SC) because managers often need different levels of forecasts at different levels of SC to make a decision. Top-Down (TD), Bottom-Up (BU) and Optimal Combination (COM) are common HF models. These approaches are static and often ignore the dynamics of the series while disaggregating them. Consequently, they may fail to…
▽ More
Hierarchical forecasting (HF) is needed in many situations in the supply chain (SC) because managers often need different levels of forecasts at different levels of SC to make a decision. Top-Down (TD), Bottom-Up (BU) and Optimal Combination (COM) are common HF models. These approaches are static and often ignore the dynamics of the series while disaggregating them. Consequently, they may fail to perform well if the investigated group of time series are subject to large changes such as during the periods of promotional sales. We address the HF problem of predicting real-world sales time series that are highly impacted by promotion. We use three machine learning (ML) models to capture sales variations over time. Artificial neural networks (ANN), extreme gradient boosting (XGboost), and support vector regression (SVR) algorithms are used to estimate the proportions of lower-level time series from the upper level. We perform an in-depth analysis of 61 groups of time series with different volatilities and show that ML models are competitive and outperform some well-established models in the literature.
△ Less
Submitted 1 December, 2019;
originally announced December 2019.
-
Demand forecasting in supply chain: The impact of demand volatility in the presence of promotion
Authors:
Mahdi Abolghasemi,
Richard Gerlach,
Garth Tarr,
Eric Beh
Abstract:
The demand for a particular product or service is typically associated with different uncertainties that can make them volatile and challenging to predict. Demand unpredictability is one of the managers' concerns in the supply chain that can cause large forecasting errors, issues in the upstream supply chain and impose unnecessary costs. We investigate 843 real demand time series with different va…
▽ More
The demand for a particular product or service is typically associated with different uncertainties that can make them volatile and challenging to predict. Demand unpredictability is one of the managers' concerns in the supply chain that can cause large forecasting errors, issues in the upstream supply chain and impose unnecessary costs. We investigate 843 real demand time series with different values of coefficient of variations (CoV) where promotion causes volatility over the entire demand series. In such a case, forecasting demand for different CoV require different models to capture the underlying behavior of demand series and pose significant challenges due to very different and diverse demand behavior. We decompose demand into baseline and promotional demand and propose a hybrid model to forecast demand. Our results indicate that our proposed hybrid model generates robust and accurate forecast across series with different levels of volatilities. We stress the necessity of decomposition for volatile demand series. We also model demand series with a number of well known statistical and machine learning (ML) models to investigate their performance empirically. We found that ARIMA with covariate (ARIMAX) works well to forecast volatile demand series, but exponential smoothing with covariate (ETSX) has a poor performance. Support vector regression (SVR) and dynamic linear regression (DLR) models generate robust forecasts across different categories of demands with different CoV values.
△ Less
Submitted 28 September, 2019;
originally announced September 2019.
-
mplot: An R Package for Graphical Model Stability and Variable Selection Procedures
Authors:
Garth Tarr,
Samuel Müller,
Alan Welsh
Abstract:
The mplot package provides an easy to use implementation of model stability and variable inclusion plots (Müller and Welsh 2010; Murray, Heritier, and Müller 2013) as well as the adaptive fence (Jiang, Rao, Gu, and Nguyen 2008; Jiang, Nguyen, and Rao 2009) for linear and generalised linear models. We provide a number of innovations on the standard procedures and address many practical implementati…
▽ More
The mplot package provides an easy to use implementation of model stability and variable inclusion plots (Müller and Welsh 2010; Murray, Heritier, and Müller 2013) as well as the adaptive fence (Jiang, Rao, Gu, and Nguyen 2008; Jiang, Nguyen, and Rao 2009) for linear and generalised linear models. We provide a number of innovations on the standard procedures and address many practical implementation issues including the addition of redundant variables, interactive visualisations and approximating logistic models with linear models. An option is provided that combines our bootstrap approach with glmnet for higher dimensional models. The plots and graphical user interface leverage state of the art web technologies to facilitate interaction with the results. The speed of implementation comes from the leaps package and cross-platform multicore support.
△ Less
Submitted 28 February, 2017; v1 submitted 25 September, 2015;
originally announced September 2015.
-
Robust estimation of precision matrices under cellwise contamination
Authors:
Garth Tarr,
Samuel Müller,
Neville C. Weber
Abstract:
There is a great need for robust techniques in data mining and machine learning contexts where many standard techniques such as principal component analysis and linear discriminant analysis are inherently susceptible to outliers. Furthermore, standard robust procedures assume that less than half the observation rows of a data matrix are contaminated, which may not be a realistic assumption when th…
▽ More
There is a great need for robust techniques in data mining and machine learning contexts where many standard techniques such as principal component analysis and linear discriminant analysis are inherently susceptible to outliers. Furthermore, standard robust procedures assume that less than half the observation rows of a data matrix are contaminated, which may not be a realistic assumption when the number of observed features is large. This work looks at the problem of estimating covariance and precision matrices under cellwise contamination. We consider using a robust pairwise covariance matrix as an input to various regularisation routines, such as the graphical lasso, QUIC and CLIME. To ensure the input covariance matrix is positive semidefinite, we use a method that transforms a symmetric matrix of pairwise covariances to the nearest covariance matrix. The result is a potentially sparse precision matrix that is resilient to moderate levels of cellwise contamination. Since this procedure is not based on subsampling it scales well as the number of variables increases.
△ Less
Submitted 8 January, 2015;
originally announced January 2015.