-
Dynamic Bayesian Item Response Model with Decomposition (D-BIRD): Modeling Cohort and Individual Learning Over Time
Authors:
Hansol Lee,
Jason B. Cho,
David S. Matteson,
Benjamin W. Domingue
Abstract:
We present D-BIRD, a Bayesian dynamic item response model for estimating student ability from sparse, longitudinal assessments. By decomposing ability into a cohort trend and individual trajectory, D-BIRD supports interpretable modeling of learning over time. We evaluate parameter recovery in simulation and demonstrate the model using real-world personalized learning data.
We present D-BIRD, a Bayesian dynamic item response model for estimating student ability from sparse, longitudinal assessments. By decomposing ability into a cohort trend and individual trajectory, D-BIRD supports interpretable modeling of learning over time. We evaluate parameter recovery in simulation and demonstrate the model using real-world personalized learning data.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Smoothing Variances Across Time: Adaptive Stochastic Volatility
Authors:
Jason B. Cho,
David S. Matteson
Abstract:
We introduce a novel Bayesian framework for estimating time-varying volatility by extending the Random Walk Stochastic Volatility (RWSV) model with Dynamic Shrinkage Processes (DSP) in log-variances. Unlike the classical Stochastic Volatility (SV) or GARCH-type models with restrictive parametric stationarity assumptions, our proposed Adaptive Stochastic Volatility (ASV) model provides smooth yet d…
▽ More
We introduce a novel Bayesian framework for estimating time-varying volatility by extending the Random Walk Stochastic Volatility (RWSV) model with Dynamic Shrinkage Processes (DSP) in log-variances. Unlike the classical Stochastic Volatility (SV) or GARCH-type models with restrictive parametric stationarity assumptions, our proposed Adaptive Stochastic Volatility (ASV) model provides smooth yet dynamically adaptive estimates of evolving volatility and its uncertainty. We further enhance the model by incorporating a nugget effect, allowing it to flexibly capture small-scale variability while preserving smoothness elsewhere. We derive the theoretical properties of the global-local shrinkage prior DSP. Through simulation studies, we show that ASV exhibits remarkable misspecification resilience and low prediction error across various data-generating processes. Furthermore, ASV's capacity to yield locally smooth and interpretable estimates facilitates a clearer understanding of the underlying patterns and trends in volatility. As an extension, we develop the Bayesian Trend Filter with ASV (BTF-ASV) which allows joint modeling of the mean and volatility with abrupt changes. Finally, our proposed models are applied to time series data from finance, econometrics, and environmental science, highlighting their flexibility and broad applicability.
△ Less
Submitted 4 June, 2025; v1 submitted 20 August, 2024;
originally announced August 2024.
-
Atomic Resolution Observations of Nanoparticle Surface Dynamics and Instabilities Enabled by Artificial Intelligence
Authors:
Peter A. Crozier,
Matan Leibovich,
Piyush Haluai,
Mai Tan,
Andrew M. Thomas,
Joshua Vincent,
Sreyas Mohan,
Adria Marcos Morales,
Shreyas A. Kulkarni,
David S. Matteson,
Yifan Wang,
Carlos Fernandez-Granda
Abstract:
Nanoparticle surface structural dynamics is believed to play a significant role in regulating functionalities such as diffusion, reactivity, and catalysis but the atomic-level processes are not well understood. Atomic resolution characterization of nanoparticle surface dynamics is challenging since it requires both high spatial and temporal resolution. Though ultrafast transmission electron micros…
▽ More
Nanoparticle surface structural dynamics is believed to play a significant role in regulating functionalities such as diffusion, reactivity, and catalysis but the atomic-level processes are not well understood. Atomic resolution characterization of nanoparticle surface dynamics is challenging since it requires both high spatial and temporal resolution. Though ultrafast transmission electron microscopy (TEM) can achieve picosecond temporal resolution, it is limited to nanometer spatial resolution. On the other hand, with the high readout rate of new electron detectors, conventional TEM has the potential to visualize atomic structure with millisecond time resolutions. However, the need to limit electron dose rates to reduce beam damage yields millisecond images that are dominated by noise, obscuring structural details. Here we show that a newly developed unsupervised denoising framework based on artificial intelligence enables observations of metal nanoparticle surfaces with time resolutions down to 10 ms at moderate electron dose. On this timescale, we find that many nanoparticle surfaces continuously transition between ordered and disordered configurations. The associated stress fields can penetrate below the surface leading to defect formation and destabilization making the entire nanoparticle fluxional. Combining this unsupervised denoiser with electron microscopy greatly improves spatio-temporal characterization capabilities, opening a new window for future exploration of atomic-level structural dynamics in materials.
△ Less
Submitted 2 August, 2024; v1 submitted 24 July, 2024;
originally announced July 2024.
-
Vector AutoRegressive Moving Average Models: A Review
Authors:
Marie-Christine Düker,
David S. Matteson,
Ruey S. Tsay,
Ines Wilms
Abstract:
Vector AutoRegressive Moving Average (VARMA) models form a powerful and general model class for analyzing dynamics among multiple time series. While VARMA models encompass the Vector AutoRegressive (VAR) models, their popularity in empirical applications is dominated by the latter. Can this phenomenon be explained fully by the simplicity of VAR models? Perhaps many users of VAR models have not ful…
▽ More
Vector AutoRegressive Moving Average (VARMA) models form a powerful and general model class for analyzing dynamics among multiple time series. While VARMA models encompass the Vector AutoRegressive (VAR) models, their popularity in empirical applications is dominated by the latter. Can this phenomenon be explained fully by the simplicity of VAR models? Perhaps many users of VAR models have not fully appreciated what VARMA models can provide. The goal of this review is to provide a comprehensive resource for researchers and practitioners seeking insights into the advantages and capabilities of VARMA models. We start by reviewing the identification challenges inherent to VARMA models thereby encompassing classical and modern identification schemes and we continue along the same lines regarding estimation, specification and diagnosis of VARMA models. We then highlight the practical utility of VARMA models in terms of Granger Causality analysis, forecasting and structural analysis as well as recent advances and extensions of VARMA models to further facilitate their adoption in practice. Finally, we discuss some interesting future research directions where VARMA models can fulfill their potentials in applications as compared to their subclass of VAR models.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Bayesian changepoint detection via logistic regression and the topological analysis of image series
Authors:
Andrew M. Thomas,
Michael Jauch,
David S. Matteson
Abstract:
We present a Bayesian method for multivariate changepoint detection that allows for simultaneous inference on the location of a changepoint and the coefficients of a logistic regression model for distinguishing pre-changepoint data from post-changepoint data. In contrast to many methods for multivariate changepoint detection, the proposed method is applicable to data of mixed type and avoids stric…
▽ More
We present a Bayesian method for multivariate changepoint detection that allows for simultaneous inference on the location of a changepoint and the coefficients of a logistic regression model for distinguishing pre-changepoint data from post-changepoint data. In contrast to many methods for multivariate changepoint detection, the proposed method is applicable to data of mixed type and avoids strict assumptions regarding the distribution of the data and the nature of the change. The regression coefficients provide an interpretable description of a potentially complex change. For posterior inference, the model admits a simple Gibbs sampling algorithm based on Pólya-gamma data augmentation. We establish conditions under which the proposed method is guaranteed to recover the true underlying changepoint. As a testing ground for our method, we consider the problem of detecting topological changes in time series of images. We demonstrate that our proposed method BCLR, combined with a topological feature embedding, performs well on both simulated and real image data. The method also successfully recovers the location and nature of changes in more traditional changepoint tasks.
△ Less
Submitted 7 March, 2025; v1 submitted 5 January, 2024;
originally announced January 2024.
-
Locally Adaptive Shrinkage Priors for Trends and Breaks in Count Time Series
Authors:
Toryn L. J. Schafer,
David S. Matteson
Abstract:
Non-stationary count time series characterized by features such as abrupt changes and fluctuations about the trend arise in many scientific domains including biophysics, ecology, energy, epidemiology, and social science domains. Current approaches for integer-valued time series lack the flexibility to capture local transient features while more flexible models for continuous data types are inadequ…
▽ More
Non-stationary count time series characterized by features such as abrupt changes and fluctuations about the trend arise in many scientific domains including biophysics, ecology, energy, epidemiology, and social science domains. Current approaches for integer-valued time series lack the flexibility to capture local transient features while more flexible models for continuous data types are inadequate for universal applications to integer-valued responses such as settings with small counts. We present a modeling framework, the negative binomial Bayesian trend filter (NB-BTF), that offers an adaptive model-based solution to capturing multiscale features with valid integer-valued inference for trend filtering. The framework is a hierarchical Bayesian model with a dynamic global-local shrinkage process. The flexibility of the global-local process allows for the necessary local regularization while the temporal dependence induces a locally smooth trend. In simulation, the NB-BTF outperforms a number of alternative trend filtering methods. Then, we demonstrate the method on weekly power outage frequency in Massachusetts townships. Power outage frequency is characterized by a nominal low level with occasional spikes. These illustrations show the estimation of a smooth, non-stationary trend with adequate uncertainty quantification.
△ Less
Submitted 31 August, 2023;
originally announced September 2023.
-
Dynamic Atomic Column Detection in Transmission Electron Microscopy Videos via Ridge Estimation
Authors:
Yuchen Xu,
Andrew M. Thomas,
Peter A. Crozier,
David S. Matteson
Abstract:
Ridge detection is a classical tool to extract curvilinear features in image processing. As such, it has great promise in applications to material science problems; specifically, for trend filtering relatively stable atom-shaped objects in image sequences, such as Transmission Electron Microscopy (TEM) videos. Standard analysis of TEM videos is limited to frame-by-frame object recognition. We inst…
▽ More
Ridge detection is a classical tool to extract curvilinear features in image processing. As such, it has great promise in applications to material science problems; specifically, for trend filtering relatively stable atom-shaped objects in image sequences, such as Transmission Electron Microscopy (TEM) videos. Standard analysis of TEM videos is limited to frame-by-frame object recognition. We instead harness temporal correlation across frames through simultaneous analysis of long image sequences, specified as a spatio-temporal image tensor. We define new ridge detection algorithms to non-parametrically estimate explicit trajectories of atomic-level object locations as a continuous function of time. Our approach is specially tailored to handle temporal analysis of objects that seemingly stochastically disappear and subsequently reappear throughout a sequence. We demonstrate that the proposed method is highly effective and efficient in simulation scenarios, and delivers notable performance improvements in TEM experiments compared to other material science benchmarks.
△ Less
Submitted 1 February, 2023;
originally announced February 2023.
-
Non-fungible token transactions: data and challenges
Authors:
Jason B. Cho,
Sven Serneels,
David S. Matteson
Abstract:
Non-fungible tokens (NFT) have recently emerged as a novel blockchain hosted financial asset class that has attracted major transaction volumes. Investment decisions rely on data and adequate preprocessing and application of analytics to them. Both owing to the non-fungible nature of the tokens and to a blockchain being the primary data source, NFT transaction data pose several challenges not comm…
▽ More
Non-fungible tokens (NFT) have recently emerged as a novel blockchain hosted financial asset class that has attracted major transaction volumes. Investment decisions rely on data and adequate preprocessing and application of analytics to them. Both owing to the non-fungible nature of the tokens and to a blockchain being the primary data source, NFT transaction data pose several challenges not commonly encountered in traditional financial data. Using data that consist of the transaction history of eight highly valued NFT collections, a selection of such challenges is illustrated. These are: price differentiation by token traits, the possible existence of lateral swaps and wash trades in the transaction history and finally, severe volatility. While this paper merely scratches the surface of how data analytics can be applied in this context, the data and challenges laid out here may present opportunities for future research on the topic.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
Feature detection and hypothesis testing for extremely noisy nanoparticle images using topological data analysis
Authors:
Andrew M. Thomas,
Peter A. Crozier,
Yuchen Xu,
David S. Matteson
Abstract:
We propose a flexible algorithm for feature detection and hypothesis testing in images with ultra low signal-to-noise ratio using cubical persistent homology. Our main application is in the identification of atomic columns and other features in transmission electron microscopy (TEM). Cubical persistent homology is used to identify local minima and their size in subregions in the frames of nanopart…
▽ More
We propose a flexible algorithm for feature detection and hypothesis testing in images with ultra low signal-to-noise ratio using cubical persistent homology. Our main application is in the identification of atomic columns and other features in transmission electron microscopy (TEM). Cubical persistent homology is used to identify local minima and their size in subregions in the frames of nanoparticle videos, which are hypothesized to correspond to relevant atomic features. We compare the performance of our algorithm to other employed methods for the detection of columns and their intensity. Additionally, Monte Carlo goodness-of-fit testing using real valued summaries of persistence diagrams derived from smoothed images (generated from pixels residing in the vacuum region of an image) is developed and employed to identify whether or not the proposed atomic features generated by our algorithm are due to noise. Using these summaries derived from the generated persistence diagrams, one can produce univariate time series for the nanoparticle videos, thus providing a means for assessing fluxional behavior. A guarantee on the false discovery rate for multiple Monte Carlo testing of identical hypotheses is also established.
△ Less
Submitted 18 January, 2023; v1 submitted 27 September, 2022;
originally announced September 2022.
-
K-ARMA Models for Clustering Time Series Data
Authors:
Derek O. Hoare,
David S. Matteson,
Martin T. Wells
Abstract:
We present an approach to clustering time series data using a model-based generalization of the K-Means algorithm which we call K-Models. We prove the convergence of this general algorithm and relate it to the hard-EM algorithm for mixture modeling. We then apply our method first with an AR($p$) clustering example and show how the clustering algorithm can be made robust to outliers using a least-a…
▽ More
We present an approach to clustering time series data using a model-based generalization of the K-Means algorithm which we call K-Models. We prove the convergence of this general algorithm and relate it to the hard-EM algorithm for mixture modeling. We then apply our method first with an AR($p$) clustering example and show how the clustering algorithm can be made robust to outliers using a least-absolute deviations criteria. We then build our clustering algorithm up for ARMA($p,q$) models and extend this to ARIMA($p,d,q$) models. We develop a goodness of fit statistic for the models fitted to clusters based on the Ljung-Box statistic. We perform experiments with simulated data to show how the algorithm can be used for outlier detection, detecting distributional drift, and discuss the impact of initialization method on empty clusters. We also perform experiments on real data which show that our method is competitive with other existing methods for similar time series clustering tasks.
△ Less
Submitted 30 June, 2022;
originally announced July 2022.
-
Interpretable Latent Variables in Deep State Space Models
Authors:
Haoxuan Wu,
David S. Matteson,
Martin T. Wells
Abstract:
We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that ar…
▽ More
We introduce a new version of deep state-space models (DSSMs) that combines a recurrent neural network with a state-space framework to forecast time series data. The model estimates the observed series as functions of latent variables that evolve non-linearly through time. Due to the complexity and non-linearity inherent in DSSMs, previous works on DSSMs typically produced latent variables that are very difficult to interpret. Our paper focus on producing interpretable latent parameters with two key modifications. First, we simplify the predictive decoder by restricting the response variables to be a linear transformation of the latent variables plus some noise. Second, we utilize shrinkage priors on the latent variables to reduce redundancy and improve robustness. These changes make the latent variables much easier to understand and allow us to interpret the resulting latent variables as random effects in a linear mixed model. We show through two public benchmark datasets the resulting model improves forecasting performances.
△ Less
Submitted 19 May, 2022; v1 submitted 3 March, 2022;
originally announced March 2022.
-
Bayesian Spillover Graphs for Dynamic Networks
Authors:
Grace Deng,
David S. Matteson
Abstract:
We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize t…
▽ More
We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize temporal relationships in terms of systemic risk and prediction variability. Forecast horizon hyperparameter $h$ allows for learning both short-term and equilibrium state network behaviors. Experiments for identifying source and sink nodes under various graph and error specifications show significant performance gains against state-of-the-art Bayesian Networks and deep-learning baselines. Applications to real-world systems also showcase BSG as an exploratory analysis tool for uncovering indirect spillovers and quantifying systemic risk.
△ Less
Submitted 16 June, 2022; v1 submitted 3 March, 2022;
originally announced March 2022.
-
Drift vs Shift: Decoupling Trends and Changepoint Analysis
Authors:
Haoxuan Wu,
Toryn L. J. Schafer,
Sean Ryan,
David S. Matteson
Abstract:
We introduce a new approach for decoupling trends (drift) and changepoints (shifts) in time series. Our locally adaptive model-based approach for robustly decoupling combines Bayesian trend filtering and machine learning based regularization. An over-parameterized Bayesian dynamic linear model (DLM) is first applied to characterize drift. Then a weighted penalized likelihood estimator is paired wi…
▽ More
We introduce a new approach for decoupling trends (drift) and changepoints (shifts) in time series. Our locally adaptive model-based approach for robustly decoupling combines Bayesian trend filtering and machine learning based regularization. An over-parameterized Bayesian dynamic linear model (DLM) is first applied to characterize drift. Then a weighted penalized likelihood estimator is paired with the estimated DLM posterior distribution to identify shifts. We show how Bayesian DLMs specified with so-called shrinkage priors can provide smooth estimates of underlying trends in the presence of complex noise components. However, their inability to shrink exactly to zero inhibits direct changepoint detection. In contrast, penalized likelihood methods are highly effective in locating changepoints. However, they require data with simple patterns in both signal and noise. The proposed decoupling approach combines the strengths of both, i.e. the flexibility of Bayesian DLMs with the hard thresholding property of penalized likelihood estimators, to provide changepoint analysis in complex, modern settings. The proposed framework is outlier robust and can identify a variety of changes, including in mean and slope. It is also easily extended for analysis of parameter shifts in time-varying parameter models like dynamic regressions. We illustrate the flexibility and contrast the performance and robustness of our approach with several alternative methods across a wide range of simulations and application examples.
△ Less
Submitted 6 January, 2024; v1 submitted 17 January, 2022;
originally announced January 2022.
-
Analysis of animal-related electric outages using species distribution models and community science data
Authors:
Mei-Ling E. Feng,
Olukunle O. Owolabi,
Toryn L. J. Schafer,
Sanhita Sengupta,
Lan Wang,
David S. Matteson,
Judy P. Che-Castaldo,
Deborah A. Sunter
Abstract:
Animal-related outages (AROs) are a prevalent form of outages in electrical distribution systems. Animal-infrastructure interactions vary across focal species and regions, underlining the need to study the animal-outage relationship in more species and diverse systems. Animal activity has been used as an indicator of reliability in the electrical grid system and to describe temporal patterns in AR…
▽ More
Animal-related outages (AROs) are a prevalent form of outages in electrical distribution systems. Animal-infrastructure interactions vary across focal species and regions, underlining the need to study the animal-outage relationship in more species and diverse systems. Animal activity has been used as an indicator of reliability in the electrical grid system and to describe temporal patterns in AROs. However, these ARO models have been limited by a lack of available estimates of species activity, instead approximating activity based on seasonal and weather patterns in animal-related outage records and characteristics of broad taxonomic groups, e.g., squirrels. We highlight publicly available resources to fill the ecological data gap that is limiting joint analyses between ecology and energy sectors. Species distribution models (SDMs), a common technique to model the distribution of a species across geographic space and time, paired with data sourced from eBird, a community science database for bird observations, provided us with species-specific estimates of activity to model spatio-temporal patterns of AROs. These flexible, species-specific estimates can allow future animal-indicators of grid reliability to be investigated in more diverse regions and ecological communities, providing a better understanding of the variation that exists in animal-outage relationship. AROs were best modeled by accounting for multiple outage-prone species activity patterns and their unique relationships with seasonality and habitat availability. Different species were important for modeling outages in different landscapes and seasons depending on their distribution and migration behavior. We recommend that future models of AROs include species-specific activity data that account for the diverse spectrum of spatio-temporal activity patterns that outage-prone animals exhibit.
△ Less
Submitted 22 December, 2021;
originally announced December 2021.
-
Role of Variable Renewable Energy Penetration on Electricity Price and its Volatility Across Independent System Operators in the United States
Authors:
Olukunle O. Owolabi,
Toryn L. J. Schafer,
Georgia E. Smits,
Sanhita Sengupta,
Sean E. Ryan,
Lan Wang,
David S. Matteson,
Mila Getmansky Sherman,
Deborah A. Sunter
Abstract:
The U.S. electrical grid has undergone substantial transformation with increased penetration of wind and solar -- forms of variable renewable energy (VRE). Despite the benefits of VRE for decarbonization, it has garnered some controversy for inducing unwanted effects in regional electricity markets. In this study, the role of VRE penetration is examined on the system electricity price and price vo…
▽ More
The U.S. electrical grid has undergone substantial transformation with increased penetration of wind and solar -- forms of variable renewable energy (VRE). Despite the benefits of VRE for decarbonization, it has garnered some controversy for inducing unwanted effects in regional electricity markets. In this study, the role of VRE penetration is examined on the system electricity price and price volatility based on hourly, real-time, historical data from six Independent System Operators (ISOs) in the U.S. using quantile and skew t-distribution regressions. After correcting for temporal effects, we found an increase in VRE penetration is associated with decrease in system electricity price in all ISOs studied. The increase in VRE penetration is associated with decrease in temporal price volatility in five out of six ISOs studied. The relationships are non-linear. These results are consistent with the modern portfolio theory where diverse volatile assets may lead to more stable and less risky portfolios.
△ Less
Submitted 28 November, 2022; v1 submitted 10 November, 2021;
originally announced December 2021.
-
Log-Gaussian Cox Process Modeling of Large Spatial Lightning Data using Spectral and Laplace Approximations
Authors:
Megan L. Gelsinger,
Maryclare Griffin,
David S. Matteson,
Joseph Guinness
Abstract:
Lightning is a destructive and highly visible product of severe storms, yet there is still much to be learned about the conditions under which lightning is most likely to occur. The GOES-16 and GOES-17 satellites, launched in 2016 and 2018 by NOAA and NASA, collect a wealth of data regarding individual lightning strike occurrence and potentially related atmospheric variables. The acute nature and…
▽ More
Lightning is a destructive and highly visible product of severe storms, yet there is still much to be learned about the conditions under which lightning is most likely to occur. The GOES-16 and GOES-17 satellites, launched in 2016 and 2018 by NOAA and NASA, collect a wealth of data regarding individual lightning strike occurrence and potentially related atmospheric variables. The acute nature and inherent spatial correlation in lightning data renders standard regression analyses inappropriate. Further, computational considerations are foregrounded by the desire to analyze the immense and rapidly increasing volume of lightning data. We present a new computationally feasible method that combines spectral and Laplace approximations in an EM algorithm, denoted SLEM, to fit the widely popular log-Gaussian Cox process model to large spatial point pattern datasets. In simulations, we find SLEM is competitive with contemporary techniques in terms of speed and accuracy. When applied to two lightning datasets, SLEM provides better out-of-sample prediction scores and quicker runtimes, suggesting its particular usefulness for analyzing lightning data, which tend to have sparse signals.
△ Less
Submitted 30 November, 2021;
originally announced November 2021.
-
Spatial Correlation in Weather Forecast Accuracy: A Functional Time Series Approach
Authors:
Phillip A. Jang,
David S. Matteson
Abstract:
A functional time series approach is proposed for investigating spatial correlation in daily maximum temperature forecast errors for 111 cities spread across the U.S. The modelling of spatial correlation is most fruitful for longer forecast horizons, and becomes less relevant as the forecast horizon shrinks towards zero. For 6-day-ahead forecasts, the functional approach uncovers interpretable reg…
▽ More
A functional time series approach is proposed for investigating spatial correlation in daily maximum temperature forecast errors for 111 cities spread across the U.S. The modelling of spatial correlation is most fruitful for longer forecast horizons, and becomes less relevant as the forecast horizon shrinks towards zero. For 6-day-ahead forecasts, the functional approach uncovers interpretable regional spatial effects, and captures the higher variance observed in inland cities versus coastal cities, as well as the higher variance observed in mountain and midwest states. The functional approach also naturally handles missing data through modelling a continuum, and can be implemented efficiently by exploiting the sparsity induced by a B-spline basis.
The temporal dependence in the data is modeled through temporal dependence in functional basis coefficients. Independent first order autoregressions with generalized autoregressive conditional heteroskedasticity [AR(1)+GARCH(1,1)] and Student-t innovations work well to capture the persistence of basis coefficients over time and the seasonal heteroskedasticity reflecting higher variance in winter. Through exploiting autocorrelation in the basis coefficients, the functional time series approach also yields a method for improving weather forecasts and uncertainty quantification. The resulting method corrects for bias in the weather forecasts, while reducing the error variance.
△ Less
Submitted 22 November, 2021;
originally announced November 2021.
-
IB-GAN: A Unified Approach for Multivariate Time Series Classification under Class Imbalance
Authors:
Grace Deng,
Cuize Han,
Tommaso Dreossi,
Clarence Lee,
David S. Matteson
Abstract:
Classification of large multivariate time series with strong class imbalance is an important task in real-world applications. Standard methods of class weights, oversampling, or parametric data augmentation do not always yield significant improvements for predicting minority classes of interest. Non-parametric data augmentation with Generative Adversarial Networks (GANs) offers a promising solutio…
▽ More
Classification of large multivariate time series with strong class imbalance is an important task in real-world applications. Standard methods of class weights, oversampling, or parametric data augmentation do not always yield significant improvements for predicting minority classes of interest. Non-parametric data augmentation with Generative Adversarial Networks (GANs) offers a promising solution. We propose Imputation Balanced GAN (IB-GAN), a novel method that joins data augmentation and classification in a one-step process via an imputation-balancing approach. IB-GAN uses imputation and resampling techniques to generate higher quality samples from randomly masked vectors than from white noise, and augments classification through a class-balanced set of real and synthetic samples. Imputation hyperparameter $p_{miss}$ allows for regularization of classifier variability by tuning innovations introduced via generator imputation. IB-GAN is simple to train and model-agnostic, pairing any deep learning classifier with a generator-discriminator duo and resulting in higher accuracy for under-observed classes. Empirical experiments on open-source UCR data and proprietary 90K product dataset show significant performance gains against state-of-the-art parametric and GAN baselines.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.
-
Mixture representations and Bayesian nonparametric inference for likelihood ratio ordered distributions
Authors:
Michael Jauch,
Andrés F. Barrientos,
Víctor Peña,
David S. Matteson
Abstract:
In this article, we introduce mixture representations for likelihood ratio ordered distributions. Essentially, the ratio of two probability densities, or mass functions, is monotone if and only if one can be expressed as a mixture of one-sided truncations of the other. To illustrate the practical value of the mixture representations, we address the problem of density estimation for likelihood rati…
▽ More
In this article, we introduce mixture representations for likelihood ratio ordered distributions. Essentially, the ratio of two probability densities, or mass functions, is monotone if and only if one can be expressed as a mixture of one-sided truncations of the other. To illustrate the practical value of the mixture representations, we address the problem of density estimation for likelihood ratio ordered distributions. In particular, we propose a nonparametric Bayesian solution which takes advantage of the mixture representations. The prior distribution is constructed from Dirichlet process mixtures and has large support on the space of pairs of densities satisfying the monotone ratio constraint. Posterior consistency holds under reasonable conditions on the prior specification and the true unknown densities. To our knowledge, this is the first posterior consistency result in the literature on order constrained inference. With a simple modification to the prior distribution, we can test the equality of two distributions against the alternative of likelihood ratio ordering. We develop a Markov chain Monte Carlo algorithm for posterior inference and demonstrate the method in a biomedical application.
△ Less
Submitted 26 October, 2023; v1 submitted 10 October, 2021;
originally announced October 2021.
-
A Survey of Estimation Methods for Sparse High-dimensional Time Series Models
Authors:
Sumanta Basu,
David S. Matteson
Abstract:
High-dimensional time series datasets are becoming increasingly common in many areas of biological and social sciences. Some important applications include gene regulatory network reconstruction using time course gene expression data, brain connectivity analysis from neuroimaging data, structural analysis of a large panel of macroeconomic indicators, and studying linkages among financial firms for…
▽ More
High-dimensional time series datasets are becoming increasingly common in many areas of biological and social sciences. Some important applications include gene regulatory network reconstruction using time course gene expression data, brain connectivity analysis from neuroimaging data, structural analysis of a large panel of macroeconomic indicators, and studying linkages among financial firms for more robust financial regulation. These applications have led to renewed interest in developing principled statistical methods and theory for estimating large time series models given only a relatively small number of temporally dependent samples. Sparse modeling approaches have gained popularity over the last two decades in statistics and machine learning for their interpretability and predictive accuracy. Although there is a rich literature on several sparsity inducing methods when samples are independent, research on the statistical properties of these methods for estimating time series models is still in progress.
We survey some recent advances in this area, focusing on empirically successful lasso based estimation methods for two canonical multivariate time series models - stochastic regression and vector autoregression. We discuss key technical challenges arising in high-dimensional time series analysis and outline several interesting research directions.
△ Less
Submitted 30 July, 2021;
originally announced July 2021.
-
Graphical Influence Diagnostics for Changepoint Models
Authors:
Ines Wilms,
Rebecca Killick,
David S. Matteson
Abstract:
Changepoint models enjoy a wide appeal in a variety of disciplines to model the heterogeneity of ordered data. Graphical influence diagnostics to characterize the influence of single observations on changepoint models are, however, lacking. We address this gap by developing a framework for investigating instabilities in changepoint segmentations and assessing the influence of single observations o…
▽ More
Changepoint models enjoy a wide appeal in a variety of disciplines to model the heterogeneity of ordered data. Graphical influence diagnostics to characterize the influence of single observations on changepoint models are, however, lacking. We address this gap by developing a framework for investigating instabilities in changepoint segmentations and assessing the influence of single observations on various outputs of a changepoint analysis. We construct graphical diagnostic plots that allow practitioners to assess whether instabilities occur; how and where they occur; and to detect influential individual observations triggering instability. We analyze well-log data to illustrate how such influence diagnostic plots can be used in practice to reveal features of the data that may otherwise remain hidden.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
Classifying Contaminated Cell Cultures using Time Series Features
Authors:
Laura L. Tupper,
Charles R. Keese,
David S. Matteson
Abstract:
We examine the use of time series data, derived from Electric Cell-substrate Impedance Sensing (ECIS), to differentiate between standard mammalian cell cultures and those infected with a mycoplasma organism. With the goal of interpretable results, we perform low-dimensional feature-based classification, extracting application-relevant features from the ECIS time courses. We can achieve very high c…
▽ More
We examine the use of time series data, derived from Electric Cell-substrate Impedance Sensing (ECIS), to differentiate between standard mammalian cell cultures and those infected with a mycoplasma organism. With the goal of interpretable results, we perform low-dimensional feature-based classification, extracting application-relevant features from the ECIS time courses. We can achieve very high classification accuracy using only two features, which depend on the cell line under examination. Initial results also show the existence of experimental variation between plates and suggest types of features that may prove more robust to such variation. Our paper is the first to perform a broad examination of ECIS time course features in the context of detecting contamination; to combine different types of features to achieve classification accuracy while preserving interpretability; and to describe and suggest possibilities for ameliorating plate-to-plate variation.
△ Less
Submitted 22 February, 2022; v1 submitted 14 May, 2021;
originally announced May 2021.
-
Testing Simultaneous Diagonalizability
Authors:
Yuchen Xu,
Marie-Christine Düker,
David S. Matteson
Abstract:
This paper proposes novel methods to test for simultaneous diagonalization of possibly asymmetric matrices. Motivated by various applications, a two-sample test as well as a generalization for multiple matrices are proposed. A partial version of the test is also studied to check whether a partial set of eigenvectors is shared across samples. Additionally, a novel algorithm for the considered testi…
▽ More
This paper proposes novel methods to test for simultaneous diagonalization of possibly asymmetric matrices. Motivated by various applications, a two-sample test as well as a generalization for multiple matrices are proposed. A partial version of the test is also studied to check whether a partial set of eigenvectors is shared across samples. Additionally, a novel algorithm for the considered testing methods is introduced. Simulation studies demonstrate favorable performance for all designs. Finally, the theoretical results are utilized to decouple vector autoregression models into multiple univariate time series, and to test for the same stationary distribution in recurrent Markov chains. These applications are demonstrated using macroeconomic indices of 8 countries and streamflow data, respectively.
△ Less
Submitted 19 January, 2021;
originally announced January 2021.
-
Critical Risk Indicators (CRIs) for the electric power grid: A survey and discussion of interconnected effects
Authors:
Judy P. Che-Castaldo,
Rémi Cousin,
Stefani Daryanto,
Grace Deng,
Mei-Ling E. Feng,
Rajesh K. Gupta,
Dezhi Hong,
Ryan M. McGranaghan,
Olukunle O. Owolabi,
Tianyi Qu,
Wei Ren,
Toryn L. J. Schafer,
Ashutosh Sharma,
Chaopeng Shen,
Mila Getmansky Sherman,
Deborah A. Sunter,
Lan Wang,
David S. Matteson
Abstract:
The electric power grid is a critical societal resource connecting multiple infrastructural domains such as agriculture, transportation, and manufacturing. The electrical grid as an infrastructure is shaped by human activity and public policy in terms of demand and supply requirements. Further, the grid is subject to changes and stresses due to solar weather, climate, hydrology, and ecology. The e…
▽ More
The electric power grid is a critical societal resource connecting multiple infrastructural domains such as agriculture, transportation, and manufacturing. The electrical grid as an infrastructure is shaped by human activity and public policy in terms of demand and supply requirements. Further, the grid is subject to changes and stresses due to solar weather, climate, hydrology, and ecology. The emerging interconnected and complex network dependencies make such interactions increasingly dynamic causing potentially large swings, thus presenting new challenges to manage the coupled human-natural system. This paper provides a survey of models and methods that seek to explore the significant interconnected impact of the electric power grid and interdependent domains. We also provide relevant critical risk indicators (CRIs) across diverse domains that may influence electric power grid risks, including climate, ecology, hydrology, finance, space weather, and agriculture. We discuss the convergence of indicators from individual domains to explore possible systemic risk, i.e., holistic risk arising from cross-domains interconnections. Our study provides an important first step towards data-driven analysis and predictive modeling of risks in the coupled interconnected systems. Further, we propose a compositional approach to risk assessment that incorporates diverse domain expertise and information, data science, and computer science to identify domain-specific CRIs and their union in systemic risk indicators.
△ Less
Submitted 9 June, 2021; v1 submitted 19 January, 2021;
originally announced January 2021.
-
Developing and Evaluating Deep Neural Network-based Denoising for Nanoparticle TEM Images with Ultra-low Signal-to-Noise
Authors:
Joshua L. Vincent,
Ramon Manzorro,
Sreyas Mohan,
Binh Tang,
Dev Y. Sheth,
Eero P. Simoncelli,
David S. Matteson,
Carlos Fernandez-Granda,
Peter A. Crozier
Abstract:
A deep convolutional neural network has been developed to denoise atomic-resolution TEM image datasets of nanoparticles acquired using direct electron counting detectors, for applications where the image signal is severely limited by shot noise. The network was applied to a model system of CeO2-supported Pt nanoparticles. We leverage multislice image simulations to generate a large and flexible da…
▽ More
A deep convolutional neural network has been developed to denoise atomic-resolution TEM image datasets of nanoparticles acquired using direct electron counting detectors, for applications where the image signal is severely limited by shot noise. The network was applied to a model system of CeO2-supported Pt nanoparticles. We leverage multislice image simulations to generate a large and flexible dataset for training and testing the network. The proposed network outperforms state-of-the-art denoising methods by a significant margin both on simulated and experimental test data. Factors contributing to the performance are identified, including most importantly (a) the geometry of the images used during training and (b) the size of the network's receptive field. Through a gradient-based analysis, we investigate the mechanisms learned by the network to denoise experimental images. This shows that the network exploits global and local information in the noisy measurements, for example, by adapting its filtering approach when it encounters atomic-level defects at the nanoparticle surface. Extensive analysis has been done to characterize the network's ability to correctly predict the exact atomic structure at the nanoparticle surface. Finally, we develop an approach based on the log-likelihood ratio test that provides a quantitative measure of the agreement between the noisy observation and the atomic-level structure in the network-denoised image.
△ Less
Submitted 17 March, 2021; v1 submitted 19 January, 2021;
originally announced January 2021.
-
Clustering Future Scenarios Based on Predicted Range Maps
Authors:
Matthew Davidow,
Cory Merow,
Judy Che-Castaldo,
Toryn Schafer,
Marie-Christine Duker,
Derek Corcoran,
David Matteson
Abstract:
Predictions of biodiversity trajectories under climate change are crucial in order to act effectively in maintaining the diversity of species. In many ecological applications, future predictions are made under various global warming scenarios as described by a range of different climate models. The outputs of these various predictions call for a reliable interpretation. We propose a interpretable…
▽ More
Predictions of biodiversity trajectories under climate change are crucial in order to act effectively in maintaining the diversity of species. In many ecological applications, future predictions are made under various global warming scenarios as described by a range of different climate models. The outputs of these various predictions call for a reliable interpretation. We propose a interpretable and flexible two step methodology to measure the similarity between predicted species range maps and cluster the future scenario predictions utilizing a spectral clustering technique. We find that clustering based on ecological impact (predicted species range maps) is mainly driven by the amount of warming. We contrast this with clustering based only on predicted climate features, which is driven mainly by climate models. The differences between these clusterings illustrate that it is crucial to incorporate ecological information to understand the relevant differences between climate models. The findings of this work can be used to better synthesize forecasts of biodiversity loss under the wide spectrum of results that emerge when considering potential future biodiversity loss.
△ Less
Submitted 17 July, 2022; v1 submitted 18 January, 2021;
originally announced January 2021.
-
Group Linear non-Gaussian Component Analysis with Applications to Neuroimaging
Authors:
Yuxuan Zhao,
David S. Matteson,
Mary Beth Nebel,
Stewart H. Mostofsky,
Benjamin Risk
Abstract:
Independent component analysis (ICA) is an unsupervised learning method popular in functional magnetic resonance imaging (fMRI). Group ICA has been used to search for biomarkers in neurological disorders including autism spectrum disorder and dementia. However, current methods use a principal component analysis (PCA) step that may remove low-variance features. Linear non-Gaussian component analysi…
▽ More
Independent component analysis (ICA) is an unsupervised learning method popular in functional magnetic resonance imaging (fMRI). Group ICA has been used to search for biomarkers in neurological disorders including autism spectrum disorder and dementia. However, current methods use a principal component analysis (PCA) step that may remove low-variance features. Linear non-Gaussian component analysis (LNGCA) enables simultaneous dimension reduction and feature estimation including low-variance features in single-subject fMRI. We present a group LNGCA model to extract group components shared by more than one subject and subject-specific components. To determine the total number of components in each subject, we propose a parametric resampling test that samples spatially correlated Gaussian noise to match the spatial dependence observed in data. In simulations, our estimated group components achieve higher accuracy compared to group ICA. We apply our method to a resting-state fMRI study on autism spectrum disorder in 342 children (252 typically developing, 90 with autism), where the group signals include resting-state networks. We find examples of group components that appear to exhibit different levels of temporal engagement in autism versus typically developing children, as revealed using group LNGCA. This novel approach to matrix decomposition is a promising direction for feature detection in neuroimaging.
△ Less
Submitted 12 January, 2021;
originally announced January 2021.
-
Copula Quadrant Similarity for Anomaly Scores
Authors:
Matthew Davidow,
David Matteson
Abstract:
Practical anomaly detection requires applying numerous approaches due to the inherent difficulty of unsupervised learning. Direct comparison between complex or opaque anomaly detection algorithms is intractable; we instead propose a framework for associating the scores of multiple methods. Our aim is to answer the question: how should one measure the similarity between anomaly scores generated by…
▽ More
Practical anomaly detection requires applying numerous approaches due to the inherent difficulty of unsupervised learning. Direct comparison between complex or opaque anomaly detection algorithms is intractable; we instead propose a framework for associating the scores of multiple methods. Our aim is to answer the question: how should one measure the similarity between anomaly scores generated by different methods? The scoring crux is the extremes, which identify the most anomalous observations. A pair of algorithms are defined here to be similar if they assign their highest scores to roughly the same small fraction of observations. To formalize this, we propose a measure based on extremal similarity in scoring distributions through a novel upper quadrant modeling approach, and contrast it with tail and other dependence measures. We illustrate our method with simulated and real experiments, applying spectral methods to cluster multiple anomaly detection methods and to contrast our similarity measure with others. We demonstrate that our method is able to detect the clusters of anomaly detection algorithms to achieve an accurate and robust ensemble algorithm.
△ Less
Submitted 6 January, 2021;
originally announced January 2021.
-
Regularized Estimation in High-Dimensional Vector Auto-Regressive Models using Spatio-Temporal Information
Authors:
Zhenzhong Wang,
Abolfazl Safikhani,
Zhengyuan Zhu,
David S. Matteson
Abstract:
A Vector Auto-Regressive (VAR) model is commonly used to model multivariate time series, and there are many penalized methods to handle high dimensionality. However in terms of spatio-temporal data, most methods do not take the spatial and temporal structure of the data into consideration, which may lead to unreliable network detection and inaccurate forecasts. This paper proposes a data-driven we…
▽ More
A Vector Auto-Regressive (VAR) model is commonly used to model multivariate time series, and there are many penalized methods to handle high dimensionality. However in terms of spatio-temporal data, most methods do not take the spatial and temporal structure of the data into consideration, which may lead to unreliable network detection and inaccurate forecasts. This paper proposes a data-driven weighted l1 regularized approach for spatio-temporal VAR model. Extensive simulation studies are carried out to compare the proposed method with four existing methods of high-dimensional VAR model, demonstrating improvements of our method over others in parameter estimation, network detection and out-of-sample forecasts. We also apply our method on a traffic data set to evaluate its performance in real application. In addition, we explore the theoretical properties of l1 regularized estimation of VAR model under the weakly sparse scenario, in which the exact sparsity can be viewed as a special case. To the best of our knowledge, this direction has not been considered yet in the literature. For general stationary VAR process, we derive the non-asymptotic upper bounds on l1 regularized estimation errors under the weakly sparse scenario, provide the conditions of estimation consistency, and further simplify these conditions for a special VAR(1) case.
△ Less
Submitted 17 December, 2020;
originally announced December 2020.
-
Trend and Variance Adaptive Bayesian Changepoint Analysis & Local Outlier Scoring
Authors:
Haoxuan Wu,
Toryn L. J. Schafer,
David S. Matteson
Abstract:
We adaptively estimate both changepoints and local outlier processes in a Bayesian dynamic linear model with global-local shrinkage priors in a novel model we call Adaptive Bayesian Changepoints with Outliers (ABCO). We utilize a state-space approach to identify a dynamic signal in the presence of outliers and measurement error with stochastic volatility. We find that global state equation paramet…
▽ More
We adaptively estimate both changepoints and local outlier processes in a Bayesian dynamic linear model with global-local shrinkage priors in a novel model we call Adaptive Bayesian Changepoints with Outliers (ABCO). We utilize a state-space approach to identify a dynamic signal in the presence of outliers and measurement error with stochastic volatility. We find that global state equation parameters are inadequate for most real applications and we include local parameters to track noise at each time-step. This setup provides a flexible framework to detect unspecified changepoints in complex series, such as those with large interruptions in local trends, with robustness to outliers and heteroskedastic noise. Finally, we compare our algorithm against several alternatives to demonstrate its efficacy in diverse simulation scenarios and two empirical examples on the U.S. economy.
△ Less
Submitted 13 March, 2024; v1 submitted 18 November, 2020;
originally announced November 2020.
-
Likelihood Inference for Possibly Non-Stationary Processes via Adaptive Overdifferencing
Authors:
Maryclare Griffin,
Gennady Samorodnitsky,
David S. Matteson
Abstract:
We make an observation that facilitates exact likelihood-based inference for the parameters of the popular ARFIMA model without requiring stationarity by allowing the upper bound $\bar{d}$ for the memory parameter $d$ to exceed $0.5$: estimating the parameters of a single non-stationary ARFIMA model is equivalent to estimating the parameters of a sequence of stationary ARFIMA models. This allows f…
▽ More
We make an observation that facilitates exact likelihood-based inference for the parameters of the popular ARFIMA model without requiring stationarity by allowing the upper bound $\bar{d}$ for the memory parameter $d$ to exceed $0.5$: estimating the parameters of a single non-stationary ARFIMA model is equivalent to estimating the parameters of a sequence of stationary ARFIMA models. This allows for the use of existing methods for evaluating the likelihood for an invertible and stationary ARFIMA model. This enables improved inference because many standard methods perform poorly when estimates are close to the boundary of the parameter space. It also allows us to leverage the wealth of likelihood approximations that have been introduced for estimating the parameters of a stationary process. We explore how estimation of the memory parameter $d$ depends on the upper bound $\bar{d}$ and introduce adaptive procedures for choosing $\bar{d}$. We show via simulation how our adaptive procedures estimate the memory parameter well, relative to existing alternatives, when the true value is as large as 2.5.
△ Less
Submitted 9 January, 2025; v1 submitted 8 November, 2020;
originally announced November 2020.
-
Extended Missing Data Imputation via GANs for Ranking Applications
Authors:
Grace Deng,
Cuize Han,
David S. Matteson
Abstract:
We propose Conditional Imputation GAN, an extended missing data imputation method based on Generative Adversarial Networks (GANs). The motivating use case is learning-to-rank, the cornerstone of modern search, recommendation system, and information retrieval applications. Empirical ranking datasets do not always follow standard Gaussian distributions or Missing Completely At Random (MCAR) mechanis…
▽ More
We propose Conditional Imputation GAN, an extended missing data imputation method based on Generative Adversarial Networks (GANs). The motivating use case is learning-to-rank, the cornerstone of modern search, recommendation system, and information retrieval applications. Empirical ranking datasets do not always follow standard Gaussian distributions or Missing Completely At Random (MCAR) mechanism, which are standard assumptions of classic missing data imputation methods. Our methodology provides a simple solution that offers compatible imputation guarantees while relaxing assumptions for missing mechanisms and sidesteps approximating intractable distributions to improve imputation quality. We prove that the optimal GAN imputation is achieved for Extended Missing At Random (EMAR) and Extended Always Missing At Random (EAMAR) mechanisms, beyond the naive MCAR. Our method demonstrates the highest imputation quality on the open-source Microsoft Research Ranking (MSR) Dataset and a synthetic ranking dataset compared to state-of-the-art benchmarks and across various feature distributions. Using a proprietary Amazon Search ranking dataset, we also demonstrate comparable ranking quality metrics for ranking models trained on GAN-imputed data compared to ground-truth data.
△ Less
Submitted 10 November, 2021; v1 submitted 3 November, 2020;
originally announced November 2020.
-
Deep Denoising For Scientific Discovery: A Case Study In Electron Microscopy
Authors:
Sreyas Mohan,
Ramon Manzorro,
Joshua L. Vincent,
Binh Tang,
Dev Yashpal Sheth,
Eero P. Simoncelli,
David S. Matteson,
Peter A. Crozier,
Carlos Fernandez-Granda
Abstract:
Denoising is a fundamental challenge in scientific imaging. Deep convolutional neural networks (CNNs) provide the current state of the art in denoising natural images, where they produce impressive results. However, their potential has barely been explored in the context of scientific imaging. Denoising CNNs are typically trained on real natural images artificially corrupted with simulated noise.…
▽ More
Denoising is a fundamental challenge in scientific imaging. Deep convolutional neural networks (CNNs) provide the current state of the art in denoising natural images, where they produce impressive results. However, their potential has barely been explored in the context of scientific imaging. Denoising CNNs are typically trained on real natural images artificially corrupted with simulated noise. In contrast, in scientific applications, noiseless ground-truth images are usually not available. To address this issue, we propose a simulation-based denoising (SBD) framework, in which CNNs are trained on simulated images. We test the framework on data obtained from transmission electron microscopy (TEM), an imaging technique with widespread applications in material science, biology, and medicine. SBD outperforms existing techniques by a wide margin on a simulated benchmark dataset, as well as on real data. Apart from the denoised images, SBD generates likelihood maps to visualize the agreement between the structure of the denoised image and the observed data. Our results reveal shortcomings of state-of-the-art denoising architectures, such as their small field-of-view: substantially increasing the field-of-view of the CNNs allows them to exploit non-local periodic patterns in the data, which is crucial at high noise levels. In addition, we analyze the generalization capability of SBD, demonstrating that the trained networks are robust to variations of imaging parameters and of the underlying signal structure. Finally, we release the first publicly available benchmark dataset of TEM images, containing 18,000 examples.
△ Less
Submitted 13 July, 2021; v1 submitted 24 October, 2020;
originally announced October 2020.
-
PyXtal FF: a Python Library for Automated Force Field Generation
Authors:
Howard Yanxon,
David Zagaceta,
Binh Tang,
David Matteson,
Qiang Zhu
Abstract:
We present PyXtal FF, a package based on Python programming language, for developing machine learning potentials (MLPs). The aim of PyXtal FF is to promote the application of atomistic simulations by providing several choices of structural descriptors and machine learning regressions in one platform. Based on the given choice of structural descriptors (including the atom-centered symmetry function…
▽ More
We present PyXtal FF, a package based on Python programming language, for developing machine learning potentials (MLPs). The aim of PyXtal FF is to promote the application of atomistic simulations by providing several choices of structural descriptors and machine learning regressions in one platform. Based on the given choice of structural descriptors (including the atom-centered symmetry functions, embedded atom density, SO4 bispectrum, and smooth SO3 power spectrum), PyXtal FF can train the MLPs with either the generalized linear regression or neural networks model, by simultaneously minimizing the errors of energy/forces/stress tensors in comparison with the data from the ab-initio simulation. The trained MLP model from PyXtal FF is interfaced with the Atomic Simulation Environment (ASE) package, which allows different types of light-weight simulations such as geometry optimization, molecular dynamics simulation, and physical properties prediction. Finally, we will illustrate the performance of PyXtal FF by applying it to investigate several material systems, including the bulk SiO2, high entropy alloy NbMoTaW, and elemental Pt for general purposes. Full documentation of PyXtal FF is available at https://pyxtal-ff.readthedocs.io.
△ Less
Submitted 25 July, 2020;
originally announced July 2020.
-
Modeling a Nonlinear Biophysical Trend Followed by Long-Memory Equilibrium with Unknown Change Point
Authors:
Wenyu Zhang,
Maryclare Griffin,
David S. Matteson
Abstract:
Measurements of many biological processes are characterized by an initial trend period followed by an equilibrium period. Scientists may wish to quantify features of the two periods, as well as the timing of the change point. Specifically, we are motivated by problems in the study of electrical cell-substrate impedance sensing (ECIS) data. ECIS is a popular new technology which measures cell behav…
▽ More
Measurements of many biological processes are characterized by an initial trend period followed by an equilibrium period. Scientists may wish to quantify features of the two periods, as well as the timing of the change point. Specifically, we are motivated by problems in the study of electrical cell-substrate impedance sensing (ECIS) data. ECIS is a popular new technology which measures cell behavior non-invasively. Previous studies using ECIS data have found that different cell types can be classified by their equilibrium behavior. However, it can be challenging to identify when equilibrium has been reached, and to quantify the relevant features of cells' equilibrium behavior. In this paper, we assume that measurements during the trend period are independent deviations from a smooth nonlinear function of time, and that measurements during the equilibrium period are characterized by a simple long memory model. We propose a method to simultaneously estimate the parameters of the trend and equilibrium processes and locate the change point between the two. We find that this method performs well in simulations and in practice. When applied to ECIS data, it produces estimates of change points and measures of cell equilibrium behavior which offer improved classification of infected and uninfected cells.
△ Less
Submitted 19 September, 2020; v1 submitted 18 July, 2020;
originally announced July 2020.
-
Graph-Based Continual Learning
Authors:
Binh Tang,
David S. Matteson
Abstract:
Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an ar…
▽ More
Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning.
△ Less
Submitted 28 February, 2021; v1 submitted 9 July, 2020;
originally announced July 2020.
-
Factor Analysis of Mixed Data for Anomaly Detection
Authors:
Matthew Davidow,
David S. Matteson
Abstract:
Anomaly detection aims to identify observations that deviate from the typical pattern of data. Anomalous observations may correspond to financial fraud, health risks, or incorrectly measured data in practice. We show detecting anomalies in high-dimensional mixed data is enhanced through first embedding the data then assessing an anomaly scoring scheme. We focus on unsupervised detection and the co…
▽ More
Anomaly detection aims to identify observations that deviate from the typical pattern of data. Anomalous observations may correspond to financial fraud, health risks, or incorrectly measured data in practice. We show detecting anomalies in high-dimensional mixed data is enhanced through first embedding the data then assessing an anomaly scoring scheme. We focus on unsupervised detection and the continuous and categorical (mixed) variable case. We propose a kurtosis-weighted Factor Analysis of Mixed Data for anomaly detection, FAMDAD, to obtain a continuous embedding for anomaly scoring. We illustrate that anomalies are highly separable in the first and last few ordered dimensions of this space, and test various anomaly scoring experiments within this subspace. Results are illustrated for both simulated and real datasets, and the proposed approach (FAMDAD) is highly accurate for high-dimensional mixed data throughout these diverse scenarios.
△ Less
Submitted 25 May, 2020;
originally announced May 2020.
-
ABACUS: Unsupervised Multivariate Change Detection via Bayesian Source Separation
Authors:
Wenyu Zhang,
Daniel Gilbert,
David Matteson
Abstract:
Change detection involves segmenting sequential data such that observations in the same segment share some desired properties. Multivariate change detection continues to be a challenging problem due to the variety of ways change points can be correlated across channels and the potentially poor signal-to-noise ratio on individual channels. In this paper, we are interested in locating additive outli…
▽ More
Change detection involves segmenting sequential data such that observations in the same segment share some desired properties. Multivariate change detection continues to be a challenging problem due to the variety of ways change points can be correlated across channels and the potentially poor signal-to-noise ratio on individual channels. In this paper, we are interested in locating additive outliers (AO) and level shifts (LS) in the unsupervised setting. We propose ABACUS, Automatic BAyesian Changepoints Under Sparsity, a Bayesian source separation technique to recover latent signals while also detecting changes in model parameters. Multi-level sparsity achieves both dimension reduction and modeling of signal changes. We show ABACUS has competitive or superior performance in simulation studies against state-of-the-art change detection methods and established latent variable models. We also illustrate ABACUS on two real application, modeling genomic profiles and analyzing household electricity consumption.
△ Less
Submitted 14 October, 2018;
originally announced October 2018.
-
Testing for Conditional Mean Independence with Covariates through Martingale Difference Divergence
Authors:
Ze Jin,
Xiaohan Yan,
David S. Matteson
Abstract:
As a crucial problem in statistics is to decide whether additional variables are needed in a regression model. We propose a new multivariate test to investigate the conditional mean independence of Y given X conditioning on some known effect Z, i.e., E(Y|X, Z) = E(Y|Z). Assuming that E(Y|Z) and Z are linearly related, we reformulate an equivalent notion of conditional mean independence through tra…
▽ More
As a crucial problem in statistics is to decide whether additional variables are needed in a regression model. We propose a new multivariate test to investigate the conditional mean independence of Y given X conditioning on some known effect Z, i.e., E(Y|X, Z) = E(Y|Z). Assuming that E(Y|Z) and Z are linearly related, we reformulate an equivalent notion of conditional mean independence through transformation, which is approximated in practice. We apply the martingale difference divergence (Shao and Zhang, 2014) to measure conditional mean dependence, and show that the estimation error from approximation is negligible, as it has no impact on the asymptotic distribution of the test statistic under some regularity assumptions. The implementation of our test is demonstrated by both simulations and a financial data example.
△ Less
Submitted 17 May, 2018;
originally announced May 2018.
-
Independent Component Analysis via Energy-based and Kernel-based Mutual Dependence Measures
Authors:
Ze Jin,
David S. Matteson
Abstract:
We apply both distance-based (Jin and Matteson, 2017) and kernel-based (Pfister et al., 2016) mutual dependence measures to independent component analysis (ICA), and generalize dCovICA (Matteson and Tsay, 2017) to MDMICA, minimizing empirical dependence measures as an objective function in both deflation and parallel manners. Solving this minimization problem, we introduce Latin hypercube sampling…
▽ More
We apply both distance-based (Jin and Matteson, 2017) and kernel-based (Pfister et al., 2016) mutual dependence measures to independent component analysis (ICA), and generalize dCovICA (Matteson and Tsay, 2017) to MDMICA, minimizing empirical dependence measures as an objective function in both deflation and parallel manners. Solving this minimization problem, we introduce Latin hypercube sampling (LHS) (McKay et al., 2000), and a global optimization method, Bayesian optimization (BO) (Mockus, 1994) to improve the initialization of the Newton-type local optimization method. The performance of MDMICA is evaluated in various simulation studies and an image data example. When the ICA model is correct, MDMICA achieves competitive results compared to existing approaches. When the ICA model is misspecified, the estimated independent components are less mutually dependent than the observed components using MDMICA, while they are prone to be even more mutually dependent than the observed components using other approaches.
△ Less
Submitted 17 May, 2018;
originally announced May 2018.
-
Optimization and Testing in Linear Non-Gaussian Component Analysis
Authors:
Ze Jin,
Benjamin B. Risk,
David S. Matteson
Abstract:
Independent component analysis (ICA) decomposes multivariate data into mutually independent components (ICs). The ICA model is subject to a constraint that at most one of these components is Gaussian, which is required for model identifiability. Linear non-Gaussian component analysis (LNGCA) generalizes the ICA model to a linear latent factor model with any number of both non-Gaussian components (…
▽ More
Independent component analysis (ICA) decomposes multivariate data into mutually independent components (ICs). The ICA model is subject to a constraint that at most one of these components is Gaussian, which is required for model identifiability. Linear non-Gaussian component analysis (LNGCA) generalizes the ICA model to a linear latent factor model with any number of both non-Gaussian components (signals) and Gaussian components (noise), where observations are linear combinations of independent components. Although the individual Gaussian components are not identifiable, the Gaussian subspace is identifiable. We introduce an estimator along with its optimization approach in which non-Gaussian and Gaussian components are estimated simultaneously, maximizing the discrepancy of each non-Gaussian component from Gaussianity while minimizing the discrepancy of each Gaussian component from Gaussianity. When the number of non-Gaussian components is unknown, we develop a statistical test to determine it based on resampling and the discrepancy of estimated components. Through a variety of simulation studies, we demonstrate the improvements of our estimator over competing estimators, and we illustrate the effectiveness of the test to determine the number of non-Gaussian components. Further, we apply our method to real data examples and demonstrate its practical value.
△ Less
Submitted 29 December, 2017; v1 submitted 23 December, 2017;
originally announced December 2017.
-
Interpretable Vector AutoRegressions with Exogenous Time Series
Authors:
Ines Wilms,
Sumanta Basu,
Jacob Bien,
David S. Matteson
Abstract:
The Vector AutoRegressive (VAR) model is fundamental to the study of multivariate time series. Although VAR models are intensively investigated by many researchers, practitioners often show more interest in analyzing VARX models that incorporate the impact of unmodeled exogenous variables (X) into the VAR. However, since the parameter space grows quadratically with the number of time series, estim…
▽ More
The Vector AutoRegressive (VAR) model is fundamental to the study of multivariate time series. Although VAR models are intensively investigated by many researchers, practitioners often show more interest in analyzing VARX models that incorporate the impact of unmodeled exogenous variables (X) into the VAR. However, since the parameter space grows quadratically with the number of time series, estimation quickly becomes challenging. While several proposals have been made to sparsely estimate large VAR models, the estimation of large VARX models is under-explored. Moreover, typically these sparse proposals involve a lasso-type penalty and do not incorporate lag selection into the estimation procedure. As a consequence, the resulting models may be difficult to interpret. In this paper, we propose a lag-based hierarchically sparse estimator, called "HVARX", for large VARX models. We illustrate the usefulness of HVARX on a cross-category management marketing application. Our results show how it provides a highly interpretable model, and improves out-of-sample forecast accuracy compared to a lasso-type approach.
△ Less
Submitted 9 November, 2017;
originally announced November 2017.
-
Cell Line Classification Using Electric Cell-substrate Impedance Sensing (ECIS)
Authors:
Megan L. Gelsinger,
Laura L. Tupper,
David S. Matteson
Abstract:
We consider cell line classification using multivariate time series data obtained from electric cell-substrate impedance sensing (ECIS) technology. The ECIS device, which monitors the attachment and spreading of mammalian cells in real time through the collection of electrical impedance data, has historically been used to study one cell line at a time. However, we show that if applied to data from…
▽ More
We consider cell line classification using multivariate time series data obtained from electric cell-substrate impedance sensing (ECIS) technology. The ECIS device, which monitors the attachment and spreading of mammalian cells in real time through the collection of electrical impedance data, has historically been used to study one cell line at a time. However, we show that if applied to data from multiple cell lines, ECIS can be used to classify unknown or potentially mislabeled cells, which may help to mitigate the current crisis of reproducibility in the biological literature. We assess a range of approaches to this new problem, testing different classification methods and deriving a dictionary of 29 features to characterize ECIS data. Our analysis also makes use of simultaneous multi-frequency ECIS data, where previous studies have focused on only one frequency. In classification tests on fifteen mammalian cell lines, we obtain very high out-of-sample accuracy. These preliminary findings provide a baseline for future large-scale studies in this field.
△ Less
Submitted 20 November, 2019; v1 submitted 26 October, 2017;
originally announced October 2017.
-
Pruning and Nonparametric Multiple Change Point Detection
Authors:
Wenyu Zhang,
Nicholas James,
David Matteson
Abstract:
Change point analysis is a statistical tool to identify homogeneity within time series data. We propose a pruning approach for approximate nonparametric estimation of multiple change points. This general purpose change point detection procedure `cp3o' applies a pruning routine within a dynamic program to greatly reduce the search space and computational costs. Existing goodness-of-fit change point…
▽ More
Change point analysis is a statistical tool to identify homogeneity within time series data. We propose a pruning approach for approximate nonparametric estimation of multiple change points. This general purpose change point detection procedure `cp3o' applies a pruning routine within a dynamic program to greatly reduce the search space and computational costs. Existing goodness-of-fit change point objectives can immediately be utilized within the framework. We further propose novel change point algorithms by applying cp3o to two popular nonparametric goodness of fit measures: `e-cp3o' uses E-statistics, and `ks-cp3o' uses Kolmogorov-Smirnov statistics. Simulation studies highlight the performance of these algorithms in comparison with parametric and other nonparametric change point methods. Finally, we illustrate these approaches with climatological and financial applications.
△ Less
Submitted 16 September, 2017;
originally announced September 2017.
-
Generalizing Distance Covariance to Measure and Test Multivariate Mutual Dependence
Authors:
Ze Jin,
David S. Matteson
Abstract:
We propose three measures of mutual dependence between multiple random vectors. All the measures are zero if and only if the random vectors are mutually independent. The first measure generalizes distance covariance from pairwise dependence to mutual dependence, while the other two measures are sums of squared distance covariance. All the measures share similar properties and asymptotic distributi…
▽ More
We propose three measures of mutual dependence between multiple random vectors. All the measures are zero if and only if the random vectors are mutually independent. The first measure generalizes distance covariance from pairwise dependence to mutual dependence, while the other two measures are sums of squared distance covariance. All the measures share similar properties and asymptotic distributions to distance covariance, and capture non-linear and non-monotone mutual dependence between the random vectors. Inspired by complete and incomplete V-statistics, we define the empirical measures and simplified empirical measures as a trade-off between the complexity and power when testing mutual independence. Implementation of the tests is demonstrated by both simulation results and real data examples.
△ Less
Submitted 25 February, 2018; v1 submitted 8 September, 2017;
originally announced September 2017.
-
Sparse Identification and Estimation of Large-Scale Vector AutoRegressive Moving Averages
Authors:
Ines Wilms,
Sumanta Basu,
Jacob Bien,
David S. Matteson
Abstract:
The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the theory of multivariate time series; however, identifiability issues have led practitioners to abandon it in favor of the simpler but more restrictive Vector AutoRegressive (VAR) model. We narrow this gap with a new optimization-based approach to VARMA identification built upon the principle of parsimony. Among all equival…
▽ More
The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the theory of multivariate time series; however, identifiability issues have led practitioners to abandon it in favor of the simpler but more restrictive Vector AutoRegressive (VAR) model. We narrow this gap with a new optimization-based approach to VARMA identification built upon the principle of parsimony. Among all equivalent data-generating models, we use convex optimization to seek the parameterization that is "simplest" in a certain sense. A user-specified strongly convex penalty is used to measure model simplicity, and that same penalty is then used to define an estimator that can be efficiently computed. We establish consistency of our estimators in a double-asymptotic regime. Our non-asymptotic error bound analysis accommodates both model specification and parameter estimation steps, a feature that is crucial for studying large-scale VARMA algorithms. Our analysis also provides new results on penalized estimation of infinite-order VAR, and elastic net regression under a singular covariance structure of regressors, which may be of independent interest. We illustrate the advantage of our method over VAR alternatives on three real data examples.
△ Less
Submitted 8 June, 2021; v1 submitted 28 July, 2017;
originally announced July 2017.
-
Dynamic Shrinkage Processes
Authors:
Daniel R. Kowal,
David S. Matteson,
David Ruppert
Abstract:
We propose a novel class of dynamic shrinkage processes for Bayesian time series and regression analysis. Building upon a global-local framework of prior construction, in which continuous scale mixtures of Gaussian distributions are employed for both desirable shrinkage properties and computational tractability, we model dependence among the local scale parameters. The resulting processes inherit…
▽ More
We propose a novel class of dynamic shrinkage processes for Bayesian time series and regression analysis. Building upon a global-local framework of prior construction, in which continuous scale mixtures of Gaussian distributions are employed for both desirable shrinkage properties and computational tractability, we model dependence among the local scale parameters. The resulting processes inherit the desirable shrinkage behavior of popular global-local priors, such as the horseshoe prior, but provide additional localized adaptivity, which is important for modeling time series data or regression functions with local features. We construct a computationally efficient Gibbs sampling algorithm based on a Pólya-Gamma scale mixture representation of the proposed process. Using dynamic shrinkage processes, we develop a Bayesian trend filtering model that produces more accurate estimates and tighter posterior credible intervals than competing methods, and apply the model for irregular curve-fitting of minute-by-minute Twitter CPU usage data. In addition, we develop an adaptive time-varying parameter regression model to assess the efficacy of the Fama-French five-factor asset pricing model with momentum added as a sixth factor. Our dynamic analysis of manufacturing and healthcare industry data shows that with the exception of the market risk, no other risk factors are significant except for brief periods.
△ Less
Submitted 23 February, 2018; v1 submitted 3 July, 2017;
originally announced July 2017.
-
BigVAR: Tools for Modeling Sparse High-Dimensional Multivariate Time Series
Authors:
William Nicholson,
David Matteson,
Jacob Bien
Abstract:
The R package BigVAR allows for the simultaneous estimation of high-dimensional time series by applying structured penalties to the conventional vector autoregression (VAR) and vector autoregression with exogenous variables (VARX) frameworks. Our methods can be utilized in many forecasting applications that make use of time-dependent data such as macroeconomics, finance, and internet traffic. Our…
▽ More
The R package BigVAR allows for the simultaneous estimation of high-dimensional time series by applying structured penalties to the conventional vector autoregression (VAR) and vector autoregression with exogenous variables (VARX) frameworks. Our methods can be utilized in many forecasting applications that make use of time-dependent data such as macroeconomics, finance, and internet traffic. Our package extends solution algorithms from the machine learning and signal processing literatures to a time dependent setting: selecting the regularization parameter by sequential cross validation and provides substantial improvements in forecasting performance over conventional methods. We offer a user-friendly interface that utilizes R's s4 object class structure which makes our methodology easily accessible to practicioners.
In this paper, we present an overview of our notation, the models that comprise BigVAR, and the functionality of our package with a detailed example using publicly available macroeconomic data. In addition, we present a simulation study comparing the performance of several procedures that refit the support selected by a BigVAR procedure according to several variants of least squares and conclude that refitting generally degrades forecast performance.
△ Less
Submitted 22 February, 2017;
originally announced February 2017.
-
Mixed Data and Classification of Transit Stops
Authors:
Laura L. Tupper,
David S. Matteson,
John C. Handley
Abstract:
An analysis of the characteristics and behavior of individual bus stops can reveal clusters of similar stops, which can be of use in making routing and scheduling decisions, as well as determining what facilities to provide at each stop. This paper provides an exploratory analysis, including several possible clustering results, of a dataset provided by the Regional Transit Service of Rochester, NY…
▽ More
An analysis of the characteristics and behavior of individual bus stops can reveal clusters of similar stops, which can be of use in making routing and scheduling decisions, as well as determining what facilities to provide at each stop. This paper provides an exploratory analysis, including several possible clustering results, of a dataset provided by the Regional Transit Service of Rochester, NY. The dataset describes ridership on public buses, recording the time, location, and number of entering and exiting passengers each time a bus stops. A description of the overall behavior of bus ridership is followed by a stop-level analysis. We compare multiple measures of stop similarity, based on location, route information, and ridership volume over time.
△ Less
Submitted 12 November, 2016;
originally announced November 2016.
-
Functional Autoregression for Sparsely Sampled Data
Authors:
Daniel R. Kowal,
David S. Matteson,
David Ruppert
Abstract:
We develop a hierarchical Gaussian process model for forecasting and inference of functional time series data. Unlike existing methods, our approach is especially suited for sparsely or irregularly sampled curves and for curves sampled with non-negligible measurement error. The latent process is dynamically modeled as a functional autoregression (FAR) with Gaussian process innovations. We propose…
▽ More
We develop a hierarchical Gaussian process model for forecasting and inference of functional time series data. Unlike existing methods, our approach is especially suited for sparsely or irregularly sampled curves and for curves sampled with non-negligible measurement error. The latent process is dynamically modeled as a functional autoregression (FAR) with Gaussian process innovations. We propose a fully nonparametric dynamic functional factor model for the dynamic innovation process, with broader applicability and improved computational efficiency over standard Gaussian process models. We prove finite-sample forecasting and interpolation optimality properties of the proposed model, which remain valid with the Gaussian assumption relaxed. An efficient Gibbs sampling algorithm is developed for estimation, inference, and forecasting, with extensions for FAR(p) models with model averaging over the lag p. Extensive simulations demonstrate substantial improvements in forecasting performance and recovery of the autoregressive surface over competing methods, especially under sparse designs. We apply the proposed methods to forecast nominal and real yield curves using daily U.S. data. Real yields are observed more sparsely than nominal yields, yet the proposed methods are highly competitive in both settings.
△ Less
Submitted 19 October, 2016; v1 submitted 9 March, 2016;
originally announced March 2016.