-
Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses
Authors:
Charles G. Frye,
James Simon,
Neha S. Wadia,
Andrew Ligeralde,
Michael R. DeWeese,
Kristofer E. Bouchard
Abstract:
Despite the fact that the loss functions of deep neural networks are highly non-convex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. One thread of work has focused on explaining this phenomenon by characterizing the local curvature near critical points of the loss function, where the gradients are near zero, and demonstratin…
▽ More
Despite the fact that the loss functions of deep neural networks are highly non-convex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. One thread of work has focused on explaining this phenomenon by characterizing the local curvature near critical points of the loss function, where the gradients are near zero, and demonstrating that neural network losses enjoy a no-bad-local-minima property and an abundance of saddle points. We report here that the methods used to find these putative critical points suffer from a bad local minima problem of their own: they often converge to or pass through regions where the gradient norm has a stationary point. We call these gradient-flat regions, since they arise when the gradient is approximately in the kernel of the Hessian, such that the loss is locally approximately linear, or flat, in the direction of the gradient. We describe how the presence of these regions necessitates care in both interpreting past results that claimed to find critical points of neural network losses and in designing second-order methods for optimizing neural networks.
△ Less
Submitted 23 March, 2020;
originally announced March 2020.
-
Sparse and Low-bias Estimation of High Dimensional Vector Autoregressive Models
Authors:
Trevor D. Ruiz,
Sharmodeep Bhattacharyya,
Mahesh Balasubramanian,
Kristofer E. Bouchard
Abstract:
Vector autoregressive (VAR) models are widely used for causal discovery and forecasting in multivariate time series analysis. In the high-dimensional setting, which is increasingly common in fields such as neuroscience and econometrics, model parameters are inferred by L1-regularized maximum likelihood (RML). A well-known feature of RML inference is that in general the technique produces a trade-o…
▽ More
Vector autoregressive (VAR) models are widely used for causal discovery and forecasting in multivariate time series analysis. In the high-dimensional setting, which is increasingly common in fields such as neuroscience and econometrics, model parameters are inferred by L1-regularized maximum likelihood (RML). A well-known feature of RML inference is that in general the technique produces a trade-off between sparsity and bias that depends on the choice of the regularization hyperparameter. In the context of multivariate time series analysis, sparse estimates are favorable for causal discovery and low-bias estimates are favorable for forecasting. However, owing to a paucity of research on hyperparameter selection methods, practitioners must rely on ad-hoc methods such as cross-validation (or manual tuning). The particular balance that such approaches achieve between the two goals -- causal discovery and forecasting -- is poorly understood. Our paper investigates this behavior and proposes a method (UoI-VAR) that achieves a better balance between sparsity and bias when the underlying causal influences are in fact sparse. We demonstrate through simulation that RML with a hyperparameter selected by cross-validation tends to overfit, producing relatively dense estimates. We further demonstrate that UoI-VAR much more effectively approximates the correct sparsity pattern with only a minor compromise in model fit, particularly so for larger data dimensions, and that the estimates produced by UoI-VAR exhibit less bias. We conclude that our method achieves improved performance especially well-suited to applications involving simultaneous causal discovery and forecasting in high-dimensional settings.
△ Less
Submitted 12 March, 2025; v1 submitted 29 August, 2019;
originally announced August 2019.
-
Numerically Recovering the Critical Points of a Deep Linear Autoencoder
Authors:
Charles G. Frye,
Neha S. Wadia,
Michael R. DeWeese,
Kristofer E. Bouchard
Abstract:
Numerically locating the critical points of non-convex surfaces is a long-standing problem central to many fields. Recently, the loss surfaces of deep neural networks have been explored to gain insight into outstanding questions in optimization, generalization, and network architecture design. However, the degree to which recently-proposed methods for numerically recovering critical points actuall…
▽ More
Numerically locating the critical points of non-convex surfaces is a long-standing problem central to many fields. Recently, the loss surfaces of deep neural networks have been explored to gain insight into outstanding questions in optimization, generalization, and network architecture design. However, the degree to which recently-proposed methods for numerically recovering critical points actually do so has not been thoroughly evaluated. In this paper, we examine this issue in a case for which the ground truth is known: the deep linear autoencoder. We investigate two sub-problems associated with numerical critical point identification: first, because of large parameter counts, it is infeasible to find all of the critical points for contemporary neural networks, necessitating sampling approaches whose characteristics are poorly understood; second, the numerical tolerance for accurately identifying a critical point is unknown, and conservative tolerances are difficult to satisfy. We first identify connections between recently-proposed methods and well-understood methods in other fields, including chemical physics, economics, and algebraic geometry. We find that several methods work well at recovering certain information about loss surfaces, but fail to take an unbiased sample of critical points. Furthermore, numerical tolerance must be very strict to ensure that numerically-identified critical points have similar properties to true analytical critical points. We also identify a recently-published Newton method for optimization that outperforms previous methods as a critical point-finding algorithm. We expect our results will guide future attempts to numerically study critical points in large nonlinear neural networks.
△ Less
Submitted 29 January, 2019;
originally announced January 2019.
-
Union of Intersections (UoI) for Interpretable Data Driven Discovery and Prediction
Authors:
Kristofer E. Bouchard,
Alejandro F. Bujan,
Farbod Roosta-Khorasani,
Shashanka Ubaru,
Prabhat,
Antoine M. Snijders,
Jian-Hua Mao,
Edward F. Chang,
Michael W. Mahoney,
Sharmodeep Bhattacharyya
Abstract:
The increasing size and complexity of scientific data could dramatically enhance discovery and prediction for basic scientific applications. Realizing this potential, however, requires novel statistical analysis methods that are both interpretable and predictive. We introduce Union of Intersections (UoI), a flexible, modular, and scalable framework for enhanced model selection and estimation. Meth…
▽ More
The increasing size and complexity of scientific data could dramatically enhance discovery and prediction for basic scientific applications. Realizing this potential, however, requires novel statistical analysis methods that are both interpretable and predictive. We introduce Union of Intersections (UoI), a flexible, modular, and scalable framework for enhanced model selection and estimation. Methods based on UoI perform model selection and model estimation through intersection and union operations, respectively. We show that UoI-based methods achieve low-variance and nearly unbiased estimation of a small number of interpretable features, while maintaining high-quality prediction accuracy. We perform extensive numerical investigation to evaluate a UoI algorithm ($UoI_{Lasso}$) on synthetic and real data. In doing so, we demonstrate the extraction of interpretable functional networks from human electrophysiology recordings as well as accurate prediction of phenotypes from genotype-phenotype data with reduced features. We also show (with the $UoI_{L1Logistic}$ and $UoI_{CUR}$ variants of the basic framework) improved prediction parsimony for classification and matrix factorization on several benchmark biomedical data sets. These results suggest that methods based on the UoI framework could improve interpretation and prediction in data-driven discovery across scientific fields.
△ Less
Submitted 2 November, 2017; v1 submitted 22 May, 2017;
originally announced May 2017.
-
Bootstrapped Adaptive Threshold Selection for Statistical Model Selection and Estimation
Authors:
Kristofer E. Bouchard
Abstract:
A central goal of neuroscience is to understand how activity in the nervous system is related to features of the external world, or to features of the nervous system itself. A common approach is to model neural responses as a weighted combination of external features, or vice versa. The structure of the model weights can provide insight into neural representations. Often, neural input-output relat…
▽ More
A central goal of neuroscience is to understand how activity in the nervous system is related to features of the external world, or to features of the nervous system itself. A common approach is to model neural responses as a weighted combination of external features, or vice versa. The structure of the model weights can provide insight into neural representations. Often, neural input-output relationships are sparse, with only a few inputs contributing to the output. In part to account for such sparsity, structured regularizers are incorporated into model fitting optimization. However, by imposing priors, structured regularizers can make it difficult to interpret learned model parameters. Here, we investigate a simple, minimally structured model estimation method for accurate, unbiased estimation of sparse models based on Bootstrapped Adaptive Threshold Selection followed by ordinary least-squares refitting (BoATS). Through extensive numerical investigations, we show that this method often performs favorably compared to L1 and L2 regularizers. In particular, for a variety of model distributions and noise levels, BoATS more accurately recovers the parameters of sparse models, leading to more parsimonious explanations of outputs. Finally, we apply this method to the task of decoding human speech production from ECoG recordings.
△ Less
Submitted 13 May, 2015;
originally announced May 2015.