-
A Primal-Dual Framework for Transformers and Neural Networks
Authors:
Tan M. Nguyen,
Tam Nguyen,
Nhat Ho,
Andrea L. Bertozzi,
Richard G. Baraniuk,
Stanley J. Osher
Abstract:
Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresp…
▽ More
Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Improving Fairness and Mitigating MADness in Generative Models
Authors:
Paul Mayer,
Lorenzo Luzi,
Ali Siahkoohi,
Don H. Johnson,
Richard G. Baraniuk
Abstract:
Generative models unfairly penalize data belonging to minority classes, suffer from model autophagy disorder (MADness), and learn biased estimates of the underlying distribution parameters. Our theoretical and empirical results show that training generative models with intentionally designed hypernetworks leads to models that 1) are more fair when generating datapoints belonging to minority classe…
▽ More
Generative models unfairly penalize data belonging to minority classes, suffer from model autophagy disorder (MADness), and learn biased estimates of the underlying distribution parameters. Our theoretical and empirical results show that training generative models with intentionally designed hypernetworks leads to models that 1) are more fair when generating datapoints belonging to minority classes 2) are more stable in a self-consumed (i.e., MAD) setting, and 3) learn parameters that are less statistically biased. To further mitigate unfairness, MADness, and bias, we introduce a regularization term that penalizes discrepancies between a generative model's estimated weights when trained on real data versus its own synthetic data. To facilitate training existing deep generative models within our framework, we offer a scalable implementation of hypernetworks that automatically generates a hypernetwork architecture for any given generative model.
△ Less
Submitted 3 October, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
[Re] The Discriminative Kalman Filter for Bayesian Filtering with Nonlinear and Non-Gaussian Observation Models
Authors:
Josue Casco-Rodriguez,
Caleb Kemere,
Richard G. Baraniuk
Abstract:
Kalman filters provide a straightforward and interpretable means to estimate hidden or latent variables, and have found numerous applications in control, robotics, signal processing, and machine learning. One such application is neural decoding for neuroprostheses. In 2020, Burkhart et al. thoroughly evaluated their new version of the Kalman filter that leverages Bayes' theorem to improve filter p…
▽ More
Kalman filters provide a straightforward and interpretable means to estimate hidden or latent variables, and have found numerous applications in control, robotics, signal processing, and machine learning. One such application is neural decoding for neuroprostheses. In 2020, Burkhart et al. thoroughly evaluated their new version of the Kalman filter that leverages Bayes' theorem to improve filter performance for highly non-linear or non-Gaussian observation models. This work provides an open-source Python alternative to the authors' MATLAB algorithm. Specifically, we reproduce their most salient results for neuroscientific contexts and further examine the efficacy of their filter using multiple random seeds and previously unused trials from the authors' dataset. All experiments were performed offline on a single computer.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Boomerang: Local sampling on image manifolds using diffusion models
Authors:
Lorenzo Luzi,
Paul M Mayer,
Josue Casco-Rodriguez,
Ali Siahkoohi,
Richard G. Baraniuk
Abstract:
The inference stage of diffusion models can be seen as running a reverse-time diffusion stochastic differential equation, where samples from a Gaussian latent distribution are transformed into samples from a target distribution that usually reside on a low-dimensional manifold, e.g., an image manifold. The intermediate values between the initial latent space and the image manifold can be interpret…
▽ More
The inference stage of diffusion models can be seen as running a reverse-time diffusion stochastic differential equation, where samples from a Gaussian latent distribution are transformed into samples from a target distribution that usually reside on a low-dimensional manifold, e.g., an image manifold. The intermediate values between the initial latent space and the image manifold can be interpreted as noisy images, with the amount of noise determined by the forward diffusion process noise schedule. We utilize this interpretation to present Boomerang, an approach for local sampling of image manifolds. As implied by its name, Boomerang local sampling involves adding noise to an input image, moving it closer to the latent space, and then mapping it back to the image manifold through a partial reverse diffusion process. Thus, Boomerang generates images on the manifold that are ``similar,'' but nonidentical, to the original input image. We can control the proximity of the generated images to the original by adjusting the amount of noise added. Furthermore, due to the stochastic nature of the reverse diffusion process in Boomerang, the generated images display a certain degree of stochasticity, allowing us to obtain local samples from the manifold without encountering any duplicates. Boomerang offers the flexibility to work seamlessly with any pretrained diffusion model, such as Stable Diffusion, without necessitating any adjustments to the reverse diffusion process. We present three applications for Boomerang. First, we provide a framework for constructing privacy-preserving datasets having controllable degrees of anonymity. Second, we show that using Boomerang for data augmentation increases generalization performance and outperforms state-of-the-art synthetic data augmentation. Lastly, we introduce a perceptual image enhancement framework, which enables resolution enhancement.
△ Less
Submitted 17 April, 2024; v1 submitted 21 October, 2022;
originally announced October 2022.
-
Batch Normalization Explained
Authors:
Randall Balestriero,
Richard G. Baraniuk
Abstract:
A critically important, ubiquitous, and yet poorly understood ingredient in modern deep networks (DNs) is batch normalization (BN), which centers and normalizes the feature maps. To date, only limited progress has been made understanding why BN boosts DN learning and inference performance; work has focused exclusively on showing that BN smooths a DN's loss landscape. In this paper, we study BN the…
▽ More
A critically important, ubiquitous, and yet poorly understood ingredient in modern deep networks (DNs) is batch normalization (BN), which centers and normalizes the feature maps. To date, only limited progress has been made understanding why BN boosts DN learning and inference performance; work has focused exclusively on showing that BN smooths a DN's loss landscape. In this paper, we study BN theoretically from the perspective of function approximation; we exploit the fact that most of today's state-of-the-art DNs are continuous piecewise affine (CPA) splines that fit a predictor to the training data via affine mappings defined over a partition of the input space (the so-called "linear regions"). {\em We demonstrate that BN is an unsupervised learning technique that -- independent of the DN's weights or gradient-based learning -- adapts the geometry of a DN's spline partition to match the data.} BN provides a "smart initialization" that boosts the performance of DN learning, because it adapts even a DN initialized with random weights to align its spline partition with the data. We also show that the variation of BN statistics between mini-batches introduces a dropout-like random perturbation to the partition boundaries and hence the decision boundary for classification problems. This per mini-batch perturbation reduces overfitting and improves generalization by increasing the margin between the training samples and the decision boundary.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
A Blessing of Dimensionality in Membership Inference through Regularization
Authors:
Jasper Tan,
Daniel LeJeune,
Blake Mason,
Hamid Javadi,
Richard G. Baraniuk
Abstract:
Is overparameterization a privacy liability? In this work, we study the effect that the number of parameters has on a classifier's vulnerability to membership inference attacks. We first demonstrate how the number of parameters of a model can induce a privacy--utility trade-off: increasing the number of parameters generally improves generalization performance at the expense of lower privacy. Howev…
▽ More
Is overparameterization a privacy liability? In this work, we study the effect that the number of parameters has on a classifier's vulnerability to membership inference attacks. We first demonstrate how the number of parameters of a model can induce a privacy--utility trade-off: increasing the number of parameters generally improves generalization performance at the expense of lower privacy. However, remarkably, we then show that if coupled with proper regularization, increasing the number of parameters of a model can actually simultaneously increase both its privacy and performance, thereby eliminating the privacy--utility trade-off. Theoretically, we demonstrate this curious phenomenon for logistic regression with ridge regularization in a bi-level feature ensemble setting. Pursuant to our theoretical exploration, we develop a novel leave-one-out analysis tool to precisely characterize the vulnerability of a linear classifier to the optimal membership inference attack. We empirically exhibit this "blessing of dimensionality" for neural networks on a variety of tasks using early stopping as the regularizer.
△ Less
Submitted 13 April, 2023; v1 submitted 27 May, 2022;
originally announced May 2022.
-
DeepTensor: Low-Rank Tensor Decomposition with Deep Network Priors
Authors:
Vishwanath Saragadam,
Randall Balestriero,
Ashok Veeraraghavan,
Richard G. Baraniuk
Abstract:
DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a self-supervised manner to minimize the mean-squared approximati…
▽ More
DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a self-supervised manner to minimize the mean-squared approximation error. Our key observation is that the implicit regularization inherent in DNs enables them to capture nonlinear signal structures (e.g., manifolds) that are out of the reach of classical linear methods like the singular value decomposition (SVD) and principal component analysis (PCA). Furthermore, in contrast to the SVD and PCA, whose performance deteriorates when the tensor's entries deviate from additive white Gaussian noise, we demonstrate that the performance of DeepTensor is robust to a wide range of distributions. We validate that DeepTensor is a robust and computationally efficient drop-in replacement for the SVD, PCA, nonnegative matrix factorization (NMF), and similar decompositions by exploring a range of real-world applications, including hyperspectral image denoising, 3D MRI tomography, and image classification. In particular, DeepTensor offers a 6dB signal-to-noise ratio improvement over standard denoising methods for signals corrupted by Poisson noise and learns to decompose 3D tensors 60 times faster than a single DN equipped with 3D convolutions.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
Parameters or Privacy: A Provable Tradeoff Between Overparameterization and Membership Inference
Authors:
Jasper Tan,
Blake Mason,
Hamid Javadi,
Richard G. Baraniuk
Abstract:
A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this paper, we study an underexplored hidden cost of overp…
▽ More
A surprising phenomenon in modern machine learning is the ability of a highly overparameterized model to generalize well (small error on the test data) even when it is trained to memorize the training data (zero error on the training data). This has led to an arms race towards increasingly overparameterized models (c.f., deep learning). In this paper, we study an underexplored hidden cost of overparameterization: the fact that overparameterized models may be more vulnerable to privacy attacks, in particular the membership inference attack that predicts the (potentially sensitive) examples used to train a model. We significantly extend the relatively few empirical results on this problem by theoretically proving for an overparameterized linear regression model in the Gaussian data setting that membership inference vulnerability increases with the number of parameters. Moreover, a range of empirical studies indicates that more complex, nonlinear models exhibit the same behavior. Finally, we extend our analysis towards ridge-regularized linear regression and show in the Gaussian data setting that increased regularization also increases membership inference vulnerability in the overparameterized regime.
△ Less
Submitted 30 November, 2022; v1 submitted 2 February, 2022;
originally announced February 2022.
-
Improving Transformers with Probabilistic Attention Keys
Authors:
Tam Nguyen,
Tan M. Nguyen,
Dung D. Le,
Duy Khuong Nguyen,
Viet-Anh Tran,
Richard G. Baraniuk,
Nhat Ho,
Stanley J. Osher
Abstract:
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observati…
▽ More
Multi-head attention is a driving force behind state-of-the-art transformers, which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires fewer FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attention. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications, including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.
△ Less
Submitted 12 June, 2022; v1 submitted 16 October, 2021;
originally announced October 2021.
-
Unrolling Particles: Unsupervised Learning of Sampling Distributions
Authors:
Fernando Gama,
Nicolas Zilberstein,
Richard G. Baraniuk,
Santiago Segarra
Abstract:
Particle filtering is used to compute good nonlinear estimates of complex systems. It samples trajectories from a chosen distribution and computes the estimate as a weighted average. Easy-to-sample distributions often lead to degenerate samples where only one trajectory carries all the weight, negatively affecting the resulting performance of the estimate. While much research has been done on the…
▽ More
Particle filtering is used to compute good nonlinear estimates of complex systems. It samples trajectories from a chosen distribution and computes the estimate as a weighted average. Easy-to-sample distributions often lead to degenerate samples where only one trajectory carries all the weight, negatively affecting the resulting performance of the estimate. While much research has been done on the design of appropriate sampling distributions that would lead to controlled degeneracy, in this paper our objective is to \emph{learn} sampling distributions. Leveraging the framework of algorithm unrolling, we model the sampling distribution as a multivariate normal, and we use neural networks to learn both the mean and the covariance. We carry out unsupervised training of the model to minimize weight degeneracy, relying only on the observed measurements of the system. We show in simulations that the resulting particle filter yields good estimates in a wide range of scenarios.
△ Less
Submitted 6 October, 2021;
originally announced October 2021.
-
A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning
Authors:
Yehuda Dar,
Vidya Muthukumar,
Richard G. Baraniuk
Abstract:
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, which results in them perfectly fitting (i.e., interpo…
▽ More
The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, which results in them perfectly fitting (i.e., interpolating) the training data, which is usually noisy. Such interpolation of noisy data is traditionally associated with detrimental overfitting, and yet a wide range of interpolating models -- from simple linear models to deep neural networks -- have recently been observed to generalize extremely well on fresh test data. Indeed, the recently discovered double descent phenomenon has revealed that highly overparameterized models often improve over the best underparameterized model in test performance.
Understanding learning in this overparameterized regime requires new theory and foundational empirical studies, even for the simplest case of the linear model. The underpinnings of this understanding have been laid in very recent analyses of overparameterized linear regression and related statistical learning tasks, which resulted in precise analytic characterizations of double descent. This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective. We emphasize the unique aspects that define the TOPML research area as a subfield of modern ML theory and outline interesting open questions that remain.
△ Less
Submitted 6 September, 2021;
originally announced September 2021.
-
The Flip Side of the Reweighted Coin: Duality of Adaptive Dropout and Regularization
Authors:
Daniel LeJeune,
Hamid Javadi,
Richard G. Baraniuk
Abstract:
Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$η$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy t…
▽ More
Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$η$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory.
△ Less
Submitted 3 January, 2022; v1 submitted 14 June, 2021;
originally announced June 2021.
-
NePTuNe: Neural Powered Tucker Network for Knowledge Graph Completion
Authors:
Shashank Sonkar,
Arzoo Katiyar,
Richard G. Baraniuk
Abstract:
Knowledge graphs link entities through relations to provide a structured representation of real world facts. However, they are often incomplete, because they are based on only a small fraction of all plausible facts. The task of knowledge graph completion via link prediction aims to overcome this challenge by inferring missing facts represented as links between entities. Current approaches to link…
▽ More
Knowledge graphs link entities through relations to provide a structured representation of real world facts. However, they are often incomplete, because they are based on only a small fraction of all plausible facts. The task of knowledge graph completion via link prediction aims to overcome this challenge by inferring missing facts represented as links between entities. Current approaches to link prediction leverage tensor factorization and/or deep learning. Factorization methods train and deploy rapidly thanks to their small number of parameters but have limited expressiveness due to their underlying linear methodology. Deep learning methods are more expressive but also computationally expensive and prone to overfitting due to their large number of trainable parameters. We propose Neural Powered Tucker Network (NePTuNe), a new hybrid link prediction model that couples the expressiveness of deep models with the speed and size of linear models. We demonstrate that NePTuNe provides state-of-the-art performance on the FB15K-237 dataset and near state-of-the-art performance on the WN18RR dataset.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
Ensembles of Generative Adversarial Networks for Disconnected Data
Authors:
Lorenzo Luzi,
Randall Balestriero,
Richard G. Baraniuk
Abstract:
Most current computer vision datasets are composed of disconnected sets, such as images from different classes. We prove that distributions of this type of data cannot be represented with a continuous generative network without error. They can be represented in two ways: With an ensemble of networks or with a single network with truncated latent space. We show that ensembles are more desirable tha…
▽ More
Most current computer vision datasets are composed of disconnected sets, such as images from different classes. We prove that distributions of this type of data cannot be represented with a continuous generative network without error. They can be represented in two ways: With an ensemble of networks or with a single network with truncated latent space. We show that ensembles are more desirable than truncated distributions in practice. We construct a regularized optimization problem that establishes the relationship between a single continuous GAN, an ensemble of GANs, conditional GANs, and Gaussian Mixture GANs. This regularization can be computed efficiently, and we show empirically that our framework has a performance sweet spot which can be found with hyperparameter tuning. This ensemble framework allows better performance than a single continuous GAN or cGAN while maintaining fewer total parameters.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.
-
Analytical Probability Distributions and EM-Learning for Deep Generative Networks
Authors:
Randall Balestriero,
Sebastien Paris,
Richard G. Baraniuk
Abstract:
Deep Generative Networks (DGNs) with probabilistic modeling of their output and latent space are currently trained via Variational Autoencoders (VAEs). In the absence of a known analytical form for the posterior and likelihood expectation, VAEs resort to approximations, including (Amortized) Variational Inference (AVI) and Monte-Carlo (MC) sampling. We exploit the Continuous Piecewise Affine (CPA)…
▽ More
Deep Generative Networks (DGNs) with probabilistic modeling of their output and latent space are currently trained via Variational Autoencoders (VAEs). In the absence of a known analytical form for the posterior and likelihood expectation, VAEs resort to approximations, including (Amortized) Variational Inference (AVI) and Monte-Carlo (MC) sampling. We exploit the Continuous Piecewise Affine (CPA) property of modern DGNs to derive their posterior and marginal distributions as well as the latter's first moments. These findings enable us to derive an analytical Expectation-Maximization (EM) algorithm that enables gradient-free DGN learning. We demonstrate empirically that EM training of DGNs produces greater likelihood than VAE training. Our findings will guide the design of new VAE AVI that better approximate the true posterior and open avenues to apply standard statistical tools for model comparison, anomaly detection, and missing data imputation.
△ Less
Submitted 17 June, 2020;
originally announced June 2020.
-
An Improved Semi-Supervised VAE for Learning Disentangled Representations
Authors:
Weili Nie,
Zichao Wang,
Ankit B. Patel,
Richard G. Baraniuk
Abstract:
Learning interpretable and disentangled representations is a crucial yet challenging task in representation learning. In this work, we focus on semi-supervised disentanglement learning and extend work by Locatello et al. (2019) by introducing another source of supervision that we denote as label replacement. Specifically, during training, we replace the inferred representation associated with a da…
▽ More
Learning interpretable and disentangled representations is a crucial yet challenging task in representation learning. In this work, we focus on semi-supervised disentanglement learning and extend work by Locatello et al. (2019) by introducing another source of supervision that we denote as label replacement. Specifically, during training, we replace the inferred representation associated with a data point with its ground-truth representation whenever it is available. Our extension is theoretically inspired by our proposed general framework of semi-supervised disentanglement learning in the context of VAEs which naturally motivates the supervised terms commonly used in existing semi-supervised VAEs (but not for disentanglement learning). Extensive experiments on synthetic and real datasets demonstrate both quantitatively and qualitatively the ability of our extension to significantly and consistently improve disentanglement with very limited supervision.
△ Less
Submitted 22 June, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Double Double Descent: On Generalization Errors in Transfer Learning between Linear Regression Tasks
Authors:
Yehuda Dar,
Richard G. Baraniuk
Abstract:
We study the transfer learning process between two linear regression problems. An important and timely special case is when the regressors are overparameterized and perfectly interpolate their training data. We examine a parameter transfer mechanism whereby a subset of the parameters of the target task solution are constrained to the values learned for a related source task. We analytically charac…
▽ More
We study the transfer learning process between two linear regression problems. An important and timely special case is when the regressors are overparameterized and perfectly interpolate their training data. We examine a parameter transfer mechanism whereby a subset of the parameters of the target task solution are constrained to the values learned for a related source task. We analytically characterize the generalization error of the target task in terms of the salient factors in the transfer learning architecture, i.e., the number of examples available, the number of (free) parameters in each of the tasks, the number of parameters transferred from the source to target task, and the relation between the two tasks. Our non-asymptotic analysis shows that the generalization error of the target task follows a two-dimensional double descent trend (with respect to the number of free parameters in each of the tasks) that is controlled by the transfer learning factors. Our analysis points to specific cases where the transfer of parameters is beneficial as a substitute for extra overparameterization (i.e., additional free parameters in the target task). Specifically, we show that the usefulness of a transfer learning setting is fragile and depends on a delicate interplay among the set of transferred parameters, the relation between the tasks, and the true solution. We also demonstrate that overparameterized transfer learning is not necessarily more beneficial when the source task is closer or identical to the target task.
△ Less
Submitted 28 September, 2022; v1 submitted 12 June, 2020;
originally announced June 2020.
-
MomentumRNN: Integrating Momentum into Recurrent Neural Networks
Authors:
Tan M. Nguyen,
Richard G. Baraniuk,
Andrea L. Bertozzi,
Stanley J. Osher,
Bao Wang
Abstract:
Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numeri…
▽ More
Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. We study the momentum long-short term memory (MomentumLSTM) and verify its advantages in convergence speed and accuracy over its LSTM counterpart across a variety of benchmarks. We also demonstrate that MomentumRNN is applicable to many types of recurrent cells, including those in the state-of-the-art orthogonal RNNs. Finally, we show that other advanced momentum-based optimization methods, such as Adam and Nesterov accelerated gradients with a restart, can be easily incorporated into the MomentumRNN framework for designing new recurrent cells with even better performance. The code is available at https://github.com/minhtannguyen/MomentumRNN.
△ Less
Submitted 11 October, 2020; v1 submitted 11 June, 2020;
originally announced June 2020.
-
qDKT: Question-centric Deep Knowledge Tracing
Authors:
Shashank Sonkar,
Andrew E. Waters,
Andrew S. Lan,
Phillip J. Grimaldi,
Richard G. Baraniuk
Abstract:
Knowledge tracing (KT) models, e.g., the deep knowledge tracing (DKT) model, track an individual learner's acquisition of skills over time by examining the learner's performance on questions related to those skills. A practical limitation in most existing KT models is that all questions nested under a particular skill are treated as equivalent observations of a learner's ability, which is an inacc…
▽ More
Knowledge tracing (KT) models, e.g., the deep knowledge tracing (DKT) model, track an individual learner's acquisition of skills over time by examining the learner's performance on questions related to those skills. A practical limitation in most existing KT models is that all questions nested under a particular skill are treated as equivalent observations of a learner's ability, which is an inaccurate assumption in real-world educational scenarios. To overcome this limitation we introduce qDKT, a variant of DKT that models every learner's success probability on individual questions over time. First, qDKT incorporates graph Laplacian regularization to smooth predictions under each skill, which is particularly useful when the number of questions in the dataset is big. Second, qDKT uses an initialization scheme inspired by the fastText algorithm, which has found success in a variety of language modeling tasks. Our experiments on several real-world datasets show that qDKT achieves state-of-art performance on predicting learner outcomes. Because of this, qDKT can serve as a simple, yet tough-to-beat, baseline for new question-centric KT models.
△ Less
Submitted 25 May, 2020;
originally announced May 2020.
-
Deep Learning Techniques for Inverse Problems in Imaging
Authors:
Gregory Ongie,
Ajil Jalal,
Christopher A. Metzler,
Richard G. Baraniuk,
Alexandros G. Dimakis,
Rebecca Willett
Abstract:
Recent work in machine learning shows that deep neural networks can be used to solve a wide variety of inverse problems arising in computational imaging. We explore the central prevailing themes of this emerging area and present a taxonomy that can be used to categorize different problems and reconstruction methods. Our taxonomy is organized along two central axes: (1) whether or not a forward mod…
▽ More
Recent work in machine learning shows that deep neural networks can be used to solve a wide variety of inverse problems arising in computational imaging. We explore the central prevailing themes of this emerging area and present a taxonomy that can be used to categorize different problems and reconstruction methods. Our taxonomy is organized along two central axes: (1) whether or not a forward model is known and to what extent it is used in training and testing, and (2) whether or not the learning is supervised or unsupervised, i.e., whether or not the training relies on access to matched ground truth image and measurement pairs. We also discuss the trade-offs associated with these different reconstruction approaches, caveats and common failure modes, plus open problems and avenues for future work.
△ Less
Submitted 12 May, 2020;
originally announced May 2020.
-
Educational Question Mining At Scale: Prediction, Analysis and Personalization
Authors:
Zichao Wang,
Sebastian Tschiatschek,
Simon Woodhead,
Jose Miguel Hernandez-Lobato,
Simon Peyton Jones,
Richard G. Baraniuk,
Cheng Zhang
Abstract:
Online education platforms enable teachers to share a large number of educational resources such as questions to form exercises and quizzes for students. With large volumes of available questions, it is important to have an automated way to quantify their properties and intelligently select them for students, enabling effective and personalized learning experiences. In this work, we propose a fram…
▽ More
Online education platforms enable teachers to share a large number of educational resources such as questions to form exercises and quizzes for students. With large volumes of available questions, it is important to have an automated way to quantify their properties and intelligently select them for students, enabling effective and personalized learning experiences. In this work, we propose a framework for mining insights from educational questions at scale. We utilize the state-of-the-art Bayesian deep learning method, in particular partial variational auto-encoders (p-VAE), to analyze real students' answers to a large collection of questions. Based on p-VAE, we propose two novel metrics that quantify question quality and difficulty, respectively, and a personalized strategy to adaptively select questions for students. We apply our proposed framework to a real-world dataset with tens of thousands of questions and tens of millions of answers from an online education platform. Our framework not only demonstrates promising results in terms of statistical metrics but also obtains highly consistent results with domain experts' evaluation.
△ Less
Submitted 28 February, 2021; v1 submitted 12 March, 2020;
originally announced March 2020.
-
Subspace Fitting Meets Regression: The Effects of Supervision and Orthonormality Constraints on Double Descent of Generalization Errors
Authors:
Yehuda Dar,
Paul Mayer,
Lorenzo Luzi,
Richard G. Baraniuk
Abstract:
We study the linear subspace fitting problem in the overparameterized setting, where the estimated subspace can perfectly interpolate the training examples. Our scope includes the least-squares solutions to subspace fitting tasks with varying levels of supervision in the training data (i.e., the proportion of input-output examples of the desired low-dimensional mapping) and orthonormality of the v…
▽ More
We study the linear subspace fitting problem in the overparameterized setting, where the estimated subspace can perfectly interpolate the training examples. Our scope includes the least-squares solutions to subspace fitting tasks with varying levels of supervision in the training data (i.e., the proportion of input-output examples of the desired low-dimensional mapping) and orthonormality of the vectors defining the learned operator. This flexible family of problems connects standard, unsupervised subspace fitting that enforces strict orthonormality with a corresponding regression task that is fully supervised and does not constrain the linear operator structure. This class of problems is defined over a supervision-orthonormality plane, where each coordinate induces a problem instance with a unique pair of supervision level and softness of orthonormality constraints. We explore this plane and show that the generalization errors of the corresponding subspace fitting problems follow double descent trends as the settings become more supervised and less orthonormally constrained.
△ Less
Submitted 20 August, 2020; v1 submitted 24 February, 2020;
originally announced February 2020.
-
Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
Authors:
Bao Wang,
Tan M. Nguyen,
Andrea L. Bertozzi,
Richard G. Baraniuk,
Stanley J. Osher
Abstract:
Stochastic gradient descent (SGD) with constant momentum and its variants such as Adam are the optimization algorithms of choice for training deep neural networks (DNNs). Since DNN training is incredibly computationally expensive, there is great interest in speeding up the convergence. Nesterov accelerated gradient (NAG) improves the convergence rate of gradient descent (GD) for convex optimizatio…
▽ More
Stochastic gradient descent (SGD) with constant momentum and its variants such as Adam are the optimization algorithms of choice for training deep neural networks (DNNs). Since DNN training is incredibly computationally expensive, there is great interest in speeding up the convergence. Nesterov accelerated gradient (NAG) improves the convergence rate of gradient descent (GD) for convex optimization using a specially designed momentum; however, it accumulates error when an inexact gradient is used (such as in SGD), slowing convergence at best and diverging at worst. In this paper, we propose Scheduled Restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. Using a variety of models and benchmarks for image classification, we demonstrate that, in training DNNs, SRSGD significantly improves convergence and generalization; for instance in training ResNet200 for ImageNet classification, SRSGD achieves an error rate of 20.93% vs. the benchmark of 22.13%. These improvements become more significant as the network grows deeper. Furthermore, on both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline.
△ Less
Submitted 26 April, 2020; v1 submitted 24 February, 2020;
originally announced February 2020.
-
InfoCNF: An Efficient Conditional Continuous Normalizing Flow with Adaptive Solvers
Authors:
Tan M. Nguyen,
Animesh Garg,
Richard G. Baraniuk,
Anima Anandkumar
Abstract:
Continuous Normalizing Flows (CNFs) have emerged as promising deep generative models for a wide range of tasks thanks to their invertibility and exact likelihood estimation. However, conditioning CNFs on signals of interest for conditional image generation and downstream predictive tasks is inefficient due to the high-dimensional latent code generated by the model, which needs to be of the same si…
▽ More
Continuous Normalizing Flows (CNFs) have emerged as promising deep generative models for a wide range of tasks thanks to their invertibility and exact likelihood estimation. However, conditioning CNFs on signals of interest for conditional image generation and downstream predictive tasks is inefficient due to the high-dimensional latent code generated by the model, which needs to be of the same size as the input data. In this paper, we propose InfoCNF, an efficient conditional CNF that partitions the latent space into a class-specific supervised code and an unsupervised code that shared among all classes for efficient use of labeled information. Since the partitioning strategy (slightly) increases the number of function evaluations (NFEs), InfoCNF also employs gating networks to learn the error tolerances of its ordinary differential equation (ODE) solvers for better speed and performance. We show empirically that InfoCNF improves the test accuracy over the baseline while yielding comparable likelihood scores and reducing the NFEs on CIFAR10. Furthermore, applying the same partitioning strategy in InfoCNF on time-series data helps improve extrapolation performance.
△ Less
Submitted 9 December, 2019;
originally announced December 2019.
-
The Implicit Regularization of Ordinary Least Squares Ensembles
Authors:
Daniel LeJeune,
Hamid Javadi,
Richard G. Baraniuk
Abstract:
Ensemble methods that average over a collection of independent predictors that are each limited to a subsampling of both the examples and features of the training data command a significant presence in machine learning, such as the ever-popular random forest, yet the nature of the subsampling effect, particularly of the features, is not well understood. We study the case of an ensemble of linear p…
▽ More
Ensemble methods that average over a collection of independent predictors that are each limited to a subsampling of both the examples and features of the training data command a significant presence in machine learning, such as the ever-popular random forest, yet the nature of the subsampling effect, particularly of the features, is not well understood. We study the case of an ensemble of linear predictors, where each individual predictor is fit using ordinary least squares on a random submatrix of the data matrix. We show that, under standard Gaussianity assumptions, when the number of features selected for each predictor is optimally tuned, the asymptotic risk of a large ensemble is equal to the asymptotic ridge regression risk, which is known to be optimal among linear predictors in this setting. In addition to eliciting this implicit regularization that results from subsampling, we also connect this ensemble to the dropout technique used in training deep (neural) networks, another strategy that has been shown to have a ridge-like regularizing effect.
△ Less
Submitted 24 March, 2020; v1 submitted 10 October, 2019;
originally announced October 2019.
-
Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks
Authors:
Haoran You,
Chaojian Li,
Pengfei Xu,
Yonggan Fu,
Yue Wang,
Xiaohan Chen,
Richard G. Baraniuk,
Zhangyang Wang,
Yingyan Celine Lin
Abstract:
(Frankle & Carbin, 2019) shows that there exist winning tickets (small but critical subnetworks) for dense, randomly initialized networks, that can be trained alone to achieve comparable accuracies to the latter in a similar number of iterations. However, the identification of these winning tickets still requires the costly train-prune-retrain process, limiting their practical benefits. In this pa…
▽ More
(Frankle & Carbin, 2019) shows that there exist winning tickets (small but critical subnetworks) for dense, randomly initialized networks, that can be trained alone to achieve comparable accuracies to the latter in a similar number of iterations. However, the identification of these winning tickets still requires the costly train-prune-retrain process, limiting their practical benefits. In this paper, we discover for the first time that the winning tickets can be identified at the very early training stage, which we term as early-bird (EB) tickets, via low-cost training schemes (e.g., early stopping and low-precision training) at large learning rates. Our finding of EB tickets is consistent with recently reported observations that the key connectivity patterns of neural networks emerge early. Furthermore, we propose a mask distance metric that can be used to identify EB tickets with low computational overhead, without needing to know the true winning tickets that emerge after the full training. Finally, we leverage the existence of EB tickets and the proposed mask distance to develop efficient training methods, which are achieved by first identifying EB tickets via low-cost schemes, and then continuing to train merely the EB tickets towards the target accuracy. Experiments based on various deep networks and datasets validate: 1) the existence of EB tickets, and the effectiveness of mask distance in efficiently identifying them; and 2) that the proposed efficient training via EB tickets can achieve up to 4.7x energy savings while maintaining comparable or even better accuracy, demonstrating a promising and easily adopted method for tackling cost-prohibitive deep network training. Code available at https://github.com/RICE-EIC/Early-Bird-Tickets.
△ Less
Submitted 3 March, 2025; v1 submitted 26 September, 2019;
originally announced September 2019.
-
Out-of-Distribution Detection Using Neural Rendering Generative Models
Authors:
Yujia Huang,
Sihui Dai,
Tan Nguyen,
Richard G. Baraniuk,
Anima Anandkumar
Abstract:
Out-of-distribution (OoD) detection is a natural downstream task for deep generative models, due to their ability to learn the input probability distribution. There are mainly two classes of approaches for OoD detection using deep generative models, viz., based on likelihood measure and the reconstruction loss. However, both approaches are unable to carry out OoD detection effectively, especially…
▽ More
Out-of-distribution (OoD) detection is a natural downstream task for deep generative models, due to their ability to learn the input probability distribution. There are mainly two classes of approaches for OoD detection using deep generative models, viz., based on likelihood measure and the reconstruction loss. However, both approaches are unable to carry out OoD detection effectively, especially when the OoD samples have smaller variance than the training samples. For instance, both flow based and VAE models assign higher likelihood to images from SVHN when trained on CIFAR-10 images. We use a recently proposed generative model known as neural rendering model (NRM) and derive metrics for OoD. We show that NRM unifies both approaches since it provides a likelihood estimate and also carries out reconstruction in each layer of the neural network. Among various measures, we found the joint likelihood of latent variables to be the most effective one for OoD detection. Our results show that when trained on CIFAR-10, lower likelihood (of latent variables) is assigned to SVHN images. Additionally, we show that this metric is consistent across other OoD datasets. To the best of our knowledge, this is the first work to show consistently lower likelihood for OoD data with smaller variance with deep generative models.
△ Less
Submitted 10 July, 2019;
originally announced July 2019.
-
Implicit Rugosity Regularization via Data Augmentation
Authors:
Daniel LeJeune,
Randall Balestriero,
Hamid Javadi,
Richard G. Baraniuk
Abstract:
Deep (neural) networks have been applied productively in a wide range of supervised and unsupervised learning tasks. Unlike classical machine learning algorithms, deep networks typically operate in the \emph{overparameterized} regime, where the number of parameters is larger than the number of training data points. Consequently, understanding the generalization properties and the role of (explicit…
▽ More
Deep (neural) networks have been applied productively in a wide range of supervised and unsupervised learning tasks. Unlike classical machine learning algorithms, deep networks typically operate in the \emph{overparameterized} regime, where the number of parameters is larger than the number of training data points. Consequently, understanding the generalization properties and the role of (explicit or implicit) regularization in these networks is of great importance. In this work, we explore how the oft-used heuristic of \emph{data augmentation} imposes an {\em implicit regularization} penalty of a novel measure of the \emph{rugosity} or "roughness" based on the tangent Hessian of the function fit to the training data.
△ Less
Submitted 10 October, 2019; v1 submitted 28 May, 2019;
originally announced May 2019.
-
Thresholding Graph Bandits with GrAPL
Authors:
Daniel LeJeune,
Gautam Dasarathy,
Richard G. Baraniuk
Abstract:
In this paper, we introduce a new online decision making paradigm that we call Thresholding Graph Bandits. The main goal is to efficiently identify a subset of arms in a multi-armed bandit problem whose means are above a specified threshold. While traditionally in such problems, the arms are assumed to be independent, in our paradigm we further suppose that we have access to the similarity between…
▽ More
In this paper, we introduce a new online decision making paradigm that we call Thresholding Graph Bandits. The main goal is to efficiently identify a subset of arms in a multi-armed bandit problem whose means are above a specified threshold. While traditionally in such problems, the arms are assumed to be independent, in our paradigm we further suppose that we have access to the similarity between the arms in the form of a graph, allowing us gain information about the arm means in fewer samples. Such settings play a key role in a wide range of modern decision making problems where rapid decisions need to be made in spite of the large number of options available at each time. We present GrAPL, a novel algorithm for the thresholding graph bandit problem. We demonstrate theoretically that this algorithm is effective in taking advantage of the graph structure when available and the reward function homophily (that strongly connected arms have similar rewards) when favorable. We confirm these theoretical findings via experiments on both synthetic and real data.
△ Less
Submitted 24 March, 2020; v1 submitted 22 May, 2019;
originally announced May 2019.
-
IdeoTrace: A Framework for Ideology Tracing with a Case Study on the 2016 U.S. Presidential Election
Authors:
Indu Manickam,
Andrew S. Lan,
Gautam Dasarathy,
Richard G. Baraniuk
Abstract:
The 2016 United States presidential election has been characterized as a period of extreme divisiveness that was exacerbated on social media by the influence of fake news, trolls, and social bots. However, the extent to which the public became more polarized in response to these influences over the course of the election is not well understood. In this paper we propose IdeoTrace, a framework for (…
▽ More
The 2016 United States presidential election has been characterized as a period of extreme divisiveness that was exacerbated on social media by the influence of fake news, trolls, and social bots. However, the extent to which the public became more polarized in response to these influences over the course of the election is not well understood. In this paper we propose IdeoTrace, a framework for (i) jointly estimating the ideology of social media users and news websites and (ii) tracing changes in user ideology over time. We apply this framework to the last two months of the election period for a group of 47508 Twitter users and demonstrate that both liberal and conservative users became more polarized over time.
△ Less
Submitted 30 May, 2019; v1 submitted 21 May, 2019;
originally announced May 2019.
-
Adaptive Estimation for Approximate k-Nearest-Neighbor Computations
Authors:
Daniel LeJeune,
Richard G. Baraniuk,
Reinhard Heckel
Abstract:
Algorithms often carry out equally many computations for "easy" and "hard" problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate k-nearest-neighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains…
▽ More
Algorithms often carry out equally many computations for "easy" and "hard" problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate k-nearest-neighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains the set of k nearest neighbors of a given query point. We propose an algorithm based on adaptively estimating the distances, and show that it is essentially optimal out of algorithms that are only allowed to adaptively estimate distances. We then demonstrate both theoretically and experimentally that the algorithm can achieve significant speedups relative to the naive method.
△ Less
Submitted 25 February, 2019;
originally announced February 2019.
-
Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data
Authors:
Benjamin Coleman,
Richard G. Baraniuk,
Anshumali Shrivastava
Abstract:
We present the first sublinear memory sketch that can be queried to find the nearest neighbors in a dataset. Our online sketching algorithm compresses an N element dataset to a sketch of size $O(N^b \log^3 N)$ in $O(N^{(b+1)} \log^3 N)$ time, where $b < 1$. This sketch can correctly report the nearest neighbors of any query that satisfies a stability condition parameterized by $b$. We achieve subl…
▽ More
We present the first sublinear memory sketch that can be queried to find the nearest neighbors in a dataset. Our online sketching algorithm compresses an N element dataset to a sketch of size $O(N^b \log^3 N)$ in $O(N^{(b+1)} \log^3 N)$ time, where $b < 1$. This sketch can correctly report the nearest neighbors of any query that satisfies a stability condition parameterized by $b$. We achieve sublinear memory performance on stable queries by combining recent advances in locality sensitive hash (LSH)-based estimators, online kernel density estimation, and compressed sensing. Our theoretical results shed new light on the memory-accuracy tradeoff for nearest neighbor search, and our sketch, which consists entirely of short integer arrays, has a variety of attractive features in practice. We evaluate the memory-recall tradeoff of our method on a friend recommendation task in the Google Plus social media network. We obtain orders of magnitude better compression than the random projection based alternative while retaining the ability to report the nearest neighbors of practical queries.
△ Less
Submitted 14 September, 2020; v1 submitted 18 February, 2019;
originally announced February 2019.
-
A Bayesian Perspective of Convolutional Neural Networks through a Deconvolutional Generative Model
Authors:
Tan Nguyen,
Nhat Ho,
Ankit Patel,
Anima Anandkumar,
Michael I. Jordan,
Richard G. Baraniuk
Abstract:
Inspired by the success of Convolutional Neural Networks (CNNs) for supervised prediction in images, we design the Deconvolutional Generative Model (DGM), a new probabilistic generative model whose inference calculations correspond to those in a given CNN architecture. The DGM uses a CNN to design the prior distribution in the probabilistic model. Furthermore, the DGM generates images from coarse…
▽ More
Inspired by the success of Convolutional Neural Networks (CNNs) for supervised prediction in images, we design the Deconvolutional Generative Model (DGM), a new probabilistic generative model whose inference calculations correspond to those in a given CNN architecture. The DGM uses a CNN to design the prior distribution in the probabilistic model. Furthermore, the DGM generates images from coarse to finer scales. It introduces a small set of latent variables at each scale, and enforces dependencies among all the latent variables via a conjugate prior distribution. This conjugate prior yields a new regularizer based on paths rendered in the generative model for training CNNs-the Rendering Path Normalization (RPN). We demonstrate that this regularizer improves generalization, both in theory and in practice. In addition, likelihood estimation in the DGM yields training losses for CNNs, and inspired by this, we design a new loss termed as the Max-Min cross entropy which outperforms the traditional cross-entropy loss for object classification. The Max-Min cross entropy suggests a new deep network architecture, namely the Max-Min network, which can learn from less labeled data while maintaining good prediction performance. Our experiments demonstrate that the DGM with the RPN and the Max-Min architecture exceeds or matches the-state-of-art on benchmarks including SVHN, CIFAR10, and CIFAR100 for semi-supervised and supervised learning tasks.
△ Less
Submitted 9 December, 2019; v1 submitted 31 October, 2018;
originally announced November 2018.
-
From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference
Authors:
Randall Balestriero,
Richard G. Baraniuk
Abstract:
Nonlinearity is crucial to the performance of a deep (neural) network (DN). To date there has been little progress understanding the menagerie of available nonlinearities, but recently progress has been made on understanding the rôle played by piecewise affine and convex nonlinearities like the ReLU and absolute value activation functions and max-pooling. In particular, DN layers constructed from…
▽ More
Nonlinearity is crucial to the performance of a deep (neural) network (DN). To date there has been little progress understanding the menagerie of available nonlinearities, but recently progress has been made on understanding the rôle played by piecewise affine and convex nonlinearities like the ReLU and absolute value activation functions and max-pooling. In particular, DN layers constructed from these operations can be interpreted as {\em max-affine spline operators} (MASOs) that have an elegant link to vector quantization (VQ) and $K$-means. While this is good theoretical progress, the entire MASO approach is predicated on the requirement that the nonlinearities be piecewise affine and convex, which precludes important activation functions like the sigmoid, hyperbolic tangent, and softmax. {\em This paper extends the MASO framework to these and an infinitely large class of new nonlinearities by linking deterministic MASOs with probabilistic Gaussian Mixture Models (GMMs).} We show that, under a GMM, piecewise affine, convex nonlinearities like ReLU, absolute value, and max-pooling can be interpreted as solutions to certain natural "hard" VQ inference problems, while sigmoid, hyperbolic tangent, and softmax can be interpreted as solutions to corresponding "soft" VQ inference problems. We further extend the framework by hybridizing the hard and soft VQ optimizations to create a $β$-VQ inference that interpolates between hard, soft, and linear VQ inference. A prime example of a $β$-VQ DN nonlinearity is the {\em swish} nonlinearity, which offers state-of-the-art performance in a range of computer vision tasks but was developed ad hoc by experimentation. Finally, we validate with experiments an important assertion of our theory, namely that DN performance can be significantly improved by enforcing orthogonality in its linear filters.
△ Less
Submitted 22 October, 2018;
originally announced October 2018.
-
MISSION: Ultra Large-Scale Feature Selection using Count-Sketches
Authors:
Amirali Aghazadeh,
Ryan Spring,
Daniel LeJeune,
Gautam Dasarathy,
Anshumali Shrivastava,
Richard G. Baraniuk
Abstract:
Feature selection is an important challenge in machine learning. It plays a crucial role in the explainability of machine-driven decisions that are rapidly permeating throughout modern society. Unfortunately, the explosion in the size and dimensionality of real-world datasets poses a severe challenge to standard feature selection algorithms. Today, it is not uncommon for datasets to have billions…
▽ More
Feature selection is an important challenge in machine learning. It plays a crucial role in the explainability of machine-driven decisions that are rapidly permeating throughout modern society. Unfortunately, the explosion in the size and dimensionality of real-world datasets poses a severe challenge to standard feature selection algorithms. Today, it is not uncommon for datasets to have billions of dimensions. At such scale, even storing the feature vector is impossible, causing most existing feature selection methods to fail. Workarounds like feature hashing, a standard approach to large-scale machine learning, helps with the computational feasibility, but at the cost of losing the interpretability of features. In this paper, we present MISSION, a novel framework for ultra large-scale feature selection that performs stochastic gradient descent while maintaining an efficient representation of the features in memory using a Count-Sketch data structure. MISSION retains the simplicity of feature hashing without sacrificing the interpretability of the features while using only O(log^2(p)) working memory. We demonstrate that MISSION accurately and efficiently performs feature selection on real-world, large-scale datasets with billions of dimensions.
△ Less
Submitted 11 June, 2018;
originally announced June 2018.
-
Unsupervised Learning with Stein's Unbiased Risk Estimator
Authors:
Christopher A. Metzler,
Ali Mousavi,
Reinhard Heckel,
Richard G. Baraniuk
Abstract:
Learning from unlabeled and noisy data is one of the grand challenges of machine learning. As such, it has seen a flurry of research with new ideas proposed continuously. In this work, we revisit a classical idea: Stein's Unbiased Risk Estimator (SURE). We show that, in the context of image recovery, SURE and its generalizations can be used to train convolutional neural networks (CNNs) for a range…
▽ More
Learning from unlabeled and noisy data is one of the grand challenges of machine learning. As such, it has seen a flurry of research with new ideas proposed continuously. In this work, we revisit a classical idea: Stein's Unbiased Risk Estimator (SURE). We show that, in the context of image recovery, SURE and its generalizations can be used to train convolutional neural networks (CNNs) for a range of image denoising and recovery problems without any ground truth data.
Specifically, our goal is to reconstruct an image $x$ from a noisy linear transformation (measurement) of the image. We consider two scenarios: one where no additional data is available and one where we have measurements of other images that are drawn from the same noisy distribution as $x$, but have no access to the clean images. Such is the case, for instance, in the context of medical imaging, microscopy, and astronomy, where noise-less ground truth data is rarely available.
We show that in this situation, SURE can be used to estimate the mean-squared-error loss associated with an estimate of $x$. Using this estimate of the loss, we train networks to perform denoising and compressed sensing recovery. In addition, we also use the SURE framework to partially explain and improve upon an intriguing results presented by Ulyanov et al. in "Deep Image Prior": that a network initialized with random weights and fit to a single noisy image can effectively denoise that image.
Public implementations of the networks and methods described in this paper can be found at https://github.com/ricedsp/D-AMP_Toolbox.
△ Less
Submitted 22 July, 2020; v1 submitted 26 May, 2018;
originally announced May 2018.
-
prDeep: Robust Phase Retrieval with a Flexible Deep Network
Authors:
Christopher A. Metzler,
Philip Schniter,
Ashok Veeraraghavan,
Richard G. Baraniuk
Abstract:
Phase retrieval algorithms have become an important component in many modern computational imaging systems. For instance, in the context of ptychography and speckle correlation imaging, they enable imaging past the diffraction limit and through scattering media, respectively. Unfortunately, traditional phase retrieval algorithms struggle in the presence of noise. Progress has been made recently on…
▽ More
Phase retrieval algorithms have become an important component in many modern computational imaging systems. For instance, in the context of ptychography and speckle correlation imaging, they enable imaging past the diffraction limit and through scattering media, respectively. Unfortunately, traditional phase retrieval algorithms struggle in the presence of noise. Progress has been made recently on more robust algorithms using signal priors, but at the expense of limiting the range of supported measurement models (e.g., to Gaussian or coded diffraction patterns). In this work we leverage the regularization-by-denoising framework and a convolutional neural network denoiser to create prDeep, a new phase retrieval algorithm that is both robust and broadly applicable. We test and validate prDeep in simulation to demonstrate that it is robust to noise and can handle a variety of system models.
A MatConvNet implementation of prDeep is available at https://github.com/ricedsp/prDeep.
△ Less
Submitted 29 June, 2018; v1 submitted 28 February, 2018;
originally announced March 2018.
-
Semi-Supervised Learning via New Deep Network Inversion
Authors:
Randall Balestriero,
Vincent Roger,
Herve G. Glotin,
Richard G. Baraniuk
Abstract:
We exploit a recently derived inversion scheme for arbitrary deep neural networks to develop a new semi-supervised learning framework that applies to a wide range of systems and problems. The approach outperforms current state-of-the-art methods on MNIST reaching $99.14\%$ of test set accuracy while using $5$ labeled examples per class. Experiments with one-dimensional signals highlight the genera…
▽ More
We exploit a recently derived inversion scheme for arbitrary deep neural networks to develop a new semi-supervised learning framework that applies to a wide range of systems and problems. The approach outperforms current state-of-the-art methods on MNIST reaching $99.14\%$ of test set accuracy while using $5$ labeled examples per class. Experiments with one-dimensional signals highlight the generality of the method. Importantly, our approach is simple, efficient, and requires no change in the deep network architecture.
△ Less
Submitted 12 November, 2017;
originally announced November 2017.
-
DeepCodec: Adaptive Sensing and Recovery via Deep Convolutional Neural Networks
Authors:
Ali Mousavi,
Gautam Dasarathy,
Richard G. Baraniuk
Abstract:
In this paper we develop a novel computational sensing framework for sensing and recovering structured signals. When trained on a set of representative signals, our framework learns to take undersampled measurements and recover signals from them using a deep convolutional neural network. In other words, it learns a transformation from the original signals to a near-optimal number of undersampled m…
▽ More
In this paper we develop a novel computational sensing framework for sensing and recovering structured signals. When trained on a set of representative signals, our framework learns to take undersampled measurements and recover signals from them using a deep convolutional neural network. In other words, it learns a transformation from the original signals to a near-optimal number of undersampled measurements and the inverse transformation from measurements to signals. This is in contrast to traditional compressive sensing (CS) systems that use random linear measurements and convex optimization or iterative algorithms for signal recovery. We compare our new framework with $\ell_1$-minimization from the phase transition point of view and demonstrate that it outperforms $\ell_1$-minimization in the regions of phase transition plot where $\ell_1$-minimization cannot recover the exact solution. In addition, we experimentally demonstrate how learning measurements enhances the overall recovery performance, speeds up training of recovery framework, and leads to having fewer parameters to learn.
△ Less
Submitted 11 July, 2017;
originally announced July 2017.
-
Learned D-AMP: Principled Neural Network based Compressive Image Recovery
Authors:
Christopher A. Metzler,
Ali Mousavi,
Richard G. Baraniuk
Abstract:
Compressive image recovery is a challenging problem that requires fast and accurate algorithms. Recently, neural networks have been applied to this problem with promising results. By exploiting massively parallel GPU processing architectures and oodles of training data, they can run orders of magnitude faster than existing techniques. However, these methods are largely unprincipled black boxes tha…
▽ More
Compressive image recovery is a challenging problem that requires fast and accurate algorithms. Recently, neural networks have been applied to this problem with promising results. By exploiting massively parallel GPU processing architectures and oodles of training data, they can run orders of magnitude faster than existing techniques. However, these methods are largely unprincipled black boxes that are difficult to train and often-times specific to a single measurement matrix.
It was recently demonstrated that iterative sparse-signal-recovery algorithms can be "unrolled" to form interpretable deep networks. Taking inspiration from this work, we develop a novel neural network architecture that mimics the behavior of the denoising-based approximate message passing (D-AMP) algorithm. We call this new network Learned D-AMP (LDAMP).
The LDAMP network is easy to train, can be applied to a variety of different measurement matrices, and comes with a state-evolution heuristic that accurately predicts its performance. Most importantly, it outperforms the state-of-the-art BM3D-AMP and NLR-CS algorithms in terms of both accuracy and run time. At high resolutions, and when used with sensing matrices that have fast implementations, LDAMP runs over $50\times$ faster than BM3D-AMP and hundreds of times faster than NLR-CS.
△ Less
Submitted 6 November, 2017; v1 submitted 21 April, 2017;
originally announced April 2017.
-
Data-Mining Textual Responses to Uncover Misconception Patterns
Authors:
Joshua J. Michalenko,
Andrew S. Lan,
Richard G. Baraniuk
Abstract:
An important, yet largely unstudied, problem in student data analysis is to detect misconceptions from students' responses to open-response questions. Misconception detection enables instructors to deliver more targeted feedback on the misconceptions exhibited by many students in their class, thus improving the quality of instruction. In this paper, we propose a new natural language processing-bas…
▽ More
An important, yet largely unstudied, problem in student data analysis is to detect misconceptions from students' responses to open-response questions. Misconception detection enables instructors to deliver more targeted feedback on the misconceptions exhibited by many students in their class, thus improving the quality of instruction. In this paper, we propose a new natural language processing-based framework to detect the common misconceptions among students' textual responses to short-answer questions. We propose a probabilistic model for students' textual responses involving misconceptions and experimentally validate it on a real-world student-response dataset. Experimental results show that our proposed framework excels at classifying whether a response exhibits one or more misconceptions. More importantly, it can also automatically detect the common misconceptions exhibited across responses from multiple students to multiple questions; this property is especially important at large scale, since instructors will no longer need to manually specify all possible misconceptions that students might exhibit.
△ Less
Submitted 29 March, 2017; v1 submitted 24 March, 2017;
originally announced March 2017.
-
Learning to Invert: Signal Recovery via Deep Convolutional Networks
Authors:
Ali Mousavi,
Richard G. Baraniuk
Abstract:
The promise of compressive sensing (CS) has been offset by two significant challenges. First, real-world data is not exactly sparse in a fixed basis. Second, current high-performance recovery algorithms are slow to converge, which limits CS to either non-real-time applications or scenarios where massive back-end computing is available. In this paper, we attack both of these challenges head-on by d…
▽ More
The promise of compressive sensing (CS) has been offset by two significant challenges. First, real-world data is not exactly sparse in a fixed basis. Second, current high-performance recovery algorithms are slow to converge, which limits CS to either non-real-time applications or scenarios where massive back-end computing is available. In this paper, we attack both of these challenges head-on by developing a new signal recovery framework we call {\em DeepInverse} that learns the inverse transformation from measurement vectors to signals using a {\em deep convolutional network}. When trained on a set of representative images, the network learns both a representation for the signals (addressing challenge one) and an inverse map approximating a greedy or convex recovery algorithm (addressing challenge two). Our experiments indicate that the DeepInverse network closely approximates the solution produced by state-of-the-art CS recovery algorithms yet is hundreds of times faster in run time. The tradeoff for the ultrafast run time is a computationally intensive, off-line training procedure typical to deep networks. However, the training needs to be completed only once, which makes the approach attractive for a host of sparse recovery problems.
△ Less
Submitted 14 January, 2017;
originally announced January 2017.
-
Semi-Supervised Learning with the Deep Rendering Mixture Model
Authors:
Tan Nguyen,
Wanjia Liu,
Ethan Perez,
Richard G. Baraniuk,
Ankit B. Patel
Abstract:
Semi-supervised learning algorithms reduce the high cost of acquiring labeled training data by using both labeled and unlabeled data during learning. Deep Convolutional Networks (DCNs) have achieved great success in supervised tasks and as such have been widely employed in the semi-supervised learning. In this paper we leverage the recently developed Deep Rendering Mixture Model (DRMM), a probabil…
▽ More
Semi-supervised learning algorithms reduce the high cost of acquiring labeled training data by using both labeled and unlabeled data during learning. Deep Convolutional Networks (DCNs) have achieved great success in supervised tasks and as such have been widely employed in the semi-supervised learning. In this paper we leverage the recently developed Deep Rendering Mixture Model (DRMM), a probabilistic generative model that models latent nuisance variation, and whose inference algorithm yields DCNs. We develop an EM algorithm for the DRMM to learn from both labeled and unlabeled data. Guided by the theory of the DRMM, we introduce a novel non-negativity constraint and a variational inference term. We report state-of-the-art performance on MNIST and SVHN and competitive results on CIFAR10. We also probe deeper into how a DRMM trained in a semi-supervised setting represents latent nuisance variation using synthetically rendered images. Taken together, our work provides a unified framework for supervised, unsupervised, and semi-supervised learning.
△ Less
Submitted 6 December, 2016;
originally announced December 2016.
-
A Probabilistic Framework for Deep Learning
Authors:
Ankit B. Patel,
Tan Nguyen,
Richard G. Baraniuk
Abstract:
We develop a probabilistic framework for deep learning based on the Deep Rendering Mixture Model (DRMM), a new generative probabilistic model that explicitly capture variations in data due to latent task nuisance variables. We demonstrate that max-sum inference in the DRMM yields an algorithm that exactly reproduces the operations in deep convolutional neural networks (DCNs), providing a first pri…
▽ More
We develop a probabilistic framework for deep learning based on the Deep Rendering Mixture Model (DRMM), a new generative probabilistic model that explicitly capture variations in data due to latent task nuisance variables. We demonstrate that max-sum inference in the DRMM yields an algorithm that exactly reproduces the operations in deep convolutional neural networks (DCNs), providing a first principles derivation. Our framework provides new insights into the successes and shortcomings of DCNs as well as a principled route to their improvement. DRMM training via the Expectation-Maximization (EM) algorithm is a powerful alternative to DCN back-propagation, and initial training results are promising. Classification based on the DRMM and other variants outperforms DCNs in supervised digit classification, training 2-3x faster while achieving similar accuracy. Moreover, the DRMM is applicable to semi-supervised and unsupervised learning tasks, achieving results that are state-of-the-art in several categories on the MNIST benchmark and comparable to state of the art on the CIFAR10 benchmark.
△ Less
Submitted 6 December, 2016;
originally announced December 2016.
-
Consistent Parameter Estimation for LASSO and Approximate Message Passing
Authors:
Ali Mousavi,
Arian Maleki,
Richard G. Baraniuk
Abstract:
We consider the problem of recovering a vector $β_o \in \mathbb{R}^p$ from $n$ random and noisy linear observations $y= Xβ_o + w$, where $X$ is the measurement matrix and $w$ is noise. The LASSO estimate is given by the solution to the optimization problem $\hatβ_λ = \arg \min_β \frac{1}{2} \|y-Xβ\|_2^2 + λ\| β\|_1$. Among the iterative algorithms that have been proposed for solving this optimizat…
▽ More
We consider the problem of recovering a vector $β_o \in \mathbb{R}^p$ from $n$ random and noisy linear observations $y= Xβ_o + w$, where $X$ is the measurement matrix and $w$ is noise. The LASSO estimate is given by the solution to the optimization problem $\hatβ_λ = \arg \min_β \frac{1}{2} \|y-Xβ\|_2^2 + λ\| β\|_1$. Among the iterative algorithms that have been proposed for solving this optimization problem, approximate message passing (AMP) has attracted attention for its fast convergence. Despite significant progress in the theoretical analysis of the estimates of LASSO and AMP, little is known about their behavior as a function of the regularization parameter $λ$, or the thereshold parameters $τ^t$. For instance the following basic questions have not yet been studied in the literature: (i) How does the size of the active set $\|\hatβ^λ\|_0/p$ behave as a function of $λ$? (ii) How does the mean square error $\|\hatβ_λ - β_o\|_2^2/p$ behave as a function of $λ$? (iii) How does $\|β^t - β_o \|_2^2/p$ behave as a function of $τ^1, \ldots, τ^{t-1}$? Answering these questions will help in addressing practical challenges regarding the optimal tuning of $λ$ or $τ^1, τ^2, \ldots$. This paper answers these questions in the asymptotic setting and shows how these results can be employed in deriving simple and theoretically optimal approaches for tuning the parameters $τ^1, \ldots, τ^t$ for AMP or $λ$ for LASSO. It also explores the connection between the optimal tuning of the parameters of AMP and the optimal tuning of LASSO.
△ Less
Submitted 4 November, 2015; v1 submitted 3 November, 2015;
originally announced November 2015.
-
An Information-Theoretic Measure of Dependency Among Variables in Large Datasets
Authors:
Ali Mousavi,
Richard G. Baraniuk
Abstract:
The maximal information coefficient (MIC), which measures the amount of dependence between two variables, is able to detect both linear and non-linear associations. However, computational cost grows rapidly as a function of the dataset size. In this paper, we develop a computationally efficient approximation to the MIC that replaces its dynamic programming step with a much simpler technique based…
▽ More
The maximal information coefficient (MIC), which measures the amount of dependence between two variables, is able to detect both linear and non-linear associations. However, computational cost grows rapidly as a function of the dataset size. In this paper, we develop a computationally efficient approximation to the MIC that replaces its dynamic programming step with a much simpler technique based on the uniform partitioning of data grid. A variety of experiments demonstrate the quality of our approximation.
△ Less
Submitted 17 August, 2015;
originally announced August 2015.
-
A Deep Learning Approach to Structured Signal Recovery
Authors:
Ali Mousavi,
Ankit B. Patel,
Richard G. Baraniuk
Abstract:
In this paper, we develop a new framework for sensing and recovering structured signals. In contrast to compressive sensing (CS) systems that employ linear measurements, sparse representations, and computationally complex convex/greedy algorithms, we introduce a deep learning framework that supports both linear and mildly nonlinear measurements, that learns a structured representation from trainin…
▽ More
In this paper, we develop a new framework for sensing and recovering structured signals. In contrast to compressive sensing (CS) systems that employ linear measurements, sparse representations, and computationally complex convex/greedy algorithms, we introduce a deep learning framework that supports both linear and mildly nonlinear measurements, that learns a structured representation from training data, and that efficiently computes a signal estimate. In particular, we apply a stacked denoising autoencoder (SDA), as an unsupervised feature learner. SDA enables us to capture statistical dependencies between the different elements of certain signals and improve signal recovery performance as compared to the CS approach.
△ Less
Submitted 17 August, 2015;
originally announced August 2015.
-
oASIS: Adaptive Column Sampling for Kernel Matrix Approximation
Authors:
Raajen Patel,
Thomas A. Goldstein,
Eva L. Dyer,
Azalia Mirhoseini,
Richard G. Baraniuk
Abstract:
Kernel matrices (e.g. Gram or similarity matrices) are essential for many state-of-the-art approaches to classification, clustering, and dimensionality reduction. For large datasets, the cost of forming and factoring such kernel matrices becomes intractable. To address this challenge, we introduce a new adaptive sampling algorithm called Accelerated Sequential Incoherence Selection (oASIS) that sa…
▽ More
Kernel matrices (e.g. Gram or similarity matrices) are essential for many state-of-the-art approaches to classification, clustering, and dimensionality reduction. For large datasets, the cost of forming and factoring such kernel matrices becomes intractable. To address this challenge, we introduce a new adaptive sampling algorithm called Accelerated Sequential Incoherence Selection (oASIS) that samples columns without explicitly computing the entire kernel matrix. We provide conditions under which oASIS is guaranteed to exactly recover the kernel matrix with an optimal number of columns selected. Numerical experiments on both synthetic and real-world datasets demonstrate that oASIS achieves performance comparable to state-of-the-art adaptive sampling methods at a fraction of the computational cost. The low runtime complexity of oASIS and its low memory footprint enable the solution of large problems that are simply intractable using other adaptive methods.
△ Less
Submitted 19 May, 2015;
originally announced May 2015.
-
Self-Expressive Decompositions for Matrix Approximation and Clustering
Authors:
Eva L. Dyer,
Tom A. Goldstein,
Raajen Patel,
Konrad P. Kording,
Richard G. Baraniuk
Abstract:
Data-aware methods for dimensionality reduction and matrix decomposition aim to find low-dimensional structure in a collection of data. Classical approaches discover such structure by learning a basis that can efficiently express the collection. Recently, "self expression", the idea of using a small subset of data vectors to represent the full collection, has been developed as an alternative to le…
▽ More
Data-aware methods for dimensionality reduction and matrix decomposition aim to find low-dimensional structure in a collection of data. Classical approaches discover such structure by learning a basis that can efficiently express the collection. Recently, "self expression", the idea of using a small subset of data vectors to represent the full collection, has been developed as an alternative to learning. Here, we introduce a scalable method for computing sparse SElf-Expressive Decompositions (SEED). SEED is a greedy method that constructs a basis by sequentially selecting incoherent vectors from the dataset. After forming a basis from a subset of vectors in the dataset, SEED then computes a sparse representation of the dataset with respect to this basis. We develop sufficient conditions under which SEED exactly represents low rank matrices and vectors sampled from a unions of independent subspaces. We show how SEED can be used in applications ranging from matrix approximation and denoising to clustering, and apply it to numerous real-world datasets. Our results demonstrate that SEED is an attractive low-complexity alternative to other sparse matrix factorization approaches such as sparse PCA and self-expressive methods for clustering.
△ Less
Submitted 4 May, 2015;
originally announced May 2015.
-
A Probabilistic Theory of Deep Learning
Authors:
Ankit B. Patel,
Tan Nguyen,
Richard G. Baraniuk
Abstract:
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. R…
▽ More
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.
△ Less
Submitted 2 April, 2015;
originally announced April 2015.