-
PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework
Authors:
Abhineet Agarwal,
Michael Xiao,
Rebecca Barter,
Omer Ronen,
Boyu Fan,
Bin Yu
Abstract:
As machine learning (ML) models are increasingly deployed in high-stakes domains, trustworthy uncertainty quantification (UQ) is critical for ensuring the safety and reliability of these models. Traditional UQ methods rely on specifying a true generative model and are not robust to misspecification. On the other hand, conformal inference allows for arbitrary ML models but does not consider model s…
▽ More
As machine learning (ML) models are increasingly deployed in high-stakes domains, trustworthy uncertainty quantification (UQ) is critical for ensuring the safety and reliability of these models. Traditional UQ methods rely on specifying a true generative model and are not robust to misspecification. On the other hand, conformal inference allows for arbitrary ML models but does not consider model selection, which leads to large interval sizes. We tackle these drawbacks by proposing a UQ method based on the predictability, computability, and stability (PCS) framework for veridical data science proposed by Yu and Kumbier. Specifically, PCS-UQ addresses model selection by using a prediction check to screen out unsuitable models. PCS-UQ then fits these screened algorithms across multiple bootstraps to assess inter-sample variability and algorithmic instability, enabling more reliable uncertainty estimates. Further, we propose a novel calibration scheme that improves local adaptivity of our prediction sets. Experiments across $17$ regression and $6$ classification datasets show that PCS-UQ achieves the desired coverage and reduces width over conformal approaches by $\approx 20\%$. Further, our local analysis shows PCS-UQ often achieves target coverage across subgroups while conformal methods fail to do so. For large deep-learning models, we propose computationally efficient approximation schemes that avoid the expensive multiple bootstrap trainings of PCS-UQ. Across three computer vision benchmarks, PCS-UQ reduces prediction set size over conformal methods by $20\%$. Theoretically, we show a modified PCS-UQ algorithm is a form of split conformal inference and achieves the desired coverage with exchangeable data.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Statistics at a Crossroads; Who is for the Challenge?
Authors:
Xuming He,
David Madigan,
Bin Yu,
Jon Wellner
Abstract:
This project was sponsored by the National Science Foundation and organized by a steering committee and a group of theme leaders. The six-member steering committee, consisting of James Berger, Xuming He, David Madigan, Susan Murphy, Bin Yu, and Jon Wellner, was responsible for the overall planning of the project.
This report is designed to be accessible to the wider audience of key stakeholders…
▽ More
This project was sponsored by the National Science Foundation and organized by a steering committee and a group of theme leaders. The six-member steering committee, consisting of James Berger, Xuming He, David Madigan, Susan Murphy, Bin Yu, and Jon Wellner, was responsible for the overall planning of the project.
This report is designed to be accessible to the wider audience of key stakeholders in statistics and data science, including academic departments, university administration, and funding agencies. After the role and the value of Statistics and Data Science are discussed in Section 1, the report focuses on the two goals related to emerging research and data-driven challenges in applications. Section 2 identifies emerging research topics from the data challenges arising from scientific and social applications, and Section 3 discusses a number of emerging areas in foundational research. How to engage with those data-driven challenges and foster interdisciplinary collaborations is also summarized in the Executive Summary. The third goal of creating a vibrant research community and maintaining an appropriate balance is addressed in Sections 4 (Professional Culture and Community Responsibilities) and 5 (Doctoral Education).
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression
Authors:
Jingfeng Wu,
Peter Bartlett,
Matus Telgarsky,
Bin Yu
Abstract:
In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution -- a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk v…
▽ More
In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution -- a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early-stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early-stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early-stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early-stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Fast Multi-Group Gaussian Process Factor Models
Authors:
Evren Gokcen,
Anna I. Jasper,
Adam Kohn,
Christian K. Machens,
Byron M. Yu
Abstract:
Gaussian processes are now commonly used in dimensionality reduction approaches tailored to neuroscience, especially to describe changes in high-dimensional neural activity over time. As recording capabilities expand to include neuronal populations across multiple brain areas, cortical layers, and cell types, interest in extending Gaussian process factor models to characterize multi-population int…
▽ More
Gaussian processes are now commonly used in dimensionality reduction approaches tailored to neuroscience, especially to describe changes in high-dimensional neural activity over time. As recording capabilities expand to include neuronal populations across multiple brain areas, cortical layers, and cell types, interest in extending Gaussian process factor models to characterize multi-population interactions has grown. However, the cubic runtime scaling of current methods with the length of experimental trials and the number of recorded populations (groups) precludes their application to large-scale multi-population recordings. Here, we improve this scaling from cubic to linear in both trial length and group number. We present two approximate approaches to fitting multi-group Gaussian process factor models based on (1) inducing variables and (2) the frequency domain. Empirically, both methods achieved orders of magnitude speed-up with minimal impact on statistical performance, in simulation and on neural recordings of hundreds of neurons across three brain areas. The frequency domain approach, in particular, consistently provided the greatest runtime benefits with the fewest trade-offs in statistical performance. We further characterize the estimation biases introduced by the frequency domain approach and demonstrate effective strategies to mitigate them. This work enables a powerful class of analysis techniques to keep pace with the growing scale of multi-population recordings, opening new avenues for exploring brain function.
△ Less
Submitted 21 December, 2024;
originally announced December 2024.
-
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Authors:
Xinyan Guan,
Yanjiang Liu,
Xinyu Lu,
Boxi Cao,
Ben He,
Xianpei Han,
Le Sun,
Jie Lou,
Bowen Yu,
Yaojie Lu,
Hongyu Lin
Abstract:
The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals. However, the emergence of foundation models presents significant challenges in providing effective supervision signals necessary for further enhancing their capabilities. Consequently, there is an urgent need to explore novel supervision signals and technical app…
▽ More
The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals. However, the emergence of foundation models presents significant challenges in providing effective supervision signals necessary for further enhancing their capabilities. Consequently, there is an urgent need to explore novel supervision signals and technical approaches. In this paper, we propose verifier engineering, a novel post-training paradigm specifically designed for the era of foundation models. The core of verifier engineering involves leveraging a suite of automated verifiers to perform verification tasks and deliver meaningful feedback to foundation models. We systematically categorize the verifier engineering process into three essential stages: search, verify, and feedback, and provide a comprehensive review of state-of-the-art research developments within each stage. We believe that verifier engineering constitutes a fundamental pathway toward achieving Artificial General Intelligence.
△ Less
Submitted 18 November, 2024;
originally announced November 2024.
-
A Geometric Model with Stochastic Error for Abnormal Motion Detection of Portal Crane Bucket Grab
Authors:
Baichen Yu,
Xiao Wang,
Hansheng Wang
Abstract:
Abnormal swing angle detection of bucket grabs is crucial for efficient harbor operations. In this study, we develop a practically convenient swing angle detection method for crane operation, requiring only a single standard surveillance camera at the fly-jib head, without the need for sophisticated sensors or markers on the payload. Specifically, our algorithm takes the video images from the came…
▽ More
Abnormal swing angle detection of bucket grabs is crucial for efficient harbor operations. In this study, we develop a practically convenient swing angle detection method for crane operation, requiring only a single standard surveillance camera at the fly-jib head, without the need for sophisticated sensors or markers on the payload. Specifically, our algorithm takes the video images from the camera as input. Next, a fine-tuned 'the fifth version of the You Only Look Once algorithm' (YOLOv5) model is used to automatically detect the position of the bucket grab on the image plane. Subsequently, a novel geometric model is constructed, which takes the pixel position of the bucket grab, the steel rope length provided by the Programmable Logic Controller system, and the optical lens information of the camera into consideration. The key parameters of this geometric model are statistically estimated by a novel iterative algorithm. Once the key parameters are estimated, the algorithm can automatically detect swing angles from video streams. Being analytically simple, the computation of our algorithm is fast, as it takes about 0.01 seconds to process one single image generated by the surveillance camera. Therefore, we are able to obtain an accurate and fast estimation of the swing angle of an operating crane in real-time applications. Simulation studies are conducted to validate the model and algorithm. Real video examples from Qingdao Seaport under various weather conditions are analyzed to demonstrate its practical performance.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Veridical Data Science for Medical Foundation Models
Authors:
Ahmed Alaa,
Bin Yu
Abstract:
The advent of foundation models (FMs) such as large language models (LLMs) has led to a cultural shift in data science, both in medicine and beyond. This shift involves moving away from specialized predictive models trained for specific, well-defined domain questions to generalist FMs pre-trained on vast amounts of unstructured data, which can then be adapted to various clinical tasks and question…
▽ More
The advent of foundation models (FMs) such as large language models (LLMs) has led to a cultural shift in data science, both in medicine and beyond. This shift involves moving away from specialized predictive models trained for specific, well-defined domain questions to generalist FMs pre-trained on vast amounts of unstructured data, which can then be adapted to various clinical tasks and questions. As a result, the standard data science workflow in medicine has been fundamentally altered; the foundation model lifecycle (FMLC) now includes distinct upstream and downstream processes, in which computational resources, model and data access, and decision-making power are distributed among multiple stakeholders. At their core, FMs are fundamentally statistical models, and this new workflow challenges the principles of Veridical Data Science (VDS), hindering the rigorous statistical analysis expected in transparent and scientifically reproducible data science practices. We critically examine the medical FMLC in light of the core principles of VDS: predictability, computability, and stability (PCS), and explain how it deviates from the standard data science workflow. Finally, we propose recommendations for a reimagined medical FMLC that expands and refines the PCS principles for VDS including considering the computational and accessibility constraints inherent to FMs.
△ Less
Submitted 15 September, 2024;
originally announced September 2024.
-
The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis
Authors:
Yan Shuo Tan,
Omer Ronen,
Theo Saarinen,
Bin Yu
Abstract:
Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by theoretical guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. In th…
▽ More
Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by theoretical guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. In this paper, we show that the BART sampler often converges slowly, confirming empirical observations by other researchers. Assuming discrete covariates, we show that, while the BART posterior concentrates on a set comprising all optimal tree structures (smallest bias and complexity), the Markov chain's hitting time for this set increases with $n$ (training sample size), under several common data generative settings. As $n$ increases, the approximate BART posterior thus becomes increasingly different from the exact posterior (for the same number of MCMC samples), contrasting with earlier concentration results on the exact posterior. This contrast is highlighted by our simulations showing worsening frequentist undercoverage for approximate posterior intervals and a growing ratio between the MSE of the approximate posterior and that obtainable by artificially improving convergence via averaging multiple sampler chains. Finally, based on our theoretical insights, possibilities are discussed to improve the BART sampler convergence performance.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Mitigating over-exploration in latent space optimization using LES
Authors:
Omer Ronen,
Ahmed Imtiaz Humayun,
Richard Baraniuk,
Randall Balestriero,
Bin Yu
Abstract:
We develop Latent Exploration Score (LES) to mitigate over-exploration in Latent Space Optimization (LSO), a popular method for solving black-box discrete optimization problems. LSO utilizes continuous optimization within the latent space of a Variational Autoencoder (VAE) and is known to be susceptible to over-exploration, which manifests in unrealistic solutions that reduce its practicality. LES…
▽ More
We develop Latent Exploration Score (LES) to mitigate over-exploration in Latent Space Optimization (LSO), a popular method for solving black-box discrete optimization problems. LSO utilizes continuous optimization within the latent space of a Variational Autoencoder (VAE) and is known to be susceptible to over-exploration, which manifests in unrealistic solutions that reduce its practicality. LES leverages the trained decoder's approximation of the data distribution, and can be employed with any VAE decoder - including pretrained ones - without additional training, architectural changes or access to the training data. Our evaluation across five LSO benchmark tasks and twenty-two VAE models demonstrates that LES always enhances the quality of the solutions while maintaining high objective values, leading to improvements over existing solutions in most cases. We believe that new avenues to LSO will be opened by LES' ability to identify out of distribution areas, differentiability, and computational tractability. Open source code for LES is available at https://github.com/OmerRonen/les.
△ Less
Submitted 21 February, 2025; v1 submitted 13 June, 2024;
originally announced June 2024.
-
The Impact of Initialization on LoRA Finetuning Dynamics
Authors:
Soufiane Hayou,
Nikhil Ghosh,
Bin Yu
Abstract:
In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes fine…
▽ More
In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Towards Scalable Automated Alignment of LLMs: A Survey
Authors:
Boxi Cao,
Keming Lu,
Xinyu Lu,
Jiawei Chen,
Mengjie Ren,
Hao Xiang,
Peilin Liu,
Yaojie Lu,
Ben He,
Xianpei Han,
Le Sun,
Hongyu Lin,
Bowen Yu
Abstract:
Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approach…
▽ More
Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approaches. In this paper, we systematically review the recently emerging methods of automated alignment, attempting to explore how to achieve effective, scalable, automated alignment once the capabilities of LLMs exceed those of humans. Specifically, we categorize existing automated alignment methods into 4 major categories based on the sources of alignment signals and discuss the current status and potential development of each category. Additionally, we explore the underlying mechanisms that enable automated alignment and discuss the essential factors that make automated alignment technologies feasible and effective from the fundamental role of alignment.
△ Less
Submitted 3 September, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Minimum-Norm Interpolation Under Covariate Shift
Authors:
Neil Mallinar,
Austin Zane,
Spencer Frei,
Bin Yu
Abstract:
Transfer learning is a critical part of real-world machine learning deployments and has been extensively studied in experimental works with overparameterized neural networks. However, even in the simplest setting of linear regression a notable gap still exists in the theoretical understanding of transfer learning. In-distribution research on high-dimensional linear regression has led to the identi…
▽ More
Transfer learning is a critical part of real-world machine learning deployments and has been extensively studied in experimental works with overparameterized neural networks. However, even in the simplest setting of linear regression a notable gap still exists in the theoretical understanding of transfer learning. In-distribution research on high-dimensional linear regression has led to the identification of a phenomenon known as \textit{benign overfitting}, in which linear interpolators overfit to noisy training labels and yet still generalize well. This behavior occurs under specific conditions on the source covariance matrix and input data dimension. Therefore, it is natural to wonder how such high-dimensional linear models behave under transfer learning. We prove the first non-asymptotic excess risk bounds for benignly-overfit linear interpolators in the transfer learning setting. From our analysis, we propose a taxonomy of \textit{beneficial} and \textit{malignant} covariate shifts based on the degree of overparameterization. We follow our analysis with empirical studies that show these beneficial and malignant covariate shifts for linear interpolators on real image data, and for fully-connected neural networks in settings where the input data dimension is larger than the training sample size.
△ Less
Submitted 17 July, 2024; v1 submitted 30 March, 2024;
originally announced April 2024.
-
Designing a Data Science simulation with MERITS: A Primer
Authors:
Corrine F Elliott,
James PC Duncan,
Tiffany M Tang,
Merle Behr,
Karl Kumbier,
Bin Yu
Abstract:
Simulations play a crucial role in the modern scientific process. Yet despite (or due to) this ubiquity, the Data Science community shares neither a comprehensive definition for a "high-quality" study nor a consolidated guide to designing one. Inspired by the Predictability-Computability-Stability (PCS) framework for 'veridical' Data Science, we propose six MERITS that a simulation study should sa…
▽ More
Simulations play a crucial role in the modern scientific process. Yet despite (or due to) this ubiquity, the Data Science community shares neither a comprehensive definition for a "high-quality" study nor a consolidated guide to designing one. Inspired by the Predictability-Computability-Stability (PCS) framework for 'veridical' Data Science, we propose six MERITS that a simulation study should satisfy. (Modularity and Efficiency support the computability of a study, encouraging clean and flexible implementation. Realism and Stability address the conceptualization of the research problem: How well does a study predict reality, such that its conclusions generalize to new data/contexts? Finally, Intuitiveness and Transparency encourage good communication and trustworthiness of study design and results.) Drawing an analogy between simulation and cooking, we moreover offer (a) a conceptual framework for thinking about the anatomy of a simulation 'recipe'; (b) a baker's dozen in guidelines to aid the Data Science practitioner in designing one; and (c) a case study demonstrating the practical utility of our framework by using it to autopsy a preexisting simulation study. With this "PCS primer" for high-quality Data Science simulation, we seek to distill and enrich the best practices of simulation across disciplines into a cohesive recipe for trustworthy, veridical Data Science.
△ Less
Submitted 15 May, 2025; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency
Authors:
Jingfeng Wu,
Peter L. Bartlett,
Matus Telgarsky,
Bin Yu
Abstract:
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $η$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(η)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (ηt) )$ convergence rate after $t$ additional steps.…
▽ More
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $η$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(η)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (ηt) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $η:= Θ( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.
△ Less
Submitted 9 June, 2024; v1 submitted 24 February, 2024;
originally announced February 2024.
-
LoRA+: Efficient Low Rank Adaptation of Large Models
Authors:
Soufiane Hayou,
Nikhil Ghosh,
Bin Yu
Abstract:
In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does…
▽ More
In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$ improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$ 2X SpeedUp), at the same computational cost as LoRA.
△ Less
Submitted 4 July, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
Quantifying and mitigating the impact of label errors on model disparity metrics
Authors:
Julius Adebayo,
Melissa Hall,
Bowen Yu,
Bobbie Chern
Abstract:
Errors in labels obtained via human annotation adversely affect a model's performance. Existing approaches propose ways to mitigate the effect of label error on a model's downstream accuracy, yet little is known about its impact on a model's disparity metrics. Here we study the effect of label error on a model's disparity metrics. We empirically characterize how varying levels of label error, in b…
▽ More
Errors in labels obtained via human annotation adversely affect a model's performance. Existing approaches propose ways to mitigate the effect of label error on a model's downstream accuracy, yet little is known about its impact on a model's disparity metrics. Here we study the effect of label error on a model's disparity metrics. We empirically characterize how varying levels of label error, in both training and test data, affect these disparity metrics. We find that group calibration and other metrics are sensitive to train-time and test-time label error -- particularly for minority groups. This disparate effect persists even for models trained with noise-aware algorithms. To mitigate the impact of training-time label error, we present an approach to estimate the influence of a training input's label on a model's group disparity metric. We empirically assess the proposed approach on a variety of datasets and find significant improvement, compared to alternative approaches, in identifying training inputs that improve a model's disparity metric. We complement the approach with an automatic relabel-and-finetune scheme that produces updated models with, provably, improved group calibration error.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Prominent Roles of Conditionally Invariant Components in Domain Adaptation: Theory and Algorithms
Authors:
Keru Wu,
Yuansi Chen,
Wooseok Ha,
Bin Yu
Abstract:
Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify…
▽ More
Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify the assumptions under which a DA algorithm has good target performance. In this work, we focus on the assumption of the presence of conditionally invariant components (CICs), which are relevant for prediction and remain conditionally invariant across the source and target data. We demonstrate that CICs, which can be estimated through conditional invariant penalty (CIP), play three prominent roles in providing target risk guarantees in DA. First, we propose a new algorithm based on CICs, importance-weighted conditional invariant penalty (IW-CIP), which has target risk guarantees beyond simple settings such as covariate shift and label shift. Second, we show that CICs help identify large discrepancies between source and target risks of other DA algorithms. Finally, we demonstrate that incorporating CICs into the domain invariant projection (DIP) algorithm can address its failure scenario caused by label-flipping features. We support our new algorithms and theoretical findings via numerical experiments on synthetic data, MNIST, CelebA, Camelyon17, and DomainNet datasets.
△ Less
Submitted 8 July, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
On the Role of Non-Localities in Fundamental Diagram Estimation
Authors:
Jing Liu,
Fangfang Zheng,
Boxi Yu,
Saif Jabari
Abstract:
We consider the role of non-localities in speed-density data used to fit fundamental diagrams from vehicle trajectories. We demonstrate that the use of anticipated densities results in a clear classification of speed-density data into stationary and non-stationary points, namely, acceleration and deceleration regimes and their separating boundary. The separating boundary represents a locus of stat…
▽ More
We consider the role of non-localities in speed-density data used to fit fundamental diagrams from vehicle trajectories. We demonstrate that the use of anticipated densities results in a clear classification of speed-density data into stationary and non-stationary points, namely, acceleration and deceleration regimes and their separating boundary. The separating boundary represents a locus of stationary traffic states, i.e., the fundamental diagram. To fit fundamental diagrams, we develop an enhanced cross entropy minimization method that honors equilibrium traffic physics. We illustrate the effectiveness of our proposed approach by comparing it with the traditional approach that uses local speed-density states and least squares estimation. Our experiments show that the separating boundary in our approach is invariant to varying trajectory samples within the same spatio-temporal region, providing further evidence that the separating boundary is indeed a locus of stationary traffic states.
△ Less
Submitted 31 August, 2023;
originally announced August 2023.
-
The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning
Authors:
Nikhil Ghosh,
Spencer Frei,
Wooseok Ha,
Bin Yu
Abstract:
In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size…
▽ More
In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size. In the full-batch setting, we show that the solution is dense (i.e., not sparse) and is highly aligned with its initialized direction, showing that relatively little feature learning occurs. On the other hand, for any batch size strictly smaller than the number of samples, SGD finds a global minimum which is sparse and nearly orthogonal to its initialization, showing that the randomness of stochastic gradients induces a qualitatively different type of "feature selection" in this setting. Moreover, if we measure the sharpness of the minimum by the trace of the Hessian, the minima found with full batch gradient descent are flatter than those found with strictly smaller batch sizes, in contrast to previous works which suggest that large batches lead to sharper minima. To prove convergence of SGD with a constant step size, we introduce a powerful tool from the theory of non-homogeneous random walks which may be of independent interest.
△ Less
Submitted 6 August, 2023;
originally announced August 2023.
-
MDI+: A Flexible Random Forest-Based Feature Importance Framework
Authors:
Abhineet Agarwal,
Ana M. Kenney,
Yan Shuo Tan,
Tiffany M. Tang,
Bin Yu
Abstract:
Mean decrease in impurity (MDI) is a popular feature importance measure for random forests (RFs). We show that the MDI for a feature $X_k$ in each tree in an RF is equivalent to the unnormalized $R^2$ value in a linear regression of the response on the collection of decision stumps that split on $X_k$. We use this interpretation to propose a flexible feature importance framework called MDI+. Speci…
▽ More
Mean decrease in impurity (MDI) is a popular feature importance measure for random forests (RFs). We show that the MDI for a feature $X_k$ in each tree in an RF is equivalent to the unnormalized $R^2$ value in a linear regression of the response on the collection of decision stumps that split on $X_k$. We use this interpretation to propose a flexible feature importance framework called MDI+. Specifically, MDI+ generalizes MDI by allowing the analyst to replace the linear regression model and $R^2$ metric with regularized generalized linear models (GLMs) and metrics better suited for the given data structure. Moreover, MDI+ incorporates additional features to mitigate known biases of decision trees against additive or smooth models. We further provide guidance on how practitioners can choose an appropriate GLM and metric based upon the Predictability, Computability, Stability framework for veridical data science. Extensive data-inspired simulations show that MDI+ significantly outperforms popular feature importance measures in identifying signal features. We also apply MDI+ to two real-world case studies on drug response prediction and breast cancer subtype classification. We show that MDI+ extracts well-established predictive genes with significantly greater stability compared to existing feature importance measures. All code and models are released in a full-fledged python package on Github.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
Estimands in Real-World Evidence Studies
Authors:
Jie Chen,
Daniel Scharfstein,
Hongwei Wang,
Binbing Yu,
Yang Song,
Weili He,
John Scott,
Xiwu Lin,
Hana Lee
Abstract:
A Real-World Evidence (RWE) Scientific Working Group (SWG) of the American Statistical Association Biopharmaceutical Section (ASA BIOP) has been reviewing statistical considerations for the generation of RWE to support regulatory decision-making. As part of the effort, the working group is addressing estimands in RWE studies. Constructing the right estimand -- the target of estimation -- which ref…
▽ More
A Real-World Evidence (RWE) Scientific Working Group (SWG) of the American Statistical Association Biopharmaceutical Section (ASA BIOP) has been reviewing statistical considerations for the generation of RWE to support regulatory decision-making. As part of the effort, the working group is addressing estimands in RWE studies. Constructing the right estimand -- the target of estimation -- which reflects the research question and the study objective, is one of the key components in formulating a clinical study. ICH E9(R1) describes statistical principles for constructing estimands in clinical trials with a focus on five attributes -- population, treatment, endpoints, intercurrent events, and population-level summary. However, defining estimands for clinical studies using real-world data (RWD), i.e., RWE studies, requires additional considerations due to, for example, heterogeneity of study population, complexity of treatment regimes, different types and patterns of intercurrent events, and complexities in choosing study endpoints. This paper reviews the essential components of estimands and causal inference framework, discusses considerations in constructing estimands for RWE studies, highlights similarities and differences in traditional clinical trial and RWE study estimands, and provides a roadmap for choosing appropriate estimands for RWE studies.
△ Less
Submitted 30 June, 2023;
originally announced July 2023.
-
A Mixing Time Lower Bound for a Simplified Version of BART
Authors:
Omer Ronen,
Theo Saarinen,
Yan Shuo Tan,
James Duncan,
Bin Yu
Abstract:
Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression algorithm. The posterior is a distribution over sums of decision trees, and predictions are made by averaging approximate samples from the posterior.
The combination of strong predictive performance and the ability to provide uncertainty measures has led BART to be commonly used in the social sciences, bios…
▽ More
Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression algorithm. The posterior is a distribution over sums of decision trees, and predictions are made by averaging approximate samples from the posterior.
The combination of strong predictive performance and the ability to provide uncertainty measures has led BART to be commonly used in the social sciences, biostatistics, and causal inference.
BART uses Markov Chain Monte Carlo (MCMC) to obtain approximate posterior samples over a parameterized space of sums of trees, but it has often been observed that the chains are slow to mix.
In this paper, we provide the first lower bound on the mixing time for a simplified version of BART in which we reduce the sum to a single tree and use a subset of the possible moves for the MCMC proposal distribution. Our lower bound for the mixing time grows exponentially with the number of data points.
Inspired by this new connection between the mixing time and the number of data points, we perform rigorous simulations on BART. We show qualitatively that BART's mixing time increases with the number of data points.
The slow mixing time of the simplified BART suggests a large variation between different runs of the simplified BART algorithm and a similar large variation is known for BART in the literature. This large variation could result in a lack of stability in the models, predictions, and posterior intervals obtained from the BART MCMC samples.
Our lower bound and simulations suggest increasing the number of chains with the number of data points.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data
Authors:
Dennis Shen,
Peng Ding,
Jasjeet Sekhon,
Bin Yu
Abstract:
A central goal in social science is to evaluate the causal effect of a policy. One dominant approach is through panel data analysis in which the behaviors of multiple units are observed over time. The information across time and space motivates two general approaches: (i) horizontal regression (i.e., unconfoundedness), which exploits time series patterns, and (ii) vertical regression (e.g., synthe…
▽ More
A central goal in social science is to evaluate the causal effect of a policy. One dominant approach is through panel data analysis in which the behaviors of multiple units are observed over time. The information across time and space motivates two general approaches: (i) horizontal regression (i.e., unconfoundedness), which exploits time series patterns, and (ii) vertical regression (e.g., synthetic controls), which exploits cross-sectional patterns. Conventional wisdom states that the two approaches are fundamentally different. We establish this position to be partly false for estimation but generally true for inference. In particular, we prove that both approaches yield identical point estimates under several standard settings. For the same point estimate, however, each approach quantifies uncertainty with respect to a distinct estimand. In turn, the confidence interval developed for one estimand may have incorrect coverage for another. This emphasizes that the source of randomness that researchers assume has direct implications for the accuracy of inference.
△ Less
Submitted 8 October, 2022; v1 submitted 29 July, 2022;
originally announced July 2022.
-
Group Probability-Weighted Tree Sums for Interpretable Modeling of Heterogeneous Data
Authors:
Keyan Nasseri,
Chandan Singh,
James Duncan,
Aaron Kornblith,
Bin Yu
Abstract:
Machine learning in high-stakes domains, such as healthcare, faces two critical challenges: (1) generalizing to diverse data distributions given limited training data while (2) maintaining interpretability. To address these challenges, we propose an instance-weighted tree-sum method that effectively pools data across diverse groups to output a concise, rule-based model. Given distinct groups of in…
▽ More
Machine learning in high-stakes domains, such as healthcare, faces two critical challenges: (1) generalizing to diverse data distributions given limited training data while (2) maintaining interpretability. To address these challenges, we propose an instance-weighted tree-sum method that effectively pools data across diverse groups to output a concise, rule-based model. Given distinct groups of instances in a dataset (e.g., medical patients grouped by age or treatment site), our method first estimates group membership probabilities for each instance. Then, it uses these estimates as instance weights in FIGS (Tan et al. 2022), to grow a set of decision trees whose values sum to the final prediction. We call this new method Group Probability-Weighted Tree Sums (G-FIGS). G-FIGS achieves state-of-the-art prediction performance on important clinical datasets; e.g., holding the level of sensitivity fixed at 92%, G-FIGS increases specificity for identifying cervical spine injury by up to 10% over CART and up to 3% over FIGS alone, with larger gains at higher sensitivity levels. By keeping the total number of rules below 16 in FIGS, the final models remain interpretable, and we find that their rules match medical domain expertise. All code, data, and models are released on Github.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods
Authors:
Abhineet Agarwal,
Yan Shuo Tan,
Omer Ronen,
Chandan Singh,
Bin Yu
Abstract:
Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking th…
▽ More
Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors. The amount of shrinkage is controlled by a single regularization parameter and the number of data points in each ancestor. Since HS is a post-hoc method, it is extremely fast, compatible with any tree growing algorithm, and can be used synergistically with other regularization techniques. Extensive experiments over a wide variety of real-world datasets show that HS substantially increases the predictive performance of decision trees, even when used in conjunction with other regularization techniques. Moreover, we find that applying HS to each tree in an RF often improves accuracy, as well as its interpretability by simplifying and stabilizing its decision boundaries and SHAP values. We further explain the success of HS in improving prediction performance by showing its equivalence to ridge regression on a (supervised) basis constructed of decision stumps associated with the internal nodes of a tree. All code and models are released in a full-fledged package available on Github (github.com/csinva/imodels)
△ Less
Submitted 1 February, 2022;
originally announced February 2022.
-
Fast Interpretable Greedy-Tree Sums
Authors:
Yan Shuo Tan,
Chandan Singh,
Keyan Nasseri,
Abhineet Agarwal,
James Duncan,
Omer Ronen,
Matthew Epland,
Aaron Kornblith,
Bin Yu
Abstract:
Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FI…
▽ More
Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FIGS), which generalizes the CART algorithm to simultaneously grow a flexible number of trees in summation. By combining logical rules with addition, FIGS is able to adapt to additive structure while remaining highly interpretable. Extensive experiments on real-world datasets show that FIGS achieves state-of-the-art prediction performance. To demonstrate the usefulness of FIGS in high-stakes domains, we adapt FIGS to learn clinical decision instruments (CDIs), which are tools for guiding clinical decision-making. Specifically, we introduce a variant of FIGS known as G-FIGS that accounts for the heterogeneity in medical data. G-FIGS derives CDIs that reflect domain knowledge and enjoy improved specificity (by up to 20% over CART) without sacrificing sensitivity or interpretability. To provide further insight into FIGS, we prove that FIGS learns components of additive models, a property we refer to as disentanglement. Further, we show (under oracle conditions) that unconstrained tree-sum models leverage disentanglement to generalize more efficiently than single decision tree models when fitted to additive regression functions. Finally, to avoid overfitting with an unconstrained number of splits, we develop Bagging-FIGS, an ensemble version of FIGS that borrows the variance reduction techniques of random forests. Bagging-FIGS enjoys competitive performance with random forests and XGBoost on real-world datasets.
△ Less
Submitted 8 July, 2023; v1 submitted 27 January, 2022;
originally announced January 2022.
-
Deep Probability Estimation
Authors:
Sheng Liu,
Aakash Kaku,
Weicheng Zhu,
Matan Leibovich,
Sreyas Mohan,
Boyang Yu,
Haoxiang Huang,
Laure Zanna,
Narges Razavian,
Jonathan Niles-Weed,
Carlos Fernandez-Granda
Abstract:
Reliable probability estimation is of crucial importance in many real-world applications where there is inherent (aleatoric) uncertainty. Probability-estimation models are trained on observed outcomes (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities of the events of interest are typically unknown. The problem is therefore analogous t…
▽ More
Reliable probability estimation is of crucial importance in many real-world applications where there is inherent (aleatoric) uncertainty. Probability-estimation models are trained on observed outcomes (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities of the events of interest are typically unknown. The problem is therefore analogous to binary classification, with the difference that the objective is to estimate probabilities rather than predicting the specific outcome. This work investigates probability estimation from high-dimensional data using deep neural networks. There exist several methods to improve the probabilities generated by these models but they mostly focus on model (epistemic) uncertainty. For problems with inherent uncertainty, it is challenging to evaluate performance without access to ground-truth probabilities. To address this, we build a synthetic dataset to study and compare different computable metrics. We evaluate existing methods on the synthetic data as well as on three real-world probability estimation tasks, all of which involve inherent uncertainty: precipitation forecasting from radar images, predicting cancer patient survival from histopathology images, and predicting car crashes from dashcam videos. We also give a theoretical analysis of a model for high-dimensional probability estimation which reproduces several of the phenomena evinced in our experiments. Finally, we propose a new method for probability estimation using neural networks, which modifies the training process to promote output probabilities that are consistent with empirical probabilities computed from the data. The method outperforms existing approaches on most metrics on the simulated as well as real-world data.
△ Less
Submitted 11 October, 2022; v1 submitted 20 November, 2021;
originally announced November 2021.
-
The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods
Authors:
Nikhil Ghosh,
Song Mei,
Bin Yu
Abstract:
To understand how deep learning works, it is crucial to understand the training dynamics of neural networks. Several interesting hypotheses about these dynamics have been made based on empirically observed phenomena, but there exists a limited theoretical understanding of when and why such phenomena occur.
In this paper, we consider the training dynamics of gradient flow on kernel least-squares…
▽ More
To understand how deep learning works, it is crucial to understand the training dynamics of neural networks. Several interesting hypotheses about these dynamics have been made based on empirically observed phenomena, but there exists a limited theoretical understanding of when and why such phenomena occur.
In this paper, we consider the training dynamics of gradient flow on kernel least-squares objectives, which is a limiting dynamics of SGD trained neural networks. Using precise high-dimensional asymptotics, we characterize the dynamics of the fitted model in two "worlds": in the Oracle World the model is trained on the population distribution and in the Empirical World the model is trained on a sampled dataset. We show that under mild conditions on the kernel and $L^2$ target regression function the training dynamics undergo three stages characterized by the behaviors of the models in the two worlds. Our theoretical results also mathematically formalize some interesting deep learning phenomena. Specifically, in our setting we show that SGD progressively learns more complex functions and that there is a "deep bootstrap" phenomenon: during the second stage, the test error of both worlds remain close despite the empirical training error being much smaller. Finally, we give a concrete example comparing the dynamics of two different kernels which shows that faster training is not necessary for better generalization.
△ Less
Submitted 13 November, 2021;
originally announced November 2021.
-
A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds
Authors:
Yan Shuo Tan,
Abhineet Agarwal,
Bin Yu
Abstract:
Decision trees are important both as interpretable models amenable to high-stakes decision-making, and as building blocks of ensemble methods such as random forests and gradient boosting. Their statistical properties, however, are not well understood. The most cited prior works have focused on deriving pointwise consistency guarantees for CART in a classical nonparametric regression setting. We ta…
▽ More
Decision trees are important both as interpretable models amenable to high-stakes decision-making, and as building blocks of ensemble methods such as random forests and gradient boosting. Their statistical properties, however, are not well understood. The most cited prior works have focused on deriving pointwise consistency guarantees for CART in a classical nonparametric regression setting. We take a different approach, and advocate studying the generalization performance of decision trees with respect to different generative regression models. This allows us to elicit their inductive bias, that is, the assumptions the algorithms make (or do not make) to generalize to new data, thereby guiding practitioners on when and how to apply these methods. In this paper, we focus on sparse additive generative models, which have both low statistical complexity and some nonparametric flexibility. We prove a sharp squared error generalization lower bound for a large class of decision tree algorithms fitted to sparse additive models with $C^1$ component functions. This bound is surprisingly much worse than the minimax rate for estimating such sparse additive models. The inefficiency is due not to greediness, but to the loss in power for detecting global structure when we average responses solely over each leaf, an observation that suggests opportunities to improve tree-based algorithms, for example, by hierarchical shrinkage. To prove these bounds, we develop new technical machinery, establishing a novel connection between decision tree estimation and rate-distortion theory, a sub-field of information theory.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Towards Robust Waveform-Based Acoustic Models
Authors:
Dino Oglic,
Zoran Cvetkovic,
Peter Sollich,
Steve Renals,
Bin Yu
Abstract:
We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, wh…
▽ More
We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, which aims at improving risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We then specify the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus on the waveform-based setting. Our empirical results show that the approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances.
△ Less
Submitted 29 June, 2022; v1 submitted 16 October, 2021;
originally announced October 2021.
-
Seven Principles for Rapid-Response Data Science: Lessons Learned from Covid-19 Forecasting
Authors:
Bin Yu,
Chandan Singh
Abstract:
In this article, we take a step back to distill seven principles out of our experience in the spring of 2020, when our 12-person rapid-response team used skills of data science and beyond to help distribute Covid PPE. This process included tapping into domain knowledge of epidemiology and medical logistics chains, curating a relevant data repository, developing models for short-term county-level d…
▽ More
In this article, we take a step back to distill seven principles out of our experience in the spring of 2020, when our 12-person rapid-response team used skills of data science and beyond to help distribute Covid PPE. This process included tapping into domain knowledge of epidemiology and medical logistics chains, curating a relevant data repository, developing models for short-term county-level death forecasting in the US, and building a website for sharing visualization (an automated AI machine). The principles are described in the context of working with Response4Life, a then-new nonprofit organization, to illustrate their necessity. Many of these principles overlap with those in standard data-science teams, but an emphasis is put on dealing with problems that require rapid response, often resembling agile software development.
△ Less
Submitted 29 March, 2022; v1 submitted 18 August, 2021;
originally announced August 2021.
-
Interpreting and improving deep-learning models with reality checks
Authors:
Chandan Singh,
Wooseok Ha,
Bin Yu
Abstract:
Recent deep-learning models have achieved impressive predictive performance by learning complex functions of many variables, often at the cost of interpretability. This chapter covers recent work aiming to interpret models by attributing importance to features and feature groups for a single prediction. Importantly, the proposed attributions assign importance to interactions between features, in a…
▽ More
Recent deep-learning models have achieved impressive predictive performance by learning complex functions of many variables, often at the cost of interpretability. This chapter covers recent work aiming to interpret models by attributing importance to features and feature groups for a single prediction. Importantly, the proposed attributions assign importance to interactions between features, in addition to features in isolation. These attributions are shown to yield insights across real-world domains, including bio-imaging, cosmology image and natural-language processing. We then show how these attributions can be used to directly improve the generalization of a neural network or to distill it into a simple model. Throughout the chapter, we emphasize the use of reality checks to scrutinize the proposed interpretation techniques.
△ Less
Submitted 18 August, 2021; v1 submitted 15 August, 2021;
originally announced August 2021.
-
Divergent Effects of Factors on Crashes under Autonomous and Conventional Driving Modes Using A Hierarchical Bayesian Approach
Authors:
Weixi Ren,
Bo Yu,
Yuren Chen,
Kun Gao,
Shan Bao
Abstract:
Influencing factors on crashes involved with autonomous vehicles (AVs) have been paid increasing attention. However, there is a lack of comparative analyses between influencing factors on crashes of AVs and human-driven vehicles. To fill this research gap, the study aims to explore the divergent effects of factors on crashes under autonomous and conventional driving modes. This study obtained 154…
▽ More
Influencing factors on crashes involved with autonomous vehicles (AVs) have been paid increasing attention. However, there is a lack of comparative analyses between influencing factors on crashes of AVs and human-driven vehicles. To fill this research gap, the study aims to explore the divergent effects of factors on crashes under autonomous and conventional driving modes. This study obtained 154 publicly available autonomous vehicle crash data (70 for the autonomous driving mode and 84 for the conventional driving mode), and 36 explanatory variables were extracted from three categories, including environment, roads, and vehicles. Then, a hierarchical Bayesian approach was applied to analyze the impacting factors on crash type and severity under both driving modes. The results showed that some factors affected both driving modes, but their degrees were different. For example, the presence of turning movement had a greater impact on the crash severity under the conventional driving mode, while the presence of turning movement led to a larger decrease in the likelihood of rear-end crashes under the autonomous driving mode. More influencing factors only had a significant impact on one of the driving modes. For example, in the autonomous driving mode, two sidewalks decreased the severity of crashes, and on-street parking was positively associated with rear-end crashes, but they were not significant in the conventional driving mode. This study could contribute to the understanding and development of autonomous driving systems and the better coordination between autonomous driving and conventional driving.
△ Less
Submitted 7 April, 2022; v1 submitted 5 August, 2021;
originally announced August 2021.
-
Adaptive wavelet distillation from neural networks through interpretations
Authors:
Wooseok Ha,
Chandan Singh,
Francois Lanusse,
Srigokul Upadhyayula,
Bin Yu
Abstract:
Recent deep-learning models have achieved impressive prediction performance, but often sacrifice interpretability and computational efficiency. Interpretability is crucial in many disciplines, such as science and medicine, where models must be carefully vetted or where interpretation is the goal itself. Moreover, interpretable models are concise and often yield computational efficiency. Here, we p…
▽ More
Recent deep-learning models have achieved impressive prediction performance, but often sacrifice interpretability and computational efficiency. Interpretability is crucial in many disciplines, such as science and medicine, where models must be carefully vetted or where interpretation is the goal itself. Moreover, interpretable models are concise and often yield computational efficiency. Here, we propose adaptive wavelet distillation (AWD), a method which aims to distill information from a trained neural network into a wavelet transform. Specifically, AWD penalizes feature attributions of a neural network in the wavelet domain to learn an effective multi-resolution wavelet transform. The resulting model is highly predictive, concise, computationally efficient, and has properties (such as a multi-scale structure) which make it easy to interpret. In close collaboration with domain experts, we showcase how AWD addresses challenges in two real-world settings: cosmological parameter inference and molecular-partner prediction. In both cases, AWD yields a scientifically interpretable and concise model which gives predictive performance better than state-of-the-art neural networks. Moreover, AWD identifies predictive features that are scientifically meaningful in the context of respective domains. All code and models are released in a full-fledged package available on Github (https://github.com/Yu-Group/adaptive-wavelets).
△ Less
Submitted 26 August, 2021; v1 submitted 19 July, 2021;
originally announced July 2021.
-
Shape-Preserving Dimensionality Reduction : An Algorithm and Measures of Topological Equivalence
Authors:
Byeongsu Yu,
Kisung You
Abstract:
We introduce a linear dimensionality reduction technique preserving topological features via persistent homology. The method is designed to find linear projection $L$ which preserves the persistent diagram of a point cloud $\mathbb{X}$ via simulated annealing. The projection $L$ induces a set of canonical simplicial maps from the Rips (or ÄŒech) filtration of $\mathbb{X}$ to that of $L\mathbb{X}$.…
▽ More
We introduce a linear dimensionality reduction technique preserving topological features via persistent homology. The method is designed to find linear projection $L$ which preserves the persistent diagram of a point cloud $\mathbb{X}$ via simulated annealing. The projection $L$ induces a set of canonical simplicial maps from the Rips (or Čech) filtration of $\mathbb{X}$ to that of $L\mathbb{X}$. In addition to the distance between persistent diagrams, the projection induces a map between filtrations, called filtration homomorphism. Using the filtration homomorphism, one can measure the difference between shapes of two filtrations directly comparing simplicial complexes with respect to quasi-isomorphism $μ_{\operatorname{quasi-iso}}$ or strong homotopy equivalence $μ_{\operatorname{equiv}}$. These $μ_{\operatorname{quasi-iso}}$ and $μ_{\operatorname{equiv}}$ measures how much portion of corresponding simplicial complexes is quasi-isomorphic or homotopy equivalence respectively. We validate the effectiveness of our framework with simple examples.
△ Less
Submitted 13 June, 2021; v1 submitted 3 June, 2021;
originally announced June 2021.
-
A stability-driven protocol for drug response interpretable prediction (staDRIP)
Authors:
Xiao Li,
Tiffany M. Tang,
Xuewei Wang,
Jean-Pierre A. Kocher,
Bin Yu
Abstract:
Modern cancer -omics and pharmacological data hold great promise in precision cancer medicine for developing individualized patient treatments. However, high heterogeneity and noise in such data pose challenges for predicting the response of cancer cell lines to therapeutic drugs accurately. As a result, arbitrary human judgment calls are rampant throughout the predictive modeling pipeline. In thi…
▽ More
Modern cancer -omics and pharmacological data hold great promise in precision cancer medicine for developing individualized patient treatments. However, high heterogeneity and noise in such data pose challenges for predicting the response of cancer cell lines to therapeutic drugs accurately. As a result, arbitrary human judgment calls are rampant throughout the predictive modeling pipeline. In this work, we develop a transparent stability-driven pipeline for drug response interpretable predictions, or staDRIP, which builds upon the PCS framework for veridical data science (Yu and Kumbier, 2020) and mitigates the impact of human judgment calls. Here we use the PCS framework for the first time in cancer research to extract proteins and genes that are important in predicting the drug responses and stable across appropriate data and model perturbations. Out of the 24 most stable proteins we identified using data from the Cancer Cell Line Encyclopedia (CCLE), 18 have been associated with the drug response or identified as a known or possible drug target in previous literature, demonstrating the utility of our stability-driven pipeline for knowledge discovery in cancer drug response prediction modeling.
△ Less
Submitted 16 November, 2020; v1 submitted 12 November, 2020;
originally announced November 2020.
-
Stable discovery of interpretable subgroups via calibration in causal studies
Authors:
Raaz Dwivedi,
Yan Shuo Tan,
Briton Park,
Mian Wei,
Kevin Horgan,
David Madigan,
Bin Yu
Abstract:
Building on Yu and Kumbier's PCS framework and for randomized experiments, we introduce a novel methodology for Stable Discovery of Interpretable Subgroups via Calibration (StaDISC), with large heterogeneous treatment effects. StaDISC was developed during our re-analysis of the 1999-2000 VIGOR study, an 8076 patient randomized controlled trial (RCT), that compared the risk of adverse events from a…
▽ More
Building on Yu and Kumbier's PCS framework and for randomized experiments, we introduce a novel methodology for Stable Discovery of Interpretable Subgroups via Calibration (StaDISC), with large heterogeneous treatment effects. StaDISC was developed during our re-analysis of the 1999-2000 VIGOR study, an 8076 patient randomized controlled trial (RCT), that compared the risk of adverse events from a then newly approved drug, Rofecoxib (Vioxx), to that from an older drug Naproxen. Vioxx was found to, on average and in comparison to Naproxen, reduce the risk of gastrointestinal (GI) events but increase the risk of thrombotic cardiovascular (CVT) events. Applying StaDISC, we fit 18 popular conditional average treatment effect (CATE) estimators for both outcomes and use calibration to demonstrate their poor global performance. However, they are locally well-calibrated and stable, enabling the identification of patient groups with larger than (estimated) average treatment effects. In fact, StaDISC discovers three clinically interpretable subgroups each for the GI outcome (totaling 29.4% of the study size) and the CVT outcome (totaling 11.0%). Complementary analyses of the found subgroups using the 2001-2004 APPROVe study, a separate independently conducted RCT with 2587 patients, provides further supporting evidence for the promise of StaDISC.
△ Less
Submitted 28 September, 2020; v1 submitted 23 August, 2020;
originally announced August 2020.
-
Revisiting minimum description length complexity in overparameterized models
Authors:
Raaz Dwivedi,
Chandan Singh,
Bin Yu,
Martin J. Wainwright
Abstract:
Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen's principle of minimum description le…
▽ More
Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen's principle of minimum description length (MDL) and define a novel MDL-based complexity (MDL-COMP) that remains valid for overparameterized models. MDL-COMP is defined via an optimality criterion over the encodings induced by a good Ridge estimator class. We provide an extensive theoretical characterization of MDL-COMP for linear models and kernel methods and show that it is not just a function of parameter count, but rather a function of the singular values of the design or the kernel matrix and the signal-to-noise ratio. For a linear model with $n$ observations, $d$ parameters, and i.i.d. Gaussian predictors, MDL-COMP scales linearly with $d$ when $d<n$, but the scaling is exponentially smaller -- $\log d$ for $d>n$. For kernel methods, we show that MDL-COMP informs minimax in-sample error, and can decrease as the dimensionality of the input increases. We also prove that MDL-COMP upper bounds the in-sample mean squared error (MSE). Via an array of simulations and real-data experiments, we show that a data-driven Prac-MDL-COMP informs hyper-parameter tuning for optimizing test MSE with ridge regression in limited data settings, sometimes improving upon cross-validation and (always) saving computational costs. Finally, our findings also suggest that the recently observed double decent phenomenons in overparameterized models might be a consequence of the choice of non-ideal estimators.
△ Less
Submitted 12 October, 2023; v1 submitted 17 June, 2020;
originally announced June 2020.
-
Classify and Generate Reciprocally: Simultaneous Positive-Unlabelled Learning and Conditional Generation with Extra Data
Authors:
Bing Yu,
Ke Sun,
He Wang,
Zhouchen Lin,
Zhanxing Zhu
Abstract:
The scarcity of class-labeled data is a ubiquitous bottleneck in many machine learning problems. While abundant unlabeled data typically exist and provide a potential solution, it is highly challenging to exploit them. In this paper, we address this problem by leveraging Positive-Unlabeled~(PU) classification and the conditional generation with extra unlabeled data \emph{simultaneously}. In partic…
▽ More
The scarcity of class-labeled data is a ubiquitous bottleneck in many machine learning problems. While abundant unlabeled data typically exist and provide a potential solution, it is highly challenging to exploit them. In this paper, we address this problem by leveraging Positive-Unlabeled~(PU) classification and the conditional generation with extra unlabeled data \emph{simultaneously}. In particular, we present a novel training framework to jointly target both PU classification and conditional generation when exposed to extra data, especially out-of-distribution unlabeled data, by exploring the interplay between them: 1) enhancing the performance of PU classifiers with the assistance of a novel Classifier-Noise-Invariant Conditional GAN~(CNI-CGAN) that is robust to noisy labels, 2) leveraging extra data with predicted labels from a PU classifier to help the generation. Theoretically, we prove the optimal condition of CNI-CGAN, and experimentally, we conducted extensive evaluations on diverse datasets, verifying the simultaneous improvements in both classification and generation.
△ Less
Submitted 8 February, 2024; v1 submitted 14 June, 2020;
originally announced June 2020.
-
Knowledge Distillation: A Survey
Authors:
Jianping Gou,
Baosheng Yu,
Stephen John Maybank,
Dacheng Tao
Abstract:
In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedd…
▽ More
In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.
△ Less
Submitted 20 May, 2021; v1 submitted 9 June, 2020;
originally announced June 2020.
-
How to Grow a (Product) Tree: Personalized Category Suggestions for eCommerce Type-Ahead
Authors:
Jacopo Tagliabue,
Bingqing Yu,
Marie Beaulieu
Abstract:
In an attempt to balance precision and recall in the search page, leading digital shops have been effectively nudging users into select category facets as early as in the type-ahead suggestions. In this work, we present SessionPath, a novel neural network model that improves facet suggestions on two counts: first, the model is able to leverage session embeddings to provide scalable personalization…
▽ More
In an attempt to balance precision and recall in the search page, leading digital shops have been effectively nudging users into select category facets as early as in the type-ahead suggestions. In this work, we present SessionPath, a novel neural network model that improves facet suggestions on two counts: first, the model is able to leverage session embeddings to provide scalable personalization; second, SessionPath predicts facets by explicitly producing a probability distribution at each node in the taxonomy path. We benchmark SessionPath on two partnering shops against count-based and neural models, and show how business requirements and model behavior can be combined in a principled way.
△ Less
Submitted 26 May, 2020;
originally announced May 2020.
-
Instability, Computational Efficiency and Statistical Accuracy
Authors:
Nhat Ho,
Koulik Khamaru,
Raaz Dwivedi,
Martin J. Wainwright,
Michael I. Jordan,
Bin Yu
Abstract:
Many statistical estimators are defined as the fixed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of infinitely many samples. We develop a general framework that yields bounds on statistical accurac…
▽ More
Many statistical estimators are defined as the fixed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of infinitely many samples. We develop a general framework that yields bounds on statistical accuracy based on the interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (in)stability when applied to an empirical object based on $n$ samples. Using this framework, we analyze both stable forms of gradient descent and some higher-order and unstable algorithms, including Newton's method and its cubic-regularized variant, as well as the EM algorithm. We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models. We exhibit cases in which an unstable algorithm can achieve the same statistical accuracy as a stable algorithm in exponentially fewer steps -- namely, with the number of iterations being reduced from polynomial to logarithmic in sample size $n$.
△ Less
Submitted 20 March, 2022; v1 submitted 22 May, 2020;
originally announced May 2020.
-
Curating a COVID-19 data repository and forecasting county-level death counts in the United States
Authors:
Nick Altieri,
Rebecca L. Barter,
James Duncan,
Raaz Dwivedi,
Karl Kumbier,
Xiao Li,
Robert Netzorg,
Briton Park,
Chandan Singh,
Yan Shuo Tan,
Tiffany Tang,
Yu Wang,
Chao Zhang,
Bin Yu
Abstract:
As the COVID-19 outbreak evolves, accurate forecasting continues to play an extremely important role in informing policy decisions. In this paper, we present our continuous curation of a large data repository containing COVID-19 information from a range of sources. We use this data to develop predictions and corresponding prediction intervals for the short-term trajectory of COVID-19 cumulative de…
▽ More
As the COVID-19 outbreak evolves, accurate forecasting continues to play an extremely important role in informing policy decisions. In this paper, we present our continuous curation of a large data repository containing COVID-19 information from a range of sources. We use this data to develop predictions and corresponding prediction intervals for the short-term trajectory of COVID-19 cumulative death counts at the county-level in the United States up to two weeks ahead. Using data from January 22 to June 20, 2020, we develop and combine multiple forecasts using ensembling techniques, resulting in an ensemble we refer to as Combined Linear and Exponential Predictors (CLEP). Our individual predictors include county-specific exponential and linear predictors, a shared exponential predictor that pools data together across counties, an expanded shared exponential predictor that uses data from neighboring counties, and a demographics-based shared exponential predictor. We use prediction errors from the past five days to assess the uncertainty of our death predictions, resulting in generally-applicable prediction intervals, Maximum (absolute) Error Prediction Intervals (MEPI). MEPI achieves a coverage rate of more than 94% when averaged across counties for predicting cumulative recorded death counts two weeks in the future. Our forecasts are currently being used by the non-profit organization, Response4Life, to determine the medical supply need for individual hospitals and have directly contributed to the distribution of medical supplies across the country. We hope that our forecasts and data repository at https://covidseverity.com can help guide necessary county-specific decision-making and help counties prepare for their continued fight against COVID-19.
△ Less
Submitted 9 August, 2020; v1 submitted 16 May, 2020;
originally announced May 2020.
-
"An Image is Worth a Thousand Features": Scalable Product Representations for In-Session Type-Ahead Personalization
Authors:
Bingqing Yu,
Jacopo Tagliabue,
Ciro Greco,
Federico Bianchi
Abstract:
We address the problem of personalizing query completion in a digital commerce setting, in which the bounce rate is typically high and recurring users are rare. We focus on in-session personalization and improve a standard noisy channel model by injecting dense vectors computed from product images at query time. We argue that image-based personalization displays several advantages over alternative…
▽ More
We address the problem of personalizing query completion in a digital commerce setting, in which the bounce rate is typically high and recurring users are rare. We focus on in-session personalization and improve a standard noisy channel model by injecting dense vectors computed from product images at query time. We argue that image-based personalization displays several advantages over alternative proposals (from data availability to business scalability), and provide quantitative evidence and qualitative support on the effectiveness of the proposed methods. Finally, we show how a shared vector space between similar shops can be used to improve the experience of users browsing across sites, opening up the possibility of applying zero-shot unsupervised personalization to increase conversions. This will prove to be particularly relevant to retail groups that manage multiple brands and/or websites and to multi-tenant SaaS providers that serve multiple clients in the same space.
△ Less
Submitted 11 March, 2020;
originally announced March 2020.
-
Transformation Importance with Applications to Cosmology
Authors:
Chandan Singh,
Wooseok Ha,
Francois Lanusse,
Vanessa Boehm,
Jia Liu,
Bin Yu
Abstract:
Machine learning lies at the heart of new possibilities for scientific discovery, knowledge generation, and artificial intelligence. Its potential benefits to these fields requires going beyond predictive accuracy and focusing on interpretability. In particular, many scientific problems require interpretations in a domain-specific interpretable feature space (e.g. the frequency domain) whereas att…
▽ More
Machine learning lies at the heart of new possibilities for scientific discovery, knowledge generation, and artificial intelligence. Its potential benefits to these fields requires going beyond predictive accuracy and focusing on interpretability. In particular, many scientific problems require interpretations in a domain-specific interpretable feature space (e.g. the frequency domain) whereas attributions to the raw features (e.g. the pixel space) may be unintelligible or even misleading. To address this challenge, we propose TRIM (TRansformation IMportance), a novel approach which attributes importances to features in a transformed space and can be applied post-hoc to a fully trained model. TRIM is motivated by a cosmological parameter estimation problem using deep neural networks (DNNs) on simulated data, but it is generally applicable across domains/models and can be combined with any local interpretation method. In our cosmology example, combining TRIM with contextual decomposition shows promising results for identifying which frequencies a DNN uses, helping cosmologists to understand and validate that the model learns appropriate physical features rather than simulation artifacts.
△ Less
Submitted 14 June, 2021; v1 submitted 4 March, 2020;
originally announced March 2020.
-
VLSI Mask Optimization: From Shallow To Deep Learning
Authors:
Haoyu Yang,
Wei Zhong,
Yuzhe Ma,
Hao Geng,
Ran Chen,
Wanli Chen,
Bei Yu
Abstract:
VLSI mask optimization is one of the most critical stages in manufacturability aware design, which is costly due to the complicated mask optimization and lithography simulation. Recent researches have shown prominent advantages of machine learning techniques dealing with complicated and big data problems, which bring potential of dedicated machine learning solution for DFM problems and facilitate…
▽ More
VLSI mask optimization is one of the most critical stages in manufacturability aware design, which is costly due to the complicated mask optimization and lithography simulation. Recent researches have shown prominent advantages of machine learning techniques dealing with complicated and big data problems, which bring potential of dedicated machine learning solution for DFM problems and facilitate the VLSI design cycle. In this paper, we focus on a heterogeneous OPC framework that assists mask layout optimization. Preliminary results show the efficiency and effectiveness of proposed frameworks that have the potential to be alternatives to existing EDA solutions.
△ Less
Submitted 16 December, 2019;
originally announced December 2019.
-
Automatic Layout Generation with Applications in Machine Learning Engine Evaluation
Authors:
Haoyu Yang,
Wen Chen,
Piyush Pathak,
Frank Gennari,
Ya-Chieh Lai,
Bei Yu
Abstract:
Machine learning-based lithography hotspot detection has been deeply studied recently, from varies feature extraction techniques to efficient learning models. It has been observed that such machine learning-based frameworks are providing satisfactory metal layer hotspot prediction results on known public metal layer benchmarks. In this work, we seek to evaluate how these machine learning-based hot…
▽ More
Machine learning-based lithography hotspot detection has been deeply studied recently, from varies feature extraction techniques to efficient learning models. It has been observed that such machine learning-based frameworks are providing satisfactory metal layer hotspot prediction results on known public metal layer benchmarks. In this work, we seek to evaluate how these machine learning-based hotspot detectors generalize to complicated patterns. We first introduce a automatic layout generation tool that can synthesize varies layout patterns given a set of design rules. The tool currently supports both metal layer and via layer generation. As a case study, we conduct hotspot detection on the generated via layer layouts with representative machine learning-based hotspot detectors, which shows that continuous study on model robustness and generality is necessary to prototype and integrate the learning engines in DFM flows. The source code of the layout generation tool will be available at https://github. com/phdyang007/layout-generation.
△ Less
Submitted 12 December, 2019;
originally announced December 2019.
-
Patch-level Neighborhood Interpolation: A General and Effective Graph-based Regularization Strategy
Authors:
Ke Sun,
Bing Yu,
Zhouchen Lin,
Zhanxing Zhu
Abstract:
Regularization plays a crucial role in machine learning models, especially for deep neural networks. The existing regularization techniques mainly rely on the i.i.d. assumption and only consider the knowledge from the current sample, without the leverage of the neighboring relationship between samples. In this work, we propose a general regularizer called \textbf{Patch-level Neighborhood Interpola…
▽ More
Regularization plays a crucial role in machine learning models, especially for deep neural networks. The existing regularization techniques mainly rely on the i.i.d. assumption and only consider the knowledge from the current sample, without the leverage of the neighboring relationship between samples. In this work, we propose a general regularizer called \textbf{Patch-level Neighborhood Interpolation~(Pani)} that conducts a non-local representation in the computation of networks. Our proposal explicitly constructs patch-level graphs in different layers and then linearly interpolates neighborhood patch features, serving as a general and effective regularization strategy. Further, we customize our approach into two kinds of popular regularization methods, namely Virtual Adversarial Training (VAT) and MixUp as well as its variants. The first derived \textbf{Pani VAT} presents a novel way to construct non-local adversarial smoothness by employing patch-level interpolated perturbations. The second derived \textbf{Pani MixUp} method extends the MixUp, and achieves superiority over MixUp and competitive performance over state-of-the-art variants of MixUp method with a significant advantage in computational efficiency. Extensive experiments have verified the effectiveness of our Pani approach in both supervised and semi-supervised settings.
△ Less
Submitted 22 October, 2023; v1 submitted 21 November, 2019;
originally announced November 2019.
-
MLPerf Inference Benchmark
Authors:
Vijay Janapa Reddi,
Christine Cheng,
David Kanter,
Peter Mattson,
Guenther Schmuelling,
Carole-Jean Wu,
Brian Anderson,
Maximilien Breughe,
Mark Charlebois,
William Chou,
Ramesh Chukka,
Cody Coleman,
Sam Davis,
Pan Deng,
Greg Diamos,
Jared Duke,
Dave Fick,
J. Scott Gardner,
Itay Hubara,
Sachin Idgunji,
Thomas B. Jablin,
Jeff Jiao,
Tom St. John,
Pankaj Kanwar,
David Lee
, et al. (22 additional authors not shown)
Abstract:
Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devic…
▽ More
Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.
△ Less
Submitted 9 May, 2020; v1 submitted 6 November, 2019;
originally announced November 2019.
-
Interpretations are useful: penalizing explanations to align neural networks with prior knowledge
Authors:
Laura Rieger,
Chandan Singh,
W. James Murdoch,
Bin Yu
Abstract:
For an explanation of a deep learning model to be effective, it must provide both insight into a model and suggest a corresponding action in order to achieve some objective. Too often, the litany of proposed explainable deep learning methods stop at the first step, providing practitioners with insight into a model, but no way to act on it. In this paper, we propose contextual decomposition explana…
▽ More
For an explanation of a deep learning model to be effective, it must provide both insight into a model and suggest a corresponding action in order to achieve some objective. Too often, the litany of proposed explainable deep learning methods stop at the first step, providing practitioners with insight into a model, but no way to act on it. In this paper, we propose contextual decomposition explanation penalization (CDEP), a method which enables practitioners to leverage existing explanation methods in order to increase the predictive accuracy of deep learning models. In particular, when shown that a model has incorrectly assigned importance to some features, CDEP enables practitioners to correct these errors by directly regularizing the provided explanations. Using explanations provided by contextual decomposition (CD) (Murdoch et al., 2018), we demonstrate the ability of our method to increase performance on an array of toy and real datasets.
△ Less
Submitted 8 October, 2020; v1 submitted 30 September, 2019;
originally announced September 2019.