-
GUD: Generation with Unified Diffusion
Authors:
Mathis Gerdes,
Max Welling,
Miranda C. N. Cheng
Abstract:
Diffusion generative models transform noise into data by inverting a process that progressively adds noise to data samples. Inspired by concepts from the renormalization group in physics, which analyzes systems across different scales, we revisit diffusion models by exploring three key design aspects: 1) the choice of representation in which the diffusion process operates (e.g. pixel-, PCA-, Fouri…
▽ More
Diffusion generative models transform noise into data by inverting a process that progressively adds noise to data samples. Inspired by concepts from the renormalization group in physics, which analyzes systems across different scales, we revisit diffusion models by exploring three key design aspects: 1) the choice of representation in which the diffusion process operates (e.g. pixel-, PCA-, Fourier-, or wavelet-basis), 2) the prior distribution that data is transformed into during diffusion (e.g. Gaussian with covariance $Σ$), and 3) the scheduling of noise levels applied separately to different parts of the data, captured by a component-wise noise schedule. Incorporating the flexibility in these choices, we develop a unified framework for diffusion generative models with greatly enhanced design freedom. In particular, we introduce soft-conditioning models that smoothly interpolate between standard diffusion models and autoregressive models (in any basis), conceptually bridging these two approaches. Our framework opens up a wide design space which may lead to more efficient training and data generation, and paves the way to novel architectures integrating different generative approaches and generation tasks.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts
Authors:
Ruochen Wang,
Sohyun An,
Minhao Cheng,
Tianyi Zhou,
Sung Ju Hwang,
Cho-Jui Hsieh
Abstract:
Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction.…
▽ More
Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction. Such simplification significantly limits their capacity, as a single demo-free instruction might not be able to cover the entire complex problem space of the targeted task. To alleviate this issue, we adopt the Mixture-of-Expert paradigm and divide the problem space into a set of sub-regions; Each sub-region is governed by a specialized expert, equipped with both an instruction and a set of demos. A two-phase process is developed to construct the specialized expert for each region: (1) demo assignment: Inspired by the theoretical connection between in-context learning and kernel regression, we group demos into experts based on their semantic similarity; (2) instruction assignment: A region-based joint search of an instruction per expert complements the demos assigned to it, yielding a synergistic effect. The resulting method, codenamed Mixture-of-Prompts (MoP), achieves an average win rate of 81% against prior arts across several major benchmarks.
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
Regularized Optimal Transport Layers for Generalized Global Pooling Operations
Authors:
Hongteng Xu,
Minjie Cheng
Abstract:
Global pooling is one of the most significant operations in many machine learning models and tasks, which works for information fusion and structured data (like sets and graphs) representation. However, without solid mathematical fundamentals, its practical implementations often depend on empirical mechanisms and thus lead to sub-optimal, even unsatisfactory performance. In this work, we develop a…
▽ More
Global pooling is one of the most significant operations in many machine learning models and tasks, which works for information fusion and structured data (like sets and graphs) representation. However, without solid mathematical fundamentals, its practical implementations often depend on empirical mechanisms and thus lead to sub-optimal, even unsatisfactory performance. In this work, we develop a novel and generalized global pooling framework through the lens of optimal transport. The proposed framework is interpretable from the perspective of expectation-maximization. Essentially, it aims at learning an optimal transport across sample indices and feature dimensions, making the corresponding pooling operation maximize the conditional expectation of input data. We demonstrate that most existing pooling methods are equivalent to solving a regularized optimal transport (ROT) problem with different specializations, and more sophisticated pooling operations can be implemented by hierarchically solving multiple ROT problems. Making the parameters of the ROT problem learnable, we develop a family of regularized optimal transport pooling (ROTP) layers. We implement the ROTP layers as a new kind of deep implicit layer. Their model architectures correspond to different optimization algorithms. We test our ROTP layers in several representative set-level machine learning scenarios, including multi-instance learning (MIL), graph classification, graph set representation, and image classification. Experimental results show that applying our ROTP layers can reduce the difficulty of the design and selection of global pooling -- our ROTP layers may either imitate some existing global pooling methods or lead to some new pooling layers fitting data better. The code is available at \url{https://github.com/SDS-Lab/ROT-Pooling}.
△ Less
Submitted 12 December, 2022;
originally announced December 2022.
-
Efficient Non-Parametric Optimizer Search for Diverse Tasks
Authors:
Ruochen Wang,
Yuanhao Xiong,
Minhao Cheng,
Cho-Jui Hsieh
Abstract:
Efficient and automated design of optimizers plays a crucial role in full-stack AutoML systems. However, prior methods in optimizer search are often limited by their scalability, generability, or sample efficiency. With the goal of democratizing research and application of optimizer search, we present the first efficient, scalable and generalizable framework that can directly search on the tasks o…
▽ More
Efficient and automated design of optimizers plays a crucial role in full-stack AutoML systems. However, prior methods in optimizer search are often limited by their scalability, generability, or sample efficiency. With the goal of democratizing research and application of optimizer search, we present the first efficient, scalable and generalizable framework that can directly search on the tasks of interest. We first observe that optimizer updates are fundamentally mathematical expressions applied to the gradient. Inspired by the innate tree structure of the underlying math expressions, we re-arrange the space of optimizers into a super-tree, where each path encodes an optimizer. This way, optimizer search can be naturally formulated as a path-finding problem, allowing a variety of well-established tree traversal methods to be used as the search algorithm. We adopt an adaptation of the Monte Carlo method to tree search, equipped with rejection sampling and equivalent-form detection that leverage the characteristics of optimizer update rules to further boost the sample efficiency. We provide a diverse set of tasks to benchmark our algorithm and demonstrate that, with only 128 evaluations, the proposed framework can discover optimizers that surpass both human-designed counterparts and prior optimizer search methods.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
Leachable Component Clustering
Authors:
Miao Cheng,
Xinge You
Abstract:
Clustering attempts to partition data instances into several distinctive groups, while the similarities among data belonging to the common partition can be principally reserved. Furthermore, incomplete data frequently occurs in many realworld applications, and brings perverse influence on pattern analysis. As a consequence, the specific solutions to data imputation and handling are developed to co…
▽ More
Clustering attempts to partition data instances into several distinctive groups, while the similarities among data belonging to the common partition can be principally reserved. Furthermore, incomplete data frequently occurs in many realworld applications, and brings perverse influence on pattern analysis. As a consequence, the specific solutions to data imputation and handling are developed to conduct the missing values of data, and independent stage of knowledge exploitation is absorbed for information understanding. In this work, a novel approach to clustering of incomplete data, termed leachable component clustering, is proposed. Rather than existing methods, the proposed method handles data imputation with Bayes alignment, and collects the lost patterns in theory. Due to the simple numeric computation of equations, the proposed method can learn optimized partitions while the calculation efficiency is held. Experiments on several artificial incomplete data sets demonstrate that, the proposed method is able to present superior performance compared with other state-of-the-art algorithms.
△ Less
Submitted 28 August, 2022;
originally announced August 2022.
-
Learning Disentangled Representations for Counterfactual Regression via Mutual Information Minimization
Authors:
Mingyuan Cheng,
Xinru Liao,
Quan Liu,
Bin Ma,
Jian Xu,
Bo Zheng
Abstract:
Learning individual-level treatment effect is a fundamental problem in causal inference and has received increasing attention in many areas, especially in the user growth area which concerns many internet companies. Recently, disentangled representation learning methods that decompose covariates into three latent factors, including instrumental, confounding and adjustment factors, have witnessed g…
▽ More
Learning individual-level treatment effect is a fundamental problem in causal inference and has received increasing attention in many areas, especially in the user growth area which concerns many internet companies. Recently, disentangled representation learning methods that decompose covariates into three latent factors, including instrumental, confounding and adjustment factors, have witnessed great success in treatment effect estimation. However, it remains an open problem how to learn the underlying disentangled factors precisely. Specifically, previous methods fail to obtain independent disentangled factors, which is a necessary condition for identifying treatment effect. In this paper, we propose Disentangled Representations for Counterfactual Regression via Mutual Information Minimization (MIM-DRCFR), which uses a multi-task learning framework to share information when learning the latent factors and incorporates MI minimization learning criteria to ensure the independence of these factors. Extensive experiments including public benchmarks and real-world industrial user growth datasets demonstrate that our method performs much better than state-of-the-art methods.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
Entangled q-Convolutional Neural Nets
Authors:
Vassilis Anagiannis,
Miranda C. N. Cheng
Abstract:
We introduce a machine learning model, the q-CNN model, sharing key features with convolutional neural networks and admitting a tensor network description. As examples, we apply q-CNN to the MNIST and Fashion MNIST classification tasks. We explain how the network associates a quantum state to each classification label, and study the entanglement structure of these network states. In both our exper…
▽ More
We introduce a machine learning model, the q-CNN model, sharing key features with convolutional neural networks and admitting a tensor network description. As examples, we apply q-CNN to the MNIST and Fashion MNIST classification tasks. We explain how the network associates a quantum state to each classification label, and study the entanglement structure of these network states. In both our experiments on the MNIST and Fashion-MNIST datasets, we observe a distinct increase in both the left/right as well as the up/down bipartition entanglement entropy during training as the network learns the fine features of the data. More generally, we observe a universal negative correlation between the value of the entanglement entropy and the value of the cost function, suggesting that the network needs to learn the entanglement structure in order the perform the task accurately. This supports the possibility of exploiting the entanglement structure as a guide to design the machine learning algorithm suitable for given tasks.
△ Less
Submitted 5 March, 2021;
originally announced March 2021.
-
Voting based ensemble improves robustness of defensive models
Authors:
Devvrit,
Minhao Cheng,
Cho-Jui Hsieh,
Inderjit Dhillon
Abstract:
Developing robust models against adversarial perturbations has been an active area of research and many algorithms have been proposed to train individual robust models. Taking these pretrained robust models, we aim to study whether it is possible to create an ensemble to further improve robustness. Several previous attempts tackled this problem by ensembling the soft-label prediction and have been…
▽ More
Developing robust models against adversarial perturbations has been an active area of research and many algorithms have been proposed to train individual robust models. Taking these pretrained robust models, we aim to study whether it is possible to create an ensemble to further improve robustness. Several previous attempts tackled this problem by ensembling the soft-label prediction and have been proved vulnerable based on the latest attack methods. In this paper, we show that if the robust training loss is diverse enough, a simple hard-label based voting ensemble can boost the robust error over each individual model. Furthermore, given a pool of robust models, we develop a principled way to select which models to ensemble. Finally, to verify the improved robustness, we conduct extensive experiments to study how to attack a voting-based ensemble and develop several new white-box attacks. On CIFAR-10 dataset, by ensembling several state-of-the-art pre-trained defense models, our method can achieve a 59.8% robust accuracy, outperforming all the existing defensive models without using additional data.
△ Less
Submitted 27 November, 2020;
originally announced November 2020.
-
Semi-Supervised Learning with Meta-Gradient
Authors:
Xin-Yu Zhang,
Taihong Xiao,
Haolin Jia,
Ming-Ming Cheng,
Ming-Hsuan Yang
Abstract:
In this work, we propose a simple yet effective meta-learning algorithm in semi-supervised learning. We notice that most existing consistency-based approaches suffer from overfitting and limited model generalization ability, especially when training with only a small number of labeled data. To alleviate this issue, we propose a learn-to-generalize regularization term by utilizing the label informa…
▽ More
In this work, we propose a simple yet effective meta-learning algorithm in semi-supervised learning. We notice that most existing consistency-based approaches suffer from overfitting and limited model generalization ability, especially when training with only a small number of labeled data. To alleviate this issue, we propose a learn-to-generalize regularization term by utilizing the label information and optimize the problem in a meta-learning fashion. Specifically, we seek the pseudo labels of the unlabeled data so that the model can generalize well on the labeled data, which is formulated as a nested optimization problem. We address this problem using the meta-gradient that bridges between the pseudo label and the regularization term. In addition, we introduce a simple first-order approximation to avoid computing higher-order derivatives and provide theoretic convergence analysis. Extensive evaluations on the SVHN, CIFAR, and ImageNet datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.
△ Less
Submitted 17 March, 2021; v1 submitted 8 July, 2020;
originally announced July 2020.
-
DrNAS: Dirichlet Neural Architecture Search
Authors:
Xiangning Chen,
Ruochen Wang,
Minhao Cheng,
Xiaocheng Tang,
Cho-Jui Hsieh
Abstract:
This paper proposes a novel differentiable architecture search method by formulating it into a distribution learning problem. We treat the continuously relaxed architecture mixing weight as random variables, modeled by Dirichlet distribution. With recently developed pathwise derivatives, the Dirichlet parameters can be easily optimized with gradient-based optimizer in an end-to-end manner. This fo…
▽ More
This paper proposes a novel differentiable architecture search method by formulating it into a distribution learning problem. We treat the continuously relaxed architecture mixing weight as random variables, modeled by Dirichlet distribution. With recently developed pathwise derivatives, the Dirichlet parameters can be easily optimized with gradient-based optimizer in an end-to-end manner. This formulation improves the generalization ability and induces stochasticity that naturally encourages exploration in the search space. Furthermore, to alleviate the large memory consumption of differentiable NAS, we propose a simple yet effective progressive learning scheme that enables searching directly on large-scale tasks, eliminating the gap between search and evaluation phases. Extensive experiments demonstrate the effectiveness of our method. Specifically, we obtain a test error of 2.46% for CIFAR-10, 23.7% for ImageNet under the mobile setting. On NAS-Bench-201, we also achieve state-of-the-art results on all three datasets and provide insights for the effective design of neural architecture search algorithms.
△ Less
Submitted 15 March, 2021; v1 submitted 18 June, 2020;
originally announced June 2020.
-
CAT: Customized Adversarial Training for Improved Robustness
Authors:
Minhao Cheng,
Qi Lei,
Pin-Yu Chen,
Inderjit Dhillon,
Cho-Jui Hsieh
Abstract:
Adversarial training has become one of the most effective methods for improving robustness of neural networks. However, it often suffers from poor generalization on both clean and perturbed data. In this paper, we propose a new algorithm, named Customized Adversarial Training (CAT), which adaptively customizes the perturbation level and the corresponding label for each training sample in adversari…
▽ More
Adversarial training has become one of the most effective methods for improving robustness of neural networks. However, it often suffers from poor generalization on both clean and perturbed data. In this paper, we propose a new algorithm, named Customized Adversarial Training (CAT), which adaptively customizes the perturbation level and the corresponding label for each training sample in adversarial training. We show that the proposed algorithm achieves better clean and robust accuracy than previous adversarial training methods through extensive experiments.
△ Less
Submitted 17 February, 2020;
originally announced February 2020.
-
Enhancing Certifiable Robustness via a Deep Model Ensemble
Authors:
Huan Zhang,
Minhao Cheng,
Cho-Jui Hsieh
Abstract:
We propose an algorithm to enhance certified robustness of a deep model ensemble by optimally weighting each base model. Unlike previous works on using ensembles to empirically improve robustness, our algorithm is based on optimizing a guaranteed robustness certificate of neural networks. Our proposed ensemble framework with certified robustness, RobBoost, formulates the optimal model selection an…
▽ More
We propose an algorithm to enhance certified robustness of a deep model ensemble by optimally weighting each base model. Unlike previous works on using ensembles to empirically improve robustness, our algorithm is based on optimizing a guaranteed robustness certificate of neural networks. Our proposed ensemble framework with certified robustness, RobBoost, formulates the optimal model selection and weighting task as an optimization problem on a lower bound of classification margin, which can be efficiently solved using coordinate descent. Experiments show that our algorithm can form a more robust ensemble than naively averaging all available models using robustly trained MNIST or CIFAR base models. Additionally, our ensemble typically has better accuracy on clean (unperturbed) data. RobBoost allows us to further improve certified robustness and clean accuracy by creating an ensemble of already certified models.
△ Less
Submitted 31 October, 2019;
originally announced October 2019.
-
Sign-OPT: A Query-Efficient Hard-label Adversarial Attack
Authors:
Minhao Cheng,
Simranjit Singh,
Patrick Chen,
Pin-Yu Chen,
Sijia Liu,
Cho-Jui Hsieh
Abstract:
We study the most practical problem setup for evaluating adversarial robustness of a machine learning system with limited access: the hard-label black-box attack setting for generating adversarial examples, where limited model queries are allowed and only the decision is provided to a queried data input. Several algorithms have been proposed for this problem but they typically require huge amount…
▽ More
We study the most practical problem setup for evaluating adversarial robustness of a machine learning system with limited access: the hard-label black-box attack setting for generating adversarial examples, where limited model queries are allowed and only the decision is provided to a queried data input. Several algorithms have been proposed for this problem but they typically require huge amount (>20,000) of queries for attacking one example. Among them, one of the state-of-the-art approaches (Cheng et al., 2019) showed that hard-label attack can be modeled as an optimization problem where the objective function can be evaluated by binary search with additional model queries, thereby a zeroth order optimization algorithm can be applied. In this paper, we adopt the same optimization formulation but propose to directly estimate the sign of gradient at any direction instead of the gradient itself, which enjoys the benefit of single query. Using this single query oracle for retrieving sign of directional derivative, we develop a novel query-efficient Sign-OPT approach for hard-label black-box attack. We provide a convergence analysis of the new algorithm and conduct experiments on several models on MNIST, CIFAR-10 and ImageNet. We find that Sign-OPT attack consistently requires 5X to 10X fewer queries when compared to the current state-of-the-art approaches, and usually converges to an adversarial example with smaller perturbation.
△ Less
Submitted 13 February, 2020; v1 submitted 24 September, 2019;
originally announced September 2019.
-
Understanding Adversarial Behavior of DNNs by Disentangling Non-Robust and Robust Components in Performance Metric
Authors:
Yujun Shi,
Benben Liao,
Guangyong Chen,
Yun Liu,
Ming-Ming Cheng,
Jiashi Feng
Abstract:
The vulnerability to slight input perturbations is a worrying yet intriguing property of deep neural networks (DNNs). Despite many previous works studying the reason behind such adversarial behavior, the relationship between the generalization performance and adversarial behavior of DNNs is still little understood. In this work, we reveal such relation by introducing a metric characterizing the ge…
▽ More
The vulnerability to slight input perturbations is a worrying yet intriguing property of deep neural networks (DNNs). Despite many previous works studying the reason behind such adversarial behavior, the relationship between the generalization performance and adversarial behavior of DNNs is still little understood. In this work, we reveal such relation by introducing a metric characterizing the generalization performance of a DNN. The metric can be disentangled into an information-theoretic non-robust component, responsible for adversarial behavior, and a robust component. Then, we show by experiments that current DNNs rely heavily on optimizing the non-robust component in achieving decent performance. We also demonstrate that current state-of-the-art adversarial training algorithms indeed try to robustify the DNNs by preventing them from using the non-robust component to distinguish samples from different categories. Also, based on our findings, we take a step forward and point out the possible direction for achieving decent standard performance and adversarial robustness simultaneously. We believe that our theory could further inspire the community to make more interesting discoveries about the relationship between standard generalization and adversarial generalization of deep learning models.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.
-
Covariance in Physics and Convolutional Neural Networks
Authors:
Miranda C. N. Cheng,
Vassilis Anagiannis,
Maurice Weiler,
Pim de Haan,
Taco S. Cohen,
Max Welling
Abstract:
In this proceeding we give an overview of the idea of covariance (or equivariance) featured in the recent development of convolutional neural networks (CNNs). We study the similarities and differences between the use of covariance in theoretical physics and in the CNN context. Additionally, we demonstrate that the simple assumption of covariance, together with the required properties of locality,…
▽ More
In this proceeding we give an overview of the idea of covariance (or equivariance) featured in the recent development of convolutional neural networks (CNNs). We study the similarities and differences between the use of covariance in theoretical physics and in the CNN context. Additionally, we demonstrate that the simple assumption of covariance, together with the required properties of locality, linearity and weight sharing, is sufficient to uniquely determine the form of the convolution.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.
-
Time-Series Analysis via Low-Rank Matrix Factorization Applied to Infant-Sleep Data
Authors:
Sheng Liu,
Mark Cheng,
Hayley Brooks,
Wayne Mackey,
David J. Heeger,
Esteban G. Tabak,
Carlos Fernandez-Granda
Abstract:
We propose a nonparametric model for time series with missing data based on low-rank matrix factorization. The model expresses each instance in a set of time series as a linear combination of a small number of shared basis functions. Constraining the functions and the corresponding coefficients to be nonnegative yields an interpretable low-dimensional representation of the data. A time-smoothing r…
▽ More
We propose a nonparametric model for time series with missing data based on low-rank matrix factorization. The model expresses each instance in a set of time series as a linear combination of a small number of shared basis functions. Constraining the functions and the corresponding coefficients to be nonnegative yields an interpretable low-dimensional representation of the data. A time-smoothing regularization term ensures that the model captures meaningful trends in the data, instead of overfitting short-term fluctuations. The low-dimensional representation makes it possible to detect outliers and cluster the time series according to the interpretable features extracted by the model, and also to perform forecasting via kernel regression. We apply our methodology to a large real-world dataset of infant-sleep data gathered by caregivers with a mobile-phone app. Our analysis automatically extracts daily-sleep patterns consistent with the existing literature. This allows us to compute sleep-development trends for the cohort, which characterize the emergence of circadian sleep and different napping habits. We apply our methodology to detect anomalous individuals, to cluster the cohort into groups with different sleeping tendencies, and to obtain improved predictions of future sleep behavior.
△ Less
Submitted 6 November, 2019; v1 submitted 9 April, 2019;
originally announced April 2019.
-
A Family of Maximum Margin Criterion for Adaptive Learning
Authors:
Miao Cheng,
Zunren Liu,
Hongwei Zou,
Ah Chung Tsoi
Abstract:
In recent years, pattern analysis plays an important role in data mining and recognition, and many variants have been proposed to handle complicated scenarios. In the literature, it has been quite familiar with high dimensionality of data samples, but either such characteristics or large data have become usual sense in real-world applications. In this work, an improved maximum margin criterion (MM…
▽ More
In recent years, pattern analysis plays an important role in data mining and recognition, and many variants have been proposed to handle complicated scenarios. In the literature, it has been quite familiar with high dimensionality of data samples, but either such characteristics or large data have become usual sense in real-world applications. In this work, an improved maximum margin criterion (MMC) method is introduced firstly. With the new definition of MMC, several variants of MMC, including random MMC, layered MMC, 2D^2 MMC, are designed to make adaptive learning applicable. Particularly, the MMC network is developed to learn deep features of images in light of simple deep networks. Experimental results on a diversity of data sets demonstrate the discriminant ability of proposed MMC methods are compenent to be adopted in complicated application scenarios.
△ Less
Submitted 7 November, 2018; v1 submitted 9 October, 2018;
originally announced October 2018.
-
Query-Efficient Hard-label Black-box Attack:An Optimization-based Approach
Authors:
Minhao Cheng,
Thong Le,
Pin-Yu Chen,
Jinfeng Yi,
Huan Zhang,
Cho-Jui Hsieh
Abstract:
We study the problem of attacking a machine learning model in the hard-label black-box setting, where no model information is revealed except that the attacker can make queries to probe the corresponding hard-label decisions. This is a very challenging problem since the direct extension of state-of-the-art white-box attacks (e.g., CW or PGD) to the hard-label black-box setting will require minimiz…
▽ More
We study the problem of attacking a machine learning model in the hard-label black-box setting, where no model information is revealed except that the attacker can make queries to probe the corresponding hard-label decisions. This is a very challenging problem since the direct extension of state-of-the-art white-box attacks (e.g., CW or PGD) to the hard-label black-box setting will require minimizing a non-continuous step function, which is combinatorial and cannot be solved by a gradient-based optimizer. The only current approach is based on random walk on the boundary, which requires lots of queries and lacks convergence guarantees. We propose a novel way to formulate the hard-label black-box attack as a real-valued optimization problem which is usually continuous and can be solved by any zeroth order optimization algorithm. For example, using the Randomized Gradient-Free method, we are able to bound the number of iterations needed for our algorithm to achieve stationary points. We demonstrate that our proposed method outperforms the previous random walk approach to attacking convolutional neural networks on MNIST, CIFAR, and ImageNet datasets. More interestingly, we show that the proposed algorithm can also be used to attack other discrete and non-continuous machine learning models, such as Gradient Boosting Decision Trees (GBDT).
△ Less
Submitted 12 July, 2018;
originally announced July 2018.
-
Stochastic Zeroth-order Optimization via Variance Reduction method
Authors:
Liu Liu,
Minhao Cheng,
Cho-Jui Hsieh,
Dacheng Tao
Abstract:
Derivative-free optimization has become an important technique used in machine learning for optimizing black-box models. To conduct updates without explicitly computing gradient, most current approaches iteratively sample a random search direction from Gaussian distribution and compute the estimated gradient along that direction. However, due to the variance in the search direction, the convergenc…
▽ More
Derivative-free optimization has become an important technique used in machine learning for optimizing black-box models. To conduct updates without explicitly computing gradient, most current approaches iteratively sample a random search direction from Gaussian distribution and compute the estimated gradient along that direction. However, due to the variance in the search direction, the convergence rates and query complexities of existing methods suffer from a factor of $d$, where $d$ is the problem dimension. In this paper, we introduce a novel Stochastic Zeroth-order method with Variance Reduction under Gaussian smoothing (SZVR-G) and establish the complexity for optimizing non-convex problems. With variance reduction on both sample space and search space, the complexity of our algorithm is sublinear to $d$ and is strictly better than current approaches, in both smooth and non-smooth cases. Moreover, we extend the proposed method to the mini-batch version. Our experimental results demonstrate the superior performance of the proposed method over existing derivative-free optimization techniques. Furthermore, we successfully apply our method to conduct a universal black-box attack to deep neural networks and present some interesting results.
△ Less
Submitted 2 August, 2018; v1 submitted 30 May, 2018;
originally announced May 2018.
-
Towards Robust Neural Networks via Random Self-ensemble
Authors:
Xuanqing Liu,
Minhao Cheng,
Huan Zhang,
Cho-Jui Hsieh
Abstract:
Recent studies have revealed the vulnerability of deep neural networks: A small adversarial perturbation that is imperceptible to human can easily make a well-trained deep neural network misclassify. This makes it unsafe to apply neural networks in security-critical applications. In this paper, we propose a new defense algorithm called Random Self-Ensemble (RSE) by combining two important concepts…
▽ More
Recent studies have revealed the vulnerability of deep neural networks: A small adversarial perturbation that is imperceptible to human can easily make a well-trained deep neural network misclassify. This makes it unsafe to apply neural networks in security-critical applications. In this paper, we propose a new defense algorithm called Random Self-Ensemble (RSE) by combining two important concepts: {\bf randomness} and {\bf ensemble}. To protect a targeted model, RSE adds random noise layers to the neural network to prevent the strong gradient-based attacks, and ensembles the prediction over random noises to stabilize the performance. We show that our algorithm is equivalent to ensemble an infinite number of noisy models $f_ε$ without any additional memory overhead, and the proposed training procedure based on noisy stochastic gradient descent can ensure the ensemble model has a good predictive capability. Our algorithm significantly outperforms previous defense techniques on real data sets. For instance, on CIFAR-10 with VGG network (which has 92\% accuracy without any attack), under the strong C\&W attack within a certain distortion tolerance, the accuracy of unprotected model drops to less than 10\%, the best previous defense technique has $48\%$ accuracy, while our method still has $86\%$ prediction accuracy under the same level of attack. Finally, our method is simple and easy to integrate into any neural network.
△ Less
Submitted 31 July, 2018; v1 submitted 2 December, 2017;
originally announced December 2017.
-
Greedy Forward Regression for Variable Screening
Authors:
Ming-Yen Cheng,
Sanying Feng,
Gaorong Li,
Heng Lian
Abstract:
Two popular variable screening methods under the ultra-high dimensional setting with the desirable sure screening property are the sure independence screening (SIS) and the forward regression (FR). Both are classical variable screening methods and recently have attracted greater attention under the new light of high-dimensional data analysis. We consider a new and simple screening method that inco…
▽ More
Two popular variable screening methods under the ultra-high dimensional setting with the desirable sure screening property are the sure independence screening (SIS) and the forward regression (FR). Both are classical variable screening methods and recently have attracted greater attention under the new light of high-dimensional data analysis. We consider a new and simple screening method that incorporates multiple predictors in each step of forward regression, with decision on which variables to incorporate based on the same criterion. If only one step is carried out, it actually reduces to the SIS. Thus it can be regarded as a generalization and unification of the FR and the SIS. More importantly, it preserves the sure screening property and has similar computational complexity as FR in each step, yet it can discover the relevant covariates in fewer steps. Thus, it reduces the computational burden of FR drastically while retaining advantages of the latter over SIS. Furthermore, we show that it can find all the true variables if the number of steps taken is the same as the correct model size, even when using the original FR. An extensive simulation study and application to two real data examples demonstrate excellent performance of the proposed method.
△ Less
Submitted 3 November, 2015;
originally announced November 2015.
-
Efficient estimation in semivarying coefficient models for longitudinal/clustered data
Authors:
Ming-Yen Cheng,
Toshio Honda,
Jialiang Li
Abstract:
In semivarying coefficient models for longitudinal/clustered data, usually of primary interest is usually the parametric component which involves unknown constant coefficients. First, we study semiparametric efficiency bound for estimation of the constant coefficients in a general setup. It can be achieved by spline regression provided that the within-cluster covariance matrices are all known, whi…
▽ More
In semivarying coefficient models for longitudinal/clustered data, usually of primary interest is usually the parametric component which involves unknown constant coefficients. First, we study semiparametric efficiency bound for estimation of the constant coefficients in a general setup. It can be achieved by spline regression provided that the within-cluster covariance matrices are all known, which is an unrealistic assumption in reality. Thus, we propose an adaptive estimator of the constant coefficients when the covariance matrices are unknown and depend only on the index random variable, such as time, and when the link function is the identity function. After preliminary estimation, based on working independence and both spline and local linear regression, we estimate the covariance matrices by applying local linear regression to the resulting residuals. Then we employ the covariance matrix estimates and spline regression to obtain our final estimators of the constant coefficients. The proposed estimator achieves the semiparametric efficiency bound under normality assumption, and it has the smallest covariance matrix among a class of estimators even when normality is violated. We also present results of numerical studies. The simulation results demonstrate that our estimator is superior to the one based on working independence. When applied to the CD4 count data, our method identifies an interesting structure that was not found by previous analyses.
△ Less
Submitted 13 September, 2015; v1 submitted 3 January, 2015;
originally announced January 2015.
-
Forward variable selection for sparse ultra-high dimensional varying coefficient models
Authors:
Ming-Yen Cheng,
Toshio Honda,
Jin-Ting Zhang
Abstract:
Varying coefficient models have numerous applications in a wide scope of scientific areas. While enjoying nice interpretability, they also allow flexibility in modeling dynamic impacts of the covariates. But, in the new era of big data, it is challenging to select the relevant variables when there are a large number of candidates. Recently several work are focused on this important problem based o…
▽ More
Varying coefficient models have numerous applications in a wide scope of scientific areas. While enjoying nice interpretability, they also allow flexibility in modeling dynamic impacts of the covariates. But, in the new era of big data, it is challenging to select the relevant variables when there are a large number of candidates. Recently several work are focused on this important problem based on sparsity assumptions; they are subject to some limitations, however. We introduce an appealing forward variable selection procedure. It selects important variables sequentially according to a sum of squares criterion, and it employs an EBIC- or BIC-based stopping rule. Clearly it is simple to implement and fast to compute, and it possesses many other desirable properties from both theoretical and numerical viewpoints. We establish rigorous selection consistency results when either EBIC or BIC is used as the stopping criterion, under some mild regularity conditions. Notably, unlike existing methods, an extra screening step is not required to ensure selection consistency. Even if the regularity conditions fail to hold, our procedure is still useful as an effective screening procedure in a less restrictive setup. We carried out simulation and empirical studies to show the efficacy and usefulness of our procedure.
△ Less
Submitted 23 October, 2014;
originally announced October 2014.
-
A New Test for One-Way ANOVA with Functional Data and Application to Ischemic Heart Screening
Authors:
Jin-Ting Zhang,
Ming-Yen Cheng,
Chi-Jen Tseng,
Hau-Tieng Wu
Abstract:
We propose and study a new global test, namely the $F_{\max}$-test, for the one-way ANOVA problem in functional data analysis. The test statistic is taken as the maximum value of the usual pointwise $F$-test statistics over the interval the functional responses are observed. A nonparametric bootstrap method is employed to approximate the null distribution of the test statistic and to obtain an est…
▽ More
We propose and study a new global test, namely the $F_{\max}$-test, for the one-way ANOVA problem in functional data analysis. The test statistic is taken as the maximum value of the usual pointwise $F$-test statistics over the interval the functional responses are observed. A nonparametric bootstrap method is employed to approximate the null distribution of the test statistic and to obtain an estimated critical value for the test. The asymptotic random expression of the test statistic is derived and the asymptotic power is studied. In particular, under mild conditions, the $F_{\max}$-test asymptotically has the correct level and is root-$n$ consistent in detecting local alternatives. Via some simulation studies, it is found that in terms of both level accuracy and power, the $F_{\max}$-test outperforms the Globalized Pointwise F (GPF) test of \cite{Zhang_Liang:2013} when the functional data are highly or moderately correlated, and its performance is comparable with the latter otherwise. An application to an ischemic heart real dataset suggests that, after proper manipulation, resting electrocardiogram (ECG) signals can be used as an effective tool in clinical ischemic heart screening, without the need of further stress tests as in the current standard procedure.
△ Less
Submitted 27 September, 2013;
originally announced September 2013.
-
Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data
Authors:
Ming-Yen Cheng,
Toshio Honda,
Jialiang Li,
Heng Peng
Abstract:
Ultra-high dimensional longitudinal data are increasingly common and the analysis is challenging both theoretically and methodologically. We offer a new automatic procedure for finding a sparse semivarying coefficient model, which is widely accepted for longitudinal data analysis. Our proposed method first reduces the number of covariates to a moderate order by employing a screening procedure, and…
▽ More
Ultra-high dimensional longitudinal data are increasingly common and the analysis is challenging both theoretically and methodologically. We offer a new automatic procedure for finding a sparse semivarying coefficient model, which is widely accepted for longitudinal data analysis. Our proposed method first reduces the number of covariates to a moderate order by employing a screening procedure, and then identifies both the varying and constant coefficients using a group SCAD estimator, which is subsequently refined by accounting for the within-subject correlation. The screening procedure is based on working independence and B-spline marginal models. Under weaker conditions than those in the literature, we show that with high probability only irrelevant variables will be screened out, and the number of selected variables can be bounded by a moderate order. This allows the desirable sparsity and oracle properties of the subsequent structure identification step. Note that existing methods require some kind of iterative screening in order to achieve this, thus they demand heavy computational effort and consistency is not guaranteed. The refined semivarying coefficient model employs profile least squares, local linear smoothing and nonparametric covariance estimation, and is semiparametric efficient. We also suggest ways to implement the proposed methods, and to select the tuning parameters. An extensive simulation study is summarized to demonstrate its finite sample performance and the yeast cell cycle data is analyzed.
△ Less
Submitted 23 September, 2014; v1 submitted 19 August, 2013;
originally announced August 2013.