-
Solving the Best Subset Selection Problem via Suboptimal Algorithms
Authors:
Vikram Singh,
Min Sun
Abstract:
Best subset selection in linear regression is well known to be nonconvex and computationally challenging to solve, as the number of possible subsets grows rapidly with increasing dimensionality of the problem. As a result, finding the global optimal solution via an exact optimization method for a problem with dimensions of 1000s may take an impractical amount of CPU time. This suggests the importa…
▽ More
Best subset selection in linear regression is well known to be nonconvex and computationally challenging to solve, as the number of possible subsets grows rapidly with increasing dimensionality of the problem. As a result, finding the global optimal solution via an exact optimization method for a problem with dimensions of 1000s may take an impractical amount of CPU time. This suggests the importance of finding suboptimal procedures that can provide good approximate solutions using much less computational effort than exact methods. In this work, we introduce a new procedure and compare it with other popular suboptimal algorithms to solve the best subset selection problem. Extensive computational experiments using synthetic and real data have been performed. The results provide insights into the performance of these methods in different data settings. The new procedure is observed to be a competitive suboptimal algorithm for solving the best subset selection problem for high-dimensional data.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules
Authors:
Kairong Luo,
Haodong Wen,
Shengding Hu,
Zhenbo Sun,
Zhiyuan Liu,
Maosong Sun,
Kaifeng Lyu,
Wenguang Chen
Abstract:
Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed…
▽ More
Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.
△ Less
Submitted 17 March, 2025;
originally announced March 2025.
-
Coherent Disaggregation and Uncertainty Quantification for Spatially Misaligned Data
Authors:
Man Ho Suen,
Mark Naylor,
Finn Lindgren
Abstract:
Spatial misalignment problems arise from both data aggregation and attempts to align misaligned data, leading to information loss. We propose a Bayesian disaggregation framework that links misaligned data to a continuous domain model using an iteratively linearised integration method via integrated nested Laplace approximation (INLA). The framework supports point pattern and aggregated count model…
▽ More
Spatial misalignment problems arise from both data aggregation and attempts to align misaligned data, leading to information loss. We propose a Bayesian disaggregation framework that links misaligned data to a continuous domain model using an iteratively linearised integration method via integrated nested Laplace approximation (INLA). The framework supports point pattern and aggregated count models under four covariate field scenarios: \textit{Raster at Full Resolution (RastFull), Raster Aggregation (RastAgg), Polygon Aggregation (PolyAgg), and Point Values (PointVal)}. The first three involve aggregation, while the latter two have incomplete fields. For PolyAgg and PointVal, we estimate the full covariate field using \textit{Value Plugin, Joint Uncertainty, and Uncertainty Plugin} methods, with the latter two accounting for uncertainty propagation. These methods demonstrate superior performance, and remain more robust even under model misspecification (i.e.\ modelling a nonlinear field as linear).
In landslide studies, landslide occurrences are often aggregated into counts based on slope units, reducing spatial detail. The results indicate that point pattern observations and full-resolution covariate fields should be prioritized. For incomplete fields, methods incorporating uncertainty propagation are preferred. This framework supports landslide susceptibility and other spatial mapping, integrating seamlessly with INLA-extension packages.
△ Less
Submitted 4 April, 2025; v1 submitted 14 February, 2025;
originally announced February 2025.
-
A Survey on Large Language Model-based Agents for Statistics and Data Science
Authors:
Maojun Sun,
Ruijian Han,
Binyan Jiang,
Houduo Qi,
Defeng Sun,
Yancheng Yuan,
Jian Huang
Abstract:
In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users witho…
▽ More
In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation
Authors:
Bofang Jia,
Pengxiang Ding,
Can Cui,
Mingyang Sun,
Pengfang Qian,
Siteng Huang,
Zhaoxin Fan,
Donglin Wang
Abstract:
Visual-motor policy learning has advanced with architectures like diffusion-based policies, known for modeling complex robotic trajectories. However, their prolonged inference times hinder high-frequency control tasks requiring real-time feedback. While consistency distillation (CD) accelerates inference, it introduces errors that compromise action quality. To address these limitations, we propose…
▽ More
Visual-motor policy learning has advanced with architectures like diffusion-based policies, known for modeling complex robotic trajectories. However, their prolonged inference times hinder high-frequency control tasks requiring real-time feedback. While consistency distillation (CD) accelerates inference, it introduces errors that compromise action quality. To address these limitations, we propose the Score and Distribution Matching Policy (SDM Policy), which transforms diffusion-based policies into single-step generators through a two-stage optimization process: score matching ensures alignment with true action distributions, and distribution matching minimizes KL divergence for consistency. A dual-teacher mechanism integrates a frozen teacher for stability and an unfrozen teacher for adversarial training, enhancing robustness and alignment with target distributions. Evaluated on a 57-task simulation benchmark, SDM Policy achieves a 6x inference speedup while having state-of-the-art action quality, providing an efficient and reliable framework for high-frequency robotic tasks.
△ Less
Submitted 19 December, 2024; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Authors:
Yuqi Luo,
Chenyang Song,
Xu Han,
Yingfa Chen,
Chaojun Xiao,
Zhiyuan Liu,
Maosong Sun
Abstract:
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation s…
▽ More
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.
△ Less
Submitted 16 May, 2025; v1 submitted 4 November, 2024;
originally announced November 2024.
-
Data-Driven Approaches for Modelling Target Behaviour
Authors:
Isabel Schlangen,
André Brandenburger,
Mengwei Sun,
James R. Hopgood
Abstract:
The performance of tracking algorithms strongly depends on the chosen model assumptions regarding the target dynamics. If there is a strong mismatch between the chosen model and the true object motion, the track quality may be poor or the track is easily lost. Still, the true dynamics might not be known a priori or it is too complex to be expressed in a tractable mathematical formulation. This pap…
▽ More
The performance of tracking algorithms strongly depends on the chosen model assumptions regarding the target dynamics. If there is a strong mismatch between the chosen model and the true object motion, the track quality may be poor or the track is easily lost. Still, the true dynamics might not be known a priori or it is too complex to be expressed in a tractable mathematical formulation. This paper provides a comparative study between three different methods that use machine learning to describe the underlying object motion based on training data. The first method builds on Gaussian Processes (GPs) for predicting the object motion, the second learns the parameters of an Interacting Multiple Model (IMM) filter and the third uses a Long Short-Term Memory (LSTM) network as a motion model. All methods are compared against an Extended Kalman Filter (EKF) with an analytic motion model as a benchmark and their respective strengths are highlighted in one simulated and two real-world scenarios.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Optimizing MCMC-Driven Bayesian Neural Networks for High-Precision Medical Image Classification in Small Sample Sizes
Authors:
Mingyu Sun
Abstract:
This paper discusses the application of a Bayesian neural network based on the Markov Chain Monte Carlo method in medical image classification with small samples. Experimental results on two medical image datasets, including lung X-ray images and breast tissue slice images, show that this MCMC-based BNN model works very well on small-sample data and greatly improves the robustness and accuracy of…
▽ More
This paper discusses the application of a Bayesian neural network based on the Markov Chain Monte Carlo method in medical image classification with small samples. Experimental results on two medical image datasets, including lung X-ray images and breast tissue slice images, show that this MCMC-based BNN model works very well on small-sample data and greatly improves the robustness and accuracy of classification. Model accuracy reached 85% for the lung X-ray dataset and 88% for the breast tissue slice dataset. To this end, we combine data augmentation techniques such as rotation, flipping, and scaling with regularization methods like dropout and weight decay to improve effectively the diversity of the training data and the generalization ability of the model. The performance of the model was evaluated by many indicators of the results, including accuracy, precision, recall, and the F1 score. All of these have proven the advantages of BNN in small-sample medical image classification. This study not only enriches the application of BNN in the field of medical image classification, but also provides specific implementation paths and optimization methods, providing new solutions for future medical image analysis.
△ Less
Submitted 18 September, 2024;
originally announced September 2024.
-
CoCA: Cooperative Component Analysis
Authors:
Daisy Yi Ding,
Alden Green,
Min Woo Sun,
Robert Tibshirani
Abstract:
We propose Cooperative Component Analysis (CoCA), a new method for unsupervised multi-view analysis: it identifies the component that simultaneously captures significant within-view variance and exhibits strong cross-view correlation. The challenge of integrating multi-view data is particularly important in biology and medicine, where various types of "-omic" data, ranging from genomics to proteom…
▽ More
We propose Cooperative Component Analysis (CoCA), a new method for unsupervised multi-view analysis: it identifies the component that simultaneously captures significant within-view variance and exhibits strong cross-view correlation. The challenge of integrating multi-view data is particularly important in biology and medicine, where various types of "-omic" data, ranging from genomics to proteomics, are measured on the same set of samples. The goal is to uncover important, shared signals that represent underlying biological mechanisms. CoCA combines an approximation error loss to preserve information within data views and an "agreement penalty" to encourage alignment across data views. By balancing the trade-off between these two key components in the objective, CoCA has the property of interpolating between the commonly-used principal component analysis (PCA) and canonical correlation analysis (CCA) as special cases at the two ends of the solution path. CoCA chooses the degree of agreement in a data-adaptive manner, using a validation set or cross-validation to estimate test error. Furthermore, we propose a sparse variant of CoCA that incorporates the Lasso penalty to yield feature sparsity, facilitating the identification of key features driving the observed patterns. We demonstrate the effectiveness of CoCA on simulated data and two real multiomics studies of COVID-19 and ductal carcinoma in situ of breast. In both real data applications, CoCA successfully integrates multiomics data, extracting components that are not only consistently present across different data views but also more informative and predictive of disease progression. CoCA offers a powerful framework for discovering important shared signals in multi-view data, with the potential to uncover novel insights in an increasingly multi-view data world.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
inlabru: software for fitting latent Gaussian models with non-linear predictors
Authors:
Finn Lindgren,
Fabian Bachl,
Janine Illian,
Man Ho Suen,
Håvard Rue,
Andrew E. Seaton
Abstract:
The integrated nested Laplace approximation (INLA) method has become a popular approach for computationally efficient approximate Bayesian computation. In particular, by leveraging sparsity in random effect precision matrices, INLA is commonly used in spatial and spatio-temporal applications. However, the speed of INLA comes at the cost of restricting the user to the family of latent Gaussian mode…
▽ More
The integrated nested Laplace approximation (INLA) method has become a popular approach for computationally efficient approximate Bayesian computation. In particular, by leveraging sparsity in random effect precision matrices, INLA is commonly used in spatial and spatio-temporal applications. However, the speed of INLA comes at the cost of restricting the user to the family of latent Gaussian models and the likelihoods currently implemented in {INLA}, the main software implementation of the INLA methodology.
{inlabru} is a software package that extends the types of models that can be fitted using INLA by allowing the latent predictor to be non-linear in its parameters, moving beyond the additive linear predictor framework to allow more complex functional relationships. For inference it uses an approximate iterative method based on the first-order Taylor expansion of the non-linear predictor, fitting the model using INLA for each linearised model configuration.
{inlabru} automates much of the workflow required to fit models using {R-INLA}, simplifying the process for users to specify, fit and predict from models. There is additional support for fitting joint likelihood models by building each likelihood individually. {inlabru} also supports the direct use of spatial data structures, such as those implemented in the {sf} and {terra} packages.
In this paper we outline the statistical theory, model structure and basic syntax required for users to understand and develop their own models using {inlabru}. We evaluate the approximate inference method using a Bayesian method checking approach. We provide three examples modelling simulated spatial data that demonstrate the benefits of the additional flexibility provided by {inlabru}.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Recovering Linear Causal Models with Latent Variables via Cholesky Factorization of Covariance Matrix
Authors:
Yunfeng Cai,
Xu Li,
Minging Sun,
Ping Li
Abstract:
Discovering the causal relationship via recovering the directed acyclic graph (DAG) structure from the observed data is a well-known challenging combinatorial problem. When there are latent variables, the problem becomes even more difficult. In this paper, we first propose a DAG structure recovering algorithm, which is based on the Cholesky factorization of the covariance matrix of the observed da…
▽ More
Discovering the causal relationship via recovering the directed acyclic graph (DAG) structure from the observed data is a well-known challenging combinatorial problem. When there are latent variables, the problem becomes even more difficult. In this paper, we first propose a DAG structure recovering algorithm, which is based on the Cholesky factorization of the covariance matrix of the observed data. The algorithm is fast and easy to implement and has theoretical grantees for exact recovery. On synthetic and real-world datasets, the algorithm is significantly faster than previous methods and achieves the state-of-the-art performance. Furthermore, under the equal error variances assumption, we incorporate an optimization procedure into the Cholesky factorization based algorithm to handle the DAG recovering problem with latent variables. Numerical simulations show that the modified "Cholesky + optimization" algorithm is able to recover the ground truth graph in most cases and outperforms existing algorithms.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Imitating Human Behaviour with Diffusion Models
Authors:
Tim Pearce,
Tabish Rashid,
Anssi Kanervisto,
Dave Bignell,
Mingfei Sun,
Raluca Georgescu,
Sergio Valcarcel Macua,
Shan Zheng Tan,
Ida Momennejad,
Katja Hofmann,
Sam Devlin
Abstract:
Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their ex…
▽ More
Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.
△ Less
Submitted 3 March, 2023; v1 submitted 25 January, 2023;
originally announced January 2023.
-
On the Overlooked Structure of Stochastic Gradients
Authors:
Zeke Xie,
Qian-Yuan Tang,
Mingming Sun,
Ping Li
Abstract:
Stochastic gradients closely relate to both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statisti…
▽ More
Stochastic gradients closely relate to both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures overlooked by previous studies and present its theoretical implications for training of DNNs. While previous studies believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients in deep learning.
△ Less
Submitted 20 October, 2023; v1 submitted 5 December, 2022;
originally announced December 2022.
-
Confidence intervals for the Cox model test error from cross-validation
Authors:
Min Woo Sun,
Robert Tibshirani
Abstract:
Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedur…
▽ More
Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedures during CV and as a result, the CV estimates of the errors become correlated. Without accounting for this correlation, the estimate of the variance is smaller than it should be. One way to mitigate this issue is by estimating the mean squared error of the prediction error instead using nested CV. This approach has been shown to achieve superior coverage compared to intervals derived from standard CV. In this work, we generalize the nested CV idea to the Cox proportional hazards model and explore various choices of test error for this setting.
△ Less
Submitted 6 October, 2023; v1 submitted 26 January, 2022;
originally announced January 2022.
-
Incompatibility Clustering as a Defense Against Backdoor Poisoning Attacks
Authors:
Charles Jin,
Melinda Sun,
Martin Rinard
Abstract:
We propose a novel clustering mechanism based on an incompatibility property between subsets of data that emerges during model training. This mechanism partitions the dataset into subsets that generalize only to themselves, i.e., training on one subset does not improve performance on the other subsets. Leveraging the interaction between the dataset and the training process, our clustering mechanis…
▽ More
We propose a novel clustering mechanism based on an incompatibility property between subsets of data that emerges during model training. This mechanism partitions the dataset into subsets that generalize only to themselves, i.e., training on one subset does not improve performance on the other subsets. Leveraging the interaction between the dataset and the training process, our clustering mechanism partitions datasets into clusters that are defined by--and therefore meaningful to--the objective of the training process.
We apply our clustering mechanism to defend against data poisoning attacks, in which the attacker injects malicious poisoned data into the training dataset to affect the trained model's output. Our evaluation focuses on backdoor attacks against deep neural networks trained to perform image classification using the GTSRB and CIFAR-10 datasets. Our results show that (1) these attacks produce poisoned datasets in which the poisoned and clean data are incompatible and (2) our technique successfully identifies (and removes) the poisoned data. In an end-to-end evaluation, our defense reduces the attack success rate to below 1% on 134 out of 165 scenarios, with only a 2% drop in clean accuracy on CIFAR-10 and a negligible drop in clean accuracy on GTSRB.
△ Less
Submitted 27 April, 2023; v1 submitted 8 May, 2021;
originally announced May 2021.
-
Learning Deep Neural Networks under Agnostic Corrupted Supervision
Authors:
Boyang Liu,
Mengying Sun,
Ding Wang,
Pang-Ning Tan,
Jiayu Zhou
Abstract:
Training deep neural models in the presence of corrupted supervision is challenging as the corrupted data points may significantly impact the generalization performance. To alleviate this problem, we present an efficient robust algorithm that achieves strong guarantees without any assumption on the type of corruption and provides a unified framework for both classification and regression problems.…
▽ More
Training deep neural models in the presence of corrupted supervision is challenging as the corrupted data points may significantly impact the generalization performance. To alleviate this problem, we present an efficient robust algorithm that achieves strong guarantees without any assumption on the type of corruption and provides a unified framework for both classification and regression problems. Unlike many existing approaches that quantify the quality of the data points (e.g., based on their individual loss values), and filter them accordingly, the proposed algorithm focuses on controlling the collective impact of data points on the average gradient. Even when a corrupted data point failed to be excluded by our algorithm, the data point will have a very limited impact on the overall loss, as compared with state-of-the-art filtering methods based on loss values. Extensive experiments on multiple benchmark datasets have demonstrated the robustness of our algorithm under different types of corruption.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Improving Auto-Augment via Augmentation-Wise Weight Sharing
Authors:
Keyu Tian,
Chen Lin,
Ming Sun,
Luping Zhou,
Junjie Yan,
Wanli Ouyang
Abstract:
The recent progress on automatically searching augmentation policies has boosted the performance substantially for various tasks. A key component of automatic augmentation search is the evaluation process for a particular augmentation policy, which is utilized to return reward and usually runs thousands of times. A plain evaluation process, which includes full model training and validation, would…
▽ More
The recent progress on automatically searching augmentation policies has boosted the performance substantially for various tasks. A key component of automatic augmentation search is the evaluation process for a particular augmentation policy, which is utilized to return reward and usually runs thousands of times. A plain evaluation process, which includes full model training and validation, would be time-consuming. To achieve efficiency, many choose to sacrifice evaluation reliability for speed. In this paper, we dive into the dynamics of augmented training of the model. This inspires us to design a powerful and efficient proxy task based on the Augmentation-Wise Weight Sharing (AWS) to form a fast yet accurate evaluation process in an elegant way. Comprehensive analysis verifies the superiority of this approach in terms of effectiveness and efficiency. The augmentation policies found by our method achieve superior accuracies compared with existing auto-augmentation search methods. On CIFAR-10, we achieve a top-1 error rate of 1.24%, which is currently the best performing single model without extra training data. On ImageNet, we get a top-1 error rate of 20.36% for ResNet-50, which leads to 3.34% absolute error rate reduction over the baseline augmentation.
△ Less
Submitted 22 October, 2020; v1 submitted 30 September, 2020;
originally announced September 2020.
-
Improving MF-DFA model with applications in precious metals market
Authors:
Zhongjun Wang,
Mengye Sun,
A. M. Elsawah
Abstract:
With the aggravation of the global economic crisis and inflation, the precious metals with safe-haven function have become more popular. An improved MF-DFA method is proposed to analyze price fluctuations of the precious metals market. Based on the widely used multifractal detrended fluctuation analysis method (MF-DFA), we compare these two methods and find that the Bi-OSW-MF-DFA method possesses…
▽ More
With the aggravation of the global economic crisis and inflation, the precious metals with safe-haven function have become more popular. An improved MF-DFA method is proposed to analyze price fluctuations of the precious metals market. Based on the widely used multifractal detrended fluctuation analysis method (MF-DFA), we compare these two methods and find that the Bi-OSW-MF-DFA method possesses better efficiency. This article analyzes the degree of multifractality between spot gold market and spot silver market as well as their risks. From the numerical results and figures, it is found that two elements constitute the contributions in the formation of multifractality in time series and the risk of the spot silver market is higher than that of the spot gold market. This attempt could lead to a better understanding of complicated precious metals market.
△ Less
Submitted 26 June, 2020;
originally announced June 2020.
-
Ansor: Generating High-Performance Tensor Programs for Deep Learning
Authors:
Lianmin Zheng,
Chengfan Jia,
Minmin Sun,
Zhao Wu,
Cody Hao Yu,
Ameer Haj-Ali,
Yida Wang,
Jun Yang,
Danyang Zhuo,
Koushik Sen,
Joseph E. Gonzalez,
Ion Stoica
Abstract:
High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require…
▽ More
High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering effort to develop platform-specific optimization code or fall short of finding high-performance programs due to restricted search space and ineffective exploration strategy.
We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches. In addition, Ansor utilizes a task scheduler to simultaneously optimize multiple subgraphs in deep neural networks. We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to $3.8\times$, $2.6\times$, and $1.7\times$, respectively.
△ Less
Submitted 15 October, 2023; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems
Authors:
Weijie Zhao,
Deping Xie,
Ronglai Jia,
Yulei Qian,
Ruiquan Ding,
Mingming Sun,
Ping Li
Abstract:
Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory n…
▽ More
Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.
△ Less
Submitted 12 March, 2020;
originally announced March 2020.
-
Denoised Smoothing: A Provable Defense for Pretrained Classifiers
Authors:
Hadi Salman,
Mingjie Sun,
Greg Yang,
Ashish Kapoor,
J. Zico Kolter
Abstract:
We present a method for provably defending any pretrained image classifier against $\ell_p$ adversarial attacks. This method, for instance, allows public vision API providers and users to seamlessly convert pretrained non-robust classification services into provably robust ones. By prepending a custom-trained denoiser to any off-the-shelf image classifier and using randomized smoothing, we effecti…
▽ More
We present a method for provably defending any pretrained image classifier against $\ell_p$ adversarial attacks. This method, for instance, allows public vision API providers and users to seamlessly convert pretrained non-robust classification services into provably robust ones. By prepending a custom-trained denoiser to any off-the-shelf image classifier and using randomized smoothing, we effectively create a new classifier that is guaranteed to be $\ell_p$-robust to adversarial examples, without modifying the pretrained classifier. Our approach applies to both the white-box and the black-box settings of the pretrained classifier. We refer to this defense as denoised smoothing, and we demonstrate its effectiveness through extensive experimentation on ImageNet and CIFAR-10. Finally, we use our approach to provably defend the Azure, Google, AWS, and ClarifAI image classification APIs. Our code replicating all the experiments in the paper can be found at: https://github.com/microsoft/denoised-smoothing.
△ Less
Submitted 20 September, 2020; v1 submitted 4 March, 2020;
originally announced March 2020.
-
Towards an Efficient and General Framework of Robust Training for Graph Neural Networks
Authors:
Kaidi Xu,
Sijia Liu,
Pin-Yu Chen,
Mengshu Sun,
Caiwen Ding,
Bhavya Kailkhura,
Xue Lin
Abstract:
Graph Neural Networks (GNNs) have made significant advances on several fundamental inference tasks. As a result, there is a surge of interest in using these models for making potentially important decisions in high-regret applications. However, despite GNNs' impressive performance, it has been observed that carefully crafted perturbations on graph structures (or nodes attributes) lead them to make…
▽ More
Graph Neural Networks (GNNs) have made significant advances on several fundamental inference tasks. As a result, there is a surge of interest in using these models for making potentially important decisions in high-regret applications. However, despite GNNs' impressive performance, it has been observed that carefully crafted perturbations on graph structures (or nodes attributes) lead them to make wrong predictions. Presence of these adversarial examples raises serious security concerns. Most of the existing robust GNN design/training methods are only applicable to white-box settings where model parameters are known and gradient based methods can be used by performing convex relaxation of the discrete graph domain. More importantly, these methods are not efficient and scalable which make them infeasible in time sensitive tasks and massive graph datasets. To overcome these limitations, we propose a general framework which leverages the greedy search algorithms and zeroth-order methods to obtain robust GNNs in a generic and an efficient manner. On several applications, we show that the proposed techniques are significantly less computationally expensive and, in some cases, more robust than the state-of-the-art methods making them suitable to large-scale problems which were out of the reach of traditional robust training methods.
△ Less
Submitted 25 February, 2020;
originally announced February 2020.
-
Few-shot acoustic event detection via meta-learning
Authors:
Bowen Shi,
Ming Sun,
Krishna C. Puvvada,
Chieh-Chi Kao,
Spyros Matsoukas,
Chao Wang
Abstract:
We study few-shot acoustic event detection (AED) in this paper. Few-shot learning enables detection of new events with very limited labeled data. Compared to other research areas like computer vision, few-shot learning for audio recognition has been under-studied. We formulate few-shot AED problem and explore different ways of utilizing traditional supervised methods for this setting as well as a…
▽ More
We study few-shot acoustic event detection (AED) in this paper. Few-shot learning enables detection of new events with very limited labeled data. Compared to other research areas like computer vision, few-shot learning for audio recognition has been under-studied. We formulate few-shot AED problem and explore different ways of utilizing traditional supervised methods for this setting as well as a variety of meta-learning approaches, which are conventionally used to solve few-shot classification problem. Compared to supervised baselines, meta-learning models achieve superior performance, thus showing its effectiveness on generalization to new audio events. Our analysis including impact of initialization and domain discrepancy further validate the advantage of meta-learning approaches in few-shot AED.
△ Less
Submitted 21 February, 2020;
originally announced February 2020.
-
Acoustic scene analysis with multi-head attention networks
Authors:
Weimin Wang,
Weiran Wang,
Ming Sun,
Chao Wang
Abstract:
Acoustic Scene Classification (ASC) is a challenging task, as a single scene may involve multiple events that contain complex sound patterns. For example, a cooking scene may contain several sound sources including silverware clinking, chopping, frying, etc. What complicates ASC more is that classes of different activities could have overlapping sounds patterns (e.g. both cooking and dishwashing c…
▽ More
Acoustic Scene Classification (ASC) is a challenging task, as a single scene may involve multiple events that contain complex sound patterns. For example, a cooking scene may contain several sound sources including silverware clinking, chopping, frying, etc. What complicates ASC more is that classes of different activities could have overlapping sounds patterns (e.g. both cooking and dishwashing could have silverware clinking sound). In this paper, we propose a multi-head attention network to model the complex temporal input structures for ASC. The proposed network takes the audio's time-frequency representation as input, and it leverages standard VGG plus LSTM layers to extract high-level feature representation. Further more, it applies multiple attention heads to summarize various patterns of sound events into fixed dimensional representation, for the purpose of final scene classification. The whole network is trained in an end-to-end fashion with back-propagation. Experimental results confirm that our model discovers meaningful sound patterns through the attention mechanism, without using explicit supervision in the alignment. We evaluated our proposed model using DCASE 2018 Task 5 dataset, and achieved competitive performance on par with previous winner's results.
△ Less
Submitted 16 September, 2019;
originally announced September 2019.
-
GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification
Authors:
Jie Zhou,
Xu Han,
Cheng Yang,
Zhiyuan Liu,
Lifeng Wang,
Changcheng Li,
Maosong Sun
Abstract:
Fact verification (FV) is a challenging task which requires to retrieve relevant evidence from plain text and use the evidence to verify given claims. Many claims require to simultaneously integrate and reason over several pieces of evidence for verification. However, previous work employs simple models to extract information from evidence without letting evidence communicate with each other, e.g.…
▽ More
Fact verification (FV) is a challenging task which requires to retrieve relevant evidence from plain text and use the evidence to verify given claims. Many claims require to simultaneously integrate and reason over several pieces of evidence for verification. However, previous work employs simple models to extract information from evidence without letting evidence communicate with each other, e.g., merely concatenate the evidence for processing. Therefore, these methods are unable to grasp sufficient relational and logical information among the evidence. To alleviate this issue, we propose a graph-based evidence aggregating and reasoning (GEAR) framework which enables information to transfer on a fully-connected evidence graph and then utilizes different aggregators to collect multi-evidence information. We further employ BERT, an effective pre-trained language representation model, to improve the performance. Experimental results on a large-scale benchmark dataset FEVER have demonstrated that GEAR could leverage multi-evidence information for FV and thus achieves the promising result with a test FEVER score of 67.10%. Our code is available at https://github.com/thunlp/GEAR.
△ Less
Submitted 22 July, 2019;
originally announced August 2019.
-
Characterizing Attacks on Deep Reinforcement Learning
Authors:
Xinlei Pan,
Chaowei Xiao,
Warren He,
Shuang Yang,
Jian Peng,
Mingjie Sun,
Jinfeng Yi,
Zijiang Yang,
Mingyan Liu,
Bo Li,
Dawn Song
Abstract:
Recent studies show that Deep Reinforcement Learning (DRL) models are vulnerable to adversarial attacks, which attack DRL models by adding small perturbations to the observations. However, some attacks assume full availability of the victim model, and some require a huge amount of computation, making them less feasible for real world applications. In this work, we make further explorations of the…
▽ More
Recent studies show that Deep Reinforcement Learning (DRL) models are vulnerable to adversarial attacks, which attack DRL models by adding small perturbations to the observations. However, some attacks assume full availability of the victim model, and some require a huge amount of computation, making them less feasible for real world applications. In this work, we make further explorations of the vulnerabilities of DRL by studying other aspects of attacks on DRL using realistic and efficient attacks. First, we adapt and propose efficient black-box attacks when we do not have access to DRL model parameters. Second, to address the high computational demands of existing attacks, we introduce efficient online sequential attacks that exploit temporal consistency across consecutive steps. Third, we explore the possibility of an attacker perturbing other aspects in the DRL setting, such as the environment dynamics. Finally, to account for imperfections in how an attacker would inject perturbations in the physical world, we devise a method for generating a robust physical perturbations to be printed. The attack is evaluated on a real-world robot under various conditions. We conduct extensive experiments both in simulation such as Atari games, robotics and autonomous driving, and on real-world robotics, to compare the effectiveness of the proposed attacks with baseline approaches. To the best of our knowledge, we are the first to apply adversarial attacks on DRL systems to physical robots.
△ Less
Submitted 16 February, 2022; v1 submitted 21 July, 2019;
originally announced July 2019.
-
Quantifying Similarity between Relations with Fact Distribution
Authors:
Weize Chen,
Hao Zhu,
Xu Han,
Zhiyuan Liu,
Maosong Sun
Abstract:
We introduce a conceptually simple and effective method to quantify the similarity between relations in knowledge bases. Specifically, our approach is based on the divergence between the conditional probability distributions over entity pairs. In this paper, these distributions are parameterized by a very simple neural network. Although computing the exact similarity is in-tractable, we provide a…
▽ More
We introduce a conceptually simple and effective method to quantify the similarity between relations in knowledge bases. Specifically, our approach is based on the divergence between the conditional probability distributions over entity pairs. In this paper, these distributions are parameterized by a very simple neural network. Although computing the exact similarity is in-tractable, we provide a sampling-based method to get a good approximation. We empirically show the outputs of our approach significantly correlate with human judgments. By applying our method to various tasks, we also find that (1) our approach could effectively detect redundant relations extracted by open information extraction (Open IE) models, that (2) even the most competitive models for relational classification still make mistakes among very similar relations, and that (3) our approach could be incorporated into negative sampling and softmax classification to alleviate these mistakes. The source code and experiment details of this paper can be obtained from https://github.com/thunlp/relation-similarity.
△ Less
Submitted 21 July, 2019;
originally announced July 2019.
-
Adversarial Imitation Learning from Incomplete Demonstrations
Authors:
Mingfei Sun,
Xiaojuan Ma
Abstract:
Imitation learning targets deriving a mapping from states to actions, a.k.a. policy, from expert demonstrations. Existing methods for imitation learning typically require any actions in the demonstrations to be fully available, which is hard to ensure in real applications. Though algorithms for learning with unobservable actions have been proposed, they focus solely on state information and overlo…
▽ More
Imitation learning targets deriving a mapping from states to actions, a.k.a. policy, from expert demonstrations. Existing methods for imitation learning typically require any actions in the demonstrations to be fully available, which is hard to ensure in real applications. Though algorithms for learning with unobservable actions have been proposed, they focus solely on state information and overlook the fact that the action sequence could still be partially available and provide useful information for policy deriving. In this paper, we propose a novel algorithm called Action-Guided Adversarial Imitation Learning (AGAIL) that learns a policy from demonstrations with incomplete action sequences, i.e., incomplete demonstrations. The core idea of AGAIL is to separate demonstrations into state and action trajectories, and train a policy with state trajectories while using actions as auxiliary information to guide the training whenever applicable. Built upon the Generative Adversarial Imitation Learning, AGAIL has three components: a generator, a discriminator, and a guide. The generator learns a policy with rewards provided by the discriminator, which tries to distinguish state distributions between demonstrations and samples generated by the policy. The guide provides additional rewards to the generator when demonstrated actions for specific states are available. We compare AGAIL to other methods on benchmark tasks and show that AGAIL consistently delivers comparable performance to the state-of-the-art methods even when the action sequence in demonstrations is only partially available.
△ Less
Submitted 23 June, 2019; v1 submitted 29 May, 2019;
originally announced May 2019.
-
Mutual Information Maximization in Graph Neural Networks
Authors:
Xinhan Di,
Pengqian Yu,
Rui Bu,
Mingchao Sun
Abstract:
A variety of graph neural networks (GNNs) frameworks for representation learning on graphs have been recently developed. These frameworks rely on aggregation and iteration scheme to learn the representation of nodes. However, information between nodes is inevitably lost in the scheme during learning. In order to reduce the loss, we extend the GNNs frameworks by exploring the aggregation and iterat…
▽ More
A variety of graph neural networks (GNNs) frameworks for representation learning on graphs have been recently developed. These frameworks rely on aggregation and iteration scheme to learn the representation of nodes. However, information between nodes is inevitably lost in the scheme during learning. In order to reduce the loss, we extend the GNNs frameworks by exploring the aggregation and iteration scheme in the methodology of mutual information. We propose a new approach of enlarging the normal neighborhood in the aggregation of GNNs, which aims at maximizing mutual information. Based on a series of experiments conducted on several benchmark datasets, we show that the proposed approach improves the state-of-the-art performance for four types of graph tasks, including supervised and semi-supervised graph classification, graph link prediction and graph edge generation and classification.
△ Less
Submitted 23 March, 2020; v1 submitted 21 May, 2019;
originally announced May 2019.
-
Learning a Multi-Modal Policy via Imitating Demonstrations with Mixed Behaviors
Authors:
Fang-I Hsiao,
Jui-Hsuan Kuo,
Min Sun
Abstract:
We propose a novel approach to train a multi-modal policy from mixed demonstrations without their behavior labels. We develop a method to discover the latent factors of variation in the demonstrations. Specifically, our method is based on the variational autoencoder with a categorical latent variable. The encoder infers discrete latent factors corresponding to different behaviors from demonstratio…
▽ More
We propose a novel approach to train a multi-modal policy from mixed demonstrations without their behavior labels. We develop a method to discover the latent factors of variation in the demonstrations. Specifically, our method is based on the variational autoencoder with a categorical latent variable. The encoder infers discrete latent factors corresponding to different behaviors from demonstrations. The decoder, as a policy, performs the behaviors accordingly. Once learned, the policy is able to reproduce a specific behavior by simply conditioning on a categorical vector. We evaluate our method on three different tasks, including a challenging task with high-dimensional visual inputs. Experimental results show that our approach is better than various baseline methods and competitive with a multi-modal policy trained by ground truth behavior labels.
△ Less
Submitted 25 March, 2019;
originally announced March 2019.
-
Graph Neural Networks: A Review of Methods and Applications
Authors:
Jie Zhou,
Ganqu Cui,
Shengding Hu,
Zhengyan Zhang,
Cheng Yang,
Zhiyuan Liu,
Lifeng Wang,
Changcheng Li,
Maosong Sun
Abstract:
Lots of learning tasks require dealing with graph data which contains rich relation information among elements. Modeling physics systems, learning molecular fingerprints, predicting protein interface, and classifying diseases demand a model to learn from graph inputs. In other domains such as learning from non-structural data like texts and images, reasoning on extracted structures (like the depen…
▽ More
Lots of learning tasks require dealing with graph data which contains rich relation information among elements. Modeling physics systems, learning molecular fingerprints, predicting protein interface, and classifying diseases demand a model to learn from graph inputs. In other domains such as learning from non-structural data like texts and images, reasoning on extracted structures (like the dependency trees of sentences and the scene graphs of images) is an important research topic which also needs graph reasoning models. Graph neural networks (GNNs) are neural models that capture the dependence of graphs via message passing between the nodes of graphs. In recent years, variants of GNNs such as graph convolutional network (GCN), graph attention network (GAT), graph recurrent network (GRN) have demonstrated ground-breaking performances on many deep learning tasks. In this survey, we propose a general design pipeline for GNN models and discuss the variants of each component, systematically categorize the applications, and propose four open problems for future research.
△ Less
Submitted 6 October, 2021; v1 submitted 20 December, 2018;
originally announced December 2018.
-
A deep learning-based remaining useful life prediction approach for bearings
Authors:
Cheng Cheng,
Guijun Ma,
Yong Zhang,
Mingyang Sun,
Fei Teng,
Han Ding,
Ye Yuan
Abstract:
In industrial applications, nearly half the failures of motors are caused by the degradation of rolling element bearings (REBs). Therefore, accurately estimating the remaining useful life (RUL) for REBs are of crucial importance to ensure the reliability and safety of mechanical systems. To tackle this challenge, model-based approaches are often limited by the complexity of mathematical modeling.…
▽ More
In industrial applications, nearly half the failures of motors are caused by the degradation of rolling element bearings (REBs). Therefore, accurately estimating the remaining useful life (RUL) for REBs are of crucial importance to ensure the reliability and safety of mechanical systems. To tackle this challenge, model-based approaches are often limited by the complexity of mathematical modeling. Conventional data-driven approaches, on the other hand, require massive efforts to extract the degradation features and construct health index. In this paper, a novel online data-driven framework is proposed to exploit the adoption of deep convolutional neural networks (CNN) in predicting the RUL of bearings. More concretely, the raw vibrations of training bearings are first processed using the Hilbert-Huang transform (HHT) and a novel nonlinear degradation indicator is constructed as the label for learning. The CNN is then employed to identify the hidden pattern between the extracted degradation indicator and the vibration of training bearings, which makes it possible to estimate the degradation of the test bearings automatically. Finally, testing bearings' RULs are predicted by using a $ε$-support vector regression model. The superior performance of the proposed RUL estimation framework, compared with the state-of-the-art approaches, is demonstrated through the experimental results. The generality of the proposed CNN model is also validated by transferring to bearings undergoing different operating conditions.
△ Less
Submitted 30 August, 2022; v1 submitted 8 December, 2018;
originally announced December 2018.
-
InstaNAS: Instance-aware Neural Architecture Search
Authors:
An-Chieh Cheng,
Chieh Hubert Lin,
Da-Cheng Juan,
Wei Wei,
Min Sun
Abstract:
Conventional Neural Architecture Search (NAS) aims at finding a single architecture that achieves the best performance, which usually optimizes task related learning objectives such as accuracy. However, a single architecture may not be representative enough for the whole dataset with high diversity and variety. Intuitively, electing domain-expert architectures that are proficient in domain-specif…
▽ More
Conventional Neural Architecture Search (NAS) aims at finding a single architecture that achieves the best performance, which usually optimizes task related learning objectives such as accuracy. However, a single architecture may not be representative enough for the whole dataset with high diversity and variety. Intuitively, electing domain-expert architectures that are proficient in domain-specific features can further benefit architecture related objectives such as latency. In this paper, we propose InstaNAS---an instance-aware NAS framework---that employs a controller trained to search for a "distribution of architectures" instead of a single architecture; This allows the model to use sophisticated architectures for the difficult samples, which usually comes with large architecture related cost, and shallow architectures for those easy samples. During the inference phase, the controller assigns each of the unseen input samples with a domain expert architecture that can achieve high accuracy with customized inference costs. Experiments within a search space inspired by MobileNetV2 show InstaNAS can achieve up to 48.8% latency reduction without compromising accuracy on a series of datasets against MobileNetV2.
△ Less
Submitted 23 May, 2019; v1 submitted 26 November, 2018;
originally announced November 2018.
-
Data Poisoning Attack against Unsupervised Node Embedding Methods
Authors:
Mingjie Sun,
Jian Tang,
Huichen Li,
Bo Li,
Chaowei Xiao,
Yao Chen,
Dawn Song
Abstract:
Unsupervised node embedding methods (e.g., DeepWalk, LINE, and node2vec) have attracted growing interests given their simplicity and effectiveness. However, although these methods have been proved effective in a variety of applications, none of the existing work has analyzed the robustness of them. This could be very risky if these methods are attacked by an adversarial party. In this paper, we ta…
▽ More
Unsupervised node embedding methods (e.g., DeepWalk, LINE, and node2vec) have attracted growing interests given their simplicity and effectiveness. However, although these methods have been proved effective in a variety of applications, none of the existing work has analyzed the robustness of them. This could be very risky if these methods are attacked by an adversarial party. In this paper, we take the task of link prediction as an example, which is one of the most fundamental problems for graph analysis, and introduce a data positioning attack to node embedding methods. We give a complete characterization of attacker's utilities and present efficient solutions to adversarial attacks for two popular node embedding methods: DeepWalk and LINE. We evaluate our proposed attack model on multiple real-world graphs. Experimental results show that our proposed model can significantly affect the results of link prediction by slightly changing the graph structures (e.g., adding or removing a few edges). We also show that our proposed model is very general and can be transferable across different embedding methods. Finally, we conduct a case study on a coauthor network to better understand our attack method.
△ Less
Submitted 1 November, 2018; v1 submitted 30 October, 2018;
originally announced October 2018.
-
FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation
Authors:
Xu Han,
Hao Zhu,
Pengfei Yu,
Ziyun Wang,
Yuan Yao,
Zhiyuan Liu,
Maosong Sun
Abstract:
We present a Few-Shot Relation Classification Dataset (FewRel), consisting of 70, 000 sentences on 100 relations derived from Wikipedia and annotated by crowdworkers. The relation of each sentence is first recognized by distant supervision methods, and then filtered by crowdworkers. We adapt the most recent state-of-the-art few-shot learning methods for relation classification and conduct a thorou…
▽ More
We present a Few-Shot Relation Classification Dataset (FewRel), consisting of 70, 000 sentences on 100 relations derived from Wikipedia and annotated by crowdworkers. The relation of each sentence is first recognized by distant supervision methods, and then filtered by crowdworkers. We adapt the most recent state-of-the-art few-shot learning methods for relation classification and conduct a thorough evaluation of these methods. Empirical results show that even the most competitive few-shot learning models struggle on this task, especially as compared with humans. We also show that a range of different reasoning skills are needed to solve our task. These results indicate that few-shot relation classification remains an open problem and still requires further research. Our detailed analysis points multiple directions for future research. All details and resources about the dataset and baselines are released on http://zhuhao.me/fewrel.
△ Less
Submitted 26 October, 2018; v1 submitted 23 October, 2018;
originally announced October 2018.
-
Rethinking the Value of Network Pruning
Authors:
Zhuang Liu,
Mingjie Sun,
Tinghui Zhou,
Gao Huang,
Trevor Darrell
Abstract:
Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surpris…
▽ More
Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization.
△ Less
Submitted 5 March, 2019; v1 submitted 11 October, 2018;
originally announced October 2018.
-
Searching Toward Pareto-Optimal Device-Aware Neural Architectures
Authors:
An-Chieh Cheng,
Jin-Dong Dong,
Chi-Hung Hsu,
Shu-Huan Chang,
Min Sun,
Shih-Chieh Chang,
Jia-Yu Pan,
Yu-Ting Chen,
Wei Wei,
Da-Cheng Juan
Abstract:
Recent breakthroughs in Neural Architectural Search (NAS) have achieved state-of-the-art performance in many tasks such as image classification and language understanding. However, most existing works only optimize for model accuracy and largely ignore other important factors imposed by the underlying hardware and devices, such as latency and energy, when making inference. In this paper, we first…
▽ More
Recent breakthroughs in Neural Architectural Search (NAS) have achieved state-of-the-art performance in many tasks such as image classification and language understanding. However, most existing works only optimize for model accuracy and largely ignore other important factors imposed by the underlying hardware and devices, such as latency and energy, when making inference. In this paper, we first introduce the problem of NAS and provide a survey on recent works. Then we deep dive into two recent advancements on extending NAS into multiple-objective frameworks: MONAS and DPP-Net. Both MONAS and DPP-Net are capable of optimizing accuracy and other objectives imposed by devices, searching for neural architectures that can be best deployed on a wide spectrum of devices: from embedded systems and mobile devices to workstations. Experimental results are poised to show that architectures found by MONAS and DPP-Net achieves Pareto optimality w.r.t the given objectives for various devices.
△ Less
Submitted 29 August, 2018; v1 submitted 29 August, 2018;
originally announced August 2018.
-
A New ECOC Algorithm for Multiclass Microarray Data Classification
Authors:
Mengxin Sun,
Kunhong Liu,
Qingqi Hong,
Beizhan Wang
Abstract:
The classification of multi-class microarray datasets is a hard task because of the small samples size in each class and the heavy overlaps among classes. To effectively solve these problems, we propose novel Error Correcting Output Code (ECOC) algorithm by Enhance Class Separability related Data Complexity measures during encoding process, named as ECOCECS. In this algorithm, two nearest neighbor…
▽ More
The classification of multi-class microarray datasets is a hard task because of the small samples size in each class and the heavy overlaps among classes. To effectively solve these problems, we propose novel Error Correcting Output Code (ECOC) algorithm by Enhance Class Separability related Data Complexity measures during encoding process, named as ECOCECS. In this algorithm, two nearest neighbor related DC measures are deployed to extract the intrinsic overlapping information from microarray data. Our ECOC algorithm aims to search an optimal class split scheme by minimizing these measures. The class splitting process ends when each class is separated from others, and then the class assignment scheme is mapped as a coding matrix. Experiments are carried out on five microarray datasets, and results demonstrate the effectiveness and robustness of our method in comparison with six state-of-art ECOC methods. In short, our work confirm the probability of applying DC to ECOC framework.
△ Less
Submitted 21 June, 2018;
originally announced July 2018.
-
Subspace Network: Deep Multi-Task Censored Regression for Modeling Neurodegenerative Diseases
Authors:
Mengying Sun,
Inci M. Baytas,
Liang Zhan,
Zhangyang Wang,
Jiayu Zhou
Abstract:
Over the past decade a wide spectrum of machine learning models have been developed to model the neurodegenerative diseases, associating biomarkers, especially non-intrusive neuroimaging markers, with key clinical scores measuring the cognitive status of patients. Multi-task learning (MTL) has been commonly utilized by these studies to address high dimensionality and small cohort size challenges.…
▽ More
Over the past decade a wide spectrum of machine learning models have been developed to model the neurodegenerative diseases, associating biomarkers, especially non-intrusive neuroimaging markers, with key clinical scores measuring the cognitive status of patients. Multi-task learning (MTL) has been commonly utilized by these studies to address high dimensionality and small cohort size challenges. However, most existing MTL approaches are based on linear models and suffer from two major limitations: 1) they cannot explicitly consider upper/lower bounds in these clinical scores; 2) they lack the capability to capture complicated non-linear interactions among the variables. In this paper, we propose Subspace Network, an efficient deep modeling approach for non-linear multi-task censored regression. Each layer of the subspace network performs a multi-task censored regression to improve upon the predictions from the last layer via sketching a low-dimensional subspace to perform knowledge transfer among learning tasks. Under mild assumptions, for each layer the parametric subspace can be recovered using only one pass of training data. Empirical results demonstrate that the proposed subspace network quickly picks up the correct parameter subspaces, and outperforms state-of-the-arts in predicting neurodegenerative clinical scores using information in brain imaging.
△ Less
Submitted 28 February, 2018; v1 submitted 18 February, 2018;
originally announced February 2018.
-
Self-paced Convolutional Neural Network for Computer Aided Detection in Medical Imaging Analysis
Authors:
Xiang Li,
Aoxiao Zhong,
Ming Lin,
Ning Guo,
Mu Sun,
Arkadiusz Sitek,
Jieping Ye,
James Thrall,
Quanzheng Li
Abstract:
Tissue characterization has long been an important component of Computer Aided Diagnosis (CAD) systems for automatic lesion detection and further clinical planning. Motivated by the superior performance of deep learning methods on various computer vision problems, there has been increasing work applying deep learning to medical image analysis. However, the development of a robust and reliable deep…
▽ More
Tissue characterization has long been an important component of Computer Aided Diagnosis (CAD) systems for automatic lesion detection and further clinical planning. Motivated by the superior performance of deep learning methods on various computer vision problems, there has been increasing work applying deep learning to medical image analysis. However, the development of a robust and reliable deep learning model for computer-aided diagnosis is still highly challenging due to the combination of the high heterogeneity in the medical images and the relative lack of training samples. Specifically, annotation and labeling of the medical images is much more expensive and time-consuming than other applications and often involves manual labor from multiple domain experts. In this work, we propose a multi-stage, self-paced learning framework utilizing a convolutional neural network (CNN) to classify Computed Tomography (CT) image patches. The key contribution of this approach is that we augment the size of training samples by refining the unlabeled instances with a self-paced learning CNN. By implementing the framework on high performance computing servers including the NVIDIA DGX1 machine, we obtained the experimental result, showing that the self-pace boosted network consistently outperformed the original network even with very scarce manual labels. The performance gain indicates that applications with limited training samples such as medical image analysis can benefit from using the proposed framework.
△ Less
Submitted 19 July, 2017;
originally announced July 2017.
-
Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting
Authors:
Ming Sun,
Anirudh Raju,
George Tucker,
Sankaran Panchapagesan,
Gengshen Fu,
Arindam Mandal,
Spyros Matsoukas,
Nikko Strom,
Shiv Vitaladevuni
Abstract:
We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance.…
▽ More
We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields $67.6\%$ relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.
△ Less
Submitted 5 May, 2017;
originally announced May 2017.
-
Tactics of Adversarial Attack on Deep Reinforcement Learning Agents
Authors:
Yen-Chen Lin,
Zhang-Wei Hong,
Yuan-Hong Liao,
Meng-Li Shih,
Ming-Yu Liu,
Min Sun
Abstract:
We introduce two tactics to attack agents trained by deep reinforcement learning algorithms using adversarial examples, namely the strategically-timed attack and the enchanting attack. In the strategically-timed attack, the adversary aims at minimizing the agent's reward by only attacking the agent at a small subset of time steps in an episode. Limiting the attack activity to this subset helps pre…
▽ More
We introduce two tactics to attack agents trained by deep reinforcement learning algorithms using adversarial examples, namely the strategically-timed attack and the enchanting attack. In the strategically-timed attack, the adversary aims at minimizing the agent's reward by only attacking the agent at a small subset of time steps in an episode. Limiting the attack activity to this subset helps prevent detection of the attack by the agent. We propose a novel method to determine when an adversarial example should be crafted and applied. In the enchanting attack, the adversary aims at luring the agent to a designated target state. This is achieved by combining a generative model and a planning algorithm: while the generative model predicts the future states, the planning algorithm generates a preferred sequence of actions for luring the agent. A sequence of adversarial examples is then crafted to lure the agent to take the preferred sequence of actions. We apply the two tactics to the agents trained by the state-of-the-art deep reinforcement learning algorithm including DQN and A3C. In 5 Atari games, our strategically timed attack reduces as much reward as the uniform attack (i.e., attacking at every time step) does by attacking the agent 4 times less often. Our enchanting attack lures the agent toward designated target states with a more than 70% success rate. Videos are available at http://yenchenlin.me/adversarial_attack_RL/
△ Less
Submitted 12 November, 2019; v1 submitted 7 March, 2017;
originally announced March 2017.
-
Generalized Canonical Correlation Analysis for Classification
Authors:
Cencheng Shen,
Ming Sun,
Minh Tang,
Carey E. Priebe
Abstract:
For multiple multivariate data sets, we derive conditions under which Generalized Canonical Correlation Analysis (GCCA) improves classification performance of the projected datasets, compared to standard Canonical Correlation Analysis (CCA) using only two data sets. We illustrate our theoretical results with simulations and a real data experiment.
For multiple multivariate data sets, we derive conditions under which Generalized Canonical Correlation Analysis (GCCA) improves classification performance of the projected datasets, compared to standard Canonical Correlation Analysis (CCA) using only two data sets. We illustrate our theoretical results with simulations and a real data experiment.
△ Less
Submitted 26 June, 2014; v1 submitted 30 April, 2013;
originally announced April 2013.
-
Generalized Canonical Correlation Analysis for Disparate Data Fusion
Authors:
Ming Sun,
Carey E. Priebe,
Minh Tang
Abstract:
Manifold matching works to identify embeddings of multiple disparate data spaces into the same low-dimensional space, where joint inference can be pursued. It is an enabling methodology for fusion and inference from multiple and massive disparate data sources. In this paper we focus on a method called Canonical Correlation Analysis (CCA) and its generalization Generalized Canonical Correlation Ana…
▽ More
Manifold matching works to identify embeddings of multiple disparate data spaces into the same low-dimensional space, where joint inference can be pursued. It is an enabling methodology for fusion and inference from multiple and massive disparate data sources. In this paper we focus on a method called Canonical Correlation Analysis (CCA) and its generalization Generalized Canonical Correlation Analysis (GCCA), which belong to the more general Reduced Rank Regression (RRR) framework. We present an efficiency investigation of CCA and GCCA under different training conditions for a particular text document classification task.
△ Less
Submitted 17 September, 2012;
originally announced September 2012.
-
A Comparative Study of Collaborative Filtering Algorithms
Authors:
Joonseok Lee,
Mingxuan Sun,
Guy Lebanon
Abstract:
Collaborative filtering is a rapidly advancing research area. Every year several new techniques are proposed and yet it is not clear which of the techniques work best and under what conditions. In this paper we conduct a study comparing several collaborative filtering techniques -- both classic and recent state-of-the-art -- in a variety of experimental contexts. Specifically, we report conclusion…
▽ More
Collaborative filtering is a rapidly advancing research area. Every year several new techniques are proposed and yet it is not clear which of the techniques work best and under what conditions. In this paper we conduct a study comparing several collaborative filtering techniques -- both classic and recent state-of-the-art -- in a variety of experimental contexts. Specifically, we report conclusions controlling for number of items, number of users, sparsity level, performance criteria, and computational complexity. Our conclusions identify what algorithms work well and in what conditions, and contribute to both industrial deployment collaborative filtering algorithms and to the research community.
△ Less
Submitted 14 May, 2012;
originally announced May 2012.