Search | arXiv e-print repository

Adversarially Pretrained Transformers may be Universally Robust In-Context Learners

Authors: Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki

Abstract: Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we show that transformers adversarially pretrained on diverse tasks can serve as robust foundation models and eliminate the need for adversarial training in downstream tasks. Specifically, we theoretically demonstrate that through in-context learning, a single adversarial… ▽ More Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we show that transformers adversarially pretrained on diverse tasks can serve as robust foundation models and eliminate the need for adversarial training in downstream tasks. Specifically, we theoretically demonstrate that through in-context learning, a single adversarially pretrained transformer can robustly generalize to multiple unseen tasks without any additional training, i.e., without any parameter updates. This robustness stems from the model's focus on robust features and its resistance to attacks that exploit non-predictive features. Besides these positive findings, we also identify several limitations. Under certain conditions (though unrealistic), no universally robust single-layer transformers exist. Moreover, robust transformers exhibit an accuracy--robustness trade-off and require a large number of in-context demonstrations. The code is available at https://github.com/s-kumano/universally-robust-in-context-learner. △ Less

Submitted 20 May, 2025; originally announced May 2025.

arXiv:2505.14021 [pdf, ps, other]

Adversarial Training from Mean Field Perspective

Authors: Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki

Abstract: Although adversarial training is known to be effective against adversarial examples, training dynamics are not well understood. In this study, we present the first theoretical analysis of adversarial training in random deep neural networks without any assumptions on data distributions. We introduce a new theoretical framework based on mean field theory, which addresses the limitations of existing… ▽ More Although adversarial training is known to be effective against adversarial examples, training dynamics are not well understood. In this study, we present the first theoretical analysis of adversarial training in random deep neural networks without any assumptions on data distributions. We introduce a new theoretical framework based on mean field theory, which addresses the limitations of existing mean field-based approaches. Based on this framework, we derive (empirically tight) upper bounds of $\ell_q$ norm-based adversarial loss with $\ell_p$ norm-based adversarial examples for various values of $p$ and $q$. Moreover, we prove that networks without shortcuts are generally not adversarially trainable and that adversarial training reduces network capacity. We also show that network width alleviates these issues. Furthermore, we present the various impacts of the input and output dimensions on the upper bounds and time evolution of the weight variance. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: NeurIPS23

arXiv:2410.23677 [pdf, other]

Wide Two-Layer Networks can Learn from Adversarial Perturbations

Authors: Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki

Abstract: Adversarial examples have raised several open questions, such as why they can deceive classifiers and transfer between different models. A prevailing hypothesis to explain these phenomena suggests that adversarial perturbations appear as random noise but contain class-specific features. This hypothesis is supported by the success of perturbation learning, where classifiers trained solely on advers… ▽ More Adversarial examples have raised several open questions, such as why they can deceive classifiers and transfer between different models. A prevailing hypothesis to explain these phenomena suggests that adversarial perturbations appear as random noise but contain class-specific features. This hypothesis is supported by the success of perturbation learning, where classifiers trained solely on adversarial examples and the corresponding incorrect labels generalize well to correctly labeled test data. Although this hypothesis and perturbation learning are effective in explaining intriguing properties of adversarial examples, their solid theoretical foundation is limited. In this study, we theoretically explain the counterintuitive success of perturbation learning. We assume wide two-layer networks and the results hold for any data distribution. We prove that adversarial perturbations contain sufficient class-specific features for networks to generalize from them. Moreover, the predictions of classifiers trained on mislabeled adversarial examples coincide with those of classifiers trained on correctly labeled clean samples. The code is available at https://github.com/s-kumano/perturbation-learning. △ Less

Submitted 17 January, 2025; v1 submitted 31 October, 2024; originally announced October 2024.

Comments: NeurIPS24

arXiv:2406.02889 [pdf, other]

Language-guided Detection and Mitigation of Unknown Dataset Bias

Authors: Zaiying Zhao, Soichiro Kumano, Toshihiko Yamasaki

Abstract: Dataset bias is a significant problem in training fair classifiers. When attributes unrelated to classification exhibit strong biases towards certain classes, classifiers trained on such dataset may overfit to these bias attributes, substantially reducing the accuracy for minority groups. Mitigation techniques can be categorized according to the availability of bias information (\ie, prior knowled… ▽ More Dataset bias is a significant problem in training fair classifiers. When attributes unrelated to classification exhibit strong biases towards certain classes, classifiers trained on such dataset may overfit to these bias attributes, substantially reducing the accuracy for minority groups. Mitigation techniques can be categorized according to the availability of bias information (\ie, prior knowledge). Although scenarios with unknown biases are better suited for real-world settings, previous work in this field often suffers from a lack of interpretability regarding biases and lower performance. In this study, we propose a framework to identify potential biases as keywords without prior knowledge based on the partial occurrence in the captions. We further propose two debiasing methods: (a) handing over to an existing debiasing approach which requires prior knowledge by assigning pseudo-labels, and (b) employing data augmentation via text-to-image generative models, using acquired bias keywords as prompts. Despite its simplicity, experimental results show that our framework not only outperforms existing methods without prior knowledge, but also is even comparable with a method that assumes prior knowledge. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2402.10470 [pdf, other]

Theoretical Understanding of Learning from Adversarial Perturbations

Authors: Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki

Abstract: It is not fully understood why adversarial examples can deceive neural networks and transfer between different networks. To elucidate this, several studies have hypothesized that adversarial perturbations, while appearing as noises, contain class features. This is supported by empirical evidence showing that networks trained on mislabeled adversarial examples can still generalize well to correctly… ▽ More It is not fully understood why adversarial examples can deceive neural networks and transfer between different networks. To elucidate this, several studies have hypothesized that adversarial perturbations, while appearing as noises, contain class features. This is supported by empirical evidence showing that networks trained on mislabeled adversarial examples can still generalize well to correctly labeled test samples. However, a theoretical understanding of how perturbations include class features and contribute to generalization is limited. In this study, we provide a theoretical framework for understanding learning from perturbations using a one-hidden-layer network trained on mutually orthogonal samples. Our results highlight that various adversarial perturbations, even perturbations of a few pixels, contain sufficient class features for generalization. Moreover, we reveal that the decision boundary when learning from perturbations matches that from standard samples except for specific regions under mild conditions. The code is available at https://github.com/s-kumano/learning-from-adversarial-perturbations. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: ICLR24

arXiv:2402.02150 [pdf, other]

Data-Driven Prediction of Seismic Intensity Distributions Featuring Hybrid Classification-Regression Models

Authors: Koyu Mizutani, Haruki Mitarai, Kakeru Miyazaki, Soichiro Kumano, Toshihiko Yamasaki

Abstract: Earthquakes are among the most immediate and deadly natural disasters that humans face. Accurately forecasting the extent of earthquake damage and assessing potential risks can be instrumental in saving numerous lives. In this study, we developed linear regression models capable of predicting seismic intensity distributions based on earthquake parameters: location, depth, and magnitude. Because it… ▽ More Earthquakes are among the most immediate and deadly natural disasters that humans face. Accurately forecasting the extent of earthquake damage and assessing potential risks can be instrumental in saving numerous lives. In this study, we developed linear regression models capable of predicting seismic intensity distributions based on earthquake parameters: location, depth, and magnitude. Because it is completely data-driven, it can predict intensity distributions without geographical information. The dataset comprises seismic intensity data from earthquakes that occurred in the vicinity of Japan between 1997 and 2020, specifically containing 1,857 instances of earthquakes with a magnitude of 5.0 or greater, sourced from the Japan Meteorological Agency. We trained both regression and classification models and combined them to take advantage of both to create a hybrid model. The proposed model outperformed commonly used Ground Motion Prediction Equations (GMPEs) in terms of the correlation coefficient, F1 score, and MCC. Furthermore, the proposed model can predict even abnormal seismic intensity distributions, a task at conventional GMPEs often struggle. △ Less

Submitted 3 February, 2024; originally announced February 2024.

arXiv:2209.02369 [pdf, other]

Improving Robustness to Out-of-Distribution Data by Frequency-based Augmentation

Authors: Koki Mukai, Soichiro Kumano, Toshihiko Yamasaki

Abstract: Although Convolutional Neural Networks (CNNs) have high accuracy in image recognition, they are vulnerable to adversarial examples and out-of-distribution data, and the difference from human recognition has been pointed out. In order to improve the robustness against out-of-distribution data, we present a frequency-based data augmentation technique that replaces the frequency components with other… ▽ More Although Convolutional Neural Networks (CNNs) have high accuracy in image recognition, they are vulnerable to adversarial examples and out-of-distribution data, and the difference from human recognition has been pointed out. In order to improve the robustness against out-of-distribution data, we present a frequency-based data augmentation technique that replaces the frequency components with other images of the same class. When the training data are CIFAR10 and the out-of-distribution data are SVHN, the Area Under Receiver Operating Characteristic (AUROC) curve of the model trained with the proposed method increases from 89.22\% to 98.15\%, and further increased to 98.59\% when combined with another data augmentation method. Furthermore, we experimentally demonstrate that the robust model for out-of-distribution data uses a lot of high-frequency components of the image. △ Less

Submitted 6 September, 2022; originally announced September 2022.

Comments: ICIP 2022

arXiv:2208.07565 [pdf, other]

Prediction of Seismic Intensity Distributions Using Neural Networks

Authors: Koyu Mizutani, Haruki Mitarai, Kakeru Miyazaki, Ryugo Shimamura, Soichiro Kumano, Toshihiko Yamasaki

Abstract: The ground motion prediction equation is commonly used to predict the seismic intensity distribution. However, it is not easy to apply this method to seismic distributions affected by underground plate structures, which are commonly known as abnormal seismic distributions. This study proposes a hybrid of regression and classification approaches using neural networks. The proposed model treats the… ▽ More The ground motion prediction equation is commonly used to predict the seismic intensity distribution. However, it is not easy to apply this method to seismic distributions affected by underground plate structures, which are commonly known as abnormal seismic distributions. This study proposes a hybrid of regression and classification approaches using neural networks. The proposed model treats the distributions as 2-dimensional data like an image. Our method can accurately predict seismic intensity distributions, even abnormal distributions. △ Less

Submitted 16 August, 2022; originally announced August 2022.

Comments: 2 pages, 2 figures, IEEE GCCE2022 accepted

arXiv:2205.14629 [pdf, other]

Superclass Adversarial Attack

Authors: Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki

Abstract: Adversarial attacks have only focused on changing the predictions of the classifier, but their danger greatly depends on how the class is mistaken. For example, when an automatic driving system mistakes a Persian cat for a Siamese cat, it is hardly a problem. However, if it mistakes a cat for a 120km/h minimum speed sign, serious problems can arise. As a stepping stone to more threatening adversar… ▽ More Adversarial attacks have only focused on changing the predictions of the classifier, but their danger greatly depends on how the class is mistaken. For example, when an automatic driving system mistakes a Persian cat for a Siamese cat, it is hardly a problem. However, if it mistakes a cat for a 120km/h minimum speed sign, serious problems can arise. As a stepping stone to more threatening adversarial attacks, we consider the superclass adversarial attack, which causes misclassification of not only fine classes, but also superclasses. We conducted the first comprehensive analysis of superclass adversarial attacks (an existing and 19 new methods) in terms of accuracy, speed, and stability, and identified several strategies to achieve better performance. Although this study is aimed at superclass misclassification, the findings can be applied to other problem settings involving multiple classes, such as top-k and multi-label classification attacks. △ Less

Submitted 14 July, 2022; v1 submitted 29 May, 2022; originally announced May 2022.

Comments: ICML Workshop 2022 on Adversarial Machine Learning Frontiers

arXiv:2012.03843 [pdf, other]

Are DNNs fooled by extremely unrecognizable images?

Authors: Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki

Abstract: Fooling images are a potential threat to deep neural networks (DNNs). These images are not recognizable to humans as natural objects, such as dogs and cats, but are misclassified by DNNs as natural-object classes with high confidence scores. Despite their original design concept, existing fooling images retain some features that are characteristic of the target objects if looked into closely. Henc… ▽ More Fooling images are a potential threat to deep neural networks (DNNs). These images are not recognizable to humans as natural objects, such as dogs and cats, but are misclassified by DNNs as natural-object classes with high confidence scores. Despite their original design concept, existing fooling images retain some features that are characteristic of the target objects if looked into closely. Hence, DNNs can react to these features. In this paper, we address the question of whether there can be fooling images with no characteristic pattern of natural objects locally or globally. As a minimal case, we introduce single-color images with a few pixels altered, called sparse fooling images (SFIs). We first prove that SFIs always exist under mild conditions for linear and nonlinear models and reveal that complex models are more likely to be vulnerable to SFI attacks. With two SFI generation methods, we demonstrate that in deeper layers, SFIs end up with similar features to those of natural images, and consequently, fool DNNs successfully. Among other layers, we discovered that the max pooling layer causes the vulnerability against SFIs. The defense against SFIs and transferability are also discussed. This study highlights the new vulnerability of DNNs by introducing a novel class of images that distributes extremely far from natural images. △ Less

Submitted 26 March, 2022; v1 submitted 7 December, 2020; originally announced December 2020.

Showing 1–10 of 10 results for author: Kumano, S