Pre-training Vision Transformers with Very Limited Synthesized Images

Nakamura, Ryo; Kataoka, Hirokatsu; Takashima, Sora; Noriega, Edgar Josafat Martinez; Yokota, Rio; Inoue, Nakamasa

Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.14710 (cs)

[Submitted on 27 Jul 2023 (v1), last revised 31 Jul 2023 (this version, v2)]

Title:Pre-training Vision Transformers with Very Limited Synthesized Images

Authors:Ryo Nakamura, Hirokatsu Kataoka, Sora Takashima, Edgar Josafat Martinez Noriega, Rio Yokota, Nakamasa Inoue

View PDF

Abstract:Formula-driven supervised learning (FDSL) is a pre-training method that relies on synthetic images generated from mathematical formulae such as fractals. Prior work on FDSL has shown that pre-training vision transformers on such synthetic datasets can yield competitive accuracy on a wide range of downstream tasks. These synthetic images are categorized according to the parameters in the mathematical formula that generate them. In the present work, we hypothesize that the process for generating different instances for the same category in FDSL, can be viewed as a form of data augmentation. We validate this hypothesis by replacing the instances with data augmentation, which means we only need a single image per category. Our experiments shows that this one-instance fractal database (OFDB) performs better than the original dataset where instances were explicitly generated. We further scale up OFDB to 21,000 categories and show that it matches, or even surpasses, the model pre-trained on ImageNet-21k in ImageNet-1k fine-tuning. The number of images in OFDB is 21k, whereas ImageNet-21k has 14M. This opens new possibilities for pre-training vision transformers with much smaller datasets.

Comments:	Accepted to ICCV 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2307.14710 [cs.CV]
	(or arXiv:2307.14710v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.14710

Submission history

From: Ryo Nakamura [view email]
[v1] Thu, 27 Jul 2023 08:58:39 UTC (5,527 KB)
[v2] Mon, 31 Jul 2023 01:06:05 UTC (5,535 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Pre-training Vision Transformers with Very Limited Synthesized Images

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Pre-training Vision Transformers with Very Limited Synthesized Images

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators