Subobject-level Image Tokenization

Chen, Delong; Cahyawijaya, Samuel; Liu, Jianfeng; Wang, Baoyuan; Fung, Pascale

Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.14327 (cs)

[Submitted on 22 Feb 2024 (v1), last revised 12 Mar 2025 (this version, v3)]

Title:Subobject-level Image Tokenization

Authors:Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung

View PDF HTML (experimental)

Abstract:Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection -- a simple task that can be handled well by a compact model -- with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2402.14327 [cs.CV]
	(or arXiv:2402.14327v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.14327

Submission history

From: Delong Chen [view email]
[v1] Thu, 22 Feb 2024 06:47:44 UTC (1,209 KB)
[v2] Tue, 23 Apr 2024 13:41:47 UTC (2,837 KB)
[v3] Wed, 12 Mar 2025 18:22:25 UTC (5,149 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Subobject-level Image Tokenization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Subobject-level Image Tokenization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators