Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

Rossetti, Simone; Zappia, Damiano; Sanzari, Marta; Schaerf, Marco; Pirri, Fiora

Computer Science > Computer Vision and Pattern Recognition

arXiv:2210.17400 (cs)

[Submitted on 31 Oct 2022]

Title:Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

Authors:Simone Rossetti (1 and 2), Damiano Zappia (1), Marta Sanzari (2), Marco Schaerf (1 and 2), Fiora Pirri (1 and 2) ((1) DeepPlants, (2) DIAG Sapienza)

View PDF

Abstract:Weakly Supervised Semantic Segmentation (WSSS) research has explored many directions to improve the typical pipeline CNN plus class activation maps (CAM) plus refinements, given the image-class label as the only supervision. Though the gap with the fully supervised methods is reduced, further abating the spread seems unlikely within this framework. On the other hand, WSSS methods based on Vision Transformers (ViT) have not yet explored valid alternatives to CAM. ViT features have been shown to retain a scene layout, and object boundaries in self-supervised learning. To confirm these findings, we prove that the advantages of transformers in self-supervised methods are further strengthened by Global Max Pooling (GMP), which can leverage patch features to negotiate pixel-label probability with class probability. This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. The end-to-end presented network learns with a single optimization process, refined shape and proper localization for segmentation masks. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3\%$ mIoU on PascalVOC 2012 $val$ set. We show that our approach has the least set of parameters, though obtaining higher accuracy than all other approaches. In a sentence, quantitative and qualitative results of our method reveal that ViT-PCM is an excellent alternative to CNN-CAM based architectures.

Comments:	28 pages, 9 images, ECCV 2022 conference
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2210.17400 [cs.CV]
	(or arXiv:2210.17400v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2210.17400

Submission history

From: Simone Rossetti [view email]
[v1] Mon, 31 Oct 2022 15:32:23 UTC (32,762 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators