Search | arXiv e-print repository

A Simple and Generalist Approach for Panoptic Segmentation

Authors: Nedyalko Prisadnikov, Wouter Van Gansbeke, Danda Pani Paudel, Luc Van Gool

Abstract: Panoptic segmentation is an important computer vision task, where the current state-of-the-art solutions require specialized components to perform well. We propose a simple generalist framework based on a deep encoder - shallow decoder architecture with per-pixel prediction. Essentially fine-tuning a massively pretrained image model with minimal additional components. Naively this method does not… ▽ More Panoptic segmentation is an important computer vision task, where the current state-of-the-art solutions require specialized components to perform well. We propose a simple generalist framework based on a deep encoder - shallow decoder architecture with per-pixel prediction. Essentially fine-tuning a massively pretrained image model with minimal additional components. Naively this method does not yield good results. We show that this is due to imbalance during training and propose a novel method for reducing it - centroid regression in the space of spectral positional embeddings. Our method achieves panoptic quality (PQ) of 55.1 on the challenging MS-COCO dataset, state-of-the-art performance among generalist methods. △ Less

Submitted 7 March, 2025; v1 submitted 29 August, 2024; originally announced August 2024.

arXiv:2404.05519 [pdf, other]

Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models

Authors: Saman Motamed, Wouter Van Gansbeke, Luc Van Gool

Abstract: With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layers of Text-to-Image (T2I) diffusion models has shown great promise in controlling the shape and location of objects in the scene. Transferring image-editing techniques to the video dom… ▽ More With recent advances in image and video diffusion models for content creation, a plethora of techniques have been proposed for customizing their generated content. In particular, manipulating the cross-attention layers of Text-to-Image (T2I) diffusion models has shown great promise in controlling the shape and location of objects in the scene. Transferring image-editing techniques to the video domain, however, is extremely challenging as object motion and temporal consistency are difficult to capture accurately. In this work, we take a first look at the role of cross-attention in Text-to-Video (T2V) diffusion models for zero-shot video editing. While one-shot models have shown potential in controlling motion and camera movement, we demonstrate zero-shot control over object shape, position and movement in T2V models. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Generative Models for Computer Vision Generative Models for Computer Vision CVPR 2024 Workshop

arXiv:2401.10227 [pdf, other]

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

Authors: Wouter Van Gansbeke, Bert De Brabandere

Abstract: Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to manage the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture that omits these complexities. Our t… ▽ More Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to manage the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture that omits these complexities. Our training consists of two steps: (1) training a shallow autoencoder to project the segmentation masks to latent space; (2) training a diffusion model to allow image-conditioned sampling in latent space. This generative approach unlocks the exploration of mask completion or inpainting. The experimental validation on COCO and ADE20k yields strong segmentation results. Finally, we demonstrate our model's adaptability to multi-tasking by introducing learnable task embeddings. △ Less

Submitted 16 July, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: Accepted at ECCV 2024, Code: https://github.com/segments-ai/latent-diffusion-segmentation

arXiv:2206.06363 [pdf, other]

Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation

Authors: Wouter Van Gansbeke, Simon Vandenhende, Luc Van Gool

Abstract: The task of unsupervised semantic segmentation aims to cluster pixels into semantically meaningful groups. Specifically, pixels assigned to the same cluster should share high-level semantic properties like their object or part category. This paper presents MaskDistill: a novel framework for unsupervised semantic segmentation based on three key ideas. First, we advocate a data-driven strategy to ge… ▽ More The task of unsupervised semantic segmentation aims to cluster pixels into semantically meaningful groups. Specifically, pixels assigned to the same cluster should share high-level semantic properties like their object or part category. This paper presents MaskDistill: a novel framework for unsupervised semantic segmentation based on three key ideas. First, we advocate a data-driven strategy to generate object masks that serve as a pixel grouping prior for semantic segmentation. This approach omits handcrafted priors, which are often designed for specific scene compositions and limit the applicability of competing frameworks. Second, MaskDistill clusters the object masks to obtain pseudo-ground-truth for training an initial object segmentation model. Third, we leverage this model to filter out low-quality object masks. This strategy mitigates the noise in our pixel grouping prior and results in a clean collection of masks which we use to train a final segmentation model. By combining these components, we can considerably outperform previous works for unsupervised semantic segmentation on PASCAL (+11% mIoU) and COCO (+4% mask AP50). Interestingly, as opposed to existing approaches, our framework does not latch onto low-level image cues and is not limited to object-centric datasets. The code and models will be made available. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Comments: Code: https://github.com/wvangansbeke/MaskDistill

arXiv:2106.05967 [pdf, other]

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

Authors: Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Luc Van Gool

Abstract: Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection. However, current methods are still primarily applied to curated datasets like ImageNet. In this paper, we first study how biases in the dataset affect existing methods. Our results show that current contrastive approaches work surprisingly well across: (i) o… ▽ More Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection. However, current methods are still primarily applied to curated datasets like ImageNet. In this paper, we first study how biases in the dataset affect existing methods. Our results show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets. Second, given the generality of the approach, we try to realize further gains with minor modifications. We show that learning additional invariances -- through the use of multi-scale cropping, stronger augmentations and nearest neighbors -- improves the representations. Finally, we observe that MoCo learns spatially structured representations when trained with a multi-crop strategy. The representations can be used for semantic segment retrieval and video instance segmentation without finetuning. Moreover, the results are on par with specialized models. We hope this work will serve as a useful study for other researchers. The code and models are available at https://github.com/wvangansbeke/Revisiting-Contrastive-SSL. △ Less

Submitted 14 December, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: NeurIPS 2021. Code: https://github.com/wvangansbeke/Revisiting-Contrastive-SSL

arXiv:2102.06191 [pdf, other]

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals

Authors: Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Luc Van Gool

Abstract: Being able to learn dense semantic representations of images without supervision is an important problem in computer vision. However, despite its significance, this problem remains rather unexplored, with a few exceptions that considered unsupervised semantic segmentation on small-scale datasets with a narrow visual domain. In this paper, we make a first attempt to tackle the problem on datasets t… ▽ More Being able to learn dense semantic representations of images without supervision is an important problem in computer vision. However, despite its significance, this problem remains rather unexplored, with a few exceptions that considered unsupervised semantic segmentation on small-scale datasets with a narrow visual domain. In this paper, we make a first attempt to tackle the problem on datasets that have been traditionally utilized for the supervised case. To achieve this, we introduce a two-step framework that adopts a predetermined mid-level prior in a contrastive optimization objective to learn pixel embeddings. This marks a large deviation from existing works that relied on proxy tasks or end-to-end clustering. Additionally, we argue about the importance of having a prior that contains information about objects, or their parts, and discuss several possibilities to obtain such a prior in an unsupervised manner. Experimental evaluation shows that our method comes with key advantages over existing works. First, the learned pixel embeddings can be directly clustered in semantic groups using K-Means on PASCAL. Under the fully unsupervised setting, there is no precedent in solving the semantic segmentation task on such a challenging benchmark. Second, our representations can improve over strong baselines when transferred to new datasets, e.g. COCO and DAVIS. The code is available. △ Less

Submitted 3 August, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

Comments: ICCV 2021 - Code: https://github.com/wvangansbeke/Unsupervised-Semantic-Segmentation

arXiv:2005.12320 [pdf, other]

SCAN: Learning to Classify Images without Labels

Authors: Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, Luc Van Gool

Abstract: Can we automatically group images into semantically meaningful clusters when ground-truth annotations are absent? The task of unsupervised image classification remains an important, and open challenge in computer vision. Several recent approaches have tried to tackle this problem in an end-to-end fashion. In this paper, we deviate from recent works, and advocate a two-step approach where feature l… ▽ More Can we automatically group images into semantically meaningful clusters when ground-truth annotations are absent? The task of unsupervised image classification remains an important, and open challenge in computer vision. Several recent approaches have tried to tackle this problem in an end-to-end fashion. In this paper, we deviate from recent works, and advocate a two-step approach where feature learning and clustering are decoupled. First, a self-supervised task from representation learning is employed to obtain semantically meaningful features. Second, we use the obtained features as a prior in a learnable clustering approach. In doing so, we remove the ability for cluster learning to depend on low-level features, which is present in current end-to-end learning approaches. Experimental evaluation shows that we outperform state-of-the-art methods by large margins, in particular +26.6% on CIFAR10, +25.0% on CIFAR100-20 and +21.3% on STL10 in terms of classification accuracy. Furthermore, our method is the first to perform well on a large-scale dataset for image classification. In particular, we obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime without the use of any ground-truth annotations. The code is made publicly available at https://github.com/wvangansbeke/Unsupervised-Classification. △ Less

Submitted 3 July, 2020; v1 submitted 25 May, 2020; originally announced May 2020.

Comments: Accepted at ECCV 2020. Includes supplementary. Code and pretrained models at https://github.com/wvangansbeke/Unsupervised-Classification

arXiv:2004.13379 [pdf, other]

doi 10.1109/TPAMI.2021.3054719

Multi-Task Learning for Dense Prediction Tasks: A Survey

Authors: Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, Luc Van Gool

Abstract: With the advent of deep learning, many dense prediction tasks, i.e. tasks that produce pixel-level predictions, have seen significant performance improvements. The typical approach is to learn these tasks in isolation, that is, a separate neural network is trained for each individual task. Yet, recent multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computation… ▽ More With the advent of deep learning, many dense prediction tasks, i.e. tasks that produce pixel-level predictions, have seen significant performance improvements. The typical approach is to learn these tasks in isolation, that is, a separate neural network is trained for each individual task. Yet, recent multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint, by jointly tackling multiple tasks through a learned shared representation. In this survey, we provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision, explicitly emphasizing on dense prediction tasks. Our contributions concern the following. First, we consider MTL from a network architecture point-of-view. We include an extensive overview and discuss the advantages/disadvantages of recent popular MTL models. Second, we examine various optimization methods to tackle the joint learning of multiple tasks. We summarize the qualitative elements of these works and explore their commonalities and differences. Finally, we provide an extensive experimental evaluation across a variety of dense prediction benchmarks to examine the pros and cons of the different methods, including both architectural and optimization based strategies. △ Less

Submitted 24 January, 2021; v1 submitted 28 April, 2020; originally announced April 2020.

Comments: Accepted to T-PAMI. Code + Suppl. Mat. can be found here: https://github.com/SimonVandenhende/Multi-Task-Learning-PyTorch IEEE Copyright Notice

arXiv:2001.02613 [pdf, other]

Don't Forget The Past: Recurrent Depth Estimation from Monocular Video

Authors: Vaishakh Patil, Wouter Van Gansbeke, Dengxin Dai, Luc Van Gool

Abstract: Autonomous cars need continuously updated depth information. Thus far, depth is mostly estimated independently for a single frame at a time, even if the method starts from video input. Our method produces a time series of depth maps, which makes it an ideal candidate for online learning approaches. In particular, we put three different types of depth estimation (supervised depth prediction, self-s… ▽ More Autonomous cars need continuously updated depth information. Thus far, depth is mostly estimated independently for a single frame at a time, even if the method starts from video input. Our method produces a time series of depth maps, which makes it an ideal candidate for online learning approaches. In particular, we put three different types of depth estimation (supervised depth prediction, self-supervised depth prediction, and self-supervised depth completion) into a common framework. We integrate the corresponding networks with a ConvLSTM such that the spatiotemporal structures of depth across frames can be exploited to yield a more accurate depth estimation. Our method is flexible. It can be applied to monocular videos only or be combined with different types of sparse depth patterns. We carefully study the architecture of the recurrent network and its training strategy. We are first to successfully exploit recurrent networks for real-time self-supervised monocular depth estimation and completion. Extensive experiments show that our recurrent method outperforms its image-based counterpart consistently and significantly in both self-supervised scenarios. It also outperforms previous depth estimation methods of the three popular groups. Please refer to https://www.trace.ethz.ch/publications/2020/rec_depth_estimation/ for details. △ Less

Submitted 28 July, 2020; v1 submitted 8 January, 2020; originally announced January 2020.

Comments: Please refer to our webpage for details https://www.trace.ethz.ch/publications/2020/rec_depth_estimation/

arXiv:1902.05356 [pdf, other]

Sparse and noisy LiDAR completion with RGB guidance and uncertainty

Authors: Wouter Van Gansbeke, Davy Neven, Bert De Brabandere, Luc Van Gool

Abstract: This work proposes a new method to accurately complete sparse LiDAR maps guided by RGB images. For autonomous vehicles and robotics the use of LiDAR is indispensable in order to achieve precise depth predictions. A multitude of applications depend on the awareness of their surroundings, and use depth cues to reason and react accordingly. On the one hand, monocular depth prediction methods fail to… ▽ More This work proposes a new method to accurately complete sparse LiDAR maps guided by RGB images. For autonomous vehicles and robotics the use of LiDAR is indispensable in order to achieve precise depth predictions. A multitude of applications depend on the awareness of their surroundings, and use depth cues to reason and react accordingly. On the one hand, monocular depth prediction methods fail to generate absolute and precise depth maps. On the other hand, stereoscopic approaches are still significantly outperformed by LiDAR based approaches. The goal of the depth completion task is to generate dense depth predictions from sparse and irregular point clouds which are mapped to a 2D plane. We propose a new framework which extracts both global and local information in order to produce proper depth maps. We argue that simple depth completion does not require a deep network. However, we additionally propose a fusion method with RGB guidance from a monocular camera in order to leverage object information and to correct mistakes in the sparse input. This improves the accuracy significantly. Moreover, confidence masks are exploited in order to take into account the uncertainty in the depth predictions from each modality. This fusion method outperforms the state-of-the-art and ranks first on the KITTI depth completion benchmark. Our code with visualizations is available. △ Less

Submitted 14 February, 2019; originally announced February 2019.

Comments: 7 pages, 3 figures

arXiv:1902.00293 [pdf, other]

End-to-end Lane Detection through Differentiable Least-Squares Fitting

Authors: Wouter Van Gansbeke, Bert De Brabandere, Davy Neven, Marc Proesmans, Luc Van Gool

Abstract: Lane detection is typically tackled with a two-step pipeline in which a segmentation mask of the lane markings is predicted first, and a lane line model (like a parabola or spline) is fitted to the post-processed mask next. The problem with such a two-step approach is that the parameters of the network are not optimized for the true task of interest (estimating the lane curvature parameters) but f… ▽ More Lane detection is typically tackled with a two-step pipeline in which a segmentation mask of the lane markings is predicted first, and a lane line model (like a parabola or spline) is fitted to the post-processed mask next. The problem with such a two-step approach is that the parameters of the network are not optimized for the true task of interest (estimating the lane curvature parameters) but for a proxy task (segmenting the lane markings), resulting in sub-optimal performance. In this work, we propose a method to train a lane detector in an end-to-end manner, directly regressing the lane parameters. The architecture consists of two components: a deep network that predicts a segmentation-like weight map for each lane line, and a differentiable least-squares fitting module that returns for each map the parameters of the best-fitting curve in the weighted least-squares sense. These parameters can subsequently be supervised with a loss function of choice. Our method relies on the observation that it is possible to backpropagate through a least-squares fitting procedure. This leads to an end-to-end method where the features are optimized for the true task of interest: the network implicitly learns to generate features that prevent instabilities during the model fitting step, as opposed to two-step pipelines that need to handle outliers with heuristics. Additionally, the system is not just a black box but offers a degree of interpretability because the intermediately generated segmentation-like weight maps can be inspected and visualized. Code and a video is available at github.com/wvangansbeke/LaneDetection_End2End. △ Less

Submitted 5 September, 2019; v1 submitted 1 February, 2019; originally announced February 2019.

Comments: Accepted at ICCVW 2019 (CVRSUAD-Road Scene Understanding and Autonomous Driving)

Showing 1–11 of 11 results for author: Van Gansbeke, W