Skip to main content

Showing 1–22 of 22 results for author: Hu, V T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.11234  [pdf, other

    cs.CV

    MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation

    Authors: Michael Fuest, Vincent Tao Hu, Björn Ommer

    Abstract: Generating long, high-quality videos remains a challenge due to the complex interplay of spatial and temporal dynamics and hardware limitations. In this work, we introduce MaskFlow, a unified video generation framework that combines discrete representations with flow-matching to enable efficient generation of high-quality long videos. By leveraging a frame-level masking strategy during training, M… ▽ More

    Submitted 12 March, 2025; v1 submitted 16 February, 2025; originally announced February 2025.

    Comments: Project page: https://compvis.github.io/maskflow/

  2. arXiv:2501.04765  [pdf, other

    cs.CV cs.AI

    TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

    Authors: Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer

    Abstract: Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place remains very costly. While several recent approaches - including… ▽ More

    Submitted 27 March, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

  3. arXiv:2412.11917  [pdf, other

    cs.CV

    Does VLM Classification Benefit from LLM Description Semantics?

    Authors: Pingchuan Ma, Lennart Rietdorf, Dmytro Kotovenko, Vincent Tao Hu, Björn Ommer

    Abstract: Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to de… ▽ More

    Submitted 19 December, 2024; v1 submitted 16 December, 2024; originally announced December 2024.

    Comments: AAAI-25 (extended version), Code: https://github.com/CompVis/DisCLIP

  4. arXiv:2412.06787  [pdf, other

    cs.CV cs.AI

    [MASK] is All You Need

    Authors: Vincent Tao Hu, Björn Ommer

    Abstract: In generative models, two paradigms have gained attraction in various applications: next-set prediction-based Masked Generative Models and next-noise prediction-based Non-Autoregressive Models, e.g., Diffusion Models. In this work, we propose using discrete-state models to connect them and explore their scalability in the vision domain. First, we conduct a step-by-step analysis in a unified design… ▽ More

    Submitted 10 December, 2024; v1 submitted 9 December, 2024; originally announced December 2024.

    Comments: Technical Report (WIP), Project Page(code, model, dataset): https://compvis.github.io/mask/

  5. arXiv:2412.03512  [pdf, other

    cs.CV

    Distillation of Diffusion Features for Semantic Correspondence

    Authors: Frank Fundel, Johannes Schusterbauer, Vincent Tao Hu, Björn Ommer

    Abstract: Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on… ▽ More

    Submitted 4 December, 2024; originally announced December 2024.

    Comments: WACV 2025, Page: https://compvis.github.io/distilldift

  6. arXiv:2412.02632  [pdf, other

    cs.CV cs.AI

    Scaling Image Tokenizers with Grouped Spherical Quantization

    Authors: Jiangtao Wang, Zhen Qin, Yifan Zhang, Vincent Tao Hu, Björn Ommer, Rania Briq, Stefan Kesselheim

    Abstract: Vision tokenizers have gained a lot of attraction due to their scalability and compactness; previous works depend on old-school GAN-based hyperparameters, biased comparisons, and a lack of comprehensive analysis of the scaling behaviours. To tackle those issues, we introduce Grouped Spherical Quantization (GSQ), featuring spherical codebook initialization and lookup regularization to constrain cod… ▽ More

    Submitted 4 December, 2024; v1 submitted 3 December, 2024; originally announced December 2024.

  7. arXiv:2407.20785  [pdf, other

    cs.CV

    Retinex-Diffusion: On Controlling Illumination Conditions in Diffusion Models via Retinex Theory

    Authors: Xiaoyan Xing, Vincent Tao Hu, Jan Hendrik Metzen, Konrad Groh, Sezer Karaoglu, Theo Gevers

    Abstract: This paper introduces a novel approach to illumination manipulation in diffusion models, addressing the gap in conditional image generation with a focus on lighting conditions. We conceptualize the diffusion model as a black-box image render and strategically decompose its energy function in alignment with the image formation model. Our method effectively separates and controls illumination-relate… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

  8. arXiv:2407.00783  [pdf, other

    cs.CV cs.AI

    Diffusion Models and Representation Learning: A Survey

    Authors: Michael Fuest, Pingchuan Ma, Ming Gui, Johannes Schusterbauer, Vincent Tao Hu, Bjorn Ommer

    Abstract: Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models' essential aspects, inc… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: Github Repo: https://github.com/dongzhuoyao/Diffusion-Representation-Learning-Survey-Taxonomy

  9. arXiv:2403.17064  [pdf, other

    cs.CV cs.AI cs.LG

    Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

    Authors: Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Melvin Sevi, Vincent Tao Hu, Björn Ommer

    Abstract: Recent advances in text-to-image (T2I) diffusion models have significantly improved the quality of generated images. However, providing efficient control over individual subjects, particularly the attributes characterizing them, remains a key challenge. While existing methods have introduced mechanisms to modulate attribute expression, they typically provide either detailed, object-specific locali… ▽ More

    Submitted 14 March, 2025; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: CVPR 2025. Project page: https://compvis.github.io/attribute-control

  10. arXiv:2403.13802  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    ZigMa: A DiT-style Zigzag Mamba Diffusion Model

    Authors: Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Schusterbauer, Björn Ommer

    Abstract: The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the la… ▽ More

    Submitted 24 November, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: ECCV 2024 Project Page: https://taohu.me/zigma/

  11. arXiv:2403.13788  [pdf, other

    cs.CV

    DepthFM: Fast Monocular Depth Estimation with Flow Matching

    Authors: Ming Gui, Johannes Schusterbauer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer

    Abstract: Current discriminative depth estimation methods often produce blurry artifacts, while generative approaches suffer from slow sampling due to curvatures in the noise-to-depth transport. Our method addresses these challenges by framing depth estimation as a direct transport between image and depth distributions. We are the first to explore flow matching in this field, and we demonstrate that its int… ▽ More

    Submitted 19 December, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: AAAI 2025, Project Page: https://github.com/CompVis/depth-fm

  12. arXiv:2402.10821  [pdf, other

    cs.CV

    Training Class-Imbalanced Diffusion Model Via Overlap Optimization

    Authors: Divin Yan, Lu Qi, Vincent Tao Hu, Ming-Hsuan Yang, Meng Tang

    Abstract: Diffusion models have made significant advances recently in high-quality image synthesis and related tasks. However, diffusion models trained on real-world datasets, which often follow long-tailed distributions, yield inferior fidelity for tail classes. Deep generative models, including diffusion models, are biased towards classes with abundant training images. To address the observed appearance o… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: Technique Report

  13. arXiv:2312.10825  [pdf, other

    cs.CV cs.LG

    Latent Space Editing in Transformer-Based Flow Matching

    Authors: Vincent Tao Hu, David W Zhang, Pascal Mettes, Meng Tang, Deli Zhao, Cees G. M. Snoek

    Abstract: This paper strives for image editing via generative models. Flow Matching is an emerging generative modeling technique that offers the advantage of simple and efficient training. Simultaneously, a new transformer-based U-ViT has recently been proposed to replace the commonly used UNet for better scalability and performance in generative modeling. Hence, Flow Matching with a transformer backbone of… ▽ More

    Submitted 17 December, 2023; originally announced December 2023.

    Comments: AAAI 2024 with Appendix

  14. arXiv:2312.08895  [pdf, other

    cs.CV

    Motion Flow Matching for Human Motion Synthesis and Editing

    Authors: Vincent Tao Hu, Wenzhe Yin, Pingchuan Ma, Yunlu Chen, Basura Fernando, Yuki M Asano, Efstratios Gavves, Pascal Mettes, Bjorn Ommer, Cees G. M. Snoek

    Abstract: Human motion synthesis is a fundamental task in computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds and error accumulation. In this paper, we propose \emph{Motion Flow Matching}, a novel generative model designed for human motion generation featuring efficient sampling and effective… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: WIP

  15. arXiv:2312.08825  [pdf, other

    cs.CV

    Guided Diffusion from Self-Supervised Diffusion Features

    Authors: Vincent Tao Hu, Yunlu Chen, Mathilde Caron, Yuki M. Asano, Cees G. M. Snoek, Bjorn Ommer

    Abstract: Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or classifier pretraining. That is why guidance was harnessed from self-supervised learning backbones, like DINO. However, recent studies have revealed that the feature representation derived from diffusion model itself is discriminative for numerous downstream tasks a… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Work In Progress

  16. Boosting Latent Diffusion with Flow Matching

    Authors: Johannes Schusterbauer, Ming Gui, Pingchuan Ma, Nick Stracke, Stefan A. Baumann, Vincent Tao Hu, Björn Ommer

    Abstract: Visual synthesis has recently seen significant leaps in performance, largely due to breakthroughs in generative models. Diffusion models have been a key enabler, as they excel in image diversity. However, this comes at the cost of slow training and synthesis, which is only partially alleviated by latent diffusion. To this end, flow matching is an appealing approach due to its complementary charact… ▽ More

    Submitted 4 December, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

    Comments: ECCV 2024 (Oral), Project Page: https://compvis.github.io/fm-boosting/

  17. arXiv:2311.17121  [pdf, other

    cs.CV cs.LG

    ScribbleGen: Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation

    Authors: Jacob Schnell, Jieke Wang, Lu Qi, Vincent Tao Hu, Meng Tang

    Abstract: Recent advances in generative models, such as diffusion models, have made generating high-quality synthetic images widely accessible. Prior works have shown that training on synthetic images improves many perception tasks, such as image classification, object detection, and semantic segmentation. We are the first to explore generative data augmentations for scribble-supervised semantic segmentatio… ▽ More

    Submitted 16 April, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

  18. arXiv:2311.14542  [pdf, other

    cs.CV

    ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge

    Authors: Eslam Abdelrahman, Liangbing Zhao, Vincent Tao Hu, Matthieu Cord, Patrick Perez, Mohamed Elhoseiny

    Abstract: Diffusion models break down the challenging task of generating data from high-dimensional distributions into a series of easier denoising steps. Inspired by this paradigm, we propose a novel approach that extends the diffusion framework into modality space, decomposing the complex task of RGB image generation into simpler, interpretable stages. Our method, termed ToddlerDiffusion, cascades modalit… ▽ More

    Submitted 5 October, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

  19. arXiv:2210.06462  [pdf, other

    cs.CV cs.AI cs.LG

    Self-Guided Diffusion Models

    Authors: Vincent Tao Hu, David W Zhang, Yuki M. Asano, Gertjan J. Burghouts, Cees G. M. Snoek

    Abstract: Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability, correctness and unbiasedness. In this paper, we eliminate the need for such annotation by instead leveraging the flexibili… ▽ More

    Submitted 27 November, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: CVPR 2023

  20. arXiv:2106.10137  [pdf, other

    cs.CV

    Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

    Authors: Martine Toering, Ioannis Gatopoulos, Maarten Stol, Vincent Tao Hu

    Abstract: Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose "Video Cross-Stream Prototypical Contrasting", a nov… ▽ More

    Submitted 20 October, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

    Comments: WACV2022

  21. arXiv:2008.06374  [pdf, other

    cs.CV

    PointMixup: Augmentation for Point Clouds

    Authors: Yunlu Chen, Vincent Tao Hu, Efstratios Gavves, Thomas Mensink, Pascal Mettes, Pengwan Yang, Cees G. M. Snoek

    Abstract: This paper introduces data augmentation for point clouds by interpolation between examples. Data augmentation by interpolation has shown to be a simple and effective approach in the image domain. Such a mixup is however not directly transferable to point clouds, as we do not have a one-to-one correspondence between the points of two different objects. In this paper, we define data augmentation bet… ▽ More

    Submitted 14 August, 2020; originally announced August 2020.

    Comments: Accepted as Spotlight presentation at European Conference on Computer Vision (ECCV), 2020

  22. arXiv:2008.05826  [pdf, other

    cs.CV cs.LG eess.IV

    Localizing the Common Action Among a Few Videos

    Authors: Pengwan Yang, Vincent Tao Hu, Pascal Mettes, Cees G. M. Snoek

    Abstract: This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containin… ▽ More

    Submitted 25 August, 2020; v1 submitted 13 August, 2020; originally announced August 2020.

    Comments: ECCV 2020