Search | arXiv e-print repository

Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Authors: Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

Abstract: Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretra… ▽ More Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling. △ Less

Submitted 22 April, 2025; originally announced April 2025.

arXiv:2503.15697 [pdf, other]

Technical Report for the 5th CLVision Challenge at CVPR: Addressing the Class-Incremental with Repetition using Unlabeled Data -- 4th Place Solution

Authors: Panagiota Moraiti, Efstathios Karypidis

Abstract: This paper outlines our approach to the 5th CLVision challenge at CVPR, which addresses the Class-Incremental with Repetition (CIR) scenario. In contrast to traditional class incremental learning, this novel setting introduces unique challenges and research opportunities, particularly through the integration of unlabeled data into the training process. In the CIR scenario, encountered classes may… ▽ More This paper outlines our approach to the 5th CLVision challenge at CVPR, which addresses the Class-Incremental with Repetition (CIR) scenario. In contrast to traditional class incremental learning, this novel setting introduces unique challenges and research opportunities, particularly through the integration of unlabeled data into the training process. In the CIR scenario, encountered classes may reappear in later learning experiences, and each experience may involve only a subset of the overall class distribution. Additionally, the unlabeled data provided during training may include instances of unseen classes, or irrelevant classes which should be ignored. Our approach focuses on retaining previously learned knowledge by utilizing knowledge distillation and pseudo-labeling techniques. The key characteristic of our method is the exploitation of unlabeled data during training, in order to maintain optimal performance on instances of previously encountered categories and reduce the detrimental effects of catastrophic forgetting. Our method achieves an average accuracy of 16.68\% during the pre-selection phase and 21.19% during the final evaluation phase, outperforming the baseline accuracy of 9.39%. We provide the implementation code at https://github.com/panagiotamoraiti/continual-learning-challenge-2024 . △ Less

Submitted 19 March, 2025; originally announced March 2025.

arXiv:2501.08303 [pdf, other]

Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers

Authors: Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

Abstract: Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allo… ▽ More Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. We provide the implementation code at https://github.com/Sta8is/FUTURIST . △ Less

Submitted 14 January, 2025; originally announced January 2025.

arXiv:2412.11673 [pdf, other]

DINO-Foresight: Looking into the Future with DINO

Authors: Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

Abstract: Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs)… ▽ More Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show that our framework outperforms existing methods, demonstrating its robustness and scalability. Additionally, we highlight how intermediate transformer representations in DINO-Foresight improve downstream task performance, offering a promising path for the self-supervised enhancement of VFM features. We provide the implementation code at https://github.com/Sta8is/DINO-Foresight . △ Less

Submitted 16 December, 2024; originally announced December 2024.

arXiv:2204.05983 [pdf]

doi 10.37394/23206.2022.21.19

Comparison Analysis of Traditional Machine Learning and Deep Learning Techniques for Data and Image Classification

Authors: Efstathios Karypidis, Stylianos G. Mouslech, Kassiani Skoulariki, Alexandros Gazis

Abstract: The purpose of the study is to analyse and compare the most common machine learning and deep learning techniques used for computer vision 2D object classification tasks. Firstly, we will present the theoretical background of the Bag of Visual words model and Deep Convolutional Neural Networks (DCNN). Secondly, we will implement a Bag of Visual Words model, the VGG16 CNN Architecture. Thirdly, we w… ▽ More The purpose of the study is to analyse and compare the most common machine learning and deep learning techniques used for computer vision 2D object classification tasks. Firstly, we will present the theoretical background of the Bag of Visual words model and Deep Convolutional Neural Networks (DCNN). Secondly, we will implement a Bag of Visual Words model, the VGG16 CNN Architecture. Thirdly, we will present our custom and novice DCNN in which we test the aforementioned implementations on a modified version of the Belgium Traffic Sign dataset. Our results showcase the effects of hyperparameters on traditional machine learning and the advantage in terms of accuracy of DCNNs compared to classical machine learning methods. As our tests indicate, our proposed solution can achieve similar - and in some cases better - results than existing DCNNs architectures. Finally, the technical merit of this article lies in the presented computationally simpler DCNN architecture, which we believe can pave the way towards using more efficient architectures for basic tasks. △ Less

Submitted 11 April, 2022; originally announced April 2022.

Comments: 9 pages, 9 figures, 4 tables. This is an Accepted Manuscript of an article published by Wseas Transactions on Mathematics on 2022, available online: https://doi.org/10.37394/23206.2022.21.19

ACM Class: K.6.3; C.5.2; C.5.3; C.5.5; C.5.m; C.5.0

Journal ref: WSEAS, Transactions on Mathematics, vol. 21, pp. 122-130, March, 2022

Showing 1–5 of 5 results for author: Karypidis, E