Skip to main content

Showing 1–30 of 30 results for author: Mezaris, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.03319  [pdf, ps, other

    cs.CV cs.AI cs.MM

    SD-VSum: A Method and Dataset for Script-Driven Video Summarization

    Authors: Manolis Mylonas, Evlampios Apostolidis, Vasileios Mezaris

    Abstract: In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descript… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Under review

  2. arXiv:2504.09914  [pdf, other

    cs.CV

    Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge

    Authors: Maria Tzelepi, Vasileios Mezaris

    Abstract: Memes have become a dominant form of communication in social media in recent years. Memes are typically humorous and harmless, however there are also memes that promote hate speech, being in this way harmful to individuals and groups based on their identity. Therefore, detecting hateful content in memes has emerged as a task of critical importance. The need for understanding the complex interactio… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted for publication, Multimodal Learning and Applications Workshop (MULA 2025) @ IEEE/CVF CVPR 2025, Nashville, TN, USA, June 2025. This is the authors' "accepted version"

  3. arXiv:2502.03957  [pdf, other

    cs.CV cs.AI cs.CR

    Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated Samples

    Authors: Konstantinos Tsigos, Evlampios Apostolidis, Vasileios Mezaris

    Abstract: In this paper, we introduce the idea of using adversarially-generated samples of the input images that were classified as deepfakes by a detector, to form perturbation masks for inferring the importance of different input features and produce visual explanations. We generate these samples based on Natural Evolution Strategies, aiming to flip the original deepfake detector's decision and classify t… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

    Comments: Accepted for publication, AI4MFDD Workshop @ IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, Feb. 2025. This is the authors' "accepted version"

  4. P-TAME: Explain Any Image Classifier with Trained Perturbations

    Authors: Mariano V. Ntrougkas, Vasileios Mezaris, Ioannis Patras

    Abstract: The adoption of Deep Neural Networks (DNNs) in critical fields where predictions need to be accompanied by justifications is hindered by their inherent black-box nature. In this paper, we introduce P-TAME (Perturbation-based Trainable Attention Mechanism for Explanations), a model-agnostic method for explaining DNN-based image classifiers. P-TAME employs an auxiliary image classifier to extract fe… ▽ More

    Submitted 3 June, 2025; v1 submitted 29 January, 2025; originally announced January 2025.

    Comments: Published in IEEE Open Journal of Signal Processing (Volume 6)

    Journal ref: IEEE Open Journal of Signal Processing, vol. 6, pp. 536-545, 2025

  5. arXiv:2501.16917  [pdf, other

    cs.CV

    B-FPGM: Lightweight Face Detection via Bayesian-Optimized Soft FPGM Pruning

    Authors: Nikolaos Kaparinos, Vasileios Mezaris

    Abstract: Face detection is a computer vision application that increasingly demands lightweight models to facilitate deployment on devices with limited computational resources. Neural network pruning is a promising technique that can effectively reduce network size without significantly affecting performance. In this work, we propose a novel face detection pruning pipeline that leverages Filter Pruning via… ▽ More

    Submitted 28 January, 2025; originally announced January 2025.

    Comments: Accepted for publication, RWS Workshop @ IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, AZ, USA, Feb. 2025. This is the authors' "accepted version"

  6. arXiv:2412.17415  [pdf, other

    cs.CV cs.AI cs.MM

    VidCtx: Context-aware Video Question Answering with Image Models

    Authors: Andreas Goulas, Vasileios Mezaris, Ioannis Patras

    Abstract: To address computational and memory limitations of Large Multimodal Models in the Video Question-Answering task, several recent methods extract textual representations per frame (e.g., by captioning) and feed them to a Large Language Model (LLM) that processes them to produce the final response. However, in this way, the LLM does not have access to visual information and often has to process repet… ▽ More

    Submitted 7 April, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

    Comments: Accepted in IEEE ICME 2025. This is the authors' accepted version

  7. arXiv:2412.11663  [pdf, other

    cs.CV cs.MM

    LMM-Regularized CLIP Embeddings for Image Classification

    Authors: Maria Tzelepi, Vasileios Mezaris

    Abstract: In this paper we deal with image classification tasks using the powerful CLIP vision-language model. Our goal is to advance the classification performance using the CLIP's image encoder, by proposing a novel Large Multimodal Model (LMM) based regularization method. The proposed method uses an LMM to extract semantic descriptions for the images of the dataset. Then, it uses the CLIP's text encoder,… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

    Comments: Accepted for publication, 26th Int. Symp. on Multimedia (IEEE ISM 2024), Tokyo, Japan, Dec. 2024. This is the authors' "accepted version"

  8. arXiv:2406.12668  [pdf, other

    cs.CV

    Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

    Authors: Maria Tzelepi, Vasileios Mezaris

    Abstract: In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions. Subsequently, we use the CLIP's text encoder in order to obtain the text embeddings of both the… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted for publication, LVLM Workshop @ IEEE Int. Conf. on Image Processing (ICIP 2024), Abu Dhabi, United Arab Emirates, Oct. 2024. This is the authors' "accepted version"

  9. arXiv:2406.12662  [pdf, other

    cs.CV

    Online Anchor-based Training for Image Classification Tasks

    Authors: Maria Tzelepi, Vasileios Mezaris

    Abstract: In this paper, we aim to improve the performance of a deep learning model towards image classification tasks, proposing a novel anchor-based training methodology, named \textit{Online Anchor-based Training} (OAT). The OAT method, guided by the insights provided in the anchor-based object detection methodologies, instead of learning directly the class labels, proposes to train a model to learn perc… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted for publication, IEEE Int. Conf. on Image Processing (ICIP 2024), Abu Dhabi, United Arab Emirates, Oct. 2024. This is the authors' "accepted version"

  10. arXiv:2406.03071  [pdf, other

    cs.CV cs.AI cs.MM

    Exploiting LMM-based knowledge for image classification tasks

    Authors: Maria Tzelepi, Vasileios Mezaris

    Abstract: In this paper we address image classification tasks leveraging knowledge encoded in Large Multimodal Models (LMMs). More specifically, we use the MiniGPT-4 model to extract semantic descriptions for the images, in a multimodal prompting fashion. In the current literature, vision language models such as CLIP, among other approaches, are utilized as feature extractors, using only the image encoder,… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted for publication, 25th Int. Conf. on Engineering Applications of Neural Networks (EANN/EAAAI 2024), Corfu, Greece, June 2024. This is the "submitted manuscript"

  11. arXiv:2406.02991  [pdf, other

    cs.CV cs.MM

    A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

    Authors: Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris

    Abstract: In this paper we introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries that can be consumed via traditional devices, such as TV sets and smartphones. The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods. Using thi… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted for publication, 1st Int. Workshop on Video for Immersive Experiences (Video4IMX-2024) at ACM IMX 2024, Stockholm, Sweden, June 2024. This is the "accepted version"

  12. arXiv:2405.10082  [pdf, other

    cs.CV cs.AI

    An Integrated Framework for Multi-Granular Explanation of Video Summarization

    Authors: Konstantinos Tsigos, Evlampios Apostolidis, Vasileios Mezaris

    Abstract: In this paper, we propose an integrated framework for multi-granular explanation of video summarization. This framework integrates methods for producing explanations both at the fragment level (indicating which video fragments influenced the most the decisions of the summarizer) and the more fine-grained visual object level (highlighting which visual objects were the most influential for the summa… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: Under review

  13. arXiv:2405.00384  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Visual and audio scene classification for detecting discrepancies in video: a baseline method and experimental protocol

    Authors: Konstantinos Apostolidis, Jakob Abesser, Luca Cuccovillo, Vasileios Mezaris

    Abstract: This paper presents a baseline approach and an experimental protocol for a specific content verification problem: detecting discrepancies between the audio and video modalities in multimedia content. We first design and optimize an audio-visual scene classifier, to compare with existing classification baselines that use both modalities. Then, by applying this classifier separately to the audio and… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: Accepted for publication, 3rd ACM Int. Workshop on Multimedia AI against Disinformation (MAD'24) at ACM ICMR'24, June 10, 2024, Phuket, Thailand. This is the "accepted version"

  14. arXiv:2404.18649  [pdf, other

    cs.CV cs.AI

    Towards Quantitative Evaluation of Explainable AI Methods for Deepfake Detection

    Authors: Konstantinos Tsigos, Evlampios Apostolidis, Spyridon Baxevanakis, Symeon Papadopoulos, Vasileios Mezaris

    Abstract: In this paper we propose a new framework for evaluating the performance of explanation methods on the decisions of a deepfake detector. This framework assesses the ability of an explanation method to spot the regions of a fake image with the biggest influence on the decision of the deepfake detector, by examining the extent to which these regions can be modified through a set of adversarial attack… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted for publication, 3rd ACM Int. Workshop on Multimedia AI against Disinformation (MAD'24) at ACM ICMR'24, June 10, 2024, Phuket, Thailand. This is the "accepted version"

  15. arXiv:2403.04523  [pdf, ps, other

    cs.CV cs.AI cs.LG cs.MM

    T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers

    Authors: Mariano V. Ntrougkas, Nikolaos Gkalelis, Vasileios Mezaris

    Abstract: The development and adoption of Vision Transformers and other deep-learning architectures for image classification tasks has been rapid. However, the "black box" nature of neural networks is a barrier to adoption in applications where explainability is essential. While some techniques for generating explanations have been proposed, primarily for Convolutional Neural Networks, adapting such techniq… ▽ More

    Submitted 3 June, 2025; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Accepted

    Journal ref: IEEE Access 12 (2024) 76880-76900

  16. arXiv:2312.02616  [pdf, other

    cs.CV

    Facilitating the Production of Well-tailored Video Summaries for Sharing on Social Media

    Authors: Evlampios Apostolidis, Konstantinos Apostolidis, Vasileios Mezaris

    Abstract: This paper presents a web-based tool that facilitates the production of tailored summaries for online sharing on social media. Through an interactive user interface, it supports a ``one-click'' video summarization process. Based on the integrated AI models for video summarization and aspect ratio transformation, it facilitates the generation of multiple summaries of a full-length video according t… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: Accepted for publication, 30th Int. Conf. on MultiMedia Modeling (MMM 2024), Amsterdam, NL, Jan.-Feb. 2024. This is the "submitted manuscript" version

  17. arXiv:2312.02576  [pdf, other

    cs.CV

    An Integrated System for Spatio-Temporal Summarization of 360-degrees Videos

    Authors: Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris

    Abstract: In this work, we present an integrated system for spatiotemporal summarization of 360-degrees videos. The video summary production mainly involves the detection of salient events and their synopsis into a concise summary. The analysis relies on state-of-the-art methods for saliency detection in 360-degrees video (ATSal and SST-Sal) and video summarization (CA-SUM). It also contains a mechanism tha… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: Accepted for publication, 30th Int. Conf. on MultiMedia Modeling (MMM 2024), Amsterdam, NL, Jan.-Feb. 2024. This is the "submitted manuscript" version

  18. arXiv:2312.01790  [pdf, other

    cs.CV

    MMFusion: Combining Image Forensic Filters for Visual Manipulation Detection and Localization

    Authors: Kostas Triaridis, Konstantinos Tsigos, Vasileios Mezaris

    Abstract: Recent image manipulation localization and detection techniques typically leverage forensic artifacts and traces that are produced by a noise-sensitive filter, such as SRM or Bayar convolution. In this paper, we showcase that different filters commonly used in such approaches excel at unveiling different types of manipulations and provide complementary forensic traces. Thus, we explore ways of com… ▽ More

    Submitted 16 October, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: This version (v2): extended journal version, submitted for publication. Initial version (v1), arXiv:2312.01790v1 , presented and published in the 30th Int. Conf. on MultiMedia Modeling (MMM 2024), Amsterdam, NL, Jan.-Feb. 2024. This is the "submitted manuscript" version

  19. arXiv:2311.16613  [pdf, other

    cs.CV

    Filter-Pruning of Lightweight Face Detectors Using a Geometric Median Criterion

    Authors: Konstantinos Gkrispanis, Nikolaos Gkalelis, Vasileios Mezaris

    Abstract: Face detectors are becoming a crucial component of many applications, including surveillance, that often have to run on edge devices with limited processing power and memory. Therefore, there's a pressing demand for compact face detection models that can function efficiently across resource-constrained devices. Over recent years, network pruning techniques have attracted a lot of attention from re… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

    Comments: Accepted for publication in the IEEE/CVF WACV 2024 Workshops proceedings, Hawaii, USA, Jan. 2024

  20. arXiv:2308.12673  [pdf, other

    cs.CV cs.LG cs.MM

    Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition

    Authors: Dimitrios Daskalakis, Nikolaos Gkalelis, Vasileios Mezaris

    Abstract: In this paper, we introduce Masked Feature Modelling (MFM), a novel approach for the unsupervised pre-training of a Graph Attention Network (GAT) block. MFM utilizes a pretrained Visual Tokenizer to reconstruct masked features of objects within a video, leveraging the MiniKinetics dataset. We then incorporate the pre-trained GAT block into a state-of-the-art bottom-up supervised video-event recogn… ▽ More

    Submitted 25 August, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

    Comments: 8 pages

  21. arXiv:2301.07565  [pdf, other

    cs.CV

    Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism

    Authors: Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris

    Abstract: In this paper, Gated-ViGAT, an efficient approach for video event recognition, utilizing bottom-up (object) information, a new frame sampling policy and a gating mechanism is proposed. Specifically, the frame sampling policy uses weighted in-degrees (WiDs), derived from the adjacency matrices of graph attention networks (GATs), and a dissimilarity measure to select the most salient and at the same… ▽ More

    Submitted 18 January, 2023; originally announced January 2023.

    Comments: Accepted for publication in the proceedings of IEEE Int. Symposium on Multimedia (ISM), Naples, Italy, Dec. 2022

  22. TAME: Attention Mechanism Based Feature Fusion for Generating Explanation Maps of Convolutional Neural Networks

    Authors: Mariano Ntrougkas, Nikolaos Gkalelis, Vasileios Mezaris

    Abstract: The apparent ``black box'' nature of neural networks is a barrier to adoption in applications where explainability is essential. This paper presents TAME (Trainable Attention Mechanism for Explanations), a method for generating explanation maps with a multi-branch hierarchical attention mechanism. TAME combines a target model's feature maps from multiple layers using an attention mechanism, transf… ▽ More

    Submitted 18 January, 2023; originally announced January 2023.

    Comments: Accepted for publication in the proceedings of IEEE Int. Symposium on Multimedia (ISM), Naples, Italy, Dec. 2022

    Journal ref: ISM (2022) 58-65

  23. arXiv:2211.11351  [pdf, other

    cs.CV

    Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

    Authors: Damianos Galanopoulos, Vasileios Mezaris

    Abstract: In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network a… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

    Comments: Accepted for publication; to be included in Proc. ECCV Workshops 2022. The version posted here is the "submitted manuscript" version

  24. arXiv:2209.11189  [pdf, other

    cs.CV

    Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism

    Authors: Ioanna Gkartzonika, Nikolaos Gkalelis, Vasileios Mezaris

    Abstract: In this paper two new learning-based eXplainable AI (XAI) methods for deep convolutional neural network (DCNN) image classifiers, called L-CAM-Fm and L-CAM-Img, are proposed. Both methods use an attention mechanism that is inserted in the original (frozen) DCNN and is trained to derive class activation maps (CAMs) from the last convolutional layer's feature maps. During training, CAMs are applied… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Comments: Accepted for publication; to be included in Proc. ECCV Workshops 2022. The version posted here is the "submitted manuscript" version

  25. arXiv:2207.09927  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network

    Authors: Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris

    Abstract: In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spat… ▽ More

    Submitted 31 October, 2022; v1 submitted 20 July, 2022; originally announced July 2022.

    Journal ref: IEEE Access, vol. 10, pp. 108797-108816, 2022

  26. arXiv:2101.06072  [pdf, ps, other

    cs.CV cs.LG cs.MM

    Video Summarization Using Deep Neural Networks: A Survey

    Authors: Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, Vasileios Mezaris, Ioannis Patras

    Abstract: Video summarization technologies aim to create a concise and complete synopsis by selecting the most informative parts of the video content. Several approaches have been developed over the last couple of decades and the current state of the art is represented by methods that rely on modern deep neural network architectures. This work focuses on the recent advances in the area and provides a compre… ▽ More

    Submitted 27 September, 2021; v1 submitted 15 January, 2021; originally announced January 2021.

    Comments: Accepted for publication at the Proceedings of the IEEE

  27. arXiv:1608.06770  [pdf, other

    cs.MM cs.CV

    Automatic Synchronization of Multi-User Photo Galleries

    Authors: E. Sansone, K. Apostolidis, N. Conci, G. Boato, V. Mezaris, F. G. B. De Natale

    Abstract: In this paper we address the issue of photo galleries synchronization, where pictures related to the same event are collected by different users. Existing solutions to address the problem are usually based on unrealistic assumptions, like time consistency across photo galleries, and often heavily rely on heuristics, limiting therefore the applicability to real-world scenarios. We propose a solutio… ▽ More

    Submitted 16 January, 2017; v1 submitted 24 August, 2016; originally announced August 2016.

    Comments: ACCEPTED to IEEE Transactions on Multimedia

  28. Learning to detect video events from zero or very few video examples

    Authors: Christos Tzelepis, Damianos Galanopoulos, Vasileios Mezaris, Ioannis Patras

    Abstract: In this work we deal with the problem of high-level event detection in video. Specifically, we study the challenging problems of i) learning to detect video events from solely a textual description of the event, without using any positive video examples, and ii) additionally exploiting very few positive training samples together with a small number of ``related'' videos. For learning only from an… ▽ More

    Submitted 25 November, 2015; originally announced November 2015.

    Comments: Image and Vision Computing Journal, Elsevier, 2015, accepted for publication

    Journal ref: Image and Vision Computing Journal, Elsevier, 2015

  29. arXiv:1504.07000  [pdf, other

    cs.LG

    Accelerated kernel discriminant analysis

    Authors: Nikolaos Gkalelis, Vasileios Mezaris

    Abstract: In this paper, using a novel matrix factorization and simultaneous reduction to diagonal form approach (or in short simultaneous reduction approach), Accelerated Kernel Discriminant Analysis (AKDA) and Accelerated Kernel Subclass Discriminant Analysis (AKSDA) are proposed. Specifically, instead of performing the simultaneous reduction of the between- and within-class or subclass scatter matrices,… ▽ More

    Submitted 27 November, 2017; v1 submitted 27 April, 2015; originally announced April 2015.

    Comments: 14 pages, journal, under review

  30. Linear Maximum Margin Classifier for Learning from Uncertain Data

    Authors: Christos Tzelepis, Vasileios Mezaris, Ioannis Patras

    Abstract: In this paper, we propose a maximum margin classifier that deals with uncertainty in data input. More specifically, we reformulate the SVM framework such that each training example can be modeled by a multi-dimensional Gaussian distribution described by its mean vector and its covariance matrix -- the latter modeling the uncertainty. We address the classification problem and define a cost function… ▽ More

    Submitted 19 November, 2017; v1 submitted 15 April, 2015; originally announced April 2015.

    Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence. (c) 2017 IEEE. DOI: 10.1109/TPAMI.2017.2772235 Author's accepted version. The final publication is available at http://ieeexplore.ieee.org/document/8103808/