Search | arXiv e-print repository

arXiv:2505.19795 [pdf, ps, other]

The Missing Point in Vision Transformers for Universal Image Segmentation

Authors: Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi

Abstract: Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-… ▽ More Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2505.19779 [pdf, ps, other]

Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models

Authors: Mobina Mansoori, Sajjad Shahabodini, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi

Abstract: Using massive datasets, foundation models are large-scale, pre-trained models that perform a wide range of tasks. These models have shown consistently improved results with the introduction of new methods. It is crucial to analyze how these trends impact the medical field and determine whether these advancements can drive meaningful change. This study investigates the application of recent state-o… ▽ More Using massive datasets, foundation models are large-scale, pre-trained models that perform a wide range of tasks. These models have shown consistently improved results with the introduction of new methods. It is crucial to analyze how these trends impact the medical field and determine whether these advancements can drive meaningful change. This study investigates the application of recent state-of-the-art foundation models, DINOv2, MAE, VMamba, CoCa, SAM2, and AIMv2, for medical image classification. We explore their effectiveness on datasets including CBIS-DDSM for mammography, ISIC2019 for skin lesions, APTOS2019 for diabetic retinopathy, and CHEXPERT for chest radiographs. By fine-tuning these models and evaluating their configurations, we aim to understand the potential of these advancements in medical image classification. The results indicate that these advanced models significantly enhance classification outcomes, demonstrating robust performance despite limited labeled data. Based on our results, AIMv2, DINOv2, and SAM2 models outperformed others, demonstrating that progress in natural domain training has positively impacted the medical domain and improved classification outcomes. Our code is publicly available at: https://github.com/sajjad-sh33/Medical-Transfer-Learning. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2409.09484 [pdf, other]

Self-Prompting Polyp Segmentation in Colonoscopy using Hybrid Yolo-SAM 2 Model

Authors: Mobina Mansoori, Sajjad Shahabodini, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi

Abstract: Early diagnosis and treatment of polyps during colonoscopy are essential for reducing the incidence and mortality of Colorectal Cancer (CRC). However, the variability in polyp characteristics and the presence of artifacts in colonoscopy images and videos pose significant challenges for accurate and efficient polyp detection and segmentation. This paper presents a novel approach to polyp segmentati… ▽ More Early diagnosis and treatment of polyps during colonoscopy are essential for reducing the incidence and mortality of Colorectal Cancer (CRC). However, the variability in polyp characteristics and the presence of artifacts in colonoscopy images and videos pose significant challenges for accurate and efficient polyp detection and segmentation. This paper presents a novel approach to polyp segmentation by integrating the Segment Anything Model (SAM 2) with the YOLOv8 model. Our method leverages YOLOv8's bounding box predictions to autonomously generate input prompts for SAM 2, thereby reducing the need for manual annotations. We conducted exhaustive tests on five benchmark colonoscopy image datasets and two colonoscopy video datasets, demonstrating that our method exceeds state-of-the-art models in both image and video segmentation tasks. Notably, our approach achieves high segmentation accuracy using only bounding box annotations, significantly reducing annotation time and effort. This advancement holds promise for enhancing the efficiency and scalability of polyp detection in clinical settings https://github.com/sajjad-sh33/YOLO_SAM2. △ Less

Submitted 14 September, 2024; originally announced September 2024.

arXiv:2408.05892 [pdf, other]

Polyp SAM 2: Advancing Zero shot Polyp Segmentation in Colorectal Cancer Detection

Authors: Mobina Mansoori, Sajjad Shahabodini, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi

Abstract: Polyp segmentation plays a crucial role in the early detection and diagnosis of colorectal cancer. However, obtaining accurate segmentations often requires labor-intensive annotations and specialized models. Recently, Meta AI Research released a general Segment Anything Model 2 (SAM 2), which has demonstrated promising performance in several segmentation tasks. In this manuscript, we evaluate the… ▽ More Polyp segmentation plays a crucial role in the early detection and diagnosis of colorectal cancer. However, obtaining accurate segmentations often requires labor-intensive annotations and specialized models. Recently, Meta AI Research released a general Segment Anything Model 2 (SAM 2), which has demonstrated promising performance in several segmentation tasks. In this manuscript, we evaluate the performance of SAM 2 in segmenting polyps under various prompted settings. We hope this report will provide insights to advance the field of polyp segmentation and promote more interesting work in the future. This project is publicly available at https://github.com/ sajjad-sh33/Polyp-SAM-2. △ Less

Submitted 7 September, 2024; v1 submitted 11 August, 2024; originally announced August 2024.

arXiv:2402.10851 [pdf, other]

HistoSegCap: Capsules for Weakly-Supervised Semantic Segmentation of Histological Tissue Type in Whole Slide Images

Authors: Mobina Mansoori, Sajjad Shahabodini, Jamshid Abouei, Arash Mohammadi, Konstantinos N. Plataniotis

Abstract: Digital pathology involves converting physical tissue slides into high-resolution Whole Slide Images (WSIs), which pathologists analyze for disease-affected tissues. However, large histology slides with numerous microscopic fields pose challenges for visual search. To aid pathologists, Computer Aided Diagnosis (CAD) systems offer visual assistance in efficiently examining WSIs and identifying diag… ▽ More Digital pathology involves converting physical tissue slides into high-resolution Whole Slide Images (WSIs), which pathologists analyze for disease-affected tissues. However, large histology slides with numerous microscopic fields pose challenges for visual search. To aid pathologists, Computer Aided Diagnosis (CAD) systems offer visual assistance in efficiently examining WSIs and identifying diagnostically relevant regions. This paper presents a novel histopathological image analysis method employing Weakly Supervised Semantic Segmentation (WSSS) based on Capsule Networks, the first such application. The proposed model is evaluated using the Atlas of Digital Pathology (ADP) dataset and its performance is compared with other histopathological semantic segmentation methodologies. The findings underscore the potential of Capsule Networks in enhancing the precision and efficiency of histopathological image analysis. Experimental results show that the proposed model outperforms traditional methods in terms of accuracy and the mean Intersection-over-Union (mIoU) metric. △ Less

Submitted 16 February, 2024; originally announced February 2024.

arXiv:2402.10846 [pdf, other]

FedD2S: Personalized Data-Free Federated Knowledge Distillation

Authors: Kawa Atapour, S. Jamal Seyedmohammadi, Jamshid Abouei, Arash Mohammadi, Konstantinos N. Plataniotis

Abstract: This paper addresses the challenge of mitigating data heterogeneity among clients within a Federated Learning (FL) framework. The model-drift issue, arising from the noniid nature of client data, often results in suboptimal personalization of a global model compared to locally trained models for each client. To tackle this challenge, we propose a novel approach named FedD2S for Personalized Federa… ▽ More This paper addresses the challenge of mitigating data heterogeneity among clients within a Federated Learning (FL) framework. The model-drift issue, arising from the noniid nature of client data, often results in suboptimal personalization of a global model compared to locally trained models for each client. To tackle this challenge, we propose a novel approach named FedD2S for Personalized Federated Learning (pFL), leveraging knowledge distillation. FedD2S incorporates a deep-to-shallow layer-dropping mechanism in the data-free knowledge distillation process to enhance local model personalization. Through extensive simulations on diverse image datasets-FEMNIST, CIFAR10, CINIC0, and CIFAR100-we compare FedD2S with state-of-the-art FL baselines. The proposed approach demonstrates superior performance, characterized by accelerated convergence and improved fairness among clients. The introduced layer-dropping technique effectively captures personalized knowledge, resulting in enhanced performance compared to alternative FL models. Moreover, we investigate the impact of key hyperparameters, such as the participation ratio and layer-dropping rate, providing valuable insights into the optimal configuration for FedD2S. The findings demonstrate the efficacy of adaptive layer-dropping in the knowledge distillation process to achieve enhanced personalization and performance across diverse datasets and tasks. △ Less

Submitted 16 February, 2024; originally announced February 2024.

arXiv:2303.12097 [pdf, other]

CLSA: Contrastive Learning-based Survival Analysis for Popularity Prediction in MEC Networks

Authors: Zohreh Hajiakhondi-Meybodi, Arash Mohammadi, Jamshid Abouei, Konstantinos N. Plataniotis

Abstract: Mobile Edge Caching (MEC) integrated with Deep Neural Networks (DNNs) is an innovative technology with significant potential for the future generation of wireless networks, resulting in a considerable reduction in users' latency. The MEC network's effectiveness, however, heavily relies on its capacity to predict and dynamically update the storage of caching nodes with the most popular contents. To… ▽ More Mobile Edge Caching (MEC) integrated with Deep Neural Networks (DNNs) is an innovative technology with significant potential for the future generation of wireless networks, resulting in a considerable reduction in users' latency. The MEC network's effectiveness, however, heavily relies on its capacity to predict and dynamically update the storage of caching nodes with the most popular contents. To be effective, a DNN-based popularity prediction model needs to have the ability to understand the historical request patterns of content, including their temporal and spatial correlations. Existing state-of-the-art time-series DNN models capture the latter by simultaneously inputting the sequential request patterns of multiple contents to the network, considerably increasing the size of the input sample. This motivates us to address this challenge by proposing a DNN-based popularity prediction framework based on the idea of contrasting input samples against each other, designed for the Unmanned Aerial Vehicle (UAV)-aided MEC networks. Referred to as the Contrastive Learning-based Survival Analysis (CLSA), the proposed architecture consists of a self-supervised Contrastive Learning (CL) model, where the temporal information of sequential requests is learned using a Long Short Term Memory (LSTM) network as the encoder of the CL architecture. Followed by a Survival Analysis (SA) network, the output of the proposed CLSA architecture is probabilities for each content's future popularity, which are then sorted in descending order to identify the Top-K popular contents. Based on the simulation results, the proposed CLSA architecture outperforms its counterparts across the classification accuracy and cache-hit ratio. △ Less

Submitted 21 March, 2023; originally announced March 2023.

arXiv:2210.15125 [pdf, other]

ViT-CAT: Parallel Vision Transformers with Cross Attention Fusion for Popularity Prediction in MEC Networks

Authors: Zohreh HajiAkhondi-Meybodi, Arash Mohammadi, Ming Hou, Jamshid Abouei, Konstantinos N. Plataniotis

Abstract: Mobile Edge Caching (MEC) is a revolutionary technology for the Sixth Generation (6G) of wireless networks with the promise to significantly reduce users' latency via offering storage capacities at the edge of the network. The efficiency of the MEC network, however, critically depends on its ability to dynamically predict/update the storage of caching nodes with the top-K popular contents. Convent… ▽ More Mobile Edge Caching (MEC) is a revolutionary technology for the Sixth Generation (6G) of wireless networks with the promise to significantly reduce users' latency via offering storage capacities at the edge of the network. The efficiency of the MEC network, however, critically depends on its ability to dynamically predict/update the storage of caching nodes with the top-K popular contents. Conventional statistical caching schemes are not robust to the time-variant nature of the underlying pattern of content requests, resulting in a surge of interest in using Deep Neural Networks (DNNs) for time-series popularity prediction in MEC networks. However, existing DNN models within the context of MEC fail to simultaneously capture both temporal correlations of historical request patterns and the dependencies between multiple contents. This necessitates an urgent quest to develop and design a new and innovative popularity prediction architecture to tackle this critical challenge. The paper addresses this gap by proposing a novel hybrid caching framework based on the attention mechanism. Referred to as the parallel Vision Transformers with Cross Attention (ViT-CAT) Fusion, the proposed architecture consists of two parallel ViT networks, one for collecting temporal correlation, and the other for capturing dependencies between different contents. Followed by a Cross Attention (CA) module as the Fusion Center (FC), the proposed ViT-CAT is capable of learning the mutual information between temporal and spatial correlations, as well, resulting in improving the classification accuracy, and decreasing the model's complexity about 8 times. Based on the simulation results, the proposed ViT-CAT architecture outperforms its counterparts across the classification accuracy, complexity, and cache-hit ratio. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2210.05874 [pdf, other]

Multi-Content Time-Series Popularity Prediction with Multiple-Model Transformers in MEC Networks

Authors: Zohreh HajiAkhondi-Meybodi, Arash Mohammadi, Ming Hou, Elahe Rahimian, Shahin Heidarian, Jamshid Abouei, Konstantinos N. Plataniotis

Abstract: Coded/uncoded content placement in Mobile Edge Caching (MEC) has evolved as an efficient solution to meet the significant growth of global mobile data traffic by boosting the content diversity in the storage of caching nodes. To meet the dynamic nature of the historical request pattern of multimedia contents, the main focus of recent researches has been shifted to develop data-driven and real-time… ▽ More Coded/uncoded content placement in Mobile Edge Caching (MEC) has evolved as an efficient solution to meet the significant growth of global mobile data traffic by boosting the content diversity in the storage of caching nodes. To meet the dynamic nature of the historical request pattern of multimedia contents, the main focus of recent researches has been shifted to develop data-driven and real-time caching schemes. In this regard and with the assumption that users' preferences remain unchanged over a short horizon, the Top-K popular contents are identified as the output of the learning model. Most existing datadriven popularity prediction models, however, are not suitable for the coded/uncoded content placement frameworks. On the one hand, in coded/uncoded content placement, in addition to classifying contents into two groups, i.e., popular and nonpopular, the probability of content request is required to identify which content should be stored partially/completely, where this information is not provided by existing data-driven popularity prediction models. On the other hand, the assumption that users' preferences remain unchanged over a short horizon only works for content with a smooth request pattern. To tackle these challenges, we develop a Multiple-model (hybrid) Transformer-based Edge Caching (MTEC) framework with higher generalization ability, suitable for various types of content with different time-varying behavior, that can be adapted with coded/uncoded content placement frameworks. Simulation results corroborate the effectiveness of the proposed MTEC caching framework in comparison to its counterparts in terms of the cache-hit ratio, classification accuracy, and the transferred byte volume. △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2112.00633 [pdf, other]

TEDGE-Caching: Transformer-based Edge Caching Towards 6G Networks

Authors: Zohreh Hajiakhondi Meybodi, Arash Mohammadi, Elahe Rahimian, Shahin Heidarian, Jamshid Abouei, Konstantinos N. Plataniotis

Abstract: As a consequence of the COVID-19 pandemic, the demand for telecommunication for remote learning/working and telemedicine has significantly increased. Mobile Edge Caching (MEC) in the 6G networks has been evolved as an efficient solution to meet the phenomenal growth of the global mobile data traffic by bringing multimedia content closer to the users. Although massive connectivity enabled by MEC ne… ▽ More As a consequence of the COVID-19 pandemic, the demand for telecommunication for remote learning/working and telemedicine has significantly increased. Mobile Edge Caching (MEC) in the 6G networks has been evolved as an efficient solution to meet the phenomenal growth of the global mobile data traffic by bringing multimedia content closer to the users. Although massive connectivity enabled by MEC networks will significantly increase the quality of communications, there are several key challenges ahead. The limited storage of edge nodes, the large size of multimedia content, and the time-variant users' preferences make it critical to efficiently and dynamically predict the popularity of content to store the most upcoming requested ones before being requested. Recent advancements in Deep Neural Networks (DNNs) have drawn much research attention to predict the content popularity in proactive caching schemes. Existing DNN models in this context, however, suffer from longterm dependencies, computational complexity, and unsuitability for parallel computing. To tackle these challenges, we propose an edge caching framework incorporated with the attention-based Vision Transformer (ViT) neural network, referred to as the Transformer-based Edge (TEDGE) caching, which to the best of our knowledge, is being studied for the first time. Moreover, the TEDGE caching framework requires no data pre-processing and additional contextual information. Simulation results corroborate the effectiveness of the proposed TEDGE caching framework in comparison to its counterparts. △ Less

Submitted 1 December, 2021; originally announced December 2021.

arXiv:2101.11787 [pdf, other]

Joint Transmission Scheme and Coded Content Placement in Cluster-centric UAV-aided Cellular Networks

Authors: Zohreh HajiAkhondi-Meybodi, Arash Mohammadi, Jamshid Abouei, Ming Hou, Konstantinos N. Plataniotis

Abstract: Recently, as a consequence of the COVID-19 pandemic, dependence on telecommunication for remote working and telemedicine has significantly increased. In cellular networks, incorporation of Unmanned Aerial Vehicles (UAVs) can result in enhanced connectivity for outdoor users due to the high probability of establishing Line of Sight (LoS) links. The UAV's limited battery life and its signal attenuat… ▽ More Recently, as a consequence of the COVID-19 pandemic, dependence on telecommunication for remote working and telemedicine has significantly increased. In cellular networks, incorporation of Unmanned Aerial Vehicles (UAVs) can result in enhanced connectivity for outdoor users due to the high probability of establishing Line of Sight (LoS) links. The UAV's limited battery life and its signal attenuation in indoor areas, however, make it inefficient to manage users' requests in indoor environments. Referred to as the Cluster centric and Coded UAV-aided Femtocaching (CCUF) framework, the network's coverage in both indoor and outdoor environments increases via a two-phase clustering for FAPs' formation and UAVs' deployment. First objective is to increase the content diversity. In this context, we propose a coded content placement in a cluster-centric cellular network, which is integrated with the Coordinated Multi-Point (CoMP) to mitigate the inter-cell interference in edge areas. Then, we compute, experimentally, the number of coded contents to be stored in each caching node to increase the cache-hit ratio, Signal-to-Interference-plus-Noise Ratio (SINR), and cache diversity and decrease the users' access delay and cache redundancy for different content popularity profiles. Capitalizing on clustering, our second objective is to assign the best caching node to indoor/outdoor users for managing their requests. In this regard, we define the movement speed of ground users as the decision metric of the transmission scheme for serving outdoor users' requests to avoid frequent handovers between FAPs and increase the battery life of UAVs. Simulation results illustrate that the proposed CCUF implementation increases the cache hit-ratio, SINR, and cache diversity and decrease the users' access delay, cache redundancy and UAVs' energy consumption. △ Less

Submitted 15 July, 2021; v1 submitted 27 January, 2021; originally announced January 2021.

arXiv:1803.04966 [pdf]

An Improvement Technique based on Structural Similarity Thresholding for Digital Watermarking

Authors: Amin Banitalebi-Dehkordi, Mehdi Banitalebi-Dehkordi, Jamshid Abouei, Said Nader-Esfahani

Abstract: Digital watermarking is extensively used in ownership authentication and copyright protection. In this paper, we propose an efficient thresholding scheme to improve the watermark embedding procedure in an image. For the proposed algorithm, watermark casting is performed separately in each block of an image, and embedding in each block continues until a certain structural similarity threshold is re… ▽ More Digital watermarking is extensively used in ownership authentication and copyright protection. In this paper, we propose an efficient thresholding scheme to improve the watermark embedding procedure in an image. For the proposed algorithm, watermark casting is performed separately in each block of an image, and embedding in each block continues until a certain structural similarity threshold is reached. Numerical evaluations demonstrate that our scheme improves the imperceptibility of the watermark when the capacity remains fix, and at the same time, robustness against attacks is assured. The proposed method is applicable to most image watermarking algorithms. We verify this issue on watermarking schemes in Discrete Cosine Transform (DCT), wavelet, and spatial domain. △ Less

Submitted 13 March, 2018; originally announced March 2018.

Journal ref: ACENG, 2014

Showing 1–12 of 12 results for author: Abouei, J