-
A Survey on Class-Agnostic Counting: Advancements from Reference-Based to Open-World Text-Guided Approaches
Authors:
Luca Ciampi,
Ali Azmoudeh,
Elif Ecem Akbaba,
Erdi Sarıtaş,
Ziya Ata Yazıcı,
Hazım Kemal Ekenel,
Giuseppe Amato,
Fabrizio Falchi
Abstract:
Visual object counting has recently shifted towards class-agnostic counting (CAC), which addresses the challenge of counting objects across arbitrary categories -- a crucial capability for flexible and generalizable counting systems. Unlike humans, who effortlessly identify and count objects from diverse categories without prior knowledge, most existing counting methods are restricted to enumerati…
▽ More
Visual object counting has recently shifted towards class-agnostic counting (CAC), which addresses the challenge of counting objects across arbitrary categories -- a crucial capability for flexible and generalizable counting systems. Unlike humans, who effortlessly identify and count objects from diverse categories without prior knowledge, most existing counting methods are restricted to enumerating instances of known classes, requiring extensive labeled datasets for training and struggling in open-vocabulary settings. In contrast, CAC aims to count objects belonging to classes never seen during training, operating in a few-shot setting. In this paper, we present the first comprehensive review of CAC methodologies. We propose a taxonomy to categorize CAC approaches into three paradigms based on how target object classes can be specified: reference-based, reference-less, and open-world text-guided. Reference-based approaches achieve state-of-the-art performance by relying on exemplar-guided mechanisms. Reference-less methods eliminate exemplar dependency by leveraging inherent image patterns. Finally, open-world text-guided methods use vision-language models, enabling object class descriptions via textual prompts, offering a flexible and promising solution. Based on this taxonomy, we provide an overview of the architectures of 29 CAC approaches and report their results on gold-standard benchmarks. We compare their performance and discuss their strengths and limitations. Specifically, we present results on the FSC-147 dataset, setting a leaderboard using gold-standard metrics, and on the CARPK dataset to assess generalization capabilities. Finally, we offer a critical discussion of persistent challenges, such as annotation dependency and generalization, alongside future directions. We believe this survey will be a valuable resource, showcasing CAC advancements and guiding future research.
△ Less
Submitted 28 April, 2025; v1 submitted 31 January, 2025;
originally announced January 2025.
-
Impact of Face Alignment on Face Image Quality
Authors:
Eren Onaran,
Erdi Sarıtaş,
Hazım Kemal Ekenel
Abstract:
Face alignment is a crucial step in preparing face images for feature extraction in facial analysis tasks. For applications such as face recognition, facial expression recognition, and facial attribute classification, alignment is widely utilized during both training and inference to standardize the positions of key landmarks in the face. It is well known that the application and method of face al…
▽ More
Face alignment is a crucial step in preparing face images for feature extraction in facial analysis tasks. For applications such as face recognition, facial expression recognition, and facial attribute classification, alignment is widely utilized during both training and inference to standardize the positions of key landmarks in the face. It is well known that the application and method of face alignment significantly affect the performance of facial analysis models. However, the impact of alignment on face image quality has not been thoroughly investigated. Current FIQA studies often assume alignment as a prerequisite but do not explicitly evaluate how alignment affects quality metrics, especially with the advent of modern deep learning-based detectors that integrate detection and landmark localization. To address this need, our study examines the impact of face alignment on face image quality scores. We conducted experiments on the LFW, IJB-B, and SCFace datasets, employing MTCNN and RetinaFace models for face detection and alignment. To evaluate face image quality, we utilized several assessment methods, including SER-FIQ, FaceQAN, DifFIQA, and SDD-FIQA. Our analysis included examining quality score distributions for the LFW and IJB-B datasets and analyzing average quality scores at varying distances in the SCFace dataset. Our findings reveal that face image quality assessment methods are sensitive to alignment. Moreover, this sensitivity increases under challenging real-life conditions, highlighting the importance of evaluating alignment's role in quality assessment.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
Physics-Driven Autoregressive State Space Models for Medical Image Reconstruction
Authors:
Bilal Kabas,
Fuat Arslan,
Valiyeh A. Nezhad,
Saban Ozturk,
Emine U. Saritas,
Tolga Çukur
Abstract:
Medical image reconstruction from undersampled acquisitions is an ill-posed problem involving inversion of the imaging operator linking measurement and image domains. Physics-driven (PD) models have gained prominence in reconstruction tasks due to their desirable performance and generalization. These models jointly promote data fidelity and artifact suppression, typically by combining data-consist…
▽ More
Medical image reconstruction from undersampled acquisitions is an ill-posed problem involving inversion of the imaging operator linking measurement and image domains. Physics-driven (PD) models have gained prominence in reconstruction tasks due to their desirable performance and generalization. These models jointly promote data fidelity and artifact suppression, typically by combining data-consistency mechanisms with learned network modules. Artifact suppression depends on the network's ability to disentangle artifacts from true tissue signals, both of which can exhibit contextual structure across diverse spatial scales. Convolutional neural networks (CNNs) are strong in capturing local correlations, albeit relatively insensitive to non-local context. While transformers promise to alleviate this limitation, practical implementations frequently involve design compromises to reduce computational cost by balancing local and non-local sensitivity, occasionally resulting in performance comparable to or trailing that of CNNs. To enhance contextual sensitivity without incurring high complexity, we introduce a novel physics-driven autoregressive state-space model (MambaRoll) for medical image reconstruction. In each cascade of its unrolled architecture, MambaRoll employs a physics-driven state-space module (PD-SSM) to aggregate contextual features efficiently at a given spatial scale, and autoregressively predicts finer-scale feature maps conditioned on coarser-scale features to capture multi-scale context. Learning across scales is further enhanced via a deep multi-scale decoding (DMSD) loss tailored to the autoregressive prediction task. Demonstrations on accelerated MRI and sparse-view CT reconstructions show that MambaRoll consistently outperforms state-of-the-art data-driven and physics-driven methods based on CNN, transformer, and SSM backbones.
△ Less
Submitted 8 July, 2025; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Analyzing the Feature Extractor Networks for Face Image Synthesis
Authors:
Erdi Sarıtaş,
Hazım Kemal Ekenel
Abstract:
Advancements like Generative Adversarial Networks have attracted the attention of researchers toward face image synthesis to generate ever more realistic images. Thereby, the need for the evaluation criteria to assess the realism of the generated images has become apparent. While FID utilized with InceptionV3 is one of the primary choices for benchmarking, concerns about InceptionV3's limitations…
▽ More
Advancements like Generative Adversarial Networks have attracted the attention of researchers toward face image synthesis to generate ever more realistic images. Thereby, the need for the evaluation criteria to assess the realism of the generated images has become apparent. While FID utilized with InceptionV3 is one of the primary choices for benchmarking, concerns about InceptionV3's limitations for face images have emerged. This study investigates the behavior of diverse feature extractors -- InceptionV3, CLIP, DINOv2, and ArcFace -- considering a variety of metrics -- FID, KID, Precision\&Recall. While the FFHQ dataset is used as the target domain, as the source domains, the CelebA-HQ dataset and the synthetic datasets generated using StyleGAN2 and Projected FastGAN are used. Experiments include deep-down analysis of the features: $L_2$ normalization, model attention during extraction, and domain distributions in the feature space. We aim to give valuable insights into the behavior of feature extractors for evaluating face image synthesis methodologies. The code is publicly available at https://github.com/ThEnded32/AnalyzingFeatureExtractors.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Analyzing the Effect of Combined Degradations on Face Recognition
Authors:
Erdi Sarıtaş,
Hazım Kemal Ekenel
Abstract:
A face recognition model is typically trained on large datasets of images that may be collected from controlled environments. This results in performance discrepancies when applied to real-world scenarios due to the domain gap between clean and in-the-wild images. Therefore, some researchers have investigated the robustness of these models by analyzing synthetic degradations. Yet, existing studies…
▽ More
A face recognition model is typically trained on large datasets of images that may be collected from controlled environments. This results in performance discrepancies when applied to real-world scenarios due to the domain gap between clean and in-the-wild images. Therefore, some researchers have investigated the robustness of these models by analyzing synthetic degradations. Yet, existing studies have mostly focused on single degradation factors, which may not fully capture the complexity of real-world degradations. This work addresses this problem by analyzing the impact of both single and combined degradations using a real-world degradation pipeline extended with under/over-exposure conditions. We use the LFW dataset for our experiments and assess the model's performance based on verification accuracy. Results reveal that single and combined degradations show dissimilar model behavior. The combined effect of degradation significantly lowers performance even if its single effect is negligible. This work emphasizes the importance of accounting for real-world complexity to assess the robustness of face recognition models in real-world settings. The code is publicly available at https://github.com/ThEnded32/AnalyzingCombinedDegradations.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
FD-Net: An Unsupervised Deep Forward-Distortion Model for Susceptibility Artifact Correction in EPI
Authors:
Abdallah Zaid Alkilani,
Tolga Çukur,
Emine Ulku Saritas
Abstract:
Recent learning-based correction approaches in EPI estimate a displacement field, unwarp the reversed-PE image pair with the estimated field, and average the unwarped pair to yield a corrected image. Unsupervised learning in these unwarping-based methods is commonly attained via a similarity constraint between the unwarped images in reversed-PE directions, neglecting consistency to the acquired EP…
▽ More
Recent learning-based correction approaches in EPI estimate a displacement field, unwarp the reversed-PE image pair with the estimated field, and average the unwarped pair to yield a corrected image. Unsupervised learning in these unwarping-based methods is commonly attained via a similarity constraint between the unwarped images in reversed-PE directions, neglecting consistency to the acquired EPI images. This work introduces an unsupervised deep-learning method for fast and effective correction of susceptibility artifacts in reversed phase-encode (PE) image pairs acquired with EPI. FD-Net predicts both the susceptibility-induced displacement field and the underlying anatomically-correct image. Unlike previous methods, FD-Net enforces the forward-distortions of the correct image in both PE directions to be consistent with the acquired reversed-PE image pair. FD-Net further leverages a multiresolution architecture to maintain high local and global performance. FD-Net performs competitively with a gold-standard reference method (TOPUP) in image quality, while enabling a leap in computational efficiency. Furthermore, FD-Net outperforms recent unwarping-based methods for unsupervised correction in terms of both image and field quality. The unsupervised FD-Net method introduces a deep forward-distortion approach to enable fast, high-fidelity correction of susceptibility artifacts in EPI by maintaining consistency to measured data. Therefore, it holds great promise for improving the anatomical accuracy of EPI imaging.
△ Less
Submitted 18 March, 2023;
originally announced March 2023.
-
COVID-19 Detection from Respiratory Sounds with Hierarchical Spectrogram Transformers
Authors:
Idil Aytekin,
Onat Dalmaz,
Kaan Gonc,
Haydar Ankishan,
Emine U Saritas,
Ulas Bagci,
Haydar Celik,
Tolga Cukur
Abstract:
Monitoring of prevalent airborne diseases such as COVID-19 characteristically involves respiratory assessments. While auscultation is a mainstream method for preliminary screening of disease symptoms, its utility is hampered by the need for dedicated hospital visits. Remote monitoring based on recordings of respiratory sounds on portable devices is a promising alternative, which can assist in earl…
▽ More
Monitoring of prevalent airborne diseases such as COVID-19 characteristically involves respiratory assessments. While auscultation is a mainstream method for preliminary screening of disease symptoms, its utility is hampered by the need for dedicated hospital visits. Remote monitoring based on recordings of respiratory sounds on portable devices is a promising alternative, which can assist in early assessment of COVID-19 that primarily affects the lower respiratory tract. In this study, we introduce a novel deep learning approach to distinguish patients with COVID-19 from healthy controls given audio recordings of cough or breathing sounds. The proposed approach leverages a novel hierarchical spectrogram transformer (HST) on spectrogram representations of respiratory sounds. HST embodies self-attention mechanisms over local windows in spectrograms, and window size is progressively grown over model stages to capture local to global context. HST is compared against state-of-the-art conventional and deep-learning baselines. Demonstrations on crowd-sourced multi-national datasets indicate that HST outperforms competing methods, achieving over 83% area under the receiver operating characteristic curve (AUC) in detecting COVID-19 cases.
△ Less
Submitted 26 May, 2023; v1 submitted 19 July, 2022;
originally announced July 2022.