-
Self-Supervised Multi-View Representation Learning using Vision-Language Model for 3D/4D Facial Expression Recognition
Authors:
Muzammil Behzad
Abstract:
Facial expression recognition (FER) is a fundamental task in affective computing with applications in human-computer interaction, mental health analysis, and behavioral understanding. In this paper, we propose SMILE-VLM, a self-supervised vision-language model for 3D/4D FER that unifies multiview visual representation learning with natural language supervision. SMILE-VLM learns robust, semanticall…
▽ More
Facial expression recognition (FER) is a fundamental task in affective computing with applications in human-computer interaction, mental health analysis, and behavioral understanding. In this paper, we propose SMILE-VLM, a self-supervised vision-language model for 3D/4D FER that unifies multiview visual representation learning with natural language supervision. SMILE-VLM learns robust, semantically aligned, and view-invariant embeddings by proposing three core components: multiview decorrelation via a Barlow Twins-style loss, vision-language contrastive alignment, and cross-modal redundancy minimization. Our framework achieves the state-of-the-art performance on multiple benchmarks. We further extend SMILE-VLM to the task of 4D micro-expression recognition (MER) to recognize the subtle affective cues. The extensive results demonstrate that SMILE-VLM not only surpasses existing unsupervised methods but also matches or exceeds supervised baselines, offering a scalable and annotation-efficient solution for expressive facial behavior understanding.
△ Less
Submitted 1 June, 2025;
originally announced June 2025.
-
Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model
Authors:
Reem AlJunaid,
Muzammil Behzad
Abstract:
Generating informative and knowledge-rich image captions remains a challenge for many existing captioning models, which often produce generic descriptions that lack specificity and contextual depth. To address this limitation, we propose KRCapVLM, a knowledge replay-based novel image captioning framework using vision-language model. We incorporate beam search decoding to generate more diverse and…
▽ More
Generating informative and knowledge-rich image captions remains a challenge for many existing captioning models, which often produce generic descriptions that lack specificity and contextual depth. To address this limitation, we propose KRCapVLM, a knowledge replay-based novel image captioning framework using vision-language model. We incorporate beam search decoding to generate more diverse and coherent captions. We also integrate attention-based modules into the image encoder to enhance feature representation. Finally, we employ training schedulers to improve stability and ensure smoother convergence during training. These proposals accelerate substantial gains in both caption quality and knowledge recognition. Our proposed model demonstrates clear improvements in both the accuracy of knowledge recognition and the overall quality of generated captions. It shows a stronger ability to generalize to previously unseen knowledge concepts, producing more informative and contextually relevant descriptions. These results indicate the effectiveness of our approach in enhancing the model's capacity to generate meaningful, knowledge-grounded captions across a range of scenarios.
△ Less
Submitted 29 May, 2025;
originally announced May 2025.
-
Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement
Authors:
Afrah Shaahid,
Muzammil Behzad
Abstract:
Underwater images are often affected by complex degradations such as light absorption, scattering, color casts, and artifacts, making enhancement critical for effective object detection, recognition, and scene understanding in aquatic environments. Existing methods, especially diffusion-based approaches, typically rely on synthetic paired datasets due to the scarcity of real underwater references,…
▽ More
Underwater images are often affected by complex degradations such as light absorption, scattering, color casts, and artifacts, making enhancement critical for effective object detection, recognition, and scene understanding in aquatic environments. Existing methods, especially diffusion-based approaches, typically rely on synthetic paired datasets due to the scarcity of real underwater references, introducing bias and limiting generalization. Furthermore, fine-tuning these models can degrade learned priors, resulting in unrealistic enhancements due to domain shifts. To address these challenges, we propose UDAN-CLIP, an image-to-image diffusion framework pre-trained on synthetic underwater datasets and enhanced with a customized classifier based on vision-language model, a spatial attention module, and a novel CLIP-Diffusion loss. The classifier preserves natural in-air priors and semantically guides the diffusion process, while the spatial attention module focuses on correcting localized degradations such as haze and low contrast. The proposed CLIP-Diffusion loss further strengthens visual-textual alignment and helps maintain semantic consistency during enhancement. The proposed contributions empower our UDAN-CLIP model to perform more effective underwater image enhancement, producing results that are not only visually compelling but also more realistic and detail-preserving. These improvements are consistently validated through both quantitative metrics and qualitative visual comparisons, demonstrating the model's ability to correct distortions and restore natural appearance in challenging underwater conditions.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model
Authors:
Alaa Dalaq,
Muzammil Behzad
Abstract:
Image segmentation is a fundamental task in computer vision, aimed at partitioning an image into semantically meaningful regions. Referring image segmentation extends this task by using natural language expressions to localize specific objects, requiring effective integration of visual and linguistic information. In this work, we propose SegVLM, a vision-language model that incorporates architectu…
▽ More
Image segmentation is a fundamental task in computer vision, aimed at partitioning an image into semantically meaningful regions. Referring image segmentation extends this task by using natural language expressions to localize specific objects, requiring effective integration of visual and linguistic information. In this work, we propose SegVLM, a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment. The model integrates squeeze-and-excitation (SE) blocks for dynamic feature recalibration, deformable convolutions for geometric adaptability, and residual connections for deep feature learning. We also introduce a novel referring-aware fusion (RAF) loss that balances region-level alignment, boundary precision, and class imbalance. Extensive experiments and ablation studies demonstrate that each component contributes to consistent performance improvements. SegVLM also shows strong generalization across diverse datasets and referring expression scenarios.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
MOSAIC: A Multi-View 2.5D Organ Slice Selector with Cross-Attentional Reasoning for Anatomically-Aware CT Localization in Medical Organ Segmentation
Authors:
Hania Ghouse,
Muzammil Behzad
Abstract:
Efficient and accurate multi-organ segmentation from abdominal CT volumes is a fundamental challenge in medical image analysis. Existing 3D segmentation approaches are computationally and memory intensive, often processing entire volumes that contain many anatomically irrelevant slices. Meanwhile, 2D methods suffer from class imbalance and lack cross-view contextual awareness. To address these lim…
▽ More
Efficient and accurate multi-organ segmentation from abdominal CT volumes is a fundamental challenge in medical image analysis. Existing 3D segmentation approaches are computationally and memory intensive, often processing entire volumes that contain many anatomically irrelevant slices. Meanwhile, 2D methods suffer from class imbalance and lack cross-view contextual awareness. To address these limitations, we propose a novel, anatomically-aware slice selector pipeline that reduces input volume prior to segmentation. Our unified framework introduces a vision-language model (VLM) for cross-view organ presence detection using fused tri-slice (2.5D) representations from axial, sagittal, and coronal planes. Our proposed model acts as an "expert" in anatomical localization, reasoning over multi-view representations to selectively retain slices with high structural relevance. This enables spatially consistent filtering across orientations while preserving contextual cues. More importantly, since standard segmentation metrics such as Dice or IoU fail to measure the spatial precision of such slice selection, we introduce a novel metric, Slice Localization Concordance (SLC), which jointly captures anatomical coverage and spatial alignment with organ-centric reference slices. Unlike segmentation-specific metrics, SLC provides a model-agnostic evaluation of localization fidelity. Our model offers substantial improvement gains against several baselines across all organs, demonstrating both accurate and reliable organ-focused slice filtering. These results show that our method enables efficient and spatially consistent organ filtering, thereby significantly reducing downstream segmentation cost while maintaining high anatomical fidelity.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition
Authors:
Muzammil Behzad
Abstract:
In this paper, we introduce MultiviewVLM, a vision-language model designed for unsupervised contrastive multiview representation learning of facial emotions from 3D/4D data. Our architecture integrates pseudo-labels derived from generated textual prompts to guide implicit alignment of emotional semantics. To capture shared information across multi-views, we propose a joint embedding space that ali…
▽ More
In this paper, we introduce MultiviewVLM, a vision-language model designed for unsupervised contrastive multiview representation learning of facial emotions from 3D/4D data. Our architecture integrates pseudo-labels derived from generated textual prompts to guide implicit alignment of emotional semantics. To capture shared information across multi-views, we propose a joint embedding space that aligns multiview representations without requiring explicit supervision. We further enhance the discriminability of our model through a novel multiview contrastive learning strategy that leverages stable positive-negative pair sampling. A gradient-friendly loss function is introduced to promote smoother and more stable convergence, and the model is optimized for distributed training to ensure scalability. Extensive experiments demonstrate that MultiviewVLM outperforms existing state-of-the-art methods and can be easily adapted to various real-world applications with minimal modifications.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model
Authors:
Muzammil Behzad,
Guoying Zhao
Abstract:
In this paper, we introduce AffectVLM, a vision-language model designed to integrate multiviews for a semantically rich and visually comprehensive understanding of facial emotions from 3D/4D data. To effectively capture visual features, we propose a joint representation learning framework paired with a novel gradient-friendly loss function that accelerates model convergence towards optimal feature…
▽ More
In this paper, we introduce AffectVLM, a vision-language model designed to integrate multiviews for a semantically rich and visually comprehensive understanding of facial emotions from 3D/4D data. To effectively capture visual features, we propose a joint representation learning framework paired with a novel gradient-friendly loss function that accelerates model convergence towards optimal feature representation. Additionally, we introduce augmented textual prompts to enhance the model's linguistic capabilities and employ mixed view augmentation to expand the visual dataset. We also develop a Streamlit app for a real-time interactive inference and enable the model for distributed learning. Extensive experiments validate the superior performance of AffectVLM across multiple benchmarks.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Towards Reading Beyond Faces for Sparsity-Aware 4D Affect Recognition
Authors:
Muzammil Behzad,
Nhat Vo,
Xiaobai Li,
Guoying Zhao
Abstract:
In this paper, we present a sparsity-aware deep network for automatic 4D facial expression recognition (FER). Given 4D data, we first propose a novel augmentation method to combat the data limitation problem for deep learning. This is achieved by projecting the input data into RGB and depth map images and then iteratively performing randomized channel concatenation. Encoded in the given 3D landmar…
▽ More
In this paper, we present a sparsity-aware deep network for automatic 4D facial expression recognition (FER). Given 4D data, we first propose a novel augmentation method to combat the data limitation problem for deep learning. This is achieved by projecting the input data into RGB and depth map images and then iteratively performing randomized channel concatenation. Encoded in the given 3D landmarks, we also introduce an effective way to capture the facial muscle movements from three orthogonal plans (TOP), the TOP-landmarks over multi-views. Importantly, we then present a sparsity-aware deep network to compute the sparse representations of convolutional features over multi-views. This is not only effective for a higher recognition accuracy but is also computationally convenient. For training, the TOP-landmarks and sparse representations are used to train a long short-term memory (LSTM) network. The refined predictions are achieved when the learned features collaborate over multi-views. Extensive experimental results achieved on the BU-4DFE dataset show the significance of our method over the state-of-the-art methods by reaching a promising accuracy of 99.69% for 4D FER.
△ Less
Submitted 19 August, 2020; v1 submitted 8 February, 2020;
originally announced February 2020.
-
Landmarks-assisted Collaborative Deep Framework for Automatic 4D Facial Expression Recognition
Authors:
Muzammil Behzad,
Nhat Vo,
Xiaobai Li,
Guoying Zhao
Abstract:
We propose a novel landmarks-assisted collaborative end-to-end deep framework for automatic 4D FER. Using 4D face scan data, we calculate its various geometrical images, and afterwards use rank pooling to generate their dynamic images encapsulating important facial muscle movements over time. As well, the given 3D landmarks are projected on a 2D plane as binary images and convolutional layers are…
▽ More
We propose a novel landmarks-assisted collaborative end-to-end deep framework for automatic 4D FER. Using 4D face scan data, we calculate its various geometrical images, and afterwards use rank pooling to generate their dynamic images encapsulating important facial muscle movements over time. As well, the given 3D landmarks are projected on a 2D plane as binary images and convolutional layers are used to extract sequences of feature vectors for every landmark video. During the training stage, the dynamic images are used to train an end-to-end deep network, while the feature vectors of landmark images are used train a long short-term memory (LSTM) network. The finally improved set of expression predictions are obtained when the dynamic and landmark images collaborate over multi-views using the proposed deep framework. Performance results obtained from extensive experimentation on the widely-adopted BU-4DFE database under globally used settings prove that our proposed collaborative framework outperforms the state-of-the-art 4D FER methods and reach a promising classification accuracy of 96.7% demonstrating its effectiveness.
△ Less
Submitted 7 February, 2020; v1 submitted 11 October, 2019;
originally announced October 2019.
-
Automatic 4D Facial Expression Recognition via Collaborative Cross-domain Dynamic Image Network
Authors:
Muzammil Behzad,
Nhat Vo,
Xiaobai Li,
Guoying Zhao
Abstract:
This paper proposes a novel 4D Facial Expression Recognition (FER) method using Collaborative Cross-domain Dynamic Image Network (CCDN). Given a 4D data of face scans, we first compute its geometrical images, and then combine their correlated information in the proposed cross-domain image representations. The acquired set is then used to generate cross-domain dynamic images (CDI) via rank pooling…
▽ More
This paper proposes a novel 4D Facial Expression Recognition (FER) method using Collaborative Cross-domain Dynamic Image Network (CCDN). Given a 4D data of face scans, we first compute its geometrical images, and then combine their correlated information in the proposed cross-domain image representations. The acquired set is then used to generate cross-domain dynamic images (CDI) via rank pooling that encapsulates facial deformations over time in terms of a single image. For the training phase, these CDIs are fed into an end-to-end deep learning model, and the resultant predictions collaborate over multi-views for performance gain in expression classification. Furthermore, we propose a 4D augmentation scheme that not only expands the training data scale but also introduces significant facial muscle movement patterns to improve the FER performance. Results from extensive experiments on the commonly used BU-4DFE dataset under widely adopted settings show that our proposed method outperforms the state-of-the-art 4D FER methods by achieving an accuracy of 96.5% indicating its effectiveness.
△ Less
Submitted 7 February, 2020; v1 submitted 6 May, 2019;
originally announced May 2019.
-
Toward Performance Optimization in IoT-based Next-Gen Wireless Sensor Networks
Authors:
Muzammil Behzad,
Manal Abdullah,
Muhammad Talal Hassan,
Yao Ge,
Mahmood Ashraf Khan
Abstract:
In this paper, we propose a novel framework for performance optimization in Internet of Things (IoT)-based next-generation wireless sensor networks. In particular, a computationally-convenient system is presented to combat two major research problems in sensor networks. First is the conventionally-tackled resource optimization problem which triggers the drainage of battery at a faster rate within…
▽ More
In this paper, we propose a novel framework for performance optimization in Internet of Things (IoT)-based next-generation wireless sensor networks. In particular, a computationally-convenient system is presented to combat two major research problems in sensor networks. First is the conventionally-tackled resource optimization problem which triggers the drainage of battery at a faster rate within a network. Such drainage promotes inefficient resource usage thereby causing sudden death of the network. The second main bottleneck for such networks is that of data degradation. This is because the nodes in such networks communicate via a wireless channel, where the inevitable presence of noise corrupts the data making it unsuitable for practical applications. Therefore, we present a layer-adaptive method via 3-tier communication mechanism to ensure the efficient use of resources. This is supported with a mathematical coverage model that deals with the formation of coverage holes. We also present a transform-domain based robust algorithm to effectively remove the unwanted components from the data. Our proposed framework offers a handy algorithm that enjoys desirable complexity for real-time applications as shown by the extensive simulation results.
△ Less
Submitted 23 June, 2018;
originally announced June 2018.
-
Image Denoising via Collaborative Dual-Domain Patch Filtering
Authors:
Muzammil Behzad
Abstract:
In this paper, we propose a novel image denoising algorithm exploiting features from both spatial as well as transformed domain. We implement intensity-invariance based improved grouping for collaborative support-agnostic sparse reconstruction. For collaboration firstly, we stack similar-structured patches via intensity-invariant correlation measure. The grouped patches collaborate to yield desira…
▽ More
In this paper, we propose a novel image denoising algorithm exploiting features from both spatial as well as transformed domain. We implement intensity-invariance based improved grouping for collaborative support-agnostic sparse reconstruction. For collaboration firstly, we stack similar-structured patches via intensity-invariant correlation measure. The grouped patches collaborate to yield desirable sparse estimates for noise filtering. This is because similar patches share the same support in the transformed domain, such similar supports can be used as probabilities of active taps to refine the sparse estimates. This ultimately produces a very useful patch estimate thus increasing the quality of recovered image by discarding the noise-causing components. A region growing based spatially developed post-processor is then applied to further enhance the smooth regions by extracting the spatial domain features. We also extend our proposed method for denoising of color images. Comparison results with the state-of-the-art algorithms in terms of peak signal-to-noise ratio (PNSR) and structural similarity (SSIM) index from extensive experimentations via a broad range of scenarios demonstrate the superiority of our proposed algorithm.
△ Less
Submitted 1 May, 2018;
originally announced May 2018.
-
M-BEHZAD: Minimum distance Based Energy efficiency using Hemisphere Zoning with Advanced Divide-and-Rule Scheme for Wireless Sensor Networks
Authors:
Muzammil Behzad
Abstract:
Routing Protocols are engaged in a vigorous fashion to boost up energy efficiency in WSNs. In this paper, we propose a novel routing protocol; Minimum distance Based Energy efficiency using Hemisphere Zoning with Advanced Divide-and-Rule scheme (M-BEHZAD), to maximize network lifespan, throughput and stability period of the sensors deployed in an un-attended network zone. To accomplish these objec…
▽ More
Routing Protocols are engaged in a vigorous fashion to boost up energy efficiency in WSNs. In this paper, we propose a novel routing protocol; Minimum distance Based Energy efficiency using Hemisphere Zoning with Advanced Divide-and-Rule scheme (M-BEHZAD), to maximize network lifespan, throughput and stability period of the sensors deployed in an un-attended network zone. To accomplish these objectives, static clustering technique along with threshold conscious transmissions have been used. The robustness of our proposed scheme lies in its Cluster Heads (CHs) selection and network field division which we are introducing as 'Hemisphere Zoning (HZ)'. We have implemented 3-Tier architecture to minimize the communication distance which not only leads to a better network performance but also significantly reduces the energy and coverage holes, and results in a longer stability period. We have also utilized Uniform Random Model (URM) to compute packets dropped to make our scheme a more practical approach. Results from comprehensive simulations using MATLAB validate its applicability.
△ Less
Submitted 3 April, 2018;
originally announced April 2018.
-
Layer-Adaptive Communication and Collaborative Transformed-Domain Representations for Performance Optimization in WSNs
Authors:
Muzammil Behzad,
Manal Abdullah,
Muhammad Talal Hassan,
Yao Ge,
Mahmood Ashraf Khan
Abstract:
In this paper, we combat the problem of performance optimization in wireless sensor networks. Specifically, a novel framework is proposed to handle two major research issues. Firstly, we optimize the utilization of resources available to various nodes at hand. This is achieved via proposed optimal network clustering enriched with layer-adaptive 3-tier communication mechanism to diminish energy hol…
▽ More
In this paper, we combat the problem of performance optimization in wireless sensor networks. Specifically, a novel framework is proposed to handle two major research issues. Firstly, we optimize the utilization of resources available to various nodes at hand. This is achieved via proposed optimal network clustering enriched with layer-adaptive 3-tier communication mechanism to diminish energy holes. We also introduce a mathematical coverage model that helps us minimize the number of coverage holes. Secondly, we present a novel approach to recover the corrupted version of the data received over noisy wireless channels. A robust sparse-domain based recovery method equipped with specially developed averaging filter is used to take care of the unwanted noisy components added to the data samples. Our proposed framework provides a handy routing protocol that enjoys improved computation complexity and elongated network lifetime as demonstrated with the help of extensive simulation results.
△ Less
Submitted 12 December, 2017;
originally announced December 2017.
-
Image Denoising Via Collaborative Support-Agnostic Recovery
Authors:
Muzammil Behzad,
Mudassir Masood,
Tarig Ballal,
Maha Shadaydeh,
Tareq Y. Al-Naffouri
Abstract:
In this paper, we propose a novel image denoising algorithm using collaborative support-agnostic sparse reconstruction. An observed image is first divided into patches. Similarly structured patches are grouped together to be utilized for collaborative processing. In the proposed collaborative schemes, similar patches are assumed to share the same support taps. For sparse reconstruction, the likeli…
▽ More
In this paper, we propose a novel image denoising algorithm using collaborative support-agnostic sparse reconstruction. An observed image is first divided into patches. Similarly structured patches are grouped together to be utilized for collaborative processing. In the proposed collaborative schemes, similar patches are assumed to share the same support taps. For sparse reconstruction, the likelihood of a tap being active in a patch is computed and refined through a collaboration process with other similar patches in the same group. This provides very good patch support estimation, hence enhancing the quality of image restoration. Performance comparisons with state-of-the-art algorithms, in terms of SSIM and PSNR, demonstrate the superiority of the proposed algorithm.
△ Less
Submitted 9 September, 2016;
originally announced September 2016.