Search | arXiv e-print repository

Generative Adversarial Synthesis of Radar Point Cloud Scenes

Authors: Muhammad Saad Nawaz, Thomas Dallmann, Torsten Schoen, Dirk Heberling

Abstract: For the validation and verification of automotive radars, datasets of realistic traffic scenarios are required, which, how ever, are laborious to acquire. In this paper, we introduce radar scene synthesis using GANs as an alternative to the real dataset acquisition and simulation-based approaches. We train a PointNet++ based GAN model to generate realistic radar point cloud scenes and use a binary… ▽ More For the validation and verification of automotive radars, datasets of realistic traffic scenarios are required, which, how ever, are laborious to acquire. In this paper, we introduce radar scene synthesis using GANs as an alternative to the real dataset acquisition and simulation-based approaches. We train a PointNet++ based GAN model to generate realistic radar point cloud scenes and use a binary classifier to evaluate the performance of scenes generated using this model against a test set of real scenes. We demonstrate that our GAN model achieves similar performance (~87%) to the real scenes test set. △ Less

Submitted 17 October, 2024; originally announced October 2024.

Comments: ICMIM 2024; 7th IEEE MTT Conference

arXiv:2404.09342 [pdf, other]

Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan

Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf

Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2… ▽ More The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge. △ Less

Submitted 22 July, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: ACM Multimedia Conference - Grand Challenge

arXiv:2309.09837 [pdf, other]

Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection

Authors: Awais Khan, Khalid Mahmood Malik, Shah Nawaz

Abstract: Voice spoofing attacks pose a significant threat to automated speaker verification systems. Existing anti-spoofing methods often simulate specific attack types, such as synthetic or replay attacks. However, in real-world scenarios, the countermeasures are unaware of the generation schema of the attack, necessitating a unified solution. Current unified solutions struggle to detect spoofing artifact… ▽ More Voice spoofing attacks pose a significant threat to automated speaker verification systems. Existing anti-spoofing methods often simulate specific attack types, such as synthetic or replay attacks. However, in real-world scenarios, the countermeasures are unaware of the generation schema of the attack, necessitating a unified solution. Current unified solutions struggle to detect spoofing artifacts, especially with recent spoofing mechanisms. For instance, the spoofing algorithms inject spectral or temporal anomalies, which are challenging to identify. To this end, we present a spectra-temporal fusion leveraging frame-level and utterance-level coefficients. We introduce a novel local spectral deviation coefficient (SDC) for frame-level inconsistencies and employ a bi-LSTM-based network for sequential temporal coefficients (STC), which capture utterance-level artifacts. Our spectra-temporal fusion strategy combines these coefficients, and an auto-encoder generates spectra-temporal deviated coefficients (STDC) to enhance robustness. Our proposed approach addresses multiple spoofing categories, including synthetic, replay, and partial deepfake attacks. Extensive evaluation on diverse datasets (ASVspoof2019, ASVspoof2021, VSDC, partial spoofs, and in-the-wild deepfakes) demonstrated its robustness for a wide range of voice applications. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2308.01966 [pdf, other]

DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation

Authors: Vu Ngoc Tu, Van Thong Huynh, Hyung-Jeong Yang, M. Zaigham Zaheer, Shah Nawaz, Karthik Nandakumar, Soo-Hyung Kim

Abstract: Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling an… ▽ More Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$\% improvement on test set and $4$\% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance. △ Less

Submitted 31 July, 2023; originally announced August 2023.

Comments: Accepted in ACMM Grand Challenge

arXiv:2302.13033 [pdf, other]

Speaker Recognition in Realistic Scenario Using Multimodal Data

Authors: Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon Yousaf

Abstract: In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in developing speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve… ▽ More In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in developing speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve speaker recognition task. To achieve this task, we proposed a two-branch network to learn joint representations of faces and voices in a multimodal system. Afterwards, features are extracted from the two-branch network to train a classifier for speaker recognition. We evaluated our proposed framework on a large scale audio-visual dataset named VoxCeleb$1$. Our results show that addition of facial information improved the performance of speaker recognition. Moreover, our results indicate that there is an overlap between face and voice. △ Less

Submitted 25 February, 2023; originally announced February 2023.

Comments: Accepted at the International Conference on Artificial Intelligence (ICAI'2023)

arXiv:2005.10937 [pdf, other]

doi 10.1109/ACCESS.2021.3061499

Non-Coherent and Backscatter Communications: Enabling Ultra-Massive Connectivity in 6G Wireless Networks

Authors: Syed Junaid Nawaz, Shree Krishna Sharma, Babar Mansoor, Mohmammad N. Patwary, Noor M. Khan

Abstract: With the commencement of the 5G of wireless networks, researchers around the globe have started paying their attention to the imminent challenges that may emerge in the beyond 5G (B5G) era. Various revolutionary technologies and innovative services are offered in 5G networks, which, along with many principal advantages, are anticipated to bring a boom in the number of connected wireless devices an… ▽ More With the commencement of the 5G of wireless networks, researchers around the globe have started paying their attention to the imminent challenges that may emerge in the beyond 5G (B5G) era. Various revolutionary technologies and innovative services are offered in 5G networks, which, along with many principal advantages, are anticipated to bring a boom in the number of connected wireless devices and the types of use-cases that may cause the scarcity of network resources. These challenges partly emerged with the advent of massive machine-type communications (mMTC) services, require extensive research innovations to sustain the evolution towards enhanced-mMTC (e-mMTC) with the scalable network cost in 6\textsuperscript{th} generation (6G) wireless networks. Towards delivering the anticipated massive connectivity requirements with optimal energy and spectral efficiency besides low hardware cost, this paper presents an enabling framework for 6G networks, which utilizes two emerging technologies, namely, non-coherent communications and backscatter communications (BsC). Recognizing the coherence between these technologies for their joint potential of delivering e-mMTC services in the B5G era, a comprehensive review of their state-of-the-art is conducted. The joint scope of non-coherent and BsC with other emerging 6G technologies is also identified, where the reviewed technologies include unmanned aerial vehicles (UAVs)-assisted communications, visible light communications (VLC), quantum-assisted communications, reconfigurable large intelligent surfaces (RLIS), non-orthogonal multiple access (NOMA), and machine learning-aided intelligent networks. Subsequently, the scope of these enabling technologies for different device types, service types, and optimization parameters is analyzed... △ Less

Submitted 20 February, 2021; v1 submitted 21 May, 2020; originally announced May 2020.

Comments: 6G Wireless Networks, Preprint, 34 pages, 11 Figures

arXiv:2004.13780 [pdf, other]

Cross-modal Speaker Verification and Recognition: A Multilingual Perspective

Authors: Muhammad Saad Saeed, Shah Nawaz, Pietro Morerio, Arif Mahmood, Ignazio Gallo, Muhammad Haroon Yousaf, Alessio Del Bue

Abstract: Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: "Is face-voice… ▽ More Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: "Is face-voice association language independent?" and "Can a speaker be recognised irrespective of the spoken language?". These two questions are very important to understand effectiveness and to boost development of multilingual biometric systems. To answer them, we collected a Multilingual Audio-Visual dataset, containing human speech clips of $154$ identities with $3$ language annotations extracted from various videos uploaded online. Extensive experiments on the three splits of the proposed dataset have been performed to investigate and answer these novel research questions that clearly point out the relevance of the multilingual problem. △ Less

Submitted 22 April, 2021; v1 submitted 28 April, 2020; originally announced April 2020.

Comments: Accepted: CVPRW

arXiv:1909.08685 [pdf, ps, other]

Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals

Authors: Shah Nawaz, Muhammad Kamran Janjua, Ignazio Gallo, Arif Mahmood, Alessandro Calefati

Abstract: We propose a novel deep training algorithm for joint representation of audio and visual information which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information. The proposed framework characterizes the shared latent space by leveraging the class centers which helps to eliminate the need for pairwi… ▽ More We propose a novel deep training algorithm for joint representation of audio and visual information which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information. The proposed framework characterizes the shared latent space by leveraging the class centers which helps to eliminate the need for pairwise or triplet supervision. We quantitatively and qualitatively evaluate the proposed approach on VoxCeleb, a benchmarks audio-visual dataset on a multitude of tasks including cross-modal verification, cross-modal matching, and cross-modal retrieval. State-of-the-art performance is achieved on cross-modal verification and matching while comparable results are observed on the remaining applications. Our experiments demonstrate the effectiveness of the technique for cross-modal biometric applications. △ Less

Submitted 18 September, 2019; originally announced September 2019.

Comments: Accepted to DICTA 2019

arXiv:1812.02483 [pdf, other]

Propagation Channels for mmWave Vehicular Communications: State-of-the-art and Future Research Directions

Authors: Furqan Jameel, Shurjeel Wyne, Syed Junaid Nawaz, Zheng Chang

Abstract: Vehicular communications essentially support automotive applications for safety and infotainment. For this reason, industry leaders envision an enhanced role of vehicular communications in the fifth generation of mobile communications technology. Over the years, the number of vehicle-mounted sensors has increased steadily, which potentially leads to more volume of critical data communications in a… ▽ More Vehicular communications essentially support automotive applications for safety and infotainment. For this reason, industry leaders envision an enhanced role of vehicular communications in the fifth generation of mobile communications technology. Over the years, the number of vehicle-mounted sensors has increased steadily, which potentially leads to more volume of critical data communications in a short time. Also, emerging applications such as remote/autonomous driving and infotainment such as high-definition movie streaming require data-rates on the order of multiple Gbit/s. Such high data-rates require a large system bandwidth, but very limited bandwidth is available in the sub-6 GHz cellular bands. This has sparked research interest in the millimeter wave (mmWave) band (10 GHz-300 GHz), where a large bandwidth is available to support the high data-rate and low-latency communications envisioned for emerging vehicular applications. However, leveraging mmWave communications requires a thorough understanding of the relevant vehicular propagation channels, which are significantly different from those investigated below 6 GHz. Despite their significance, very few investigations of mmWave vehicular channels are reported in the literature. This work highlights the key attributes of mmWave vehicular communication channels and surveys the recent literature on channel characterization efforts in order to provide a gap analysis and propose possible directions for future research. △ Less

Submitted 6 December, 2018; originally announced December 2018.

Showing 1–9 of 9 results for author: Nawaz, S