Search | arXiv e-print repository

Multi-Stage Speaker Diarization for Noisy Classrooms

Authors: Ali Sartaz Khan, Tolulope Ogunremi, Ahmed Adel Attia, Dorottya Demszky

Abstract: Speaker diarization, the process of identifying "who spoke when" in audio recordings, is essential for understanding classroom dynamics. However, classroom settings present distinct challenges, including poor recording quality, high levels of background noise, overlapping speech, and the difficulty of accurately capturing children's voices. This study investigates the effectiveness of multi-stage… ▽ More Speaker diarization, the process of identifying "who spoke when" in audio recordings, is essential for understanding classroom dynamics. However, classroom settings present distinct challenges, including poor recording quality, high levels of background noise, overlapping speech, and the difficulty of accurately capturing children's voices. This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline. We assess the impact of denoising on diarization accuracy and compare various voice activity detection (VAD) models, including self-supervised transformer-based frame-wise VAD models. We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions. We conduct experiments using two datasets from English speaking classrooms to separate teacher vs. student speech and to separate all speakers. Our results show that denoising significantly improves the Diarization Error Rate (DER) by reducing the rate of missed speech. Additionally, training on both denoised and noisy datasets leads to substantial performance gains in noisy conditions. The hybrid VAD model leads to further improvements in speech detection, achieving a DER as low as 17% in teacher-student experiments and 45% in all-speaker experiments. However, we also identified trade-offs between voice activity detection and speaker confusion. Overall, our study highlights the effectiveness of multi-stage diarization models and integrating ASR-based information for enhancing speaker diarization in noisy classroom environments. △ Less

Submitted 27 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

arXiv:2503.16565 [pdf, other]

Gene42: Long-Range Genomic Foundation Model With Dense Attention

Authors: Kirill Vishniakov, Boulbaba Ben Amor, Engin Tekin, Nancy A. ElNaker, Karthik Viswanathan, Aleksandr Medvedev, Aahan Singh, Maryam Nadeem, Mohammad Amaan Sayeed, Praveenkumar Kanithi, Tiago Magalhaes, Natalia Vassilieva, Dwarikanath Mahapatra, Marco Pimentel, and Shadab Khan

Abstract: We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context… ▽ More We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai. △ Less

Submitted 20 March, 2025; originally announced March 2025.

arXiv:2407.20003 [pdf, other]

On the Effects of Irrelevant Variables in Treatment Effect Estimation with Deep Disentanglement

Authors: Ahmad Saeed Khan, Erik Schaffernicht, Johannes Andreas Stork

Abstract: Estimating treatment effects from observational data is paramount in healthcare, education, and economics, but current deep disentanglement-based methods to address selection bias are insufficiently handling irrelevant variables. We demonstrate in experiments that this leads to prediction errors. We disentangle pre-treatment variables with a deep embedding method and explicitly identify and repres… ▽ More Estimating treatment effects from observational data is paramount in healthcare, education, and economics, but current deep disentanglement-based methods to address selection bias are insufficiently handling irrelevant variables. We demonstrate in experiments that this leads to prediction errors. We disentangle pre-treatment variables with a deep embedding method and explicitly identify and represent irrelevant variables, additionally to instrumental, confounding and adjustment latent factors. To this end, we introduce a reconstruction objective and create an embedding space for irrelevant variables using an attached autoencoder. Instead of relying on serendipitous suppression of irrelevant variables as in previous deep disentanglement approaches, we explicitly force irrelevant variables into this embedding space and employ orthogonalization to prevent irrelevant information from leaking into the latent space representations of the other factors. Our experiments with synthetic and real-world benchmark datasets show that we can better identify irrelevant variables and more precisely predict treatment effects than previous methods, while prediction quality degrades less when additional irrelevant variables are introduced. △ Less

Submitted 26 August, 2024; v1 submitted 29 July, 2024; originally announced July 2024.

Comments: Paper is accepted at ECAI-2024

arXiv:2103.01223 [pdf]

doi 10.30534/ijatcse/2021/151012021

Offshore Software Maintenance Outsourcing Predicting Clients Proposal using Supervised Learning

Authors: Atif Ikram, Masita Abdul Jalil, Amir Bin Ngah, Ahmad Salman Khan, Tahir Iqbal

Abstract: In software engineering, software maintenance is the process of correction, updating, and improvement of software products after handed over to the customer. Through offshore software maintenance outsourcing clients can get advantages like reduce cost, save time, and improve quality. In most cases, the OSMO vendor generates considerable revenue. However, the selection of an appropriate proposal am… ▽ More In software engineering, software maintenance is the process of correction, updating, and improvement of software products after handed over to the customer. Through offshore software maintenance outsourcing clients can get advantages like reduce cost, save time, and improve quality. In most cases, the OSMO vendor generates considerable revenue. However, the selection of an appropriate proposal among multiple clients is one of the critical problems for OSMO vendors. The purpose of this paper is to suggest an effective machine learning technique that can be used by OSMO vendors to assess or predict the OSMO client proposal. The dataset is generated through a survey of OSMO vendors working in a developing country. The results showed that supervised learning-based classifiers like Naïve Bayesian, SMO, Logistics apprehended 69.75, 81.81, and 87.27 percent testing accuracy respectively. This study concludes that supervised learning is the most suitable technique to predict the OSMO client's proposal. △ Less

Submitted 1 March, 2021; originally announced March 2021.

Comments: 10 pages, 2 figures

Journal ref: International Journal of Advanced Trends in Computer Science and Engineering, 2021

arXiv:2101.10658 [pdf]

Software Effort Estimation Accuracy Prediction of Machine Learning Techniques: A Systematic Performance Evaluation

Authors: Yasir Mahmood, Nazri Kama, Azri Azmi, Ahmad Salman Khan, Mazlan Ali

Abstract: Software effort estimation accuracy is a key factor in effective planning, controlling and to deliver a successful software project within budget and schedule. The overestimation and underestimation both are the key challenges for future software development, henceforth there is a continuous need for accuracy in software effort estimation (SEE). The researchers and practitioners are striving to id… ▽ More Software effort estimation accuracy is a key factor in effective planning, controlling and to deliver a successful software project within budget and schedule. The overestimation and underestimation both are the key challenges for future software development, henceforth there is a continuous need for accuracy in software effort estimation (SEE). The researchers and practitioners are striving to identify which machine learning estimation technique gives more accurate results based on evaluation measures, datasets and the other relevant attributes. The authors of related research are generally not aware of previously published results of machine learning effort estimation techniques. The main aim of this study is to assist the researchers to know which machine learning technique yields the promising effort estimation accuracy prediction in the software development. In this paper, the performance of the machine learning ensemble technique is investigated with the solo technique based on two most commonly used accuracy evaluation metrics. We used the systematic literature review methodology proposed by Kitchenham and Charters. This includes searching for the most relevant papers, applying quality assessment criteria, extracting data and drawing results. We have evaluated a state-of-the-art accuracy performance of 28 selected studies (14 ensemble, 14 solo) using Mean Magnitude of Relative Error (MMRE) and PRED (25) as a set of reliable accuracy metrics for performance evaluation of accuracy among two techniques to report the research questions stated in this study. We found that machine learning techniques are the most frequently implemented in the construction of ensemble effort estimation (EEE) techniques. The results of this study revealed that the EEE techniques usually yield a promising estimation accuracy than the solo techniques. △ Less

Submitted 26 January, 2021; originally announced January 2021.

Comments: Pages: 27 Figures: 15 Tables: 8

arXiv:1906.02728 [pdf, other]

Feature-level and Model-level Audiovisual Fusion for Emotion Recognition in the Wild

Authors: Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O'Reilly, Shizhong Han, Ping Liu, Min Chen, Yan Tong

Abstract: Emotion recognition plays an important role in human-computer interaction (HCI) and has been extensively studied for decades. Although tremendous improvements have been achieved for posed expressions, recognizing human emotions in "close-to-real-world" environments remains a challenge. In this paper, we proposed two strategies to fuse information extracted from different modalities, i.e., audio an… ▽ More Emotion recognition plays an important role in human-computer interaction (HCI) and has been extensively studied for decades. Although tremendous improvements have been achieved for posed expressions, recognizing human emotions in "close-to-real-world" environments remains a challenge. In this paper, we proposed two strategies to fuse information extracted from different modalities, i.e., audio and visual. Specifically, we utilized LBP-TOP, an ensemble of CNNs, and a bi-directional LSTM (BLSTM) to extract features from the visual channel, and the OpenSmile toolkit to extract features from the audio channel. Two kinds of fusion methods, i,e., feature-level fusion and model-level fusion, were developed to utilize the information extracted from the two channels. Experimental results on the EmotiW2018 AFEW dataset have shown that the proposed fusion methods outperform the baseline methods significantly and achieve better or at least comparable performance compared with the state-of-the-art methods, where the model-level fusion performs better when one of the channels totally fails. △ Less

Submitted 6 June, 2019; originally announced June 2019.

arXiv:1903.08051 [pdf, other]

Identity-Free Facial Expression Recognition using conditional Generative Adversarial Network

Authors: Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O'Reilly, Shizhong Han, Yan Tong

Abstract: A novel Identity-Free conditional Generative Adversarial Network (IF-GAN) was proposed for Facial Expression Recognition (FER) to explicitly reduce high inter-subject variations caused by identity-related facial attributes, e.g., age, race, and gender. As part of an end-to-end system, a cGAN was designed to transform a given input facial expression image to an "average" identity face with the same… ▽ More A novel Identity-Free conditional Generative Adversarial Network (IF-GAN) was proposed for Facial Expression Recognition (FER) to explicitly reduce high inter-subject variations caused by identity-related facial attributes, e.g., age, race, and gender. As part of an end-to-end system, a cGAN was designed to transform a given input facial expression image to an "average" identity face with the same expression as the input. Then, identity-free FER is possible since the generated images have the same synthetic "average" identity and differ only in their displayed expressions. Experiments on four facial expression datasets, one with spontaneous expressions, show that IF-GAN outperforms the baseline CNN and achieves state-of-the-art performance for FER. △ Less

Submitted 20 May, 2021; v1 submitted 19 March, 2019; originally announced March 2019.

arXiv:1812.07067 [pdf, other]

Probabilistic Attribute Tree in Convolutional Neural Networks for Facial Expression Recognition

Authors: Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O'Reilly, Yan Tong

Abstract: In this paper, we proposed a novel Probabilistic Attribute Tree-CNN (PAT-CNN) to explicitly deal with the large intra-class variations caused by identity-related attributes, e.g., age, race, and gender. Specifically, a novel PAT module with an associated PAT loss was proposed to learn features in a hierarchical tree structure organized according to attributes, where the final features are less aff… ▽ More In this paper, we proposed a novel Probabilistic Attribute Tree-CNN (PAT-CNN) to explicitly deal with the large intra-class variations caused by identity-related attributes, e.g., age, race, and gender. Specifically, a novel PAT module with an associated PAT loss was proposed to learn features in a hierarchical tree structure organized according to attributes, where the final features are less affected by the attributes. Then, expression-related features are extracted from leaf nodes. Samples are probabilistically assigned to tree nodes at different levels such that expression-related features can be learned from all samples weighted by probabilities. We further proposed a semi-supervised strategy to learn the PAT-CNN from limited attribute-annotated samples to make the best use of available data. Experimental results on five facial expression datasets have demonstrated that the proposed PAT-CNN outperforms the baseline models by explicitly modeling attributes. More impressively, the PAT-CNN using a single model achieves the best performance for faces in the wild on the SFEW dataset, compared with the state-of-the-art methods using an ensemble of hundreds of CNNs. △ Less

Submitted 17 December, 2018; originally announced December 2018.

Comments: 10 pages

arXiv:1710.03144 [pdf, other]

Island Loss for Learning Discriminative Features in Facial Expression Recognition

Authors: Jie Cai, Zibo Meng, Ahmed Shehab Khan, Zhiyuan Li, James O'Reilly, Yan Tong

Abstract: Over the past few years, Convolutional Neural Networks (CNNs) have shown promise on facial expression recognition. However, the performance degrades dramatically under real-world settings due to variations introduced by subtle facial appearance changes, head pose variations, illumination changes, and occlusions. In this paper, a novel island loss is proposed to enhance the discriminative power o… ▽ More Over the past few years, Convolutional Neural Networks (CNNs) have shown promise on facial expression recognition. However, the performance degrades dramatically under real-world settings due to variations introduced by subtle facial appearance changes, head pose variations, illumination changes, and occlusions. In this paper, a novel island loss is proposed to enhance the discriminative power of the deeply learned features. Specifically, the IL is designed to reduce the intra-class variations while enlarging the inter-class differences simultaneously. Experimental results on four benchmark expression databases have demonstrated that the CNN with the proposed island loss (IL-CNN) outperforms the baseline CNN models with either traditional softmax loss or the center loss and achieves comparable or better performance compared with the state-of-the-art methods for facial expression recognition. △ Less

Submitted 23 October, 2017; v1 submitted 9 October, 2017; originally announced October 2017.

Comments: 8 pages, 3 figures

arXiv:1707.05395 [pdf, other]

Incremental Boosting Convolutional Neural Network for Facial Action Unit Recognition

Authors: Shizhong Han, Zibo Meng, Ahmed Shehab Khan, Yan Tong

Abstract: Recognizing facial action units (AUs) from spontaneous facial expressions is still a challenging problem. Most recently, CNNs have shown promise on facial AU recognition. However, the learned CNNs are often overfitted and do not generalize well to unseen subjects due to limited AU-coded training images. We proposed a novel Incremental Boosting CNN (IB-CNN) to integrate boosting into the CNN via an… ▽ More Recognizing facial action units (AUs) from spontaneous facial expressions is still a challenging problem. Most recently, CNNs have shown promise on facial AU recognition. However, the learned CNNs are often overfitted and do not generalize well to unseen subjects due to limited AU-coded training images. We proposed a novel Incremental Boosting CNN (IB-CNN) to integrate boosting into the CNN via an incremental boosting layer that selects discriminative neurons from the lower layer and is incrementally updated on successive mini-batches. In addition, a novel loss function that accounts for errors from both the incremental boosted classifier and individual weak classifiers was proposed to fine-tune the IB-CNN. Experimental results on four benchmark AU databases have demonstrated that the IB-CNN yields significant improvement over the traditional CNN and the boosting CNN without incremental learning, as well as outperforming the state-of-the-art CNN-based methods in AU recognition. The improvement is more impressive for the AUs that have the lowest frequencies in the databases. △ Less

Submitted 17 July, 2017; originally announced July 2017.

Comments: NIPS2016

arXiv:1707.00860 [pdf, other]

Conditional generation of multi-modal data using constrained embedding space mapping

Authors: Subhajit Chaudhury, Sakyasingha Dasgupta, Asim Munawar, Md. A. Salam Khan, Ryuki Tachibana

Abstract: We present a conditional generative model that maps low-dimensional embeddings of multiple modalities of data to a common latent space hence extracting semantic relationships between them. The embedding specific to a modality is first extracted and subsequently a constrained optimization procedure is performed to project the two embedding spaces to a common manifold. The individual embeddings are… ▽ More We present a conditional generative model that maps low-dimensional embeddings of multiple modalities of data to a common latent space hence extracting semantic relationships between them. The embedding specific to a modality is first extracted and subsequently a constrained optimization procedure is performed to project the two embedding spaces to a common manifold. The individual embeddings are generated back from this common latent space. However, in order to enable independent conditional inference for separately extracting the corresponding embeddings from the common latent space representation, we deploy a proxy variable trick - wherein, the single shared latent space is replaced by the respective separate latent spaces of each modality. We design an objective function, such that, during training we can force these separate spaces to lie close to each other, by minimizing the distance between their probability distribution functions. Experimental results demonstrate that the learned joint model can generalize to learning concepts of double MNIST digits with additional attributes of colors,from both textual and speech input. △ Less

Submitted 25 July, 2017; v1 submitted 4 July, 2017; originally announced July 2017.

Comments: 7 pages, 4 figures, ICML 2017 Workshop on Implicit Models

arXiv:1706.08487 [pdf, ps, other]

doi 10.1109/LSP.2017.2723159

Non-Orthogonal Multiple Access combined with Random Linear Network Coded Cooperation

Authors: Amjad Saeed Khan, Ioannis Chatzigeorgiou

Abstract: This letter considers two groups of source nodes. Each group transmits packets to its own designated destination node over single-hop links and via a cluster of relay nodes shared by both groups. In an effort to boost reliability without sacrificing throughput, a scheme is proposed, whereby packets at the relay nodes are combined using two methods; packets delivered by different groups are mixed u… ▽ More This letter considers two groups of source nodes. Each group transmits packets to its own designated destination node over single-hop links and via a cluster of relay nodes shared by both groups. In an effort to boost reliability without sacrificing throughput, a scheme is proposed, whereby packets at the relay nodes are combined using two methods; packets delivered by different groups are mixed using non-orthogonal multiple access principles, while packets originating from the same group are mixed using random linear network coding. An analytical framework that characterizes the performance of the proposed scheme is developed, compared to simulation results and benchmarked against a counterpart scheme that is based on orthogonal multiple access. △ Less

Submitted 26 June, 2017; originally announced June 2017.

arXiv:1607.06143 [pdf, ps, other]

doi 10.1109/LCOMM.2016.2594768

Improved bounds on the decoding failure probability of network coding over multi-source multi-relay networks

Authors: Amjad Saeed Khan, Ioannis Chatzigeorgiou

Abstract: This paper considers a multi-source multi-relay network, in which relay nodes employ a coding scheme based on random linear network coding on source packets and generate coded packets. If a destination node collects enough coded packets, it can recover the packets of all source nodes. The links between source-to-relay nodes and relay-to-destination nodes are modeled as packet erasure channels. Imp… ▽ More This paper considers a multi-source multi-relay network, in which relay nodes employ a coding scheme based on random linear network coding on source packets and generate coded packets. If a destination node collects enough coded packets, it can recover the packets of all source nodes. The links between source-to-relay nodes and relay-to-destination nodes are modeled as packet erasure channels. Improved bounds on the probability of decoding failure are presented, which are markedly close to simulation results and notably better than previous bounds. Examples demonstrate the tightness and usefulness of the new bounds over the old bounds. △ Less

Submitted 20 July, 2016; originally announced July 2016.

Comments: 4 pages, 5 figures, accepted for publication in IEEE Communications Letters

arXiv:1508.03664 [pdf, ps, other]

doi 10.1109/LCOMM.2015.2470662

Rethinking the Intercept Probability of Random Linear Network Coding

Authors: Amjad Saeed Khan, Andrea Tassi, Ioannis Chatzigeorgiou

Abstract: This letter considers a network comprising a transmitter, which employs random linear network coding to encode a message, a legitimate receiver, which can recover the message if it gathers a sufficient number of linearly independent coded packets, and an eavesdropper. Closed-form expressions for the probability of the eavesdropper intercepting enough coded packets to recover the message are derive… ▽ More This letter considers a network comprising a transmitter, which employs random linear network coding to encode a message, a legitimate receiver, which can recover the message if it gathers a sufficient number of linearly independent coded packets, and an eavesdropper. Closed-form expressions for the probability of the eavesdropper intercepting enough coded packets to recover the message are derived. Transmission with and without feedback is studied. Furthermore, an optimization model that minimizes the intercept probability under delay and reliability constraints is presented. Results validate the proposed analysis and quantify the secrecy gain offered by a feedback link from the legitimate receiver. △ Less

Submitted 14 August, 2015; originally announced August 2015.

Comments: IEEE Communications Letters, to appear

arXiv:1503.05696 [pdf, ps, other]

doi 10.1109/ICCW.2015.7247305

Performance Analysis of Random Linear Network Coding in Two-Source Single-Relay Networks

Authors: Amjad Saeed Khan, Ioannis Chatzigeorgiou

Abstract: This paper considers the multiple-access relay channel in a setting where two source nodes transmit packets to a destination node, both directly and via a relay node, over packet erasure channels. Intra-session network coding is used at the source nodes and inter-session network coding is employed at the relay node to combine the recovered source packets of both source nodes. In this work, we inve… ▽ More This paper considers the multiple-access relay channel in a setting where two source nodes transmit packets to a destination node, both directly and via a relay node, over packet erasure channels. Intra-session network coding is used at the source nodes and inter-session network coding is employed at the relay node to combine the recovered source packets of both source nodes. In this work, we investigate the performance of the network-coded system in terms of the probability that the destination node will successfully recover the source packets of the two source nodes. We build our analysis on fundamental probability expressions for random matrices over finite fields and we derive upper bounds on the system performance for the case of systematic and non-systematic network coding. Simulation results show that the upper bounds are very tight and accurately predict the decoding probability at the destination node. Our analysis also exposes the clear benefits of systematic network coding at the source nodes compared to non-systematic transmission. △ Less

Submitted 19 March, 2015; originally announced March 2015.

Comments: Proc. ICC 2015, Workshop on Cooperative and Cognitive Mobile Networks (CoCoNet), to appear

Showing 1–15 of 15 results for author: Khan, A S