Skip to main content

Showing 1–24 of 24 results for author: Jalal, M

Searching in archive cs. Search in all archives.
.
  1. AWARE-NET: Adaptive Weighted Averaging for Robust Ensemble Network in Deepfake Detection

    Authors: Muhammad Salman, Iqra Tariq, Mishal Zulfiqar, Muqadas Jalal, Sami Aujla, Sumbal Fatima

    Abstract: Deepfake detection has become increasingly important due to the rise of synthetic media, which poses significant risks to digital identity and cyber presence for security and trust. While multiple approaches have improved detection accuracy, challenges remain in achieving consistent performance across diverse datasets and manipulation types. In response, we propose a novel two-tier ensemble framew… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

    Journal ref: IET Conference Proceedings CP917, Volume 2025, Issue 3, Pages 526-533, The Institution of Engineering and Technology, 2025

  2. arXiv:2503.18973  [pdf

    eess.IV cs.AI cs.CV

    Automated diagnosis of lung diseases using vision transformer: a comparative study on chest x-ray classification

    Authors: Muhammad Ahmad, Sardar Usman, Ildar Batyrshin, Muhammad Muzammil, K. Sajid, M. Hasnain, Muhammad Jalal, Grigori Sidorov

    Abstract: Background: Lung disease is a significant health issue, particularly in children and elderly individuals. It often results from lung infections and is one of the leading causes of mortality in children. Globally, lung-related diseases claim many lives each year, making early and accurate diagnoses crucial. Radiographs are valuable tools for the diagnosis of such conditions. The most prevalent lung… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

  3. arXiv:2501.14680  [pdf, other

    eess.AS cs.SD

    Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning

    Authors: Jisi Zhang, Pablo Peso Parada, Md Asif Jalal, Karthikeyan Saravanan

    Abstract: Diffusion based Text-To-Music (TTM) models generate music corresponding to text descriptions. Typically UNet based diffusion models condition on text embeddings generated from a pre-trained large language model or from a cross-modality audio-language representation model. This work proposes a diffusion based TTM, in which the UNet is conditioned on both (i) a uni-modal language model (e.g., T5) vi… ▽ More

    Submitted 27 January, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: Accepted at ICASSP 2025

  4. arXiv:2501.09113  [pdf, other

    eess.AS cs.SD

    persoDA: Personalized Data Augmentation for Personalized ASR

    Authors: Pablo Peso Parada, Spyros Fontalis, Md Asif Jalal, Karthikeyan Saravanan, Anastasios Drosou, Mete Ozay, Gil Ho Lee, Jungin Lee, Seokyeong Jung

    Abstract: Data augmentation (DA) is ubiquitously used in training of Automatic Speech Recognition (ASR) models. DA offers increased data variability, robustness and generalization against different acoustic distortions. Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA m… ▽ More

    Submitted 17 January, 2025; v1 submitted 15 January, 2025; originally announced January 2025.

    Comments: ICASSP'25-Copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  5. Detecting Frames in News Headlines and Lead Images in U.S. Gun Violence Coverage

    Authors: Isidora Chara Tourni, Lei Guo, Hengchang Hu, Edward Halim, Prakash Ishwar, Taufiq Daryanto, Mona Jalal, Boqi Chen, Margrit Betke, Fabian Zhafransyah, Sha Lai, Derry Tanti Wijaya

    Abstract: News media structure their reporting of events or issues using certain perspectives. When describing an incident involving gun violence, for example, some journalists may focus on mental health or gun regulation, while others may emphasize the discussion of gun rights. Such perspectives are called \say{frames} in communication research. We study, for the first time, the value of combining lead i… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: published at Findings of the Association for Computational Linguistics: EMNLP 2021

  6. arXiv:2406.17159  [pdf, other

    eess.AS cs.MM cs.SD

    Exploring compressibility of transformer based text-to-music (TTM) models

    Authors: Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan

    Abstract: State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the var… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Proceedings of INTERSPEECH 2024

  7. arXiv:2401.13146  [pdf, other

    eess.AS cs.CL cs.SD

    Locality enhanced dynamic biasing and sampling strategies for contextual ASR

    Authors: Md Asif Jalal, Pablo Peso Parada, George Pavlidis, Vasileios Moschopoulos, Karthikeyan Saravanan, Chrysovalantis-Giorgos Kontoulis, Jisi Zhang, Anastasios Drosou, Gil Ho Lee, Jungin Lee, Seokyeong Jung

    Abstract: Automatic Speech Recognition (ASR) still face challenges when recognizing time-variant rare-phrases. Contextual biasing (CB) modules bias ASR model towards such contextually-relevant phrases. During training, a list of biasing phrases are selected from a large pool of phrases following a sampling strategy. In this work we firstly analyse different sampling strategies to provide insights into the t… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: Accepted for IEEE ASRU 2023

  8. arXiv:2401.12085  [pdf, other

    eess.AS cs.SD

    Consistency Based Unsupervised Self-training For ASR Personalisation

    Authors: Jisi Zhang, Vandana Rajan, Haaris Mehmood, David Tuckey, Pablo Peso Parada, Md Asif Jalal, Karthikeyan Saravanan, Gil Ho Lee, Jungin Lee, Seokyeong Jung

    Abstract: On-device Automatic Speech Recognition (ASR) models trained on speech data of a large population might underperform for individuals unseen during training. This is due to a domain shift between user data and the original training data, differed by user's speaking characteristics and environmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted for IEEE ASRU 2023

  9. arXiv:2307.13343  [pdf, other

    eess.AS cs.CR cs.SD

    On-Device Speaker Anonymization of Acoustic Embeddings for ASR based onFlexible Location Gradient Reversal Layer

    Authors: Md Asif Jalal, Pablo Peso Parada, Jisi Zhang, Karthikeyan Saravanan, Mete Ozay, Myoungji Han, Jung In Lee, Seokyeong Jung

    Abstract: Smart devices serviced by large-scale AI models necessitates user data transfer to the cloud for inference. For speech applications, this means transferring private user information, e.g., speaker identity. Our paper proposes a privacy-enhancing framework that targets speaker identity anonymization while preserving speech recognition accuracy for our downstream task~-~Automatic Speech Recognition… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: Proceedings of INTERSPEECH 2023

  10. arXiv:2306.17500  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition

    Authors: Anna Ollerenshaw, Md Asif Jalal, Rosanna Milner, Thomas Hain

    Abstract: Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech. Variations of consonant-vowel (CV) phonemic boundaries can enrich acoustic context with linguistic cues, which impacts SER. In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration. However, phone boundaries within sp… ▽ More

    Submitted 30 June, 2023; originally announced June 2023.

  11. arXiv:2303.00550  [pdf, other

    eess.AS cs.SD

    Towards domain generalisation in ASR with elitist sampling and ensemble knowledge distillation

    Authors: Rehan Ahmad, Md Asif Jalal, Muhammad Umar Farooq, Anna Ollerenshaw, Thomas Hain

    Abstract: Knowledge distillation has widely been used for model compression and domain adaptation for speech applications. In the presence of multiple teachers, knowledge can easily be transferred to the student by averaging the models output. However, previous research shows that the student do not adapt well with such combination. This paper propose to use an elitist sampling strategy at the output of ens… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

  12. arXiv:2211.02000  [pdf, other

    cs.SD cs.CL eess.AS

    Dynamic Kernels and Channel Attention for Low Resource Speaker Verification

    Authors: Anna Ollerenshaw, Md Asif Jalal, Thomas Hain

    Abstract: State-of-the-art speaker verification frameworks have typically focused on developing models with increasingly deeper (more layers) and wider (number of channels) models to improve their verification performance. Instead, this paper proposes an approach to increase the model resolution capability using attention-based dynamic kernels in a convolutional neural network to adapt the model parameters… ▽ More

    Submitted 27 February, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

  13. arXiv:2211.01993  [pdf, other

    cs.CL cs.SD eess.AS

    Probing Statistical Representations For End-To-End ASR

    Authors: Anna Ollerenshaw, Md Asif Jalal, Thomas Hain

    Abstract: End-to-End automatic speech recognition (ASR) models aim to learn a generalised speech representation to perform recognition. In this domain there is little research to analyse internal representation dependencies and their relationship to modelling approaches. This paper investigates cross-domain language model dependencies within transformer architectures using SVCCA and uses these insights to e… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  14. A cross-corpus study on speech emotion recognition

    Authors: Rosanna Milner, Md Asif Jalal, Raymond W. M. Ng, Thomas Hain

    Abstract: For speech emotion datasets, it has been difficult to acquire large quantities of reliable data and acted emotions may be over the top compared to less expressive emotions displayed in everyday life. Lately, larger datasets with natural emotions have been created. Instead of ignoring smaller, acted datasets, this study investigates whether information learnt from acted emotions is useful for detec… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

    Comments: ASRU 2019

    Journal ref: IEEE Workshop on Automatic Speech Recognition and Understanding 2019

  15. Insights on Neural Representations for End-to-End Speech Recognition

    Authors: Anna Ollerenshaw, Md Asif Jalal, Thomas Hain

    Abstract: End-to-end automatic speech recognition (ASR) models aim to learn a generalised speech representation. However, there are limited tools available to understand the internal functions and the effect of hierarchical dependencies within the model architecture. It is crucial to understand the correlations between the layer-wise representations, to derive insights on the relationship between neural rep… ▽ More

    Submitted 19 May, 2022; originally announced May 2022.

    Comments: Submitted to Interspeech 2021

    Journal ref: Proc. Interspeech 2021, 4079-4083

  16. arXiv:2102.11420  [pdf, other

    cs.SD eess.AS

    Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion

    Authors: Samuel J. Broughton, Md Asif Jalal, Roger K. Moore

    Abstract: Generative Adversarial Networks (GANs) are machine learning networks based around creating synthetic data. Voice Conversion (VC) is a subset of voice translation that involves translating the paralinguistic features of a source speaker to a target speaker while preserving the linguistic information. The aim of non-parallel conditional GANs for VC is to translate an acoustic speech feature sequence… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

    Comments: For demo, see https://samuelbroughton.github.io/interpretability-demo-2020/

  17. arXiv:2008.06974  [pdf, other

    cs.CL cs.IR cs.LG

    OpenFraming: We brought the ML; you bring the data. Interact with your data and discover its frames

    Authors: Alyssa Smith, David Assefa Tofu, Mona Jalal, Edward Edberg Halim, Yimeng Sun, Vidya Akavoor, Margrit Betke, Prakash Ishwar, Lei Guo, Derry Wijaya

    Abstract: When journalists cover a news story, they can cover the story from multiple angles or perspectives. A news article written about COVID-19 for example, might focus on personal preventative actions such as mask-wearing, while another might focus on COVID-19's impact on the economy. These perspectives are called "frames," which when used may influence public perception and opinion of the issue. We in… ▽ More

    Submitted 16 August, 2020; originally announced August 2020.

    Comments: 8 pages, 8 figures, EMNLP 2020 demonstration papers

  18. arXiv:2008.05955  [pdf, other

    cs.CV cs.GR cs.LG cs.RO eess.IV

    SIDOD: A Synthetic Image Dataset for 3D Object Pose Recognition with Distractors

    Authors: Mona Jalal, Josef Spjut, Ben Boudaoud, Margrit Betke

    Abstract: We present a new, publicly-available image dataset generated by the NVIDIA Deep Learning Data Synthesizer intended for use in object detection, pose estimation, and tracking applications. This dataset contains 144k stereo image pairs that synthetically combine 18 camera viewpoints of three photorealistic virtual environments with up to 10 objects (chosen randomly from the 21 object models of the Y… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: 3 pages, 4 figures, 1 table, Accepted at CVPR 2019 Workshop

  19. arXiv:2008.05060  [pdf, other

    cs.CV cs.LG eess.SP stat.ML

    Online Graph Completion: Multivariate Signal Recovery in Computer Vision

    Authors: Won Hwa Kim, Mona Jalal, Seongjae Hwang, Sterling C. Johnson, Vikas Singh

    Abstract: The adoption of "human-in-the-loop" paradigms in computer vision and machine learning is leading to various applications where the actual data acquisition (e.g., human supervision) and the underlying inference algorithms are closely interwined. While classical work in active learning provides effective solutions when the learning module involves classification and regression tasks, many practical… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: 9 pages, 7 figures, CVPR 2017 Conference

  20. arXiv:2002.05242  [pdf, other

    cs.CV cs.HC cs.LG

    Leveraging Affect Transfer Learning for Behavior Prediction in an Intelligent Tutoring System

    Authors: Nataniel Ruiz, Hao Yu, Danielle A. Allessio, Mona Jalal, Ajjen Joshi, Thomas Murray, John J. Magee, Jacob R. Whitehill, Vitaly Ablavsky, Ivon Arroyo, Beverly P. Woolf, Stan Sclaroff, Margrit Betke

    Abstract: In this work, we propose a video-based transfer learning approach for predicting problem outcomes of students working with an intelligent tutoring system (ITS). By analyzing a student's face and gestures, our method predicts the outcome of a student answering a problem in an ITS from a video feed. Our work is motivated by the reasoning that the ability to predict such outcomes enables tutoring sys… ▽ More

    Submitted 8 April, 2022; v1 submitted 12 February, 2020; originally announced February 2020.

    Comments: Published at IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2021 - Best Poster Award (4% award rate)

  21. arXiv:2002.04181  [pdf, other

    cs.CL cs.LG cs.SI

    Performance Comparison of Crowdworkers and NLP Tools on Named-Entity Recognition and Sentiment Analysis of Political Tweets

    Authors: Mona Jalal, Kate K. Mays, Lei Guo, Margrit Betke

    Abstract: We report results of a comparison of the accuracy of crowdworkers and seven Natural Language Processing (NLP) toolkits in solving two important NLP tasks, named-entity recognition (NER) and entity-level sentiment (ELS) analysis. We here focus on a challenging dataset, 1,000 political tweets that were collected during the U.S. presidential primary election in February 2016. Each tweet refers to at… ▽ More

    Submitted 11 August, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

    Comments: 4 pages, 1 figure, Accepted at WiNLP Workshop at NAACL 2018

  22. arXiv:1909.00134  [pdf, other

    cs.CV

    Scraping Social Media Photos Posted in Kenya and Elsewhere to Detect and Analyze Food Types

    Authors: Kaihong Wang, Mona Jalal, Sankara Jefferson, Yi Zheng, Elaine O. Nsoesie, Margrit Betke

    Abstract: Monitoring population-level changes in diet could be useful for education and for implementing interventions to improve health. Research has shown that data from social media sources can be used for monitoring dietary behavior. We propose a scrape-by-location methodology to create food image datasets from Instagram posts. We used it to collect 3.56 million images over a period of 20 days in March… ▽ More

    Submitted 31 August, 2019; originally announced September 2019.

    Comments: Another version of the paper was submitted to the ACM International Conference on Multimedia (ACMMM2019)

  23. arXiv:1906.00290  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Adaptive Online Learning for Gradient-Based Optimizers

    Authors: Saeed Masoudian, Ali Arabzadeh, Mahdi Jafari Siavoshani, Milad Jalal, Alireza Amouzad

    Abstract: As application demands for online convex optimization accelerate, the need for designing new methods that simultaneously cover a large class of convex functions and impose the lowest possible regret is highly rising. Known online optimization methods usually perform well only in specific settings, and their performance depends highly on the geometry of the decision space and cost functions. Howeve… ▽ More

    Submitted 1 June, 2019; originally announced June 2019.

  24. arXiv:1810.01771  [pdf, other

    cs.CV

    SAVOIAS: A Diverse, Multi-Category Visual Complexity Dataset

    Authors: Elham Saraee, Mona Jalal, Margrit Betke

    Abstract: Visual complexity identifies the level of intricacy and details in an image or the level of difficulty to describe the image. It is an important concept in a variety of areas such as cognitive psychology, computer vision and visualization, and advertisement. Yet, efforts to create large, downloadable image datasets with diverse content and unbiased groundtruthing are lacking. In this work, we intr… ▽ More

    Submitted 3 October, 2018; originally announced October 2018.

    Comments: 10 pages, 4 figures, 4 tables