Skip to main content

Showing 1–11 of 11 results for author: Javed, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.14026  [pdf, other

    cs.CL cs.SD eess.AS

    Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling

    Authors: Kaushal Santosh Bhogale, Deovrat Mehendale, Niharika Parasa, Sathish Kumar Reddy G, Tahir Javed, Pratyush Kumar, Mitesh M. Khapra

    Abstract: In this study, we tackle the challenge of limited labeled data for low-resource languages in ASR, focusing on Hindi. Specifically, we explore pseudo-labeling, by proposing a generic framework combining multiple ideas from existing works. Our framework integrates multiple base models for transcription and evaluators for assessing audio-transcript pairs, resulting in robust pseudo-labeling for low r… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

  2. arXiv:2408.11440  [pdf, other

    cs.CL

    LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems

    Authors: Tahir Javed, Janki Nawale, Sakshi Joshi, Eldho George, Kaushal Bhogale, Deovrat Mehendale, Mitesh M. Khapra

    Abstract: Hindi, one of the most spoken language of India, exhibits a diverse array of accents due to its usage among individuals from diverse linguistic origins. To enable a robust evaluation of Hindi ASR systems on multiple accents, we create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases, with a total of 12.5 hours of Hindi audio, sourced from 132 s… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  3. arXiv:2403.01926  [pdf, other

    cs.CL

    IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

    Authors: Tahir Javed, Janki Atul Nawale, Eldho Ittan George, Sakshi Joshi, Kaushal Santosh Bhogale, Deovrat Mehendale, Ishvinder Virender Sethi, Aparna Ananthanarayanan, Hafsah Faquih, Pratiti Palit, Sneha Ravishankar, Saranya Sukumaran, Tripura Panchagnula, Sunjay Murali, Kunal Sharad Gandhi, Ambujavalli R, Manickam K M, C Venkata Vaijayanthi, Krishnan Srinivasa Raghavan Karunganni, Pratyush Kumar, Mitesh M Khapra

    Abstract: We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural,… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  4. arXiv:2305.15760  [pdf, other

    cs.CL cs.SD eess.AS

    Svarah: Evaluating English ASR Systems on Indian Accents

    Authors: Tahir Javed, Sakshi Joshi, Vignesh Nagarajan, Sai Sundaresan, Janki Nawale, Abhigyan Raman, Kaushal Bhogale, Pratyush Kumar, Mitesh M. Khapra

    Abstract: India is the second largest English-speaking country in the world with a speaker base of roughly 130 million. Thus, it is imperative that automatic speech recognition (ASR) systems for English should be evaluated on Indian accents. Unfortunately, Indian speakers find a very poor representation in existing English ASR benchmarks such as LibriSpeech, Switchboard, Speech Accent Archive, etc. In this… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

  5. arXiv:2305.15386  [pdf, other

    cs.CL cs.SD eess.AS

    Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR

    Authors: Kaushal Santosh Bhogale, Sai Sundaresan, Abhigyan Raman, Tahir Javed, Mitesh M. Khapra, Pratyush Kumar

    Abstract: Improving ASR systems is necessary to make new LLM-based use-cases accessible to people across the globe. In this paper, we focus on Indian languages, and make the case that diverse benchmarks are required to evaluate and improve ASR systems for Indian languages. To address this, we collate Vistaar as a set of 59 benchmarks across various language and domain combinations, on which we evaluate 3 pu… ▽ More

    Submitted 2 August, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted in INTERSPEECH 2023

  6. arXiv:2208.12666  [pdf, other

    cs.CL cs.SD eess.AS

    Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

    Authors: Kaushal Santosh Bhogale, Abhigyan Raman, Tahir Javed, Sumanth Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

    Abstract: End-to-end (E2E) models have become the default choice for state-of-the-art speech recognition systems. Such models are trained on large amounts of labelled data, which are often not available for low-resource languages. Techniques such as self-supervised learning and transfer learning hold promise, but have not yet been effective in training accurate models. On the other hand, collecting labelled… ▽ More

    Submitted 26 August, 2022; originally announced August 2022.

  7. arXiv:2208.11761  [pdf, other

    cs.CL cs.SD eess.AS

    IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages

    Authors: Tahir Javed, Kaushal Santosh Bhogale, Abhigyan Raman, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

    Abstract: A cornerstone in AI research has been the creation and adoption of standardized training and test datasets to earmark the progress of state-of-the-art models. A particularly successful example is the GLUE dataset for training and evaluating Natural Language Understanding (NLU) models for English. The large body of research around self-supervised BERT-based language models revolved around performan… ▽ More

    Submitted 15 December, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

  8. arXiv:2111.03945  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Building ASR Systems for the Next Billion Users

    Authors: Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

    Abstract: Recent methods in speech and language technology pretrain very LARGE models which are fine-tuned for specific tasks. However, the benefits of such LARGE models are often limited to a few resource rich languages of the world. In this work, we make multiple contributions towards building ASR systems for low resource languages from the Indian subcontinent. First, we curate 17,000 hours of raw speech… ▽ More

    Submitted 22 December, 2021; v1 submitted 6 November, 2021; originally announced November 2021.

  9. arXiv:2107.03141  [pdf, other

    cs.CL

    Hierarchical Text Classification of Urdu News using Deep Neural Network

    Authors: Taimoor Ahmed Javed, Waseem Shahzad, Umair Arshad

    Abstract: Digital text is increasing day by day on the internet. It is very challenging to classify a large and heterogeneous collection of data, which require improved information processing methods to organize text. To classify large size of corpus, one common approach is to use hierarchical text classification, which aims to classify textual data in a hierarchical structure. Several approaches have been… ▽ More

    Submitted 7 July, 2021; originally announced July 2021.

    Comments: 22 pages with 16 figures

  10. arXiv:2102.12362  [pdf, other

    cs.CR cs.CL

    Detecting Compliance of Privacy Policies with Data Protection Laws

    Authors: Ayesha Qamar, Tehreem Javed, Mirza Omer Beg

    Abstract: Privacy Policies are the legal documents that describe the practices that an organization or company has adopted in the handling of the personal data of its users. But as policies are a legal document, they are often written in extensive legal jargon that is difficult to understand. Though work has been done on privacy policies but none that caters to the problem of verifying if a given privacy po… ▽ More

    Submitted 21 February, 2021; originally announced February 2021.

  11. arXiv:2011.09145  [pdf, other

    cs.SI cs.CY

    A First Look at COVID-19 Messages on WhatsApp in Pakistan

    Authors: R. Tallal Javed, Mirza Elaaf Shuja, Muhammad Usama, Junaid Qadir, Waleed Iqbal, Gareth Tyson, Ignacio Castro, Kiran Garimella

    Abstract: The worldwide spread of COVID-19 has prompted extensive online discussions, creating an `infodemic' on social media platforms such as WhatsApp and Twitter. However, the information shared on these platforms is prone to be unreliable and/or misleading. In this paper, we present the first analysis of COVID-19 discourse on public WhatsApp groups from Pakistan. Building on a large scale annotation of… ▽ More

    Submitted 19 November, 2020; v1 submitted 18 November, 2020; originally announced November 2020.