Skip to main content

Showing 1–16 of 16 results for author: Doddapaneni, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.13394  [pdf, other

    cs.CL

    Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

    Authors: Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra

    Abstract: Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that in… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

  2. arXiv:2408.00960  [pdf, other

    cs.CL cs.AI cs.IR

    PERSOMA: PERsonalized SOft ProMpt Adapter Architecture for Personalized Language Prompting

    Authors: Liam Hebert, Krishna Sayana, Ambarish Jash, Alexandros Karatzoglou, Sukhdeep Sodhi, Sumanth Doddapaneni, Yanli Cai, Dima Kuzmin

    Abstract: Understanding the nuances of a user's extensive interaction history is key to building accurate and personalized natural language systems that can adapt to evolving user preferences. To address this, we introduce PERSOMA, Personalized Soft Prompt Adapter architecture. Unlike previous personalized prompting methods for large language models, PERSOMA offers a novel approach to efficiently capture us… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

  3. arXiv:2406.13439  [pdf, other

    cs.CL

    Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

    Authors: Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, Mitesh M. Khapra

    Abstract: Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework d… ▽ More

    Submitted 26 November, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: EMNLP 2024

  4. IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

    Authors: Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad B, Varun Balan G, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra

    Abstract: Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-re… ▽ More

    Submitted 28 November, 2024; v1 submitted 10 March, 2024; originally announced March 2024.

    Comments: ACL-2024 Outstanding Paper

  5. arXiv:2401.04858  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    User Embedding Model for Personalized Language Prompting

    Authors: Sumanth Doddapaneni, Krishna Sayana, Ambarish Jash, Sukhdeep Sodhi, Dima Kuzmin

    Abstract: Modeling long histories plays a pivotal role in enhancing recommendation systems, allowing to capture user's evolving preferences, resulting in more precise and personalized recommendations. In this study we tackle the challenges of modeling long user histories for preference understanding in natural language. Specifically, we introduce a new User Embedding Module (UEM) that efficiently processes… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

  6. arXiv:2305.16307  [pdf

    cs.CL

    IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

    Authors: Jay Gala, Pranjal A. Chitale, Raghavan AK, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, Anoop Kunchukuttan

    Abstract: India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, ther… ▽ More

    Submitted 20 December, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted at TMLR

  7. arXiv:2305.07491  [pdf, other

    cs.CL

    A Comprehensive Analysis of Adapter Efficiency

    Authors: Nandini Mundra, Sumanth Doddapaneni, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra

    Abstract: Adapters have been positioned as a parameter-efficient fine-tuning (PEFT) approach, whereby a minimal number of parameters are added to the model and fine-tuned. However, adapters have not been sufficiently analyzed to understand if PEFT translates to benefits in training/deployment efficiency and maintainability/extensibility. Through extensive experiments on many adapters, tasks, and languages i… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  8. arXiv:2305.05858  [pdf, other

    cs.CL

    Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

    Authors: Rahul Aralikatte, Ziling Cheng, Sumanth Doddapaneni, Jackie Chi Kit Cheung

    Abstract: We present Vārta, a large-scale multilingual dataset for headline generation in Indic languages. This dataset includes 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources. To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available. We use the data collected in a ser… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Findings of ACL 2023

  9. arXiv:2212.10168  [pdf, other

    cs.CL

    Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

    Authors: Arnav Mhaske, Harshit Kedia, Sumanth Doddapaneni, Mitesh M. Khapra, Pratyush Kumar, Rudra Murthy V, Anoop Kunchukuttan

    Abstract: We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages. The training dataset has been automaticall… ▽ More

    Submitted 28 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023

  10. arXiv:2212.05409  [pdf, other

    cs.CL

    Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

    Authors: Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar

    Abstract: Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically… ▽ More

    Submitted 24 May, 2023; v1 submitted 10 December, 2022; originally announced December 2022.

    Comments: ACL 2023

  11. arXiv:2208.12666  [pdf, other

    cs.CL cs.SD eess.AS

    Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

    Authors: Kaushal Santosh Bhogale, Abhigyan Raman, Tahir Javed, Sumanth Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

    Abstract: End-to-end (E2E) models have become the default choice for state-of-the-art speech recognition systems. Such models are trained on large amounts of labelled data, which are often not available for low-resource languages. Techniques such as self-supervised learning and transfer learning hold promise, but have not yet been effective in training accurate models. On the other hand, collecting labelled… ▽ More

    Submitted 26 August, 2022; originally announced August 2022.

  12. arXiv:2203.06414  [pdf, other

    cs.CL

    A Survey of Adversarial Defences and Robustness in NLP

    Authors: Shreya Goyal, Sumanth Doddapaneni, Mitesh M. Khapra, Balaraman Ravindran

    Abstract: In the past few years, it has become increasingly evident that deep neural networks are not resilient enough to withstand adversarial perturbations in input data, leaving them vulnerable to attack. Various authors have proposed strong adversarial attacks for computer vision and Natural Language Processing (NLP) tasks. As a response, many defense mechanisms have also been proposed to prevent these… ▽ More

    Submitted 18 April, 2023; v1 submitted 12 March, 2022; originally announced March 2022.

    Comments: Accepted for publication at ACM Computing Surveys

  13. arXiv:2111.06916  [pdf

    cs.CL cs.AI cs.LG

    Offense Detection in Dravidian Languages using Code-Mixing Index based Focal Loss

    Authors: Debapriya Tula, Shreyas MS, Viswanatha Reddy, Pranjal Sahu, Sumanth Doddapaneni, Prathyush Potluri, Rohan Sukumaran, Parth Patwa

    Abstract: Over the past decade, we have seen exponential growth in online content fueled by social media platforms. Data generation of this scale comes with the caveat of insurmountable offensive content in it. The complexity of identifying offensive content is exacerbated by the usage of multiple modalities (image, language, etc.), code-mixed language and more. Moreover, even after careful sampling and ann… ▽ More

    Submitted 6 May, 2022; v1 submitted 12 November, 2021; originally announced November 2021.

    Comments: Accepted for publication at SN Computer Science Journal

  14. arXiv:2111.03945  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Building ASR Systems for the Next Billion Users

    Authors: Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

    Abstract: Recent methods in speech and language technology pretrain very LARGE models which are fine-tuned for specific tasks. However, the benefits of such LARGE models are often limited to a few resource rich languages of the world. In this work, we make multiple contributions towards building ASR systems for low resource languages from the Indian subcontinent. First, we curate 17,000 hours of raw speech… ▽ More

    Submitted 22 December, 2021; v1 submitted 6 November, 2021; originally announced November 2021.

  15. arXiv:2107.00676  [pdf, other

    cs.CL

    A Primer on Pretrained Multilingual Language Models

    Authors: Sumanth Doddapaneni, Gowtham Ramesh, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar

    Abstract: Multilingual Language Models (\MLLMs) such as mBERT, XLM, XLM-R, \textit{etc.} have emerged as a viable option for bringing the power of pretraining to a large number of languages. Given their success in zero-shot transfer learning, there has emerged a large body of work in (i) building bigger \MLLMs~covering a large number of languages (ii) creating exhaustive benchmarks covering a wider variety… ▽ More

    Submitted 23 December, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

  16. arXiv:2104.05596  [pdf

    cs.CL

    Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

    Authors: Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra

    Abstract: We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the w… ▽ More

    Submitted 12 June, 2023; v1 submitted 12 April, 2021; originally announced April 2021.

    Comments: Accepted to the Transactions of the Association for Computational Linguistics (TACL)