Skip to main content

Showing 1–50 of 75 results for author: Cahyawijaya, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.12450  [pdf, ps, other

    cs.CL

    Language Surgery in Multilingual Large Language Models

    Authors: Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya

    Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existe… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  2. arXiv:2505.21344  [pdf, ps, other

    cs.AI cs.CL

    The Multilingual Divide and Its Impact on Global AI Safety

    Authors: Aidan Peppin, Julia Kreutzer, Alice Schoenauer Sebag, Kelly Marchisio, Beyza Ermis, John Dang, Samuel Cahyawijaya, Shivalika Singh, Seraphina Goldfarb-Tarrant, Viraat Aryabumi, Aakanksha, Wei-Yin Ko, Ahmet Üstün, Matthias Gallé, Marzieh Fadaee, Sara Hooker

    Abstract: Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally dominant languages. This paper provides researchers, policymakers and governance experts with an overview of key challenges to bridging the "language gap" in AI and minimizing safety risks across lan… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  3. arXiv:2504.20356  [pdf, other

    cs.CL

    What Causes Knowledge Loss in Multilingual Language Models?

    Authors: Maria Khelli, Samuel Cahyawijaya, Ayu Purwarianti, Genta Indra Winata

    Abstract: Cross-lingual transfer in natural language processing (NLP) models enhances multilingual performance by leveraging shared linguistic knowledge. However, traditional methods that process all data simultaneously often fail to mimic real-world scenarios, leading to challenges like catastrophic forgetting, where fine-tuning on new tasks degrades performance on previously learned ones. Our study explor… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  4. arXiv:2504.00698  [pdf

    cs.CL cs.AI cs.LG

    Command A: An Enterprise-Ready Large Language Model

    Authors: Team Cohere, :, Aakanksha, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, Raphaël Avalos, Zahara Aviv, Sammie Bae, Saurabh Baji, Alexandre Barbet, Max Bartolo, Björn Bebensee, Neeral Beladia, Walter Beller-Morales, Alexandre Bérard, Andrew Berneshawi, Anna Bialas, Phil Blunsom , et al. (205 additional authors not shown)

    Abstract: In this report we describe the development of Command A, a powerful large language model purpose-built to excel at real-world enterprise use cases. Command A is an agent-optimised and multilingual-capable model, with support for 23 languages of global business, and a novel hybrid architecture balancing efficiency with top of the range performance. It offers best-in-class Retrieval Augmented Genera… ▽ More

    Submitted 14 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

    Comments: 55 pages

  5. arXiv:2503.11280  [pdf, ps, other

    cs.CL

    High-Dimensional Interlingual Representations of Large Language Models

    Authors: Bryan Wilie, Samuel Cahyawijaya, Junxian He, Pascale Fung

    Abstract: Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs--a shared subspace in the representation space. However, evidence regarding this phenomenon is mixed, leaving it unclear whether these models truly develop unified interlingual representations, or present a partially aligned constructs. We explore 31 diverse languages varying on t… ▽ More

    Submitted 20 June, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

  6. arXiv:2503.07920  [pdf, other

    cs.CV cs.AI cs.CL

    Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

    Authors: Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, Thant Thiri Maung, Frederikus Hudi, David Anugraha, Muhammad Ravi Shulthan Habibi, Muhammad Reza Qorib, Amit Agarwal, Joseph Marvin Imperial, Hitesh Laxmichand Patel, Vicky Feliren, Bahrul Ilmi Nasution, Manuel Antonio Rufino, Genta Indra Winata, Rian Adam Rajagede, Carlos Rafael Catalan, Mohamed Fazli Imam, Priyaranjan Pattnayak, Salsabila Zahirah Pranida, Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang , et al. (67 additional authors not shown)

    Abstract: Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA… ▽ More

    Submitted 18 March, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

    Comments: [SEA-VL Dataset] https://huggingface.co/collections/SEACrowd/sea-vl-multicultural-vl-dataset-for-southeast-asia-67cf223d0c341d4ba2b236e7 [Appendix J] https://github.com/SEACrowd/seacrowd.github.io/blob/master/docs/SEA_VL_Appendix_J.pdf

  7. arXiv:2502.17956  [pdf, other

    cs.CL

    Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

    Authors: Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

    Abstract: Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge… ▽ More

    Submitted 22 May, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

  8. arXiv:2410.21573  [pdf, other

    cs.CL cs.AI

    Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense

    Authors: Samuel Cahyawijaya, Ruochen Zhang, Holy Lovenia, Jan Christian Blaise Cruz, Elisa Gilbert, Hiroki Nomoto, Alham Fikri Aji

    Abstract: Multilingual large language models (LLMs) have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely diff… ▽ More

    Submitted 30 October, 2024; v1 submitted 28 October, 2024; originally announced October 2024.

  9. arXiv:2410.12705  [pdf, other

    cs.CL cs.AI cs.CV

    WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

    Authors: Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Ching Lam Cheng, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri, Garry Kuwanto, Hanyang Zhao, Haryo Akbarianto Wibowo, Holy Lovenia , et al. (26 additional authors not shown)

    Abstract: Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering… ▽ More

    Submitted 8 May, 2025; v1 submitted 16 October, 2024; originally announced October 2024.

    Comments: Best Theme Paper at NAACL 2025

  10. arXiv:2409.14785  [pdf, other

    cs.CL cs.CV

    Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models

    Authors: Patrick Amadeus Irawan, Genta Indra Winata, Samuel Cahyawijaya, Ayu Purwarianti

    Abstract: Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models (LVLMs) through the use of language models. While existing methods for creating a Vision Question-Answering with Natural Language Explanation (VQA-NLE) datasets can p… ▽ More

    Submitted 9 December, 2024; v1 submitted 23 September, 2024; originally announced September 2024.

    Comments: Preprint

  11. arXiv:2409.13897  [pdf

    cs.CL cs.AI

    LLM for Everyone: Representing the Underrepresented in Large Language Models

    Authors: Samuel Cahyawijaya

    Abstract: Natural language processing (NLP) has witnessed a profound impact of large language models (LLMs) that excel in a multitude of tasks. However, the limitation of LLMs in multilingual settings, particularly in underrepresented languages, remains a significant hurdle. This thesis aims to bridge the gap in NLP research and development by focusing on underrepresented languages. A comprehensive evaluati… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

    Comments: PhD thesis

  12. arXiv:2409.12435  [pdf, other

    cs.CL

    Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models

    Authors: Xinyu Zhou, Delong Chen, Samuel Cahyawijaya, Xufeng Duan, Zhenguang G. Cai

    Abstract: We introduce a novel analysis that leverages linguistic minimal pairs to probe the internal linguistic representations of Large Language Models (LLMs). By measuring the similarity between LLM activation differences across minimal pairs, we quantify the and gain insight into the linguistic knowledge captured by LLMs. Our large-scale experiments, spanning 100+ LLMs and 150k minimal pairs in three la… ▽ More

    Submitted 13 December, 2024; v1 submitted 18 September, 2024; originally announced September 2024.

    Comments: COLING 2025

  13. arXiv:2407.03282  [pdf, other

    cs.CL

    LLM Internal States Reveal Hallucination Risk Faced With a Query

    Authors: Ziwei Ji, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Wilie, Pascale Fung

    Abstract: The hallucination problem of Large Language Models (LLMs) significantly limits their reliability and trustworthiness. Humans have a self-awareness process that allows us to recognize what we don't know when faced with queries. Inspired by this, our paper investigates whether LLMs can estimate their own hallucination risk before response generation. We analyze the internal mechanisms of LLMs broadl… ▽ More

    Submitted 29 September, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

  14. arXiv:2406.19764  [pdf, other

    cs.CL

    Belief Revision: The Adaptability of Large Language Models Reasoning

    Authors: Bryan Wilie, Samuel Cahyawijaya, Etsuko Ishii, Junxian He, Pascale Fung

    Abstract: The capability to reason from text is crucial for real-world NLP applications. Real-world scenarios often involve incomplete or evolving data. In response, individuals update their beliefs and understandings accordingly. However, most existing evaluations assume that language models (LMs) operate with consistent information. We introduce Belief-R, a new dataset designed to test LMs' belief revisio… ▽ More

    Submitted 17 October, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

  15. arXiv:2406.10118  [pdf, other

    cs.CL

    SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

    Authors: Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse , et al. (36 additional authors not shown)

    Abstract: Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t… ▽ More

    Submitted 10 March, 2025; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: https://seacrowd.github.io/ Published in EMNLP 2024

  16. arXiv:2406.05967  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

    Authors: David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song , et al. (51 additional authors not shown)

    Abstract: Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recen… ▽ More

    Submitted 4 November, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

  17. arXiv:2405.00485  [pdf, other

    cs.CV

    What Makes for Good Image Captions?

    Authors: Delong Chen, Samuel Cahyawijaya, Etsuko Ishii, Ho Shu Chan, Yejin Bang, Pascale Fung

    Abstract: This paper establishes a formal information-theoretic framework for image captioning, conceptualizing captions as compressed linguistic representations that selectively encode semantic units in images. Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans. By formulating these aspects as… ▽ More

    Submitted 28 September, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

  18. arXiv:2404.07900  [pdf, other

    cs.CL cs.AI

    High-Dimension Human Value Representation in Large Language Models

    Authors: Samuel Cahyawijaya, Delong Chen, Yejin Bang, Leila Khalatbari, Bryan Wilie, Ziwei Ji, Etsuko Ishii, Pascale Fung

    Abstract: The widespread application of LLMs across various tasks and fields has necessitated the alignment of these models with human values and preferences. Given various approaches of human value alignment, there is an urgent need to understand the scope and nature of human values injected into these LLMs before their deployment and adoption. We propose UniVaR, a high-dimensional neural representation of… ▽ More

    Submitted 25 March, 2025; v1 submitted 11 April, 2024; originally announced April 2024.

  19. arXiv:2404.06138  [pdf, other

    cs.CL

    Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

    Authors: Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

    Abstract: Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder… ▽ More

    Submitted 7 July, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: Cendol models are released under Apache 2.0 license and will be made publicly available soon

  20. arXiv:2403.16512  [pdf, other

    cs.CL cs.AI

    LLMs Are Few-Shot In-Context Low-Resource Language Learners

    Authors: Samuel Cahyawijaya, Holy Lovenia, Pascale Fung

    Abstract: In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages. Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource l… ▽ More

    Submitted 25 June, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

  21. arXiv:2402.14327  [pdf, other

    cs.CV cs.CL

    Subobject-level Image Tokenization

    Authors: Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung

    Abstract: Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection -- a simpl… ▽ More

    Submitted 12 March, 2025; v1 submitted 22 February, 2024; originally announced February 2024.

  22. arXiv:2401.06034  [pdf, other

    cs.CL

    LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization

    Authors: Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Alham Fikri Aji, Genta Indra Winata, Ayu Purwarianti

    Abstract: Pretrained language models (PLMs) have become remarkably adept at task and language generalization. Nonetheless, they often fail when faced with unseen languages. In this work, we present LinguAlchemy, a regularization method that incorporates various linguistic information covering typological, geographical, and phylogenetic features to align PLMs representation to the corresponding linguistic in… ▽ More

    Submitted 4 October, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

  23. arXiv:2311.12405  [pdf, other

    cs.CL

    IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

    Authors: Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata, Pascale Fung, Ayu Purwarianti

    Abstract: Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

  24. arXiv:2311.12375  [pdf, other

    cs.CL

    The Obscure Limitation of Modular Multilingual Language Models

    Authors: Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Ayu Purwarianti

    Abstract: We expose the limitation of modular multilingual language models (MLMs) in multilingual inference scenarios with unknown languages. Existing evaluations of modular MLMs exclude the involvement of language identification (LID) modules, which obscures the performance of real-case multilingual scenarios of modular MLMs. In this work, we showcase the effect of adding LID on the multilingual evaluation… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

  25. arXiv:2311.00958  [pdf, other

    cs.CL cs.AI

    IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

    Authors: Muhammad Dehan Al Kautsar, Rahmah Khoirussyifa' Nurdini, Samuel Cahyawijaya, Genta Indra Winata, Ayu Purwarianti

    Abstract: Task-oriented dialogue (ToD) systems have been mostly created for high-resource languages, such as English and Chinese. However, there is a need to develop ToD systems for other regional or local languages to broaden their ability to comprehend the dialogue contexts in various languages. This paper introduces IndoToD, an end-to-end multi domain ToD benchmark in Indonesian. We extend two English To… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: 2023 1st Workshop in South East Asian Language Processing (SEALP), Co-located with AACL 2023

  26. arXiv:2310.08885  [pdf, other

    cs.CL

    InstructTODS: Large Language Models for End-to-End Task-Oriented Dialogue Systems

    Authors: Willy Chung, Samuel Cahyawijaya, Bryan Wilie, Holy Lovenia, Pascale Fung

    Abstract: Large language models (LLMs) have been used for diverse tasks in natural language processing (NLP), yet remain under-explored for task-oriented dialogue systems (TODS), especially for end-to-end TODS. We present InstructTODS, a novel off-the-shelf framework for zero-shot end-to-end task-oriented dialogue systems that can adapt to diverse domains without fine-tuning. By leveraging LLMs, InstructTOD… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

  27. arXiv:2310.05338  [pdf, other

    cs.CV cs.CL

    Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models

    Authors: Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, Pascale Fung

    Abstract: Object hallucination poses a significant challenge in vision-language (VL) models, often leading to the generation of nonsensical or unfaithful responses with non-existent objects. However, the absence of a general measurement for evaluating object hallucination in VL models has hindered our understanding and ability to mitigate this issue. In this work, we present NOPE (Negative Object Presence E… ▽ More

    Submitted 13 August, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

    Comments: Published in ALVR Workshop at ACL 2024

  28. arXiv:2309.14381  [pdf, other

    cs.CL cs.AI

    Survey of Social Bias in Vision-Language Models

    Authors: Nayeon Lee, Yejin Bang, Holy Lovenia, Samuel Cahyawijaya, Wenliang Dai, Pascale Fung

    Abstract: In recent years, the rapid advancement of machine learning (ML) models, particularly transformer-based pre-trained models, has revolutionized Natural Language Processing (NLP) and Computer Vision (CV) fields. However, researchers have discovered that these models can inadvertently capture and reinforce social biases present in their training datasets, leading to potential social harms, such as une… ▽ More

    Submitted 24 September, 2023; originally announced September 2023.

  29. arXiv:2309.10661  [pdf, other

    cs.CL cs.AI

    NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

    Authors: Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Maulana Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, Hanung Wahyuning Linuwih, Bryan Wilie, Galih Pradipta Muridan, Genta Indra Winata, David Moeljadi, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

    Abstract: Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resul… ▽ More

    Submitted 19 September, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

  30. arXiv:2309.10413  [pdf, other

    cs.CL

    PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems

    Authors: Bryan Wilie, Yan Xu, Willy Chung, Samuel Cahyawijaya, Holy Lovenia, Pascale Fung

    Abstract: Grounding dialogue response generation on external knowledge is proposed to produce informative and engaging responses. However, current knowledge-grounded dialogue (KGD) systems often fail to align the generated responses with human-preferred qualities due to several issues like hallucination and the lack of coherence. Upon analyzing multiple language model generations, we observe the presence of… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  31. arXiv:2306.14517  [pdf, other

    cs.CL cs.SD eess.AS

    Cross-Lingual Cross-Age Group Adaptation for Low-Resource Elderly Speech Emotion Recognition

    Authors: Samuel Cahyawijaya, Holy Lovenia, Willy Chung, Rita Frieske, Zihan Liu, Pascale Fung

    Abstract: Speech emotion recognition plays a crucial role in human-computer interactions. However, most speech emotion recognition research is biased toward English-speaking adults, which hinders its applicability to other demographic groups in different languages and age groups. In this work, we analyze the transferability of emotion recognition across three different languages--English, Mandarin Chinese,… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

    Comments: Accepted in INTERSPEECH 2023

  32. arXiv:2305.16171  [pdf

    cs.CL

    Multi-lingual and Multi-cultural Figurative Language Understanding

    Authors: Anubha Kabra, Emmy Liu, Simran Khanuja, Alham Fikri Aji, Genta Indra Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, Graham Neubig

    Abstract: Figurative language permeates human communication, but at the same time is relatively understudied in NLP. Datasets have been created in English to accelerate progress towards measuring and improving figurative language processing in language models (LMs). However, the use of figurative language is an expression of our cultural and societal experiences, making it difficult for these phrases to be… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  33. arXiv:2305.14716  [pdf, other

    cs.CL

    GlobalBench: A Benchmark for Global Progress in Natural Language Processing

    Authors: Yueqi Song, Catherine Cui, Simran Khanuja, Pengfei Liu, Fahim Faisal, Alissa Ostapenko, Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Yulia Tsvetkov, Antonios Anastasopoulos, Graham Neubig

    Abstract: Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and further incentivize the global development of equitable language technology, we introduce GlobalBench. Prior multilingual benchmarks are static and have f… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: Preprint, 9 pages

  34. arXiv:2305.14235  [pdf, other

    cs.CL cs.AI

    Multilingual Large Language Models Are Not (Yet) Code-Switchers

    Authors: Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Indra Winata, Alham Fikri Aji

    Abstract: Multilingual Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks, exhibiting state-of-the-art performance through zero-shot or few-shot prompting methods. While there have been extensive studies on their abilities in monolingual tasks, the investigation of their potential in the context of code-switching (CSW), the practice of alternating languages within a… ▽ More

    Submitted 23 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted at EMNLP 2023

  35. arXiv:2305.13627  [pdf, other

    cs.CL cs.AI

    InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning

    Authors: Samuel Cahyawijaya, Holy Lovenia, Tiezheng Yu, Willy Chung, Pascale Fung

    Abstract: Large language models (LLMs) that are tuned with instructions have demonstrated remarkable capabilities in various tasks and languages. However, their ability to generalize to underrepresented languages is limited due to the scarcity of available data. Additionally, directly adapting new languages to instruction-tuned LLMs can result in catastrophic forgetting, which leads to the loss of multitask… ▽ More

    Submitted 24 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

  36. arXiv:2303.13592  [pdf, other

    cs.CL cs.AI

    Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

    Authors: Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Arjun Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Rowena Garcia, Thamar Solorio, Alham Fikri Aji

    Abstract: While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero… ▽ More

    Submitted 12 September, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: Updating Authors

  37. arXiv:2302.14680  [pdf, other

    cs.CL cs.AI cs.CV

    Which One Are You Referring To? Multimodal Object Identification in Situated Dialogue

    Authors: Holy Lovenia, Samuel Cahyawijaya, Pascale Fung

    Abstract: The demand for multimodal dialogue systems has been rising in various domains, emphasizing the importance of interpreting multimodal inputs from conversational and situational contexts. We explore three methods to tackle this problem and evaluate them on the largest situated dialogue dataset, SIMMC 2.1. Our best method, scene-dialogue alignment, improves the performance by ~20% F1-score compared t… ▽ More

    Submitted 15 March, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: Accepted at EACL SRW 2023

  38. arXiv:2302.04023  [pdf, other

    cs.CL cs.AI

    A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

    Authors: Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung

    Abstract: This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.… ▽ More

    Submitted 28 November, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

    Comments: 45 pages, AACL 2023

  39. arXiv:2301.11813  [pdf, other

    eess.IV cs.CV

    Biomedical Image Reconstruction: A Survey

    Authors: Samuel Cahyawijaya

    Abstract: Biomedical image reconstruction research has been developed for more than five decades, giving rise to various techniques such as central and filtered back projection. With the rise of deep learning technology, biomedical image reconstruction field has undergone a massive paradigm shift from analytical and iterative methods to deep learning methods To drive scientific discussion on advanced deep l… ▽ More

    Submitted 4 January, 2023; originally announced January 2023.

  40. arXiv:2212.09648  [pdf, other

    cs.CL cs.AI

    NusaCrowd: Open Source Initiative for Indonesian NLP Resources

    Authors: Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Fajri Koto, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Ivan Halim Parmonangan, Ika Alfina, Muhammad Satrio Wicaksono, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Akbar Septiandri, James Jaya, Kaustubh D. Dhole, Arie Ardiyanti Suryani, Rifki Afina Putri , et al. (22 additional authors not shown)

    Abstract: We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple exp… ▽ More

    Submitted 21 July, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

  41. arXiv:2211.07713  [pdf, other

    cs.CL cs.AI

    How Long Is Enough? Exploring the Optimal Intervals of Long-Range Clinical Note Language Modeling

    Authors: Samuel Cahyawijaya, Bryan Wilie, Holy Lovenia, Huan Zhong, MingQian Zhong, Yuk-Yu Nancy Ip, Pascale Fung

    Abstract: Large pre-trained language models (LMs) have been widely adopted in biomedical and clinical domains, introducing many powerful LMs such as bio-lm and BioELECTRA. However, the applicability of these methods to real clinical use cases is hindered, due to the limitation of pre-trained LMs in processing long textual data with thousands of words, which is a common length for a clinical note. In this wo… ▽ More

    Submitted 25 October, 2022; originally announced November 2022.

  42. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  43. arXiv:2209.01638  [pdf, other

    cs.CL

    Every picture tells a story: Image-grounded controllable stylistic story generation

    Authors: Holy Lovenia, Bryan Wilie, Romain Barraud, Samuel Cahyawijaya, Willy Chung, Pascale Fung

    Abstract: Generating a short story out of an image is arduous. Unlike image captioning, story generation from an image poses multiple challenges: preserving the story coherence, appropriately assessing the quality of the story, steering the generated story into a certain style, and addressing the scarcity of image-story pair reference datasets limiting supervision during training. In this work, we introduce… ▽ More

    Submitted 11 September, 2022; v1 submitted 4 September, 2022; originally announced September 2022.

    Comments: Accepted in LaTeCH-CLfL 2022 (6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature), COLING 2022

  44. arXiv:2207.10524  [pdf, other

    cs.CL cs.AI

    NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

    Authors: Samuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Fajri Koto, David Moeljadi, Karissa Vincentio, Ade Romadhony, Ayu Purwarianti

    Abstract: At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity. Resources in Indonesian languages, especially the local ones, are extremely scarce and underrepresented. Many Indonesian researchers do not publish their dataset. Furthermore, the few public datasets that we have are scattered across different platforms, thus m… ▽ More

    Submitted 1 August, 2022; v1 submitted 21 July, 2022; originally announced July 2022.

  45. arXiv:2207.02663  [pdf, other

    cs.CL cs.SD eess.AS

    Kaggle Competition: Cantonese Audio-Visual Speech Recognition for In-car Commands

    Authors: Wenliang Dai, Samuel Cahyawijaya, Tiezheng Yu, Elham J Barezi, Pascale Fung

    Abstract: With the rise of deep learning and intelligent vehicles, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, in this research field, most datasets are in major… ▽ More

    Submitted 6 July, 2022; originally announced July 2022.

  46. arXiv:2206.15076  [pdf, other

    cs.CL

    BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing

    Authors: Jason Alan Fries, Leon Weber, Natasha Seelam, Gabriel Altay, Debajyoti Datta, Samuele Garda, Myungsun Kang, Ruisi Su, Wojciech Kusa, Samuel Cahyawijaya, Fabio Barth, Simon Ott, Matthias Samwald, Stephen Bach, Stella Biderman, Mario Sänger, Bo Wang, Alison Callahan, Daniel León Periñán, Théo Gigant, Patrick Haller, Jenny Chim, Jose David Posada, John Michael Giorgi, Karthik Rangasai Sivaraman , et al. (18 additional authors not shown)

    Abstract: Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful i… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

    Comments: Submitted to NeurIPS 2022 Datasets and Benchmarks Track

  47. arXiv:2206.11249  [pdf, other

    cs.CL cs.AI cs.LG

    GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

    Authors: Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter , et al. (52 additional authors not shown)

    Abstract: Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, an… ▽ More

    Submitted 24 June, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

  48. arXiv:2205.15960  [pdf, other

    cs.CL

    NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

    Authors: Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, Sebastian Ruder

    Abstract: Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing re… ▽ More

    Submitted 12 April, 2023; v1 submitted 31 May, 2022; originally announced May 2022.

    Comments: EACL 2023

  49. arXiv:2205.05989  [pdf, other

    cs.CL cs.AI cs.LG

    Towards Answering Open-ended Ethical Quandary Questions

    Authors: Yejin Bang, Nayeon Lee, Tiezheng Yu, Leila Khalatbari, Yan Xu, Samuel Cahyawijaya, Dan Su, Bryan Wilie, Romain Barraud, Elham J. Barezi, Andrea Madotto, Hayden Kee, Pascale Fung

    Abstract: Considerable advancements have been made in various NLP tasks based on the impressive power of large language models (LLMs) and many NLP applications are deployed in our daily lives. In this work, we challenge the capability of LLMs with the new task of Ethical Quandary Generative Question Answering. Ethical quandary questions are more challenging to address because multiple conflicting answers ma… ▽ More

    Submitted 1 February, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: 16 pages

  50. arXiv:2204.06699  [pdf, other

    cs.LG cs.AI

    SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

    Authors: Samuel Cahyawijaya, Tiezheng Yu, Zihan Liu, Tiffany T. W. Mak, Xiaopu Zhou, Nancy Y. Ip, Pascale Fung

    Abstract: Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they focus only on understanding haploid sequences, which hinders their applicability towards understanding genetic variations, also known as single nucleotide polym… ▽ More

    Submitted 13 April, 2022; originally announced April 2022.