Skip to main content

Showing 1–50 of 96 results for author: Aji, A F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04415  [pdf, ps, other

    cs.CL

    MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

    Authors: Emilio Villa-Cueva, S M Masrur Ahmed, Rendi Chevi, Jan Christian Blaise Cruz, Kareem Elzeky, Fermin Cristobal, Alham Fikri Aji, Skyler Wang, Rada Mihalcea, Thamar Solorio

    Abstract: Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  2. arXiv:2506.12450  [pdf, ps, other

    cs.CL

    Language Surgery in Multilingual Large Language Models

    Authors: Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya

    Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existe… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  3. arXiv:2506.01789  [pdf, ps, other

    cs.LG cs.AI cs.CL cs.CV eess.AS

    Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

    Authors: Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury

    Abstract: High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about datas… ▽ More

    Submitted 3 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: Preprint

  4. arXiv:2506.01592  [pdf, ps, other

    cs.CL

    Statement-Tuning Enables Efficient Cross-lingual Generalization in Encoder-only Models

    Authors: Ahmed Elshabrawy, Thanh-Nhi Nguyen, Yeeun Kang, Lihan Feng, Annant Jain, Faadil Abdullah Shaikh, Jonibek Mansurov, Mohamed Fazli Mohamed Imam, Jesus-German Ortiz-Barajas, Rendi Chevi, Alham Fikri Aji

    Abstract: Large Language Models (LLMs) excel in zero-shot and few-shot tasks, but achieving similar performance with encoder-only models like BERT and RoBERTa has been challenging due to their architecture. However, encoders offer advantages such as lower computational and memory costs. Recent work adapts them for zero-shot generalization using Statement Tuning, which reformulates tasks into finite template… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted to ACL 2025 (Findings)

  5. arXiv:2505.24456  [pdf, ps, other

    cs.CL

    CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

    Authors: Emilio Villa-Cueva, Sholpan Bolatzhanova, Diana Turmakhan, Kareem Elzeky, Henok Biadglign Ademtew, Alham Fikri Aji, Israel Abebe Azime, Jinheon Baek, Frederico Belcavello, Fermin Cristobal, Jan Christian Blaise Cruz, Mary Dabre, Raj Dabre, Toqeer Ehsan, Naome A Etori, Fauzan Farooqui, Jiahui Geng, Guido Ivetta, Thanmay Jayakumar, Soyeong Jeong, Zheng Wei Lim, Aishik Mandal, Sofia Martinelli, Mihail Minkov Mihaylov, Daniil Orel , et al. (9 additional authors not shown)

    Abstract: Cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of image… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  6. arXiv:2505.16408  [pdf, ps, other

    cs.CL

    From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs

    Authors: Muhammad Farid Adilazuarda, Chen Cecilia Liu, Iryna Gurevych, Alham Fikri Aji

    Abstract: Adapting cultural values in Large Language Models (LLMs) presents significant challenges, particularly due to biases and limited training data. Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data. However, it remains unclear whether this approach effectively captures cultural nuances or produces distinct cultural representations for various downstre… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

  7. arXiv:2505.13141  [pdf, ps, other

    cs.CL

    Language-Specific Latent Process Hinders Cross-Lingual Performance

    Authors: Zheng Wei Lim, Alham Fikri Aji, Trevor Cohn

    Abstract: Large language models (LLMs) are demonstrably capable of cross-lingual transfer, but can produce inconsistent output when prompted with the same queries written in different languages. To understand how language models are able to generalize knowledge from one language to the others, we apply the logit lens to interpret the implicit steps taken by LLMs to solve multilingual multi-choice reasoning… ▽ More

    Submitted 27 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

  8. arXiv:2505.08910  [pdf, ps, other

    cs.CV cs.CL

    Behind Maya: Building a Multilingual Vision Language Model

    Authors: Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze-Yin Chan, S M Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth. S, Snehanshu Mukherjee, Alham Fikri Aji

    Abstract: In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pre… ▽ More

    Submitted 15 May, 2025; v1 submitted 13 May, 2025; originally announced May 2025.

    Comments: Accepted at VLMs4ALL CVPR 2025 Workshop; corrected workshop name spelling

  9. arXiv:2505.05408  [pdf, other

    cs.CL cs.AI cs.LG

    Crosslingual Reasoning through Test-Time Scaling

    Authors: Zheng-Xin Yong, M. Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H. Bach, Alham Fikri Aji

    Abstract: Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathem… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  10. arXiv:2504.20966  [pdf, ps, other

    cs.LG

    Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

    Authors: Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji

    Abstract: We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attentio… ▽ More

    Submitted 30 May, 2025; v1 submitted 29 April, 2025; originally announced April 2025.

    Comments: Updated to include experiments on 1.8B parameter models

  11. arXiv:2504.17083  [pdf, other

    cs.CL

    How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary Study

    Authors: Rendi Chevi, Kentaro Inui, Thamar Solorio, Alham Fikri Aji

    Abstract: What makes an interaction with the LLM more preferable for the user? While it is intuitive to assume that information accuracy in the LLM's responses would be one of the influential variables, recent studies have found that inaccurate LLM's responses could still be preferable when they are perceived to be more authoritative, certain, well-articulated, or simply verbose. These variables interesting… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: Accepted at GenAICHI 2025 @ ACM CHI 2025

  12. arXiv:2503.07920  [pdf, other

    cs.CV cs.AI cs.CL

    Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

    Authors: Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, Thant Thiri Maung, Frederikus Hudi, David Anugraha, Muhammad Ravi Shulthan Habibi, Muhammad Reza Qorib, Amit Agarwal, Joseph Marvin Imperial, Hitesh Laxmichand Patel, Vicky Feliren, Bahrul Ilmi Nasution, Manuel Antonio Rufino, Genta Indra Winata, Rian Adam Rajagede, Carlos Rafael Catalan, Mohamed Fazli Imam, Priyaranjan Pattnayak, Salsabila Zahirah Pranida, Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang , et al. (67 additional authors not shown)

    Abstract: Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA… ▽ More

    Submitted 18 March, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

    Comments: [SEA-VL Dataset] https://huggingface.co/collections/SEACrowd/sea-vl-multicultural-vl-dataset-for-southeast-asia-67cf223d0c341d4ba2b236e7 [Appendix J] https://github.com/SEACrowd/seacrowd.github.io/blob/master/docs/SEA_VL_Appendix_J.pdf

  13. arXiv:2503.07269  [pdf, other

    cs.CL

    SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection

    Authors: Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Seid Muhie Yimam, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine De Kock, Tadesse Destaw Belay, Ibrahim Said Ahmad, Nirmal Surange, Daniela Teodorescu, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino Ali, Vladimir Araujo, Abinew Ali Ayele, Oana Ignat, Alexander Panchenko, Yi Zhou, Saif M. Mohammad

    Abstract: We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and are spoken across various continents. The data instances are multi-labeled with six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels… ▽ More

    Submitted 24 April, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

    Comments: SemEval2025 Task11 (Task Description Paper). arXiv admin note: text overlap with arXiv:2502.11926

  14. arXiv:2503.06291  [pdf, other

    cs.CL

    IteRABRe: Iterative Recovery-Aided Block Reduction

    Authors: Haryo Akbarianto Wibowo, Haiyue Song, Hideki Tanaka, Masao Utiyama, Alham Fikri Aji, Raj Dabre

    Abstract: Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method t… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

    Comments: 8 pages

    MSC Class: 68T50

  15. arXiv:2503.01493  [pdf, other

    cs.CL

    Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh

    Authors: Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly, Yuxia Wang, Zhuohan Xie, Rahul Pal, Daniil Orel, Parvez Mullah, Diana Turmakhan, Maiya Goloburda, Mohammed Kamran, Samujjwal Ghosh, Bokang Jia, Jonibek Mansurov, Mukhammed Togmanov, Debopriyo Banerjee, Nurkhan Laiyk, Akhmed Sakip, Xudong Han, Ekaterina Kochmar, Alham Fikri Aji, Aaryamonvikram Singh, Alok Anil Jadhav, Satheesh Katipomu, Samta Kamboj , et al. (10 additional authors not shown)

    Abstract: Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: Technical Report

  16. arXiv:2502.20864  [pdf, ps, other

    cs.CL

    Do Language Models Understand Honorific Systems in Javanese?

    Authors: Mohammad Rifqi Farhansyah, Iwan Darmawan, Adryan Kusumawardhana, Genta Indra Winata, Alham Fikri Aji, Derry Tanti Wijaya

    Abstract: The Javanese language features a complex system of honorifics that vary according to the social status of the speaker, listener, and referent. Despite its cultural and linguistic significance, there has been limited progress in developing a comprehensive corpus to capture these variations for natural language processing (NLP) tasks. In this paper, we present Unggah-Ungguh, a carefully curated data… ▽ More

    Submitted 29 May, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

    Comments: ACL 2025 - Main Conference

  17. arXiv:2502.18431  [pdf, other

    cs.CL cs.AI

    TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning

    Authors: Frederikus Hudi, Genta Indra Winata, Ruochen Zhang, Alham Fikri Aji

    Abstract: Reasoning is a fundamental capability of large language models (LLMs), enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LL… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  18. arXiv:2502.18148  [pdf

    cs.CL

    NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts

    Authors: Muhammad Farid Adilazuarda, Musa Izzanardi Wijanarko, Lucky Susanto, Khumaisa Nur'aini, Derry Wijaya, Alham Fikri Aji

    Abstract: Indonesia is rich in languages and scripts. However, most NLP progress has been made using romanized text. In this paper, we present NusaAksara, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identific… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  19. arXiv:2502.12829  [pdf, other

    cs.CL

    KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan

    Authors: Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto

    Abstract: Despite having a population of twenty million, Kazakhstan's culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  20. arXiv:2502.11926  [pdf, ps, other

    cs.CL

    BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages

    Authors: Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine de Kock, Nirmal Surange, Daniela Teodorescu, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino D. M. A. Ali, Ilseyar Alimova, Vladimir Araujo, Nikolay Babakov, Naomi Baes, Ana-Maria Bucur, Andiswa Bukula, Guanqun Cao, Rodrigo Tufino Cardenas, Rendi Chevi, Chiamaka Ijeoma Chukwuneke, Alexandra Ciobotaru, Daryna Dementieva , et al. (23 additional authors not shown)

    Abstract: People worldwide use language in subtle and complex ways to express emotions. Although emotion recognition--an umbrella term for several NLP tasks--impacts various applications within NLP and beyond, most work in this area has focused on high-resource languages. This has led to significant disparities in research efforts and proposed solutions, particularly for under-resourced languages, which oft… ▽ More

    Submitted 29 May, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Comments: Accepted at ACL2025 (Main)

  21. arXiv:2502.11614  [pdf, other

    cs.CL cs.AI

    Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI

    Authors: Yuxia Wang, Rui Xing, Jonibek Mansurov, Giovanni Puccetti, Zhuohan Xie, Minh Ngoc Ta, Jiahui Geng, Jinyan Su, Mervat Abassy, Saad El Dine Ahmed, Kareem Elozeiri, Nurkhan Laiyk, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Alexander Aziz, Ryuto Koike, Masahiro Kaneko, Artem Shelmanov, Ekaterina Artemova, Vladislav Mikhailov, Akim Tsvigun, Alham Fikri Aji, Nizar Habash, Iryna Gurevych , et al. (1 additional authors not shown)

    Abstract: Prior studies have shown that distinguishing text generated by large language models (LLMs) from human-written one is highly challenging, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domai… ▽ More

    Submitted 23 May, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

  22. arXiv:2502.11495  [pdf, other

    cs.CL

    Balanced Multi-Factor In-Context Learning for Multilingual Large Language Models

    Authors: Masahiro Kaneko, Alham Fikri Aji, Timothy Baldwin

    Abstract: Multilingual large language models (MLLMs) are able to leverage in-context learning (ICL) to achieve high performance by leveraging cross-lingual knowledge transfer without parameter updates. However, their effectiveness is highly sensitive to example selection, particularly in multilingual settings. Based on the findings of existing work, three key factors influence multilingual ICL: (1) semantic… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  23. arXiv:2502.01703  [pdf, other

    cs.LG cs.AI cs.CL

    QLESS: A Quantized Approach for Data Valuation and Selection in Large Language Model Fine-Tuning

    Authors: Moses Ananta, Muhammad Farid Adilazuarda, Zayd Muhammad Kawakibi Zuhri, Ayu Purwarianti, Alham Fikri Aji

    Abstract: Fine-tuning large language models (LLMs) is often constrained by the computational costs of processing massive datasets. We propose \textbf{QLESS} (Quantized Low-rank Gradient Similarity Search), which integrates gradient quantization with the LESS framework to enable memory-efficient data valuation and selection. QLESS employs a two-step compression process: first, it obtains low-dimensional grad… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

  24. arXiv:2501.14249  [pdf, other

    cs.LG cs.AI cs.CL

    Humanity's Last Exam

    Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes , et al. (1084 additional authors not shown)

    Abstract: Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of… ▽ More

    Submitted 19 April, 2025; v1 submitted 24 January, 2025; originally announced January 2025.

    Comments: 29 pages, 6 figures

  25. arXiv:2501.12660  [pdf, other

    cs.CL

    Extracting General-use Transformers for Low-resource Languages via Knowledge Distillation

    Authors: Jan Christian Blaise Cruz, Alham Fikri Aji

    Abstract: In this paper, we propose the use of simple knowledge distillation to produce smaller and more efficient single-language transformers from Massively Multilingual Transformers (MMTs) to alleviate tradeoffs associated with the use of such in low-resource settings. Using Tagalog as a case study, we show that these smaller single-language models perform on-par with strong baselines in a variety of ben… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

    Comments: LoResLM Workshop @ COLING 2025

  26. arXiv:2501.11012  [pdf, other

    cs.CL

    GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human

    Authors: Yuxia Wang, Artem Shelmanov, Jonibek Mansurov, Akim Tsvigun, Vladislav Mikhailov, Rui Xing, Zhuohan Xie, Jiahui Geng, Giovanni Puccetti, Ekaterina Artemova, Jinyan Su, Minh Ngoc Ta, Mervat Abassy, Kareem Ashraf Elozeiri, Saad El Dine Ahmed El Etter, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Nurkhan Laiyk, Osama Mohammed Afzal, Ryuto Koike, Masahiro Kaneko, Alham Fikri Aji, Nizar Habash, Iryna Gurevych , et al. (1 additional authors not shown)

    Abstract: We present the GenAI Content Detection Task~1 -- a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 26 teams -- to the Multilin… ▽ More

    Submitted 22 February, 2025; v1 submitted 19 January, 2025; originally announced January 2025.

    Comments: 18 pages

    Journal ref: Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect) in COLING 2025

  27. arXiv:2501.10674  [pdf, other

    cs.CV cs.CL

    Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!

    Authors: Mohamed Fazli Imam, Chenyang Lyu, Alham Fikri Aji

    Abstract: Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as visual temporal understanding, which is crucial for comprehending real-world dynamics, remain underexplored. To address this, we propose a challenging evaluation benc… ▽ More

    Submitted 18 February, 2025; v1 submitted 18 January, 2025; originally announced January 2025.

    Comments: Our dataset can be found at \url{https://huggingface.co/datasets/fazliimam/temporal-vqa}

  28. arXiv:2412.15255  [pdf, ps, other

    cs.CL cs.AI

    Data Laundering: Artificially Boosting Benchmark Results through Knowledge Distillation

    Authors: Jonibek Mansurov, Akhmed Sakip, Alham Fikri Aji

    Abstract: In this paper, we show that knowledge distillation can be subverted to manipulate language model benchmark scores, revealing a critical vulnerability in current evaluation practices. We introduce "Data Laundering," a process that enables the covert transfer of benchmark-specific knowledge through seemingly legitimate intermediate training steps. Through extensive experiments with a 2-layer BERT st… ▽ More

    Submitted 4 June, 2025; v1 submitted 15 December, 2024; originally announced December 2024.

    Comments: 14 pages

  29. arXiv:2412.07112  [pdf, other

    cs.CV cs.CL

    Maya: An Instruction Finetuned Multilingual Multimodal Model

    Authors: Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze-Yin Chan, S M Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth. S, Snehanshu Mukherjee, Alham Fikri Aji

    Abstract: The rapid development of large Vision-Language Models (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to und… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  30. arXiv:2411.19346  [pdf, other

    cs.CV cs.CL cs.LG

    CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

    Authors: Mohamed Fazli Imam, Rufael Fedaku Marew, Jameel Hassan, Mustansar Fiaz, Alham Fikri Aji, Hisham Cholakkal

    Abstract: In the era of foundation models, CLIP has emerged as a powerful tool for aligning text & visual modalities into a common embedding space. However, the alignment objective used to train CLIP often results in subpar visual features for fine-grained tasks. In contrast, SSL-pretrained models like DINO excel at extracting rich visual features due to their specialized training paradigm. Yet, these SSL m… ▽ More

    Submitted 10 April, 2025; v1 submitted 28 November, 2024; originally announced November 2024.

  31. arXiv:2410.23507  [pdf, other

    cs.CL

    Efficient and Interpretable Grammatical Error Correction with Mixture of Experts

    Authors: Muhammad Reza Qorib, Alham Fikri Aji, Hwee Tou Ng

    Abstract: Error type information has been widely used to improve the performance of grammatical error correction (GEC) models, whether for generating corrections, re-ranking them, or combining GEC models. Combining GEC models that have complementary strengths in correcting different error types is very effective in producing better corrections. However, system combination incurs a high computational cost du… ▽ More

    Submitted 30 October, 2024; originally announced October 2024.

    Comments: Findings of EMNLP 2024

  32. arXiv:2410.21573  [pdf, other

    cs.CL cs.AI

    Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense

    Authors: Samuel Cahyawijaya, Ruochen Zhang, Holy Lovenia, Jan Christian Blaise Cruz, Elisa Gilbert, Hiroki Nomoto, Alham Fikri Aji

    Abstract: Multilingual large language models (LLMs) have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely diff… ▽ More

    Submitted 30 October, 2024; v1 submitted 28 October, 2024; originally announced October 2024.

  33. arXiv:2410.12705  [pdf, other

    cs.CL cs.AI cs.CV

    WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

    Authors: Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Ching Lam Cheng, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri, Garry Kuwanto, Hanyang Zhao, Haryo Akbarianto Wibowo, Holy Lovenia , et al. (26 additional authors not shown)

    Abstract: Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering… ▽ More

    Submitted 8 May, 2025; v1 submitted 16 October, 2024; originally announced October 2024.

    Comments: Best Theme Paper at NAACL 2025

  34. arXiv:2409.15380  [pdf, ps, other

    cs.CL cs.AI

    Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino

    Authors: Jann Railey Montalan, Jian Gang Ngui, Wei Qi Leong, Yosephine Susanto, Hamsawardhini Rengarajan, Alham Fikri Aji, William Chandra Tjhi

    Abstract: Multilingual large language models (LLMs) today may not necessarily provide culturally appropriate and relevant responses to its Filipino users. We introduce Kalahi, a cultural LLM evaluation suite collaboratively created by native Filipino speakers. It is composed of 150 high-quality, handcrafted and nuanced prompts that test LLMs for generations that are relevant to shared Filipino cultural know… ▽ More

    Submitted 28 June, 2025; v1 submitted 20 September, 2024; originally announced September 2024.

    Comments: Accepted for presentation at Paclic 38, 2024

    Journal ref: https://aclanthology.org/2024.paclic-1.49/

  35. arXiv:2408.04284  [pdf, other

    cs.CL

    LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

    Authors: Mervat Abassy, Kareem Elozeiri, Alexander Aziz, Minh Ngoc Ta, Raj Vardhan Tomar, Bimarsha Adhikari, Saad El Dine Ahmed, Yuxia Wang, Osama Mohammed Afzal, Zhuohan Xie, Jonibek Mansurov, Ekaterina Artemova, Vladislav Mikhailov, Rui Xing, Jiahui Geng, Hasan Iqbal, Zain Muhammad Mujahid, Tarek Mahmoud, Akim Tsvigun, Alham Fikri Aji, Artem Shelmanov, Nizar Habash, Iryna Gurevych, Preslav Nakov

    Abstract: The ease of access to large language models (LLMs) has enabled a widespread of machine-generated texts, and now it is often hard to tell whether a piece of text was human-written or machine-generated. This raises concerns about potential misuse, particularly within educational and academic domains. Thus, it is important to develop practical systems that can automate the process. Here, we present o… ▽ More

    Submitted 14 March, 2025; v1 submitted 8 August, 2024; originally announced August 2024.

  36. arXiv:2406.19349  [pdf, ps, other

    cs.CL cs.AI

    IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

    Authors: Lucky Susanto, Musa Izzanardi Wijanarko, Prasetia Anugrah Pratama, Traci Hong, Ika Idris, Alham Fikri Aji, Derry Wijaya

    Abstract: Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and ot… ▽ More

    Submitted 12 June, 2025; v1 submitted 27 June, 2024; originally announced June 2024.

    Comments: This work has been substantially expanded and finalized as IndoDiscourse (see [https://huggingface.co/datasets/Exqrch/IndoDiscourse]). IndoToxic should be considered a draft/precursor version and is no longer maintained

  37. arXiv:2406.16524  [pdf, other

    cs.CL

    The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation

    Authors: Haryo Akbarianto Wibowo, Thamar Solorio, Alham Fikri Aji

    Abstract: Knowledge distillation (KD) has proven to be a successful strategy to improve the performance of smaller models in many NLP tasks. However, most of the work in KD only explores monolingual scenarios. In this paper, we investigate the value of KD in multilingual settings. We find the significance of KD and model initialization by analyzing how well the student model acquires multilingual knowledge… ▽ More

    Submitted 8 November, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: 8 pages

    MSC Class: 68T50

  38. arXiv:2406.11661  [pdf, other

    cs.CL

    Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting

    Authors: Sagnik Mukherjee, Muhammad Farid Adilazuarda, Sunayana Sitaram, Kalika Bali, Alham Fikri Aji, Monojit Choudhury

    Abstract: Socio-demographic prompting is a commonly employed approach to study cultural biases in LLMs as well as for aligning models to certain cultures. In this paper, we systematically probe four LLMs (Llama 3, Mistral v0.2, GPT-3.5 Turbo and GPT-4) with prompts that are conditioned on culturally sensitive and non-sensitive cues, on datasets that are supposed to be culturally sensitive (EtiCor and CALI)… ▽ More

    Submitted 20 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  39. arXiv:2406.10118  [pdf, other

    cs.CL

    SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

    Authors: Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse , et al. (36 additional authors not shown)

    Abstract: Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t… ▽ More

    Submitted 10 March, 2025; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: https://seacrowd.github.io/ Published in EMNLP 2024

  40. arXiv:2406.09297  [pdf, other

    cs.LG

    MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

    Authors: Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji

    Abstract: Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Qu… ▽ More

    Submitted 15 October, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  41. arXiv:2406.05967  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

    Authors: David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song , et al. (51 additional authors not shown)

    Abstract: Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recen… ▽ More

    Submitted 4 November, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

  42. arXiv:2405.14782  [pdf, other

    cs.CL

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Authors: Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan , et al. (5 additional authors not shown)

    Abstract: Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons… ▽ More

    Submitted 29 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  43. arXiv:2404.17342  [pdf, other

    cs.CL cs.AI

    From Multiple-Choice to Extractive QA: A Case Study for English and Arabic

    Authors: Teresa Lynn, Malik H. Altakrori, Samar Mohamed Magdy, Rocktim Jyoti Das, Chenyang Lyu, Mohamed Nasr, Younes Samih, Kirill Chirkunov, Alham Fikri Aji, Preslav Nakov, Shantanu Godbole, Salim Roukos, Radu Florian, Nizar Habash

    Abstract: The rapid evolution of Natural Language Processing (NLP) has favoured major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when… ▽ More

    Submitted 24 January, 2025; v1 submitted 26 April, 2024; originally announced April 2024.

    Comments: Paper 8 pages, Appendix 12 pages. Published at COLING2025

  44. arXiv:2404.14183  [pdf, other

    cs.CL

    SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Machine-Generated Text Detection

    Authors: Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Osama Mohammed Afzal, Tarek Mahmoud, Giovanni Puccetti, Thomas Arnold, Chenxi Whitehouse, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov

    Abstract: We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: 23 pages, 12 tables

    Journal ref: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

  45. arXiv:2404.12897  [pdf, other

    cs.CL

    Enabling Natural Zero-Shot Prompting on Encoder Models via Statement-Tuning

    Authors: Ahmed Elshabrawy, Yongxin Huang, Iryna Gurevych, Alham Fikri Aji

    Abstract: While Large Language Models (LLMs) exhibit remarkable capabilities in zero-shot and few-shot scenarios, they often require computationally prohibitive sizes. Conversely, smaller Masked Language Models (MLMs) like BERT and RoBERTa achieve state-of-the-art results through fine-tuning but struggle with extending to few-shot and zero-shot settings due to their architectural constraints. Hence, we prop… ▽ More

    Submitted 17 October, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

  46. arXiv:2404.06138  [pdf, other

    cs.CL

    Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

    Authors: Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

    Abstract: Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder… ▽ More

    Submitted 7 July, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: Cendol models are released under Apache 2.0 license and will be made publicly available soon

  47. arXiv:2403.18933  [pdf, other

    cs.CL

    SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages

    Authors: Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Meriem Beloucif, Christine De Kock, Oumaima Hourrane, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Krishnapriya Vishnubhotla, Seid Muhie Yimam, Saif M. Mohammad

    Abstract: We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. The… ▽ More

    Submitted 17 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

    Comments: SemEval 2024 Task Description Paper. arXiv admin note: text overlap with arXiv:2402.08638

  48. arXiv:2403.15412  [pdf, other

    cs.CY cs.AI cs.CL

    Towards Measuring and Modeling "Culture" in LLMs: A Survey

    Authors: Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Alham Fikri Aji, Jacki O'Neill, Ashutosh Modi, Monojit Choudhury

    Abstract: We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs). We observe that none of the studies explicitly define "culture, which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture". We call these aspects the proxies of cultu… ▽ More

    Submitted 4 September, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

  49. arXiv:2402.14523  [pdf, other

    cs.CL cs.SD eess.AS

    Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition

    Authors: Rendi Chevi, Alham Fikri Aji

    Abstract: We often verbally express emotions in a multifaceted manner, they may vary in their intensities and may be expressed not just as a single but as a mixture of emotions. This wide spectrum of emotions is well-studied in the structural model of emotions, which represents variety of emotions as derivative products of primary emotions with varying degrees of intensity. In this paper, we propose an emot… ▽ More

    Submitted 27 June, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: Project Page: https://rendchevi.github.io/daisy-tts; Updates: (1) Fixed typos, missing references, and layout, (2) Revise explanation on emotion classifier or discriminator

  50. arXiv:2402.13887  [pdf, other

    cs.CL

    Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

    Authors: Chenyang Lyu, Minghao Wu, Alham Fikri Aji

    Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, fundamentally reshaping the landscape of natural language processing (NLP) research. However, recent evaluation frameworks often rely on the output probabilities of LLMs for predictions, primarily due to computational constraints, diverging from real-world LLM usage scenarios. While widely employed,… ▽ More

    Submitted 9 July, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

    Comments: Accepted to KnowledgeableLMs @ ACL 2024