-
UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models
Authors:
Roman Vashurin,
Maiya Goloburda,
Preslav Nakov,
Maxim Panov
Abstract:
Large Language Models (LLMs) have become indispensable tools across various applications, making it more important than ever to ensure the quality and the trustworthiness of their outputs. This has led to growing interest in uncertainty quantification (UQ) methods for assessing the reliability of LLM outputs. Many existing UQ techniques rely on token probabilities, which inadvertently introduces a…
▽ More
Large Language Models (LLMs) have become indispensable tools across various applications, making it more important than ever to ensure the quality and the trustworthiness of their outputs. This has led to growing interest in uncertainty quantification (UQ) methods for assessing the reliability of LLM outputs. Many existing UQ techniques rely on token probabilities, which inadvertently introduces a bias with respect to the length of the output. While some methods attempt to account for this, we demonstrate that such biases persist even in length-normalized approaches. To address the problem, here we propose UNCERTAINTY-LINE: (Length-INvariant Estimation), a simple debiasing procedure that regresses uncertainty scores on output length and uses the residuals as corrected, length-invariant estimates. Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures. Through extensive evaluation on machine translation, summarization, and question-answering tasks, we demonstrate that UNCERTAINTY-LINE: consistently improves over even nominally length-normalized UQ methods uncertainty estimates across multiple metrics and models.
△ Less
Submitted 25 May, 2025;
originally announced May 2025.
-
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh
Authors:
Fajri Koto,
Rituraj Joshi,
Nurdaulet Mukhituly,
Yuxia Wang,
Zhuohan Xie,
Rahul Pal,
Daniil Orel,
Parvez Mullah,
Diana Turmakhan,
Maiya Goloburda,
Mohammed Kamran,
Samujjwal Ghosh,
Bokang Jia,
Jonibek Mansurov,
Mukhammed Togmanov,
Debopriyo Banerjee,
Nurkhan Laiyk,
Akhmed Sakip,
Xudong Han,
Ekaterina Kochmar,
Alham Fikri Aji,
Aaryamonvikram Singh,
Alok Anil Jadhav,
Satheesh Katipomu,
Samta Kamboj
, et al. (10 additional authors not shown)
Abstract:
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion…
▽ More
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outperforming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. We release Sherkala-Chat (8B) as an open-weight instruction-tuned model and provide a detailed overview of its training, fine-tuning, safety alignment, and evaluation, aiming to advance research and support diverse real-world applications.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
Authors:
Nurkhan Laiyk,
Daniil Orel,
Rituraj Joshi,
Maiya Goloburda,
Yuxia Wang,
Preslav Nakov,
Fajri Koto
Abstract:
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and str…
▽ More
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts
Authors:
Maiya Goloburda,
Nurkhan Laiyk,
Diana Turmakhan,
Yuxia Wang,
Mukhammed Togmanov,
Jonibek Mansurov,
Askhat Sametov,
Nurdaulet Mukhituly,
Minghan Wang,
Daniil Orel,
Zain Muhammad Mujahid,
Fajri Koto,
Timothy Baldwin,
Preslav Nakov
Abstract:
Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findi…
▽ More
Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findings can diverge from those in monolingual settings. In this paper, we introduce Qorgau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian, reflecting the unique bilingual context in Kazakhstan, where both Kazakh (a low-resource language) and Russian (a high-resource language) are spoken. Experiments with both multilingual and language-specific LLMs reveal notable differences in safety performance, emphasizing the need for tailored, region-specific datasets to ensure the responsible and safe deployment of LLMs in countries like Kazakhstan. Warning: this paper contains example data that may be offensive, harmful, or biased.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan
Authors:
Mukhammed Togmanov,
Nurdaulet Mukhituly,
Diana Turmakhan,
Jonibek Mansurov,
Maiya Goloburda,
Akhmed Sakip,
Zhuohan Xie,
Yuxia Wang,
Bekassyl Syzdykov,
Nurkhan Laiyk,
Alham Fikri Aji,
Ekaterina Kochmar,
Preslav Nakov,
Fajri Koto
Abstract:
Despite having a population of twenty million, Kazakhstan's culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style…
▽ More
Despite having a population of twenty million, Kazakhstan's culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU comprises 23,000 questions that cover various educational levels, including STEM, humanities, and social sciences, sourced from authentic educational materials and manually validated by native speakers and educators. The dataset includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting Kazakhstan's bilingual education system and rich local context. Our evaluation of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4, and DeepSeek V3) demonstrates substantial room for improvement, as even the best-performing models struggle to achieve competitive performance in Kazakh and Russian. These findings underscore significant performance gaps compared to high-resource languages. We hope that our dataset will enable further research and development of Kazakh-centric LLMs. Data and code will be made available upon acceptance.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI
Authors:
Yuxia Wang,
Rui Xing,
Jonibek Mansurov,
Giovanni Puccetti,
Zhuohan Xie,
Minh Ngoc Ta,
Jiahui Geng,
Jinyan Su,
Mervat Abassy,
Saad El Dine Ahmed,
Kareem Elozeiri,
Nurkhan Laiyk,
Maiya Goloburda,
Tarek Mahmoud,
Raj Vardhan Tomar,
Alexander Aziz,
Ryuto Koike,
Masahiro Kaneko,
Artem Shelmanov,
Ekaterina Artemova,
Vladislav Mikhailov,
Akim Tsvigun,
Alham Fikri Aji,
Nizar Habash,
Iryna Gurevych
, et al. (1 additional authors not shown)
Abstract:
Prior studies have shown that distinguishing text generated by large language models (LLMs) from human-written one is highly challenging, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domai…
▽ More
Prior studies have shown that distinguishing text generated by large language models (LLMs) from human-written one is highly challenging, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6\%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50\% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source.
△ Less
Submitted 23 May, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency
Authors:
Roman Vashurin,
Maiya Goloburda,
Albina Ilina,
Aleksandr Rubashevskii,
Preslav Nakov,
Artem Shelmanov,
Maxim Panov
Abstract:
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combin…
▽ More
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combined these two approaches to boost UQ performance. However, they sometimes fail to outperform much simpler baseline methods. Our work discusses the fundamental approach to constructing uncertainty measures that directly links uncertainty with the minimum Bayes risks achieved by LLM decoding. Building on these findings, we propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods. Our investigation reveals distinctive characteristics of LLMs as probabilistic models, which help to explain why these UQ methods underperform in certain tasks. Based on these findings, we propose a new way of synthesizing model confidence and output consistency, leading to a family of efficient and robust UQ methods. We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation, demonstrating sizable improvements over state-of-the-art UQ approaches.
△ Less
Submitted 29 May, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human
Authors:
Yuxia Wang,
Artem Shelmanov,
Jonibek Mansurov,
Akim Tsvigun,
Vladislav Mikhailov,
Rui Xing,
Zhuohan Xie,
Jiahui Geng,
Giovanni Puccetti,
Ekaterina Artemova,
Jinyan Su,
Minh Ngoc Ta,
Mervat Abassy,
Kareem Ashraf Elozeiri,
Saad El Dine Ahmed El Etter,
Maiya Goloburda,
Tarek Mahmoud,
Raj Vardhan Tomar,
Nurkhan Laiyk,
Osama Mohammed Afzal,
Ryuto Koike,
Masahiro Kaneko,
Alham Fikri Aji,
Nizar Habash,
Iryna Gurevych
, et al. (1 additional authors not shown)
Abstract:
We present the GenAI Content Detection Task~1 -- a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 26 teams -- to the Multilin…
▽ More
We present the GenAI Content Detection Task~1 -- a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 26 teams -- to the Multilingual. We provide a comprehensive overview of the data, a summary of the results -- including system rankings and performance scores -- detailed descriptions of the participating systems, and an in-depth analysis of submissions. https://github.com/mbzuai-nlp/COLING-2025-Workshop-on-MGT-Detection-Task1
△ Less
Submitted 22 February, 2025; v1 submitted 19 January, 2025;
originally announced January 2025.