Search | arXiv e-print repository

Prior Prompt Engineering for Reinforcement Fine-Tuning

Authors: Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul

Abstract: This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt--the instructions prepended to queries during t… ▽ More This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt--the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning--remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies--reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization--into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT. △ Less

Submitted 20 May, 2025; originally announced May 2025.

Comments: 25 pages, 42 figures

arXiv:2504.05898 [pdf, other]

Assessing Thai Dialect Performance in LLMs with Automatic Benchmarks and Human Evaluation

Authors: Peerat Limkonchotiwat, Kanruethai Masuk, Surapon Nonesung, Chalermpun Mai-On, Sarana Nutanong, Wuttikorn Ponwitayarat, Potsawee Manakul

Abstract: Large language models show promising results in various NLP tasks. Despite these successes, the robustness and consistency of LLMs in underrepresented languages remain largely unexplored, especially concerning local dialects. Existing benchmarks also focus on main dialects, neglecting LLMs' ability on local dialect texts. In this paper, we introduce a Thai local dialect benchmark covering Northern… ▽ More Large language models show promising results in various NLP tasks. Despite these successes, the robustness and consistency of LLMs in underrepresented languages remain largely unexplored, especially concerning local dialects. Existing benchmarks also focus on main dialects, neglecting LLMs' ability on local dialect texts. In this paper, we introduce a Thai local dialect benchmark covering Northern (Lanna), Northeastern (Isan), and Southern (Dambro) Thai, evaluating LLMs on five NLP tasks: summarization, question answering, translation, conversation, and food-related tasks. Furthermore, we propose a human evaluation guideline and metric for Thai local dialects to assess generation fluency and dialect-specific accuracy. Results show that LLM performance declines significantly in local Thai dialects compared to standard Thai, with only proprietary models like GPT-4o and Gemini2 demonstrating some fluency △ Less

Submitted 8 April, 2025; originally announced April 2025.

Comments: Datasets and codes are available at https://github.com/mrpeerat/Thai_local_benchmark

arXiv:2502.17956 [pdf, other]

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

Authors: Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

Abstract: Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge… ▽ More Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic. △ Less

Submitted 22 May, 2025; v1 submitted 25 February, 2025; originally announced February 2025.

arXiv:2502.10868 [pdf, other]

NitiBench: A Comprehensive Study of LLM Framework Capabilities for Thai Legal Question Answering

Authors: Pawitsapak Akarajaradwong, Pirat Pothavorn, Chompakorn Chaksangchaichot, Panuthep Tasawong, Thitiwat Nopparatbundit, Sarana Nutanong

Abstract: The application of large language models (LLMs) in the legal domain holds significant potential for information retrieval and question answering, yet Thai legal QA systems face challenges due to a lack of standardized evaluation benchmarks and the complexity of Thai legal structures. This paper introduces NitiBench, a benchmark comprising two datasets: the NitiBench-CCL, covering general Thai fina… ▽ More The application of large language models (LLMs) in the legal domain holds significant potential for information retrieval and question answering, yet Thai legal QA systems face challenges due to a lack of standardized evaluation benchmarks and the complexity of Thai legal structures. This paper introduces NitiBench, a benchmark comprising two datasets: the NitiBench-CCL, covering general Thai financial law, and the NitiBench-Tax, which includes real-world tax law cases requiring advanced legal reasoning. We evaluate retrieval-augmented generation (RAG) and long-context LLM-based approaches to address three key research questions: the impact of domain-specific components like section-based chunking and cross-referencing, the comparative performance of different retrievers and LLMs, and the viability of long-context LLMs as an alternative to RAG. Our results show that section-based chunking significantly improves retrieval and end-to-end performance, current retrievers struggle with complex queries, and long-context LLMs still underperform RAG-based systems in Thai legal QA. To support fair evaluation, we propose tailored multi-label retrieval metrics and the use of an LLM-as-judge for coverage and contradiction detection method. These findings highlight the limitations of current Thai legal NLP solutions and provide a foundation for future research in the field. We also open-sourced our codes and dataset to available publicly. △ Less

Submitted 8 March, 2025; v1 submitted 15 February, 2025; originally announced February 2025.

arXiv:2407.19164 [pdf, other]

Addressing Topic Leakage in Cross-Topic Evaluation for Authorship Verification

Authors: Jitkapat Sawatphol, Can Udomcharoenchaikit, Sarana Nutanong

Abstract: Authorship verification (AV) aims to identify whether a pair of texts has the same author. We address the challenge of evaluating AV models' robustness against topic shifts. The conventional evaluation assumes minimal topic overlap between training and test data. However, we argue that there can still be topic leakage in test data, causing misleading model performance and unstable rankings. To add… ▽ More Authorship verification (AV) aims to identify whether a pair of texts has the same author. We address the challenge of evaluating AV models' robustness against topic shifts. The conventional evaluation assumes minimal topic overlap between training and test data. However, we argue that there can still be topic leakage in test data, causing misleading model performance and unstable rankings. To address this, we propose an evaluation method called Heterogeneity-Informed Topic Sampling (HITS), which creates a smaller dataset with a heterogeneously distributed topic set. Our experimental results demonstrate that HITS-sampled datasets yield a more stable ranking of models across random seeds and evaluation splits. Our contributions include: 1. An analysis of causes and effects of topic leakage. 2. A demonstration of the HITS in reducing the effects of topic leakage, and 3. The Robust Authorship Verification bENchmark (RAVEN) that allows topic shortcut test to uncover AV models' reliance on topic-specific features. △ Less

Submitted 27 July, 2024; originally announced July 2024.

Comments: Accepted to publish at Transactions of the Association for Computational Linguistics

arXiv:2406.03125 [pdf, other]

Space Decomposition for Sentence Embedding

Authors: Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

Abstract: Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a… ▽ More Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a new approach to treating the upper-range and lower-range classes separately. In this paper, we introduce a novel embedding space decomposition method called MixSP utilizing a Mixture of Specialized Projectors, designed to distinguish and rank upper-range and lower-range samples accurately. The experimental results demonstrate that MixSP decreased the overlap representation between upper-range and lower-range classes significantly while outperforming competitors on STS and zero-shot benchmarks. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: ACL Finding 2024. The code and pre-trained models are available at https://github.com/KornWtp/MixSP

arXiv:2403.16127 [pdf, other]

WangchanLion and WangchanX MRC Eval

Authors: Wannaphong Phatthiyaphaibun, Surapon Nonesung, Patomporn Payoungkhamdee, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Chompakorn Chaksangchaichot, Ekapol Chuangsuwanich, Sarana Nutanong

Abstract: This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publicly release all training data, code, and the final model weights under the Apache-2 license. To… ▽ More This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publicly release all training data, code, and the final model weights under the Apache-2 license. To assess the contextual understanding capability, we conducted extensive experimental studies using two Thai MRC datasets, XQuAD and Iapp_wiki_qa_squad. Experimental results demonstrate the model's ability to comprehend the context and produce an answer faithful to the reference one in 0-shot and 1-shot settings. In addition, our evaluation goes beyond the traditional MRC. We propose a new evaluation scheme assessing the answer's correctness, helpfulness, conciseness, and contextuality. Our code is available publicly at https://github.com/vistec-AI/WangchanLion. △ Less

Submitted 23 April, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

arXiv:2311.03228 [pdf, other]

An Efficient Self-Supervised Cross-View Training For Sentence Embedding

Authors: Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita Lowphansirikul, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, Sarana Nutanong

Abstract: Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrade… ▽ More Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: Accepted to TACL. The code and pre-trained models are available at https://github.com/mrpeerat/SCT

arXiv:2306.10348 [pdf, other]

Typo-Robust Representation Learning for Dense Retrieval

Authors: Panuthep Tasawong, Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, Sarana Nutanong

Abstract: Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only fo… ▽ More Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries. To assess the effectiveness of our proposed method, we compare it against the existing competitors using two benchmark datasets and two base encoders. Our method outperforms the competitors in all cases with misspelled queries. Our code and models are available at https://github. com/panuthept/DST-DenseRetrieval. △ Less

Submitted 17 June, 2023; originally announced June 2023.

Comments: 5 pages, 2 figures

ACM Class: I.2.7

arXiv:2208.04799 [pdf, ps, other]

Thai Wav2Vec2.0 with CommonVoice V8

Authors: Wannaphong Phatthiyaphaibun, Chompakorn Chaksangchaichot, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

Abstract: Recently, Automatic Speech Recognition (ASR), a system that converts audio into text, has caught a lot of attention in the machine learning community. Thus, a lot of publicly available models were released in HuggingFace. However, most of these ASR models are available in English; only a minority of the models are available in Thai. Additionally, most of the Thai ASR models are closed-sourced, and… ▽ More Recently, Automatic Speech Recognition (ASR), a system that converts audio into text, has caught a lot of attention in the machine learning community. Thus, a lot of publicly available models were released in HuggingFace. However, most of these ASR models are available in English; only a minority of the models are available in Thai. Additionally, most of the Thai ASR models are closed-sourced, and the performance of existing open-sourced models lacks robustness. To address this problem, we train a new ASR model on a pre-trained XLSR-Wav2Vec model with the Thai CommonVoice corpus V8 and train a trigram language model to boost the performance of our ASR model. We hope that our models will be beneficial to individuals and the ASR community in Thailand. △ Less

Submitted 9 August, 2022; originally announced August 2022.

arXiv:2101.09635 [pdf, ps, other]

WangchanBERTa: Pretraining transformer-based Thai Language Models

Authors: Lalita Lowphansirikul, Charin Polpanumas, Nawat Jantrakulchai, Sarana Nutanong

Abstract: Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Mor… ▽ More Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Moreover, large-scale multi-lingual pretraining does not take into account language-specific features for Thai. To overcome these limitations, we pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size), curated from diverse domains of social media posts, news articles and other publicly available datasets. We apply text processing rules that are specific to Thai most importantly preserving spaces, which are important chunk and sentence boundaries in Thai before subword tokenization. We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance. Our model wangchanberta-base-att-spm-uncased trained on the 78.5GB dataset outperforms strong baselines (NBSVM, CRF and ULMFit) and multi-lingual models (XLMR and mBERT) on both sequence classification and token classification tasks in human-annotated, mono-lingual contexts. △ Less

Submitted 20 March, 2021; v1 submitted 23 January, 2021; originally announced January 2021.

Comments: 24 pages, edited the citation of the syllable-level tokenizer from [Chormai et al., 2020] to [Phatthiyaphaibun et al., 2020] as the authors used the syllable-level tokenizer from PyThaiNLP [Phatthiyaphaibun et al., 2020] in the experiments

arXiv:2007.03541 [pdf, other]

doi 10.1007/s10579-021-09536-6

scb-mt-en-th-2020: A Large English-Thai Parallel Corpus

Authors: Lalita Lowphansirikul, Charin Polpanumas, Attapol T. Rutherford, Sarana Nutanong

Abstract: The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and re… ▽ More The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models' performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: 35 pages, 4 figures

arXiv:2002.03118 [pdf, other]

Shipper Cooperation in Stochastic Drone Delivery: A Dynamic Bayesian Game Approach

Authors: Suttinee Sawadsitang, Dusit Niyato, Tan Puay Siew, Ping Wang, Sarana Nutanong

Abstract: With the recent technological innovation, unmanned aerial vehicles, known as drones, have found numerous applications including package and parcel delivery for shippers. Drone delivery offers benefits over conventional ground-based vehicle delivery in terms of faster speed, lower cost, more environment-friendly, and less manpower needed. However, most of existing studies on drone delivery planning… ▽ More With the recent technological innovation, unmanned aerial vehicles, known as drones, have found numerous applications including package and parcel delivery for shippers. Drone delivery offers benefits over conventional ground-based vehicle delivery in terms of faster speed, lower cost, more environment-friendly, and less manpower needed. However, most of existing studies on drone delivery planning and scheduling focus on a single shipper and ignore uncertainty factors. As such, in this paper, we consider a scenario that multiple shippers can cooperate to minimize their drone delivery cost. We propose the Bayesian Shipper Cooperation in Stochastic Drone Delivery (BCoSDD) framework. The framework is composed of three functions, i.e., package assignment, shipper cooperation formation and cost management. The uncertainties of drone breakdown and misbehavior of cooperative shippers are taken into account by using multistage stochastic programming optimization and dynamic Bayesian coalition formation game. We conduct extensive performance evaluation of the BCoSDD framework by using customer locations from Solomon benchmark suite and a real Singapore logistics industry. As a result, the framework can help the shippers plan and schedule their drone delivery effectively. △ Less

Submitted 8 February, 2020; originally announced February 2020.

Comments: 15 Pages, 10 figures, 2 tables. This paper is still under review

arXiv:1908.07406 [pdf, ps, other]

Multi-Objective Optimization for Drone Delivery

Authors: Suttinee Sawadsitang, Dusit Niyato, Puay Siew Tan, Sarana Nutanong

Abstract: Recently, an unmanned aerial vehicle (UAV), as known as drone, has become an alternative means of package delivery. Although the drone delivery scheduling has been studied in recent years, most existing models are formulated as a single objective optimization problem. However, in practice, the drone delivery scheduling has multiple objectives that the shipper has to achieve. Moreover, drone delive… ▽ More Recently, an unmanned aerial vehicle (UAV), as known as drone, has become an alternative means of package delivery. Although the drone delivery scheduling has been studied in recent years, most existing models are formulated as a single objective optimization problem. However, in practice, the drone delivery scheduling has multiple objectives that the shipper has to achieve. Moreover, drone delivery typically faces with unexpected events, e.g., breakdown or unable to takeoff, that can significantly affect the scheduling problem. Therefore, in this paper, we propose a multi-objective and three-stage stochastic optimization model for the drone delivery scheduling, called multi-objective optimization for drone delivery (MODD) system. To handle the the multi-objective optimization in the MODD system, we apply $\varepsilon$-constraint method. The performance evaluation is performed by using a real dataset from Singapore delivery services. △ Less

Submitted 24 July, 2019; originally announced August 2019.

Comments: 5 pages, 4 figures

Journal ref: 2019 IEEE 90th Vehicular Technology Conference: VTC2019-Fall

arXiv:cs/0402018 [pdf]

P2P Networks for Content Sharing

Authors: Choon Hoong Ding, Sarana Nutanong, Rajkumar Buyya

Abstract: Peer-to-peer (P2P) technologies have been widely used for content sharing, popularly called "file-swapping" networks. This chapter gives a broad overview of content sharing P2P technologies. It starts with the fundamental concept of P2P computing followed by the analysis of network topologies used in peer-to-peer systems. Next, three milestone peer-to-peer technologies: Napster, Gnutella, and Fa… ▽ More Peer-to-peer (P2P) technologies have been widely used for content sharing, popularly called "file-swapping" networks. This chapter gives a broad overview of content sharing P2P technologies. It starts with the fundamental concept of P2P computing followed by the analysis of network topologies used in peer-to-peer systems. Next, three milestone peer-to-peer technologies: Napster, Gnutella, and Fasttrack are explored in details, and they are finally concluded with the comparison table in the last section. △ Less

Submitted 10 February, 2004; originally announced February 2004.

Comments: 35 pages, 26 figures

Report number: GRIDS-TR-2003-7 ACM Class: C.2.4

Journal ref: Technical Report, GRIDS-TR-2003-7, Grid Computing and Distributed Systems Laboratory, University of Melbourne, Australia, December 2003

Showing 1–15 of 15 results for author: Nutanong, S