Skip to main content

Showing 1–4 of 4 results for author: Sawatphol, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.17145  [pdf, other

    cs.CL cs.AI cs.LG

    Can General-Purpose Large Language Models Generalize to English-Thai Machine Translation ?

    Authors: Jirat Chiaranaipanich, Naiyarat Hanmatheekuna, Jitkapat Sawatphol, Krittamate Tiankanon, Jiramet Kinchagawat, Amrest Chinkamol, Parinthapat Pengpun, Piyalitt Ittichaiwong, Peerat Limkonchotiwat

    Abstract: Large language models (LLMs) perform well on common tasks but struggle with generalization in low-resource and low-computation settings. We examine this limitation by testing various LLMs and specialized translation models on English-Thai machine translation and code-switching datasets. Our findings reveal that under more strict computational constraints, such as 4-bit quantization, LLMs fail to t… ▽ More

    Submitted 22 October, 2024; originally announced October 2024.

    Comments: Accepted in GenBench EMNLP 2024

  2. arXiv:2407.19164  [pdf, other

    cs.CL

    Addressing Topic Leakage in Cross-Topic Evaluation for Authorship Verification

    Authors: Jitkapat Sawatphol, Can Udomcharoenchaikit, Sarana Nutanong

    Abstract: Authorship verification (AV) aims to identify whether a pair of texts has the same author. We address the challenge of evaluating AV models' robustness against topic shifts. The conventional evaluation assumes minimal topic overlap between training and test data. However, we argue that there can still be topic leakage in test data, causing misleading model performance and unstable rankings. To add… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

    Comments: Accepted to publish at Transactions of the Association for Computational Linguistics

  3. arXiv:2406.06000  [pdf

    cs.CL

    ThaiCoref: Thai Coreference Resolution Dataset

    Authors: Pontakorn Trakuekul, Wei Qi Leong, Charin Polpanumas, Jitkapat Sawatphol, William Chandra Tjhi, Attapol T. Rutherford

    Abstract: While coreference resolution is a well-established research area in Natural Language Processing (NLP), research focusing on Thai language remains limited due to the lack of large annotated corpora. In this work, we introduce ThaiCoref, a dataset for Thai coreference resolution. Our dataset comprises 777,271 tokens, 44,082 mentions and 10,429 entities across four text genres: university essays, new… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

  4. arXiv:2403.16127  [pdf, other

    cs.CL cs.AI

    WangchanLion and WangchanX MRC Eval

    Authors: Wannaphong Phatthiyaphaibun, Surapon Nonesung, Patomporn Payoungkhamdee, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Chompakorn Chaksangchaichot, Ekapol Chuangsuwanich, Sarana Nutanong

    Abstract: This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publicly release all training data, code, and the final model weights under the Apache-2 license. To… ▽ More

    Submitted 23 April, 2024; v1 submitted 24 March, 2024; originally announced March 2024.