Skip to main content

Showing 1–4 of 4 results for author: Phatthiyaphaibun, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.12982  [pdf, other

    cs.CL cs.AI cs.LG

    Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

    Authors: Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim , et al. (16 additional authors not shown)

    Abstract: Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

    Comments: 49 pages, 16 figures. Technical Report of Sailor2: https://sea-sailor.github.io/blog/sailor2/

  2. arXiv:2403.16127  [pdf, other

    cs.CL cs.AI

    WangchanLion and WangchanX MRC Eval

    Authors: Wannaphong Phatthiyaphaibun, Surapon Nonesung, Patomporn Payoungkhamdee, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Chompakorn Chaksangchaichot, Ekapol Chuangsuwanich, Sarana Nutanong

    Abstract: This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publicly release all training data, code, and the final model weights under the Apache-2 license. To… ▽ More

    Submitted 23 April, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

  3. PyThaiNLP: Thai Natural Language Processing in Python

    Authors: Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, Can Udomcharoenchaikit

    Abstract: We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: 12 pages, 2 figures, LaTeX; typos corrected, timeline clarified for section 2. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25-36, Singapore, Singapore. Empirical Methods in Natural Language Processing

    ACM Class: I.2.7

  4. arXiv:2208.04799  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Thai Wav2Vec2.0 with CommonVoice V8

    Authors: Wannaphong Phatthiyaphaibun, Chompakorn Chaksangchaichot, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

    Abstract: Recently, Automatic Speech Recognition (ASR), a system that converts audio into text, has caught a lot of attention in the machine learning community. Thus, a lot of publicly available models were released in HuggingFace. However, most of these ASR models are available in English; only a minority of the models are available in Thai. Additionally, most of the Thai ASR models are closed-sourced, and… ▽ More

    Submitted 9 August, 2022; originally announced August 2022.