Skip to main content

Showing 1–50 of 65 results for author: Nguyen, N L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.19606  [pdf, other

    cs.CL

    Coreference Resolution for Vietnamese Narrative Texts

    Authors: Hieu-Dai Tran, Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Coreference resolution is a vital task in natural language processing (NLP) that involves identifying and linking different expressions in a text that refer to the same entity. This task is particularly challenging for Vietnamese, a low-resource language with limited annotated datasets. To address these challenges, we developed a comprehensive annotated dataset using narrative texts from VnExpress… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: Accepted at PACLIC 2024

  2. arXiv:2501.13992  [pdf, other

    cs.LG cs.AI

    Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization

    Authors: Hy Nguyen, Nguyen Hung Nguyen, Nguyen Linh Bao Nguyen, Srikanth Thudumu, Hung Du, Rajesh Vasa, Kon Mouzakis

    Abstract: The Hierarchical Navigable Small World (HNSW) algorithm is widely used for approximate nearest neighbor (ANN) search, leveraging the principles of navigable small-world graphs. However, it faces some limitations. The first is the local optima problem, which arises from the algorithm's greedy search strategy, selecting neighbors based solely on proximity at each step. This often leads to cluster di… ▽ More

    Submitted 25 April, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

  3. An Attempt to Develop a Neural Parser based on Simplified Head-Driven Phrase Structure Grammar on Vietnamese

    Authors: Duc-Vu Nguyen, Thang Chau Phan, Quoc-Nam Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In this paper, we aimed to develop a neural parser for Vietnamese based on simplified Head-Driven Phrase Structure Grammar (HPSG). The existing corpora, VietTreebank and VnDT, had around 15% of constituency and dependency tree pairs that did not adhere to simplified HPSG rules. To attempt to address the issue of the corpora not adhering to simplified HPSG rules, we randomly permuted samples from t… ▽ More

    Submitted 28 April, 2025; v1 submitted 26 November, 2024; originally announced November 2024.

    Comments: Accepted at SoICT 2024

  4. arXiv:2411.13407  [pdf, other

    cs.CL

    Transformer-Based Contextualized Language Models Joint with Neural Networks for Natural Language Inference in Vietnamese

    Authors: Dat Van-Thanh Nguyen, Tin Van Huynh, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Natural Language Inference (NLI) is a task within Natural Language Processing (NLP) that holds value for various AI applications. However, there have been limited studies on Natural Language Inference in Vietnamese that explore the concept of joint models. Therefore, we conducted experiments using various combinations of contextualized language models (CLM) and neural networks. We use CLM to creat… ▽ More

    Submitted 20 November, 2024; v1 submitted 20 November, 2024; originally announced November 2024.

  5. arXiv:2410.14132  [pdf, other

    cs.CV cs.CL

    ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering

    Authors: Nghia Hieu Nguyen, Tho Thanh Quan, Ngan Luu-Thuy Nguyen

    Abstract: Text-based VQA is a challenging task that requires machines to use scene texts in given images to yield the most appropriate answer for the given question. The main challenge of text-based VQA is exploiting the meaning and information from scene texts. Recent studies tackled this challenge by considering the spatial information of scene texts in images via embedding 2D coordinates of their boundin… ▽ More

    Submitted 23 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

    Comments: PACLIC 2024

  6. arXiv:2406.17716  [pdf, other

    cs.CL

    ViANLI: Adversarial Natural Language Inference for Vietnamese

    Authors: Tin Van Huynh, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: The development of Natural Language Processing (NLI) datasets and models has been inspired by innovations in annotation design. With the rapid development of machine learning models today, the performance of existing machine learning models has quickly reached state-of-the-art results on a variety of tasks related to natural language processing, including natural language inference tasks. By using… ▽ More

    Submitted 1 July, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

  7. arXiv:2404.18397  [pdf, other

    cs.CV

    ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images

    Authors: Huy Quang Pham, Thang Kien-Bao Nguyen, Quan Van Nguyen, Dan Quang Tran, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Optical Character Recognition - Visual Question Answering (OCR-VQA) is the task of answering text information contained in images that have just been significantly developed in the English language in recent years. However, there are limited studies of this task in low-resource languages such as Vietnamese. To this end, we introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recogniti… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  8. arXiv:2404.10652  [pdf, other

    cs.CL

    ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

    Authors: Quan Van Nguyen, Dan Quang Tran, Huy Quang Pham, Thang Kien-Bao Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Visual Question Answerinng (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. This task was initially researched with a focus on developing methods to help machines understand objects and scene contexts in images. However, some scene text that carries explicit information about the full content of the image is not mentioned. Along wit… ▽ More

    Submitted 16 May, 2025; v1 submitted 16 April, 2024; originally announced April 2024.

  9. arXiv:2403.15882  [pdf, other

    cs.CL

    VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding

    Authors: Phong Nguyen-Thuan Do, Son Quoc Tran, Phu Gia Hoang, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: The success of Natural Language Understanding (NLU) benchmarks in various languages, such as GLUE for English, CLUE for Chinese, KLUE for Korean, and IndoNLU for Indonesian, has facilitated the evaluation of new NLU models across a wide range of tasks. To establish a standardized set of benchmarks for Vietnamese NLU, we introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchm… ▽ More

    Submitted 23 March, 2024; originally announced March 2024.

    Comments: Accepted at NAACL 2024 (Findings)

  10. arXiv:2402.02655  [pdf, other

    cs.CL

    VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension

    Authors: Thinh Phuoc Ngo, Khoa Tran Anh Dang, Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or te… ▽ More

    Submitted 6 April, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

    Comments: To appear as the main conference paper at EACL 2024

  11. arXiv:2312.09000  [pdf, ps, other

    cs.CL

    ComOM at VLSP 2023: A Dual-Stage Framework with BERTology and Unified Multi-Task Instruction Tuning Model for Vietnamese Comparative Opinion Mining

    Authors: Dang Van Thin, Duong Ngoc Hao, Ngan Luu-Thuy Nguyen

    Abstract: The ComOM shared task aims to extract comparative opinions from product reviews in Vietnamese language. There are two sub-tasks, including (1) Comparative Sentence Identification (CSI) and (2) Comparative Element Extraction (CEE). The first task is to identify whether the input is a comparative review, and the purpose of the second task is to extract the quintuplets mentioned in the comparative re… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Accepted manuscript at VLSP 2023

  12. Abusive Span Detection for Vietnamese Narrative Texts

    Authors: Nhu-Thanh Nguyen, Khoa Thi-Kim Phan, Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Abuse in its various forms, including physical, psychological, verbal, sexual, financial, and cultural, has a negative impact on mental health. However, there are limited studies on applying natural language processing (NLP) in this field in Vietnam. Therefore, we aim to contribute by building a human-annotated Vietnamese dataset for detecting abusive content in Vietnamese narrative texts. We sour… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: Accepted at SoICT 2023

  13. arXiv:2310.18046  [pdf, other

    cs.CL cs.CV

    ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese

    Authors: Khiem Vinh Tran, Hao Phu Phan, Kiet Van Nguyen, Ngan Luu Thuy Nguyen

    Abstract: In recent years, Visual Question Answering (VQA) has gained significant attention for its diverse applications, including intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made r… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: A pre-print version and submitted to journal

  14. arXiv:2307.15335  [pdf, other

    cs.CL cs.CV

    BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering

    Authors: Khiem Vinh Tran, Kiet Van Nguyen, Ngan Luu Thuy Nguyen

    Abstract: Visual Question Answering (VQA) is an intricate and demanding task that integrates natural language processing (NLP) and computer vision (CV), capturing the interest of researchers. The English language, renowned for its wealth of resources, has witnessed notable advancements in both datasets and models designed for VQA. However, there is a lack of models that target specific countries such as Vie… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

  15. OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

    Authors: Nghia Hieu Nguyen, Duong T. D. Vo, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the informati… ▽ More

    Submitted 6 May, 2023; originally announced May 2023.

    Comments: submitted to Elsevier

  16. arXiv:2304.06871  [pdf, other

    cs.CV eess.IV

    L1BSR: Exploiting Detector Overlap for Self-Supervised Single-Image Super-Resolution of Sentinel-2 L1B Imagery

    Authors: Ngoc Long Nguyen, Jérémy Anger, Axel Davy, Pablo Arias, Gabriele Facciolo

    Abstract: High-resolution satellite imagery is a key element for many Earth monitoring applications. Satellites such as Sentinel-2 feature characteristics that are favorable for super-resolution algorithms such as aliasing and band-misalignment. Unfortunately the lack of reliable high-resolution (HR) ground truth limits the application of deep learning methods to this task. In this work we propose L1BSR, a… ▽ More

    Submitted 17 April, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: EarthVision 2023

  17. arXiv:2303.18162  [pdf, ps, other

    cs.CL

    ViMMRC 2.0 -- Enhancing Machine Reading Comprehension on Vietnamese Literature Text

    Authors: Son T. Luu, Khoi Trong Hoang, Tuong Quang Pham, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Machine reading comprehension has been an interesting and challenging task in recent years, with the purpose of extracting useful information from texts. To attain the computer ability to understand the reading text and answer relevant information, we introduce ViMMRC 2.0 - an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks which conta… ▽ More

    Submitted 7 June, 2025; v1 submitted 31 March, 2023; originally announced March 2023.

    Comments: Accepted for publication at International Journal of Asian Language Processing

  18. arXiv:2303.13355  [pdf, other

    cs.CL cs.AI

    Revealing Weaknesses of Vietnamese Language Models Through Unanswerable Questions in Machine Reading Comprehension

    Authors: Son Quoc Tran, Phong Nguyen-Thuan Do, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Although the curse of multilinguality significantly restricts the language abilities of multilingual models in monolingual settings, researchers now still have to rely on multilingual models to develop state-of-the-art systems in Vietnamese Machine Reading Comprehension. This difficulty in researching is because of the limited number of high-quality works in developing Vietnamese language models.… ▽ More

    Submitted 16 March, 2023; originally announced March 2023.

    Comments: Accepted at The 2023 EACL Student Research Workshop

  19. arXiv:2303.05879  [pdf, other

    eess.IV cs.CV

    Handheld Burst Super-Resolution Meets Multi-Exposure Satellite Imagery

    Authors: Jamy Lafenetre, Ngoc Long Nguyen, Gabriele Facciolo, Thomas Eboli

    Abstract: Image resolution is an important criterion for many applications based on satellite imagery. In this work, we adapt a state-of-the-art kernel regression technique for smartphone camera burst super-resolution to satellites. This technique leverages the local structure of the image to optimally steer the fusion kernels, limiting blur in the final high-resolution prediction, denoising the image, and… ▽ More

    Submitted 10 March, 2023; originally announced March 2023.

    Comments: 9 pages

  20. EVJVQA Challenge: Multilingual Visual Question Answering

    Authors: Ngan Luu-Thuy Nguyen, Nghia Hieu Nguyen, Duong T. D Vo, Khanh Quoc Tran, Kiet Van Nguyen

    Abstract: Visual Question Answering (VQA) is a challenging task of natural language processing (NLP) and computer vision (CV), attracting significant attention from researchers. English is a resource-rich language that has witnessed various developments in datasets and models for visual question answering. Visual question answering in other languages also would be developed for resources and models. In addi… ▽ More

    Submitted 17 April, 2024; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: VLSP2022 EVJVQA challenge

  21. arXiv:2302.11494  [pdf, other

    cs.CV eess.IV

    On The Role of Alias and Band-Shift for Sentinel-2 Super-Resolution

    Authors: Ngoc Long Nguyen, Jérémy Anger, Lara Raad, Bruno Galerne, Gabriele Facciolo

    Abstract: In this work, we study the problem of single-image super-resolution (SISR) of Sentinel-2 imagery. We show that thanks to its unique sensor specification, namely the inter-band shift and alias, that deep-learning methods are able to recover fine details. By training a model using a simple $L_1$ loss, results are free of hallucinated details. For this study, we build a dataset of pairs of images Sen… ▽ More

    Submitted 17 April, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: 4 pages, 3 figures

  22. arXiv:2301.10186  [pdf, other

    cs.CL

    ViHOS: Hate Speech Spans Detection for Vietnamese

    Authors: Phu Gia Hoang, Canh Duc Luu, Khanh Quoc Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: The rise in hateful and offensive language directed at other users is one of the adverse side effects of the increased use of social networking platforms. This could make it difficult for human moderators to review tagged comments filtered by classification systems. To help address this issue, we present the ViHOS (Vietnamese Hate and Offensive Spans) dataset, the first human-annotated corpus cont… ▽ More

    Submitted 26 January, 2023; v1 submitted 24 January, 2023; originally announced January 2023.

    Comments: EACL 2023

  23. arXiv:2301.00429  [pdf, other

    cs.CL

    Integrating Semantic Information into Sketchy Reading Module of Retro-Reader for Vietnamese Machine Reading Comprehension

    Authors: Hang Thi-Thu Le, Viet-Duc Ho, Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Machine Reading Comprehension has become one of the most advanced and popular research topics in the fields of Natural Language Processing in recent years. The classification of answerability questions is a relatively significant sub-task in machine reading comprehension; however, there haven't been many studies. Retro-Reader is one of the studies that has solved this problem effectively. However,… ▽ More

    Submitted 1 January, 2023; originally announced January 2023.

    Comments: In Proceedings of the 9th NAFOSTED Conference on Information and Computer Science (NICS 2022)

  24. arXiv:2301.00422  [pdf, other

    cs.CL

    Leveraging Semantic Representations Combined with Contextual Word Representations for Recognizing Textual Entailment in Vietnamese

    Authors: Quoc-Loc Duong, Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: RTE is a significant problem and is a reasonably active research community. The proposed research works on the approach to this problem are pretty diverse with many different directions. For Vietnamese, the RTE problem is moderately new, but this problem plays a vital role in natural language understanding systems. Currently, methods to solve this problem based on contextual word representation le… ▽ More

    Submitted 1 January, 2023; originally announced January 2023.

    Comments: In Proceedings of the 9th NAFOSTED Conference on Information and Computer Science (NICS 2022)

  25. arXiv:2301.00418  [pdf, ps, other

    cs.CL

    Is word segmentation necessary for Vietnamese sentiment classification?

    Authors: Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: To the best of our knowledge, this paper made the first attempt to answer whether word segmentation is necessary for Vietnamese sentiment classification. To do this, we presented five pre-trained monolingual S4- based language models for Vietnamese, including one model without word segmentation, and four models using RDRsegmenter, uitnlp, pyvi, or underthesea toolkits in the pre-processing data ph… ▽ More

    Submitted 1 January, 2023; originally announced January 2023.

    Comments: In Proceedings of the 16th International Conference on Computing and Communication Technologies (RIVF 2022)

  26. arXiv:2211.08170  [pdf, other

    cs.CL cs.DB cs.IR cs.LG

    A Comparative Study of Question Answering over Knowledge Bases

    Authors: Khiem Vinh Tran, Hao Phu Phan, Khang Nguyen Duc Quach, Ngan Luu-Thuy Nguyen, Jun Jo, Thanh Tam Nguyen

    Abstract: Question answering over knowledge bases (KBQA) has become a popular approach to help users extract information from knowledge bases. Although several systems exist, choosing one suitable for a particular application scenario is difficult. In this article, we provide a comparative study of six representative KBQA systems on eight benchmark datasets. In that, we study various question types, propert… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

  27. arXiv:2209.11505  [pdf, ps, other

    cs.AI cs.CC cs.LG

    The complexity of unsupervised learning of lexicographic preferences

    Authors: Hélène Fargier, Pierre-François Gimenez, Jérôme Mengin, Bao Ngoc Le Nguyen

    Abstract: This paper considers the task of learning users' preferences on a combinatorial set of alternatives, as generally used by online configurators, for example. In many settings, only a set of selected alternatives during past interactions is available to the learner. Fargier et al. [2018] propose an approach to learn, in such a setting, a model of the users' preferences that ranks previously chosen a… ▽ More

    Submitted 23 September, 2022; originally announced September 2022.

    Journal ref: 13th Multidisciplinary Workshop on Advances in Preference Handling, Jul 2022, Vienne, Austria

  28. arXiv:2209.10482  [pdf, other

    cs.CL

    SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

    Authors: Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Text classification is a typical natural language processing or computational linguistics task with various interesting applications. As the number of users on social media platforms increases, data acceleration promotes emerging studies on Social Media Text Classification (SMTC) or social media text mining on these valuable resources. In contrast to English, Vietnamese, one of the low-resource la… ▽ More

    Submitted 21 September, 2022; originally announced September 2022.

    Comments: Accepted at The 36th annual Meeting of Pacific Asia Conference on Language, Information and Computation (PACLIC 36)

  29. arXiv:2206.09600  [pdf, other

    cs.CL

    SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers for Medical Texts

    Authors: Nhung Thi-Hong Nguyen, Phuong Phan-Dieu Ha, Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Question answering (QA) systems have gained explosive attention in recent years. However, QA tasks in Vietnamese do not have many datasets. Significantly, there is mostly no dataset in the medical domain. Therefore, we built a Vietnamese Healthcare Question Answering dataset (ViHealthQA), including 10,015 question-answer passage pairs for this task, in which questions from health-interested users… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

  30. arXiv:2205.02031  [pdf, other

    cs.CV eess.IV

    Self-Supervised Super-Resolution for Multi-Exposure Push-Frame Satellites

    Authors: Ngoc Long Nguyen, Jérémy Anger, Axel Davy, Pablo Arias, Gabriele Facciolo

    Abstract: Modern Earth observation satellites capture multi-exposure bursts of push-frame images that can be super-resolved via computational means. In this work, we propose a super-resolution method for such multi-exposure sequences, a problem that has received very little attention in the literature. The proposed method can handle the signal-dependent noise in the inputs, process sequences of any length,… ▽ More

    Submitted 4 May, 2022; originally announced May 2022.

    Comments: CVPR 2022

  31. arXiv:2204.07002  [pdf, other

    cs.CL

    XLMRQA: Open-Domain Question Answering on Vietnamese Wikipedia-based Textual Knowledge Source

    Authors: Kiet Van Nguyen, Phong Nguyen-Thuan Do, Nhat Duy Nguyen, Tin Van Huynh, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Question answering (QA) is a natural language understanding task within the fields of information retrieval and information extraction that has attracted much attention from the computational linguistics and artificial intelligence research community in recent years because of the strong development of machine reading comprehension-based models. A reader-based QA system is a high-level search engi… ▽ More

    Submitted 13 August, 2022; v1 submitted 14 April, 2022; originally announced April 2022.

    Comments: Accepted by ACIIDS 2022

  32. VLSP 2021 - ViMRC Challenge: Vietnamese Machine Reading Comprehension

    Authors: Kiet Van Nguyen, Son Quoc Tran, Luan Thanh Nguyen, Tin Van Huynh, Son T. Luu, Ngan Luu-Thuy Nguyen

    Abstract: One of the emerging research trends in natural language understanding is machine reading comprehension (MRC) which is the task to find answers to human questions based on textual data. Existing Vietnamese datasets for MRC research concentrate solely on answerable questions. However, in reality, questions can be unanswerable for which the correct answer is not stated in the given textual data. To a… ▽ More

    Submitted 4 April, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: The 8th International Workshop on Vietnamese Language and Speech Processing (VLSP 2021)

  33. arXiv:2112.09488  [pdf, other

    cs.CL

    Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage Span Labeling

    Authors: Duc-Vu Nguyen, Linh-Bao Vo, Ngoc-Linh Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Chinese word segmentation and part-of-speech tagging are necessary tasks in terms of computational linguistics and application of natural language processing. Many re-searchers still debate the demand for Chinese word segmentation and part-of-speech tagging in the deep learning era. Nevertheless, resolving ambiguities and detecting unknown words are challenging problems in this field. Previous stu… ▽ More

    Submitted 17 December, 2021; originally announced December 2021.

    Comments: In Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation (PACLIC 2021)

  34. arXiv:2110.00156  [pdf, other

    cs.CL

    Span Labeling Approach for Vietnamese and Chinese Word Segmentation

    Authors: Duc-Vu Nguyen, Linh-Bao Vo, Dang Van Thin, Ngan Luu-Thuy Nguyen

    Abstract: In this paper, we propose a span labeling approach to model n-gram information for Vietnamese word segmentation, namely SPAN SEG. We compare the span labeling approach with the conditional random field by using encoders with the same architecture. Since Vietnamese and Chinese have similar linguistic phenomena, we evaluated the proposed method on the Vietnamese treebank benchmark dataset and five C… ▽ More

    Submitted 30 September, 2021; originally announced October 2021.

    Comments: In Proceedings of the 18th Pacific Rim International Conference on Artificial Intelligence (PRICAI 2021)

  35. arXiv:2108.13741  [pdf, other

    cs.CL cs.AI

    Monolingual versus Multilingual BERTology for Vietnamese Extractive Multi-Document Summarization

    Authors: Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen

    Abstract: Recent researches have demonstrated that BERT shows potential in a wide range of natural language processing tasks. It is adopted as an encoder for many state-of-the-art automatic summarizing systems, which achieve excellent performance. However, so far, there is not much work done for Vietnamese. In this paper, we showcase how BERT can be implemented for extractive text summarization in Vietnames… ▽ More

    Submitted 16 October, 2021; v1 submitted 31 August, 2021; originally announced August 2021.

  36. arXiv:2105.09043  [pdf, other

    cs.CL

    Sentence Extraction-Based Machine Reading Comprehension for Vietnamese

    Authors: Phong Nguyen-Thuan Do, Nhat Duy Nguyen, Tin Van Huynh, Kiet Van Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: The development of natural language processing (NLP) in general and machine reading comprehension in particular has attracted the great attention of the research community. In recent years, there are a few datasets for machine reading comprehension tasks in Vietnamese with large sizes, such as UIT-ViQuAD and UIT-ViNewsQA. However, the datasets are not diverse in answers to serve the research. In t… ▽ More

    Submitted 11 June, 2021; v1 submitted 19 May, 2021; originally announced May 2021.

    Comments: Accepted by KSEM 2021 (International Conference on Knowledge Science, Engineering and Management)

  37. Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts

    Authors: Son T. Luu, Mao Nguyen Bui, Loi Duc Nguyen, Khiem Vinh Tran, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Machine reading comprehension (MRC) is a sub-field in natural language processing that aims to assist computers understand unstructured texts and then answer questions related to them. In practice, the conversation is an essential way to communicate and transfer information. To help machines understand conversation texts, we present UIT-ViCoQA, a new corpus for conversational machine reading compr… ▽ More

    Submitted 30 September, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: Published at The 13th International Conference on Computational Collective Intelligence (ICCCI 2021)

  38. arXiv:2104.11969  [pdf, ps, other

    cs.CL

    Vietnamese Complaint Detection on E-Commerce Websites

    Authors: Nhung Thi-Hong Nguyen, Phuong Phan-Dieu Ha, Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Customer product reviews play a role in improving the quality of products and services for business organizations or their brands. Complaining is an attitude that expresses dissatisfaction with an event or a product not meeting customer expectations. In this paper, we build a Open-domain Complaint Detection dataset (UIT-ViOCD), including 5,485 human-annotated reviews on four categories about produ… ▽ More

    Submitted 5 July, 2021; v1 submitted 24 April, 2021; originally announced April 2021.

  39. UIT-ISE-NLP at SemEval-2021 Task 5: Toxic Spans Detection with BiLSTM-CRF and ToxicBERT Comment Classification

    Authors: Son T. Luu, Ngan Luu-Thuy Nguyen

    Abstract: We present our works on SemEval-2021 Task 5 about Toxic Spans Detection. This task aims to build a model for identifying toxic words in whole posts. We use the BiLSTM-CRF model combining with ToxicBERT Classification to train the detection model for identifying toxic words in posts. Our model achieves 62.23% by F1-score on the Toxic Spans Detection task.

    Submitted 29 July, 2021; v1 submitted 20 April, 2021; originally announced April 2021.

    Comments: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

  40. A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts

    Authors: Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In recent years, Vietnam witnesses the mass development of social network users on different social platforms such as Facebook, Youtube, Instagram, and Tiktok. On social medias, hate speech has become a critical problem for social network users. To solve this problem, we introduce the ViHSD - a human-annotated dataset for automatically detecting hate speech on the social network. This dataset cont… ▽ More

    Submitted 20 July, 2021; v1 submitted 21 March, 2021; originally announced March 2021.

    Comments: IEA/AIE 2021: Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices, pp 415-426

  41. Constructive and Toxic Speech Detection for Open-domain Social Media Comments in Vietnamese

    Authors: Luan Thanh Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: The rise of social media has led to the increasing of comments on online forums. However, there still exists invalid comments which are not informative for users. Moreover, those comments are also quite toxic and harmful to people. In this paper, we create a dataset for constructive and toxic speech detection, named UIT-ViCTSD (Vietnamese Constructive and Toxic Speech Detection dataset) with 10,00… ▽ More

    Submitted 6 September, 2021; v1 submitted 18 March, 2021; originally announced March 2021.

    Comments: IEA/AIE 2021: Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices pp 572-583

  42. arXiv:2103.09519  [pdf, other

    cs.CL cs.LG

    Investigating Monolingual and Multilingual BERTModels for Vietnamese Aspect Category Detection

    Authors: Dang Van Thin, Lac Si Le, Vu Xuan Hoang, Ngan Luu-Thuy Nguyen

    Abstract: Aspect category detection (ACD) is one of the challenging tasks in the Aspect-based sentiment Analysis problem. The purpose of this task is to identify the aspect categories mentioned in user-generated reviews from a set of pre-defined categories. In this paper, we investigate the performance of various monolingual pre-trained language models compared with multilingual models on the Vietnamese asp… ▽ More

    Submitted 17 March, 2021; originally announced March 2021.

    Comments: 6 pages, 1 figure

  43. arXiv:2102.12136   

    cs.CL

    Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese

    Authors: Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Word segmentation and part-of-speech tagging are two critical preliminary steps for downstream tasks in Vietnamese natural language processing. In reality, people tend to consider also the phrase boundary when performing word segmentation and part of speech tagging rather than solely process word by word from left to right. In this paper, we implement this idea to improve word segmentation and par… ▽ More

    Submitted 16 June, 2021; v1 submitted 24 February, 2021; originally announced February 2021.

    Comments: The comparison with existing methods in this paper is unfair because the hyper-parameters of Bi-LSTM are different compared with previous research. Importantly, there is a data leakage issue w.r.t this paper's experimental setup

  44. arXiv:2101.10200  [pdf, other

    cs.CV

    Proba-V-ref: Repurposing the Proba-V challenge for reference-aware super resolution

    Authors: Ngoc Long Nguyen, Jérémy Anger, Axel Davy, Pablo Arias, Gabriele Facciolo

    Abstract: The PROBA-V Super-Resolution challenge distributes real low-resolution image series and corresponding high-resolution targets to advance research on Multi-Image Super Resolution (MISR) for satellite images. However, in the PROBA-V dataset the low-resolution image corresponding to the high-resolution target is not identified. We argue that in doing so, the challenge ranks the proposed methods not o… ▽ More

    Submitted 26 January, 2021; v1 submitted 25 January, 2021; originally announced January 2021.

    Comments: 5 pages

  45. Gender Prediction Based on Vietnamese Names with Machine Learning Techniques

    Authors: Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan Nguyen

    Abstract: As biological gender is one of the aspects of presenting individual human, much work has been done on gender classification based on people names. The proposals for English and Chinese languages are tremendous; still, there have been few works done for Vietnamese so far. We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotate… ▽ More

    Submitted 23 March, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: 6 pages, 6 figures. NLPIR 2020: 4th International Conference on Natural Language Processing and Information Retrieval

  46. arXiv:2010.09623  [pdf, other

    cs.CL

    An Empirical Study for Vietnamese Constituency Parsing with Pre-training

    Authors: Tuan-Vi Tran, Xuan-Thien Pham, Duc-Vu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In this work, we use a span-based approach for Vietnamese constituency parsing. Our method follows the self-attention encoder architecture and a chart decoder using a CKY-style inference algorithm. We present analyses of the experiment results of the comparison of our empirical method using pre-training models XLM-Roberta and PhoBERT on both Vietnamese datasets VietTreebank and NIIVTB1. The result… ▽ More

    Submitted 19 October, 2020; v1 submitted 19 October, 2020; originally announced October 2020.

  47. arXiv:2009.14725  [pdf, other

    cs.CL

    A Vietnamese Dataset for Evaluating Machine Reading Comprehension

    Authors: Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Over 97 million people speak Vietnamese as their native language in the world. However, there are few research studies on machine reading comprehension (MRC) for Vietnamese, the task of understanding a text and answering questions related to it. Due to the lack of benchmark datasets for Vietnamese, we present the Vietnamese Question Answering Dataset (UIT-ViQuAD), a new dataset for the low-resourc… ▽ More

    Submitted 7 November, 2020; v1 submitted 30 September, 2020; originally announced September 2020.

    Comments: Accepted by The 28th International Conference on Computational Linguistics (COLING 2020)

  48. arXiv:2009.13060  [pdf, other

    cs.CL

    A Simple and Efficient Ensemble Classifier Combining Multiple Neural Network Models on Social Media Datasets in Vietnamese

    Authors: Huy Duc Huynh, Hang Thi-Thuy Do, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Text classification is a popular topic of natural language processing, which has currently attracted numerous research efforts worldwide. The significant increase of data in social media requires the vast attention of researchers to analyze such data. There are various studies in this field in many languages but limited to the Vietnamese language. Therefore, this study aims to classify Vietnamese… ▽ More

    Submitted 28 September, 2020; v1 submitted 28 September, 2020; originally announced September 2020.

    Comments: Accepted by The 34th Pacific Asia Conference on Language, Information and Computation (PACLIC2020)

  49. arXiv:2009.12319  [pdf, other

    cs.CL

    Empirical Study of Text Augmentation on Social Media Text in Vietnamese

    Authors: Son T. Luu, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models. Practically, the data about user comments on social networking sites not altogether appeared - the administrators often only allow positive comments and hide negative comments. Thus, when collecting the data about user comments on the social network, the data is usually… ▽ More

    Submitted 9 October, 2020; v1 submitted 25 September, 2020; originally announced September 2020.

    Comments: Accepted by The 34th Pacific Asia Conference on Language, Information and Computation

  50. arXiv:2009.02935  [pdf, other

    cs.CL

    UIT-HSE at WNUT-2020 Task 2: Exploiting CT-BERT for Identifying COVID-19 Information on the Twitter Social Network

    Authors: Khiem Vinh Tran, Hao Phu Phan, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

    Abstract: Recently, COVID-19 has affected a variety of real-life aspects of the world and led to dreadful consequences. More and more tweets about COVID-19 has been shared publicly on Twitter. However, the plurality of those Tweets are uninformative, which is challenging to build automatic systems to detect the informative ones for useful AI applications. In this paper, we present our results at the W-NUT 2… ▽ More

    Submitted 13 November, 2020; v1 submitted 7 September, 2020; originally announced September 2020.

    Comments: Accepted by 2020 The 6th Workshop on Noisy User-generated Text (W-NUT) - EMNLP 2020

    Journal ref: https://www.aclweb.org/anthology/2020.wnut-1.53/