-
Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability
Authors:
Genta Indra Winata,
David Anugraha,
Emmy Liu,
Alham Fikri Aji,
Shou-Yi Hung,
Aditya Parashar,
Patrick Amadeus Irawan,
Ruochen Zhang,
Zheng-Xin Yong,
Jan Christian Blaise Cruz,
Niklas Muennighoff,
Seungone Kim,
Hanyang Zhao,
Sudipta Kar,
Kezia Erina Suryoraharjo,
M. Farid Adilazuarda,
En-Shiun Annie Lee,
Ayu Purwarianti,
Derry Tanti Wijaya,
Monojit Choudhury
Abstract:
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about datas…
▽ More
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.
△ Less
Submitted 3 June, 2025; v1 submitted 2 June, 2025;
originally announced June 2025.
-
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation
Authors:
Emilio Villa-Cueva,
Sholpan Bolatzhanova,
Diana Turmakhan,
Kareem Elzeky,
Henok Biadglign Ademtew,
Alham Fikri Aji,
Israel Abebe Azime,
Jinheon Baek,
Frederico Belcavello,
Fermin Cristobal,
Jan Christian Blaise Cruz,
Mary Dabre,
Raj Dabre,
Toqeer Ehsan,
Naome A Etori,
Fauzan Farooqui,
Jiahui Geng,
Guido Ivetta,
Thanmay Jayakumar,
Soyeong Jeong,
Zheng Wei Lim,
Aishik Mandal,
Sofia Martinelli,
Mihail Minkov Mihaylov,
Daniil Orel
, et al. (9 additional authors not shown)
Abstract:
Cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of image…
▽ More
Cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender usage. By releasing CaMMT, we aim to support broader efforts in building and evaluating multimodal translation systems that are better aligned with cultural nuance and regional variation.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Authors:
Samuel Cahyawijaya,
Holy Lovenia,
Joel Ruben Antony Moniz,
Tack Hwa Wong,
Mohammad Rifqi Farhansyah,
Thant Thiri Maung,
Frederikus Hudi,
David Anugraha,
Muhammad Ravi Shulthan Habibi,
Muhammad Reza Qorib,
Amit Agarwal,
Joseph Marvin Imperial,
Hitesh Laxmichand Patel,
Vicky Feliren,
Bahrul Ilmi Nasution,
Manuel Antonio Rufino,
Genta Indra Winata,
Rian Adam Rajagede,
Carlos Rafael Catalan,
Mohamed Fazli Imam,
Priyaranjan Pattnayak,
Salsabila Zahirah Pranida,
Kevin Pratama,
Yeshil Bangera,
Adisai Na-Thalang
, et al. (67 additional authors not shown)
Abstract:
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA…
▽ More
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
△ Less
Submitted 18 March, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
Political Events using RAG with LLMs
Authors:
Muhammad Arslan,
Saba Munawar,
Christophe Cruz
Abstract:
In the contemporary digital landscape, media content stands as the foundation for political news analysis, offering invaluable insights sourced from various channels like news articles, social media updates, speeches, and reports. Natural Language Processing (NLP) has revolutionized Political Information Extraction (IE), automating tasks such as Event Extraction (EE) from these diverse media outle…
▽ More
In the contemporary digital landscape, media content stands as the foundation for political news analysis, offering invaluable insights sourced from various channels like news articles, social media updates, speeches, and reports. Natural Language Processing (NLP) has revolutionized Political Information Extraction (IE), automating tasks such as Event Extraction (EE) from these diverse media outlets. While traditional NLP methods often necessitate specialized expertise to build rule-based systems or train machine learning models with domain-specific datasets, the emergence of Large Language Models (LLMs) driven by Generative Artificial Intelligence (GenAI) presents a promising alternative. These models offer accessibility, alleviating challenges associated with model construction from scratch and reducing the dependency on extensive datasets during the training phase, thus facilitating rapid implementation. However, challenges persist in handling domain-specific tasks, leading to the development of the Retrieval-Augmented Generation (RAG) framework. RAG enhances LLMs by integrating external data retrieval, enriching their contextual understanding, and expanding their knowledge base beyond pre-existing training data. To illustrate RAG's efficacy, we introduce the Political EE system, specifically tailored to extract political event information from news articles. Understanding these political insights is essential for remaining informed about the latest political advancements, whether on a national or global scale.
△ Less
Submitted 6 January, 2025;
originally announced February 2025.
-
Sustainable Digitalization of Business with Multi-Agent RAG and LLM
Authors:
Muhammad Arslan,
Saba Munawar,
Christophe Cruz
Abstract:
Businesses heavily rely on data sourced from various channels like news articles, financial reports, and consumer reviews to drive their operations, enabling informed decision-making and identifying opportunities. However, traditional manual methods for data extraction are often time-consuming and resource-intensive, prompting the adoption of digital transformation initiatives to enhance efficienc…
▽ More
Businesses heavily rely on data sourced from various channels like news articles, financial reports, and consumer reviews to drive their operations, enabling informed decision-making and identifying opportunities. However, traditional manual methods for data extraction are often time-consuming and resource-intensive, prompting the adoption of digital transformation initiatives to enhance efficiency. Yet, concerns persist regarding the sustainability of such initiatives and their alignment with the United Nations (UN)'s Sustainable Development Goals (SDGs). This research aims to explore the integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) as a sustainable solution for Information Extraction (IE) and processing. The research methodology involves reviewing existing solutions for business decision-making, noting that many systems require training new machine learning models, which are resource-intensive and have significant environmental impacts. Instead, we propose a sustainable business solution using pre-existing LLMs that can work with diverse datasets. We link domain-specific datasets to tailor LLMs to company needs and employ a Multi-Agent architecture to divide tasks such as information retrieval, enrichment, and classification among specialized agents. This approach optimizes the extraction process and improves overall efficiency. Through the utilization of these technologies, businesses can optimize resource utilization, improve decision-making processes, and contribute to sustainable development goals, thereby fostering environmental responsibility within the corporate sector.
△ Less
Submitted 6 January, 2025;
originally announced February 2025.
-
Unlocking the Potential of Generative AI through Neuro-Symbolic Architectures: Benefits and Limitations
Authors:
Oualid Bougzime,
Samir Jabbar,
Christophe Cruz,
Frédéric Demoly
Abstract:
Neuro-symbolic artificial intelligence (NSAI) represents a transformative approach in artificial intelligence (AI) by combining deep learning's ability to handle large-scale and unstructured data with the structured reasoning of symbolic methods. By leveraging their complementary strengths, NSAI enhances generalization, reasoning, and scalability while addressing key challenges such as transparenc…
▽ More
Neuro-symbolic artificial intelligence (NSAI) represents a transformative approach in artificial intelligence (AI) by combining deep learning's ability to handle large-scale and unstructured data with the structured reasoning of symbolic methods. By leveraging their complementary strengths, NSAI enhances generalization, reasoning, and scalability while addressing key challenges such as transparency and data efficiency. This paper systematically studies diverse NSAI architectures, highlighting their unique approaches to integrating neural and symbolic components. It examines the alignment of contemporary AI techniques such as retrieval-augmented generation, graph neural networks, reinforcement learning, and multi-agent systems with NSAI paradigms. This study then evaluates these architectures against comprehensive set of criteria, including generalization, reasoning capabilities, transferability, and interpretability, therefore providing a comparative analysis of their respective strengths and limitations. Notably, the Neuro > Symbolic < Neuro model consistently outperforms its counterparts across all evaluation metrics. This result aligns with state-of-the-art research that highlight the efficacy of such architectures in harnessing advanced technologies like multi-agent systems.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics
Authors:
Hussam Ghanem,
Christophe Cruz
Abstract:
Recent advancements in large language models have demonstrated significant potential in the automated construction of knowledge graphs from unstructured text. This paper builds upon our previous work [16], which evaluated various models using metrics like precision, recall, F1 score, triple matching, and graph matching, and introduces a refined approach to address the critical issues of hallucinat…
▽ More
Recent advancements in large language models have demonstrated significant potential in the automated construction of knowledge graphs from unstructured text. This paper builds upon our previous work [16], which evaluated various models using metrics like precision, recall, F1 score, triple matching, and graph matching, and introduces a refined approach to address the critical issues of hallucination and omission. We propose an enhanced evaluation framework incorporating BERTScore for graph similarity, setting a practical threshold of 95% for graph matching. Our experiments focus on the Mistral model, comparing its original and fine-tuned versions in zero-shot and few-shot settings. We further extend our experiments using examples from the KELM-sub training dataset, illustrating that the fine-tuned model significantly improves knowledge graph construction accuracy while reducing the exact hallucination and omission. However, our findings also reveal that the fine-tuned models perform worse in generalization tasks on the KELM-sub dataset. This study underscores the importance of comprehensive evaluation metrics in advancing the state-of-the-art in knowledge graph construction from textual data.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Extracting General-use Transformers for Low-resource Languages via Knowledge Distillation
Authors:
Jan Christian Blaise Cruz,
Alham Fikri Aji
Abstract:
In this paper, we propose the use of simple knowledge distillation to produce smaller and more efficient single-language transformers from Massively Multilingual Transformers (MMTs) to alleviate tradeoffs associated with the use of such in low-resource settings. Using Tagalog as a case study, we show that these smaller single-language models perform on-par with strong baselines in a variety of ben…
▽ More
In this paper, we propose the use of simple knowledge distillation to produce smaller and more efficient single-language transformers from Massively Multilingual Transformers (MMTs) to alleviate tradeoffs associated with the use of such in low-resource settings. Using Tagalog as a case study, we show that these smaller single-language models perform on-par with strong baselines in a variety of benchmark tasks in a much more efficient manner. Furthermore, we investigate additional steps during the distillation process that improves the soft-supervision of the target language, and provide a number of analyses and ablations to show the efficacy of the proposed method.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense
Authors:
Samuel Cahyawijaya,
Ruochen Zhang,
Holy Lovenia,
Jan Christian Blaise Cruz,
Elisa Gilbert,
Hiroki Nomoto,
Alham Fikri Aji
Abstract:
Multilingual large language models (LLMs) have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely diff…
▽ More
Multilingual large language models (LLMs) have gained prominence, but concerns arise regarding their reliability beyond English. This study addresses the gap in cross-lingual semantic evaluation by introducing a novel benchmark for cross-lingual sense disambiguation, StingrayBench. In this paper, we demonstrate using false friends -- words that are orthographically similar but have completely different meanings in two languages -- as a possible approach to pinpoint the limitation of cross-lingual sense disambiguation in LLMs. We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German; and challenge LLMs to distinguish the use of them in context. In our analysis of various models, we observe they tend to be biased toward higher-resource languages. We also propose new metrics for quantifying the cross-lingual sense bias and comprehension based on our benchmark. Our work contributes to developing more diverse and inclusive language modeling, promoting fairer access for the wider multilingual community.
△ Less
Submitted 30 October, 2024; v1 submitted 28 October, 2024;
originally announced October 2024.
-
Exploring Quantum Neural Networks for Demand Forecasting
Authors:
Gleydson Fernandes de Jesus,
Maria Heloísa Fraga da Silva,
Otto Menegasso Pires,
Lucas Cruz da Silva,
Clebson dos Santos Cruz,
Valéria Loureiro da Silva
Abstract:
Forecasting demand for assets and services can be addressed in various markets, providing a competitive advantage when the predictive models used demonstrate high accuracy. However, the training of machine learning models incurs high computational costs, which may limit the training of prediction models based on available computational capacity. In this context, this paper presents an approach for…
▽ More
Forecasting demand for assets and services can be addressed in various markets, providing a competitive advantage when the predictive models used demonstrate high accuracy. However, the training of machine learning models incurs high computational costs, which may limit the training of prediction models based on available computational capacity. In this context, this paper presents an approach for training demand prediction models using quantum neural networks. For this purpose, a quantum neural network was used to forecast demand for vehicle financing. A classical recurrent neural network was used to compare the results, and they show a similar predictive capacity between the classical and quantum models, with the advantage of using a lower number of training parameters and also converging in fewer steps. Utilizing quantum computing techniques offers a promising solution to overcome the limitations of traditional machine learning approaches in training predictive models for complex market dynamics.
△ Less
Submitted 19 October, 2024;
originally announced October 2024.
-
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Authors:
Genta Indra Winata,
Frederikus Hudi,
Patrick Amadeus Irawan,
David Anugraha,
Rifki Afina Putri,
Yutong Wang,
Adam Nohejl,
Ubaidillah Ariq Prathama,
Nedjma Ousidhoum,
Afifa Amriani,
Anar Rzayev,
Anirban Das,
Ashmari Pramodya,
Aulia Adila,
Bryan Wilie,
Candy Olivia Mawalim,
Ching Lam Cheng,
Daud Abolade,
Emmanuele Chersoni,
Enrico Santus,
Fariz Ikhwantri,
Garry Kuwanto,
Hanyang Zhao,
Haryo Akbarianto Wibowo,
Holy Lovenia
, et al. (26 additional authors not shown)
Abstract:
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering…
▽ More
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
△ Less
Submitted 8 May, 2025; v1 submitted 16 October, 2024;
originally announced October 2024.
-
Data Augmentation for 3DMM-based Arousal-Valence Prediction for HRI
Authors:
Christian Arzate Cruz,
Yotam Sechayk,
Takeo Igarashi,
Randy Gomez
Abstract:
Humans use multiple communication channels to interact with each other. For instance, body gestures or facial expressions are commonly used to convey an intent. The use of such non-verbal cues has motivated the development of prediction models. One such approach is predicting arousal and valence (AV) from facial expressions. However, making these models accurate for human-robot interaction (HRI) s…
▽ More
Humans use multiple communication channels to interact with each other. For instance, body gestures or facial expressions are commonly used to convey an intent. The use of such non-verbal cues has motivated the development of prediction models. One such approach is predicting arousal and valence (AV) from facial expressions. However, making these models accurate for human-robot interaction (HRI) settings is challenging as it requires handling multiple subjects, challenging conditions, and a wide range of facial expressions. In this paper, we propose a data augmentation (DA) technique to improve the performance of AV predictors using 3D morphable models (3DMM). We then utilize this approach in an HRI setting with a mediator robot and a group of three humans. Our augmentation method creates synthetic sequences for underrepresented values in the AV space of the SEWA dataset, which is the most comprehensive dataset with continuous AV labels. Results show that using our DA method improves the accuracy and robustness of AV prediction in real-time applications. The accuracy of our models on the SEWA dataset is 0.793 for arousal and valence.
△ Less
Submitted 30 September, 2024;
originally announced October 2024.
-
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Authors:
Holy Lovenia,
Rahmad Mahendra,
Salsabil Maulana Akbar,
Lester James V. Miranda,
Jennifer Santoso,
Elyanah Aco,
Akhdan Fadhilah,
Jonibek Mansurov,
Joseph Marvin Imperial,
Onno P. Kampman,
Joel Ruben Antony Moniz,
Muhammad Ravi Shulthan Habibi,
Frederikus Hudi,
Railey Montalan,
Ryan Ignatius,
Joanito Agili Lopo,
William Nixon,
Börje F. Karlsson,
James Jaya,
Ryandito Diandaru,
Yuze Gao,
Patrick Amadeus,
Bin Wang,
Jan Christian Blaise Cruz,
Chenxi Whitehouse
, et al. (36 additional authors not shown)
Abstract:
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t…
▽ More
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
△ Less
Submitted 10 March, 2025; v1 submitted 14 June, 2024;
originally announced June 2024.
-
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
Authors:
David Romero,
Chenyang Lyu,
Haryo Akbarianto Wibowo,
Teresa Lynn,
Injy Hamed,
Aditya Nanda Kishore,
Aishik Mandal,
Alina Dragonetti,
Artem Abzaliev,
Atnafu Lambebo Tonja,
Bontu Fufa Balcha,
Chenxi Whitehouse,
Christian Salamea,
Dan John Velasco,
David Ifeoluwa Adelani,
David Le Meur,
Emilio Villa-Cueva,
Fajri Koto,
Fauzan Farooqui,
Frederico Belcavello,
Ganzorig Batnasan,
Gisela Vallejo,
Grainne Caulfield,
Guido Ivetta,
Haiyue Song
, et al. (51 additional authors not shown)
Abstract:
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recen…
▽ More
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.
△ Less
Submitted 4 November, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Spatial Summation of Localized Pressure for Haptic Sensory Prostheses
Authors:
Sreela Kodali,
Cihualpilli Camino Cruz,
Thomas C. Bulea,
Kevin S. Rao Diana Bharucha-Goebel,
Alexander T. Chesler,
Carsten G. Bonnemann,
Allison M. Okamura
Abstract:
A host of medical conditions, including amputations, diabetes, stroke, and genetic disease, result in loss of touch sensation. Because most types of sensory loss have no pharmacological treatment or rehabilitative therapy, we propose a haptic sensory prosthesis that provides substitutive feedback. The wrist and forearm are compelling locations for feedback due to available skin area and not occlud…
▽ More
A host of medical conditions, including amputations, diabetes, stroke, and genetic disease, result in loss of touch sensation. Because most types of sensory loss have no pharmacological treatment or rehabilitative therapy, we propose a haptic sensory prosthesis that provides substitutive feedback. The wrist and forearm are compelling locations for feedback due to available skin area and not occluding the hands, but have reduced mechanoreceptor density compared to the fingertips. Focusing on localized pressure as the feedback modality, we hypothesize that we can improve on prior devices by invoking a wider range of stimulus intensity using multiple points of pressure to evoke spatial summation, which is the cumulative perceptual experience from multiple points of stimuli. We conducted a preliminary perceptual test to investigate this idea and found that just noticeable difference is reduced with two points of pressure compared to one, motivating future work using spatial summation in sensory prostheses.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Transforming Competition into Collaboration: The Revolutionary Role of Multi-Agent Systems and Language Models in Modern Organizations
Authors:
Carlos Jose Xavier Cruz
Abstract:
This article explores the dynamic influence of computational entities based on multi-agent systems theory (SMA) combined with large language models (LLM), which are characterized by their ability to simulate complex human interactions, as a possibility to revolutionize human user interaction from the use of specialized artificial agents to support everything from operational organizational process…
▽ More
This article explores the dynamic influence of computational entities based on multi-agent systems theory (SMA) combined with large language models (LLM), which are characterized by their ability to simulate complex human interactions, as a possibility to revolutionize human user interaction from the use of specialized artificial agents to support everything from operational organizational processes to strategic decision making based on applied knowledge and human orchestration. Previous investigations reveal that there are limitations, particularly in the autonomous approach of artificial agents, especially when dealing with new challenges and pragmatic tasks such as inducing logical reasoning and problem solving. It is also considered that traditional techniques, such as the stimulation of chains of thoughts, require explicit human guidance. In our approach we employ agents developed from large language models (LLM), each with distinct prototyping that considers behavioral elements, driven by strategies that stimulate the generation of knowledge based on the use case proposed in the scenario (role-play) business, using a discussion approach between agents (guided conversation). We demonstrate the potential of developing agents useful for organizational strategies, based on multi-agent system theories (SMA) and innovative uses based on large language models (LLM based), offering a differentiated and adaptable experiment to different applications, complexities, domains, and capabilities from LLM.
△ Less
Submitted 15 March, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
Trustworthy human-centric based Automated Decision-Making Systems
Authors:
Marcelino Cabrera,
Carlos Cruz,
Pavel Novoa-Hernández,
David A. Pelta,
José Luis Verdegay
Abstract:
Automated Decision-Making Systems (ADS) have become pervasive across various fields, activities, and occupations, to enhance performance. However, this widespread adoption introduces potential risks, including the misuse of ADS. Such misuse may manifest when ADS is employed in situations where it is unnecessary or when essential requirements, conditions, and terms are overlooked, leading to uninte…
▽ More
Automated Decision-Making Systems (ADS) have become pervasive across various fields, activities, and occupations, to enhance performance. However, this widespread adoption introduces potential risks, including the misuse of ADS. Such misuse may manifest when ADS is employed in situations where it is unnecessary or when essential requirements, conditions, and terms are overlooked, leading to unintended consequences. This research paper presents a thorough examination of the implications, distinctions, and ethical considerations associated with digitalization, digital transformation, and the utilization of ADS in contemporary society and future contexts. Emphasis is placed on the imperative need for regulation, transparency, and ethical conduct in the deployment of ADS.
△ Less
Submitted 22 December, 2023;
originally announced January 2024.
-
Samsung R&D Institute Philippines at WMT 2023
Authors:
Jan Christian Blaise Cruz
Abstract:
In this paper, we describe the constrained MT systems submitted by Samsung R&D Institute Philippines to the WMT 2023 General Translation Task for two directions: en$\rightarrow$he and he$\rightarrow$en. Our systems comprise of Transformer-based sequence-to-sequence models that are trained with a mix of best practices: comprehensive data preprocessing pipelines, synthetic backtranslated data, and t…
▽ More
In this paper, we describe the constrained MT systems submitted by Samsung R&D Institute Philippines to the WMT 2023 General Translation Task for two directions: en$\rightarrow$he and he$\rightarrow$en. Our systems comprise of Transformer-based sequence-to-sequence models that are trained with a mix of best practices: comprehensive data preprocessing pipelines, synthetic backtranslated data, and the use of noisy channel reranking during online decoding. Our models perform comparably to, and sometimes outperform, strong baseline unconstrained systems such as mBART50 M2M and NLLB 200 MoE despite having significantly fewer parameters on two public benchmarks: FLORES-200 and NTREX-128.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
LASIGE and UNICAGE solution to the NASA LitCoin NLP Competition
Authors:
Pedro Ruas,
Diana F. Sousa,
André Neves,
Carlos Cruz,
Francisco M. Couto
Abstract:
Biomedical Natural Language Processing (NLP) tends to become cumbersome for most researchers, frequently due to the amount and heterogeneity of text to be processed. To address this challenge, the industry is continuously developing highly efficient tools and creating more flexible engineering solutions. This work presents the integration between industry data engineering solutions for efficient d…
▽ More
Biomedical Natural Language Processing (NLP) tends to become cumbersome for most researchers, frequently due to the amount and heterogeneity of text to be processed. To address this challenge, the industry is continuously developing highly efficient tools and creating more flexible engineering solutions. This work presents the integration between industry data engineering solutions for efficient data processing and academic systems developed for Named Entity Recognition (LasigeUnicage\_NER) and Relation Extraction (BiOnt). Our design reflects an integration of those components with external knowledge in the form of additional training data from other datasets and biomedical ontologies. We used this pipeline in the 2022 LitCoin NLP Challenge, where our team LasigeUnicage was awarded the 7th Prize out of approximately 200 participating teams, reflecting a successful collaboration between the academia (LASIGE) and the industry (Unicage). The software supporting this work is available at \url{https://github.com/lasigeBioTM/Litcoin-Lasige_Unicage}.
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
Towards Automated Semantic Segmentation in Mammography Images
Authors:
Cesar A. Sierra-Franco,
Jan Hurtado,
Victor de A. Thomaz,
Leonardo C. da Cruz,
Santiago V. Silva,
Alberto B. Raposo
Abstract:
Mammography images are widely used to detect non-palpable breast lesions or nodules, preventing cancer and providing the opportunity to plan interventions when necessary. The identification of some structures of interest is essential to make a diagnosis and evaluate image adequacy. Thus, computer-aided detection systems can be helpful in assisting medical interpretation by automatically segmenting…
▽ More
Mammography images are widely used to detect non-palpable breast lesions or nodules, preventing cancer and providing the opportunity to plan interventions when necessary. The identification of some structures of interest is essential to make a diagnosis and evaluate image adequacy. Thus, computer-aided detection systems can be helpful in assisting medical interpretation by automatically segmenting these landmark structures. In this paper, we propose a deep learning-based framework for the segmentation of the nipple, the pectoral muscle, the fibroglandular tissue, and the fatty tissue on standard-view mammography images. We introduce a large private segmentation dataset and extensive experiments considering different deep-learning model architectures. Our experiments demonstrate accurate segmentation performance on variate and challenging cases, showing that this framework can be integrated into clinical practice.
△ Less
Submitted 18 July, 2023;
originally announced July 2023.
-
Knowledge Graph for NLG in the context of conversational agents
Authors:
Hussam Ghanem,
Massinissa Atmani,
Christophe Cruz
Abstract:
The use of knowledge graphs (KGs) enhances the accuracy and comprehensiveness of the responses provided by a conversational agent. While generating answers during conversations consists in generating text from these KGs, it is still regarded as a challenging task that has gained significant attention in recent years. In this document, we provide a review of different architectures used for knowled…
▽ More
The use of knowledge graphs (KGs) enhances the accuracy and comprehensiveness of the responses provided by a conversational agent. While generating answers during conversations consists in generating text from these KGs, it is still regarded as a challenging task that has gained significant attention in recent years. In this document, we provide a review of different architectures used for knowledge graph-to-text generation including: Graph Neural Networks, the Graph Transformer, and linearization with seq2seq models. We discuss the advantages and limitations of each architecture and conclude that the choice of architecture will depend on the specific requirements of the task at hand. We also highlight the importance of considering constraints such as execution time and model validity, particularly in the context of conversational agents. Based on these constraints and the availability of labeled data for the domains of DAVI, we choose to use seq2seq Transformer-based models (PLMs) for the Knowledge Graph-to-Text Generation task. We aim to refine benchmark datasets of kg-to-text generation on PLMs and to explore the emotional and multilingual dimensions in our future work. Overall, this review provides insights into the different approaches for knowledge graph-to-text generation and outlines future directions for research in this area.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
Pseudo-Labeling Enhanced by Privileged Information and Its Application to In Situ Sequencing Images
Authors:
Marzieh Haghighi,
Mario C. Cruz,
Erin Weisbart,
Beth A. Cimini,
Avtar Singh,
Julia Bauman,
Maria E. Lozada,
Sanam L. Kavari,
James T. Neal,
Paul C. Blainey,
Anne E. Carpenter,
Shantanu Singh
Abstract:
Various strategies for label-scarce object detection have been explored by the computer vision research community. These strategies mainly rely on assumptions that are specific to natural images and not directly applicable to the biological and biomedical vision domains. For example, most semi-supervised learning strategies rely on a small set of labeled data as a confident source of ground truth.…
▽ More
Various strategies for label-scarce object detection have been explored by the computer vision research community. These strategies mainly rely on assumptions that are specific to natural images and not directly applicable to the biological and biomedical vision domains. For example, most semi-supervised learning strategies rely on a small set of labeled data as a confident source of ground truth. In many biological vision applications, however, the ground truth is unknown and indirect information might be available in the form of noisy estimations or orthogonal evidence. In this work, we frame a crucial problem in spatial transcriptomics - decoding barcodes from In-Situ-Sequencing (ISS) images - as a semi-supervised object detection (SSOD) problem. Our proposed framework incorporates additional available sources of information into a semi-supervised learning framework in the form of privileged information. The privileged information is incorporated into the teacher's pseudo-labeling in a teacher-student self-training iteration. Although the available privileged information could be data domain specific, we have introduced a general strategy of pseudo-labeling enhanced by privileged information (PLePI) and exemplified the concept using ISS images, as well on the COCO benchmark using extra evidence provided by CLIP.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
Unlocking Insights into Business Trajectories with Transformer-based Spatio-temporal Data Analysis
Authors:
Muhammad Arslan,
Christophe Cruz
Abstract:
The world of business is constantly evolving and staying ahead of the curve requires a deep understanding of market trends and performance. This article addresses this requirement by modeling business trajectories using news articles data.
The world of business is constantly evolving and staying ahead of the curve requires a deep understanding of market trends and performance. This article addresses this requirement by modeling business trajectories using news articles data.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
Imbalanced Multi-label Classification for Business-related Text with Moderately Large Label Spaces
Authors:
Muhammad Arslan,
Christophe Cruz
Abstract:
In this study, we compared the performance of four different methods for multi label text classification using a specific imbalanced business dataset. The four methods we evaluated were fine tuned BERT, Binary Relevance, Classifier Chains, and Label Powerset. The results show that fine tuned BERT outperforms the other three methods by a significant margin, achieving high values of accuracy, F1 Sco…
▽ More
In this study, we compared the performance of four different methods for multi label text classification using a specific imbalanced business dataset. The four methods we evaluated were fine tuned BERT, Binary Relevance, Classifier Chains, and Label Powerset. The results show that fine tuned BERT outperforms the other three methods by a significant margin, achieving high values of accuracy, F1 Score, Precision, and Recall. Binary Relevance also performs well on this dataset, while Classifier Chains and Label Powerset demonstrate relatively poor performance. These findings highlight the effectiveness of fine tuned BERT for multi label text classification tasks, and suggest that it may be a useful tool for businesses seeking to analyze complex and multifaceted texts.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
Multilingual Large Language Models Are Not (Yet) Code-Switchers
Authors:
Ruochen Zhang,
Samuel Cahyawijaya,
Jan Christian Blaise Cruz,
Genta Indra Winata,
Alham Fikri Aji
Abstract:
Multilingual Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks, exhibiting state-of-the-art performance through zero-shot or few-shot prompting methods. While there have been extensive studies on their abilities in monolingual tasks, the investigation of their potential in the context of code-switching (CSW), the practice of alternating languages within a…
▽ More
Multilingual Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks, exhibiting state-of-the-art performance through zero-shot or few-shot prompting methods. While there have been extensive studies on their abilities in monolingual tasks, the investigation of their potential in the context of code-switching (CSW), the practice of alternating languages within an utterance, remains relatively uncharted. In this paper, we provide a comprehensive empirical analysis of various multilingual LLMs, benchmarking their performance across four tasks: sentiment analysis, machine translation, summarization and word-level language identification. Our results indicate that despite multilingual LLMs exhibiting promising outcomes in certain tasks using zero or few-shot prompting, they still underperform in comparison to fine-tuned models of much smaller scales. We argue that current "multilingualism" in LLMs does not inherently imply proficiency with code-switching texts, calling for future research to bridge this discrepancy.
△ Less
Submitted 23 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Authors:
Zheng-Xin Yong,
Ruochen Zhang,
Jessica Zosa Forde,
Skyler Wang,
Arjun Subramonian,
Holy Lovenia,
Samuel Cahyawijaya,
Genta Indra Winata,
Lintang Sutawika,
Jan Christian Blaise Cruz,
Yin Lin Tan,
Long Phan,
Rowena Garcia,
Thamar Solorio,
Alham Fikri Aji
Abstract:
While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero…
▽ More
While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.
△ Less
Submitted 12 September, 2023; v1 submitted 23 March, 2023;
originally announced March 2023.
-
Quantum algorithm for finding minimum values in a Quantum Random Access Memory
Authors:
Anton S. Albino,
Lucas Q. Galvão,
Ethan Hansen,
Mauro Q. Nooblath Neto,
Clebson Cruz
Abstract:
Finding the minimum value in an unordered database is a common and fundamental task in computer science. However, the optimal classical deterministic algorithm can find the minimum value with a time complexity that grows linearly with the number of elements in the database. In this paper, we present the proposal of a quantum algorithm for finding the minimum value of a database, which is quadratic…
▽ More
Finding the minimum value in an unordered database is a common and fundamental task in computer science. However, the optimal classical deterministic algorithm can find the minimum value with a time complexity that grows linearly with the number of elements in the database. In this paper, we present the proposal of a quantum algorithm for finding the minimum value of a database, which is quadratically faster than its best classical analogs. We assume a Quantum Random Access Memory (QRAM) that stores values from a database and perform an iterative search based on an oracle whose role is to limit the searched values by controlling the states of the most significant qubits. A complexity analysis was performed in order to demonstrate the advantage of this quantum algorithm over its classical counterparts. Furthermore, we demonstrate how the proposed algorithm would be used in an unsupervised machine learning task through a quantum version of the K-means algorithm.
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
Smart meter data processing: a showcase for simple and efficient textual processing
Authors:
Miguel Ferreira,
André Neves,
Rodrigo Gorjão,
Carlos Cruz,
Miguel L. Pardal
Abstract:
The increase in the production and collection of data from devices is an ongoing trend due to the roll-out of more cyber-physical applications. Smart meters, because of their importance in power grids, are a class of such devices whose produced data requires meticulous processing. In this paper, we use Unicage, a data processing system based on classic Unix shell scripting, that delivers excellent…
▽ More
The increase in the production and collection of data from devices is an ongoing trend due to the roll-out of more cyber-physical applications. Smart meters, because of their importance in power grids, are a class of such devices whose produced data requires meticulous processing. In this paper, we use Unicage, a data processing system based on classic Unix shell scripting, that delivers excellent performance in a simple package. We use this methodology to process smart meter data in XML format, subjected to the constraints posed by a real use case. We develop a solution that parses, validates and performs a simple aggregation of 27 million XML files in less than 10 minutes. We present a study of the solution as well as the benefits of its adoption.
△ Less
Submitted 27 December, 2022;
originally announced December 2022.
-
Leaf Tar Spot Detection Using RGB Images
Authors:
Sriram Baireddy,
Da-Young Lee,
Carlos Gongora-Canul,
Christian D. Cruz,
Edward J. Delp
Abstract:
Tar spot disease is a fungal disease that appears as a series of black circular spots containing spores on corn leaves. Tar spot has proven to be an impactful disease in terms of reducing crop yield. To quantify disease progression, experts usually have to visually phenotype leaves from the plant. This process is very time-consuming and is difficult to incorporate in any high-throughput phenotypin…
▽ More
Tar spot disease is a fungal disease that appears as a series of black circular spots containing spores on corn leaves. Tar spot has proven to be an impactful disease in terms of reducing crop yield. To quantify disease progression, experts usually have to visually phenotype leaves from the plant. This process is very time-consuming and is difficult to incorporate in any high-throughput phenotyping system. Deep neural networks could provide quick, automated tar spot detection with sufficient ground truth. However, manually labeling tar spots in images to serve as ground truth is also tedious and time-consuming. In this paper we first describe an approach that uses automated image analysis tools to generate ground truth images that are then used for training a Mask R-CNN. We show that a Mask R-CNN can be used effectively to detect tar spots in close-up images of leaf surfaces. We additionally show that the Mask R-CNN can also be used for in-field images of whole leaves to capture the number of tar spots and area of the leaf infected by the disease.
△ Less
Submitted 2 May, 2022;
originally announced May 2022.
-
Towards Automatic Construction of Filipino WordNet: Word Sense Induction and Synset Induction Using Sentence Embeddings
Authors:
Dan John Velasco,
Axel Alba,
Trisha Gail Pelagio,
Bryce Anthony Ramirez,
Unisse Chua,
Briane Paul Samson,
Jan Christian Blaise Cruz,
Charibeth Cheng
Abstract:
Wordnets are indispensable tools for various natural language processing applications. Unfortunately, wordnets get outdated, and producing or updating wordnets can be slow and costly in terms of time and resources. This problem intensifies for low-resource languages. This study proposes a method for word sense induction and synset induction using only two linguistic resources, namely, an unlabeled…
▽ More
Wordnets are indispensable tools for various natural language processing applications. Unfortunately, wordnets get outdated, and producing or updating wordnets can be slow and costly in terms of time and resources. This problem intensifies for low-resource languages. This study proposes a method for word sense induction and synset induction using only two linguistic resources, namely, an unlabeled corpus and a sentence embeddings-based language model. The resulting sense inventory and synonym sets can be used in automatically creating a wordnet. We applied this method on a corpus of Filipino text. The sense inventory and synsets were evaluated by matching them with the sense inventory of the machine translated Princeton WordNet, as well as comparing the synsets to the Filipino WordNet. This study empirically shows that the 30% of the induced word senses are valid and 40% of the induced synsets are valid in which 20% are novel synsets.
△ Less
Submitted 19 October, 2023; v1 submitted 7 April, 2022;
originally announced April 2022.
-
Using Synthetic Data for Conversational Response Generation in Low-resource Settings
Authors:
Gabriel Louis Tan,
Adrian Paule Ty,
Schuyler Ng,
Denzel Adrian Co,
Jan Christian Blaise Cruz,
Charibeth Cheng
Abstract:
Response generation is a task in natural language processing (NLP) where a model is trained to respond to human statements. Conversational response generators take this one step further with the ability to respond within the context of previous responses. While there are existing techniques for training such models, they all require an abundance of conversational data which are not always availabl…
▽ More
Response generation is a task in natural language processing (NLP) where a model is trained to respond to human statements. Conversational response generators take this one step further with the ability to respond within the context of previous responses. While there are existing techniques for training such models, they all require an abundance of conversational data which are not always available for low-resource languages. In this research, we make three contributions. First, we released the first Filipino conversational dataset collected from a popular Philippine online forum, which we named the PEx Conversations Dataset. Second, we introduce a data augmentation (DA) methodology for Filipino data by employing a Tagalog RoBERTa model to increase the size of the existing corpora. Lastly, we published the first Filipino conversational response generator capable of generating responses related to the previous 3 responses. With the supplementary synthetic data, we were able to improve the performance of the response generator by up to 12.2% in BERTScore, 10.7% in perplexity, and 11.7% in content word usage as compared to training with zero synthetic data.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
Data Processing Matters: SRPH-Konvergen AI's Machine Translation System for WMT'21
Authors:
Lintang Sutawika,
Jan Christian Blaise Cruz
Abstract:
In this paper, we describe the submission of the joint Samsung Research Philippines-Konvergen AI team for the WMT'21 Large Scale Multilingual Translation Task - Small Track 2. We submit a standard Seq2Seq Transformer model to the shared task without any training or architecture tricks, relying mainly on the strength of our data preprocessing techniques to boost performance. Our final submission mo…
▽ More
In this paper, we describe the submission of the joint Samsung Research Philippines-Konvergen AI team for the WMT'21 Large Scale Multilingual Translation Task - Small Track 2. We submit a standard Seq2Seq Transformer model to the shared task without any training or architecture tricks, relying mainly on the strength of our data preprocessing techniques to boost performance. Our final submission model scored 22.92 average BLEU on the FLORES-101 devtest set, and scored 22.97 average BLEU on the contest's hidden test set, ranking us sixth overall. Despite using only a standard Transformer, our model ranked first in Indonesian to Javanese, showing that data preprocessing matters equally, if not more, than cutting edge model architectures and training techniques.
△ Less
Submitted 19 November, 2021;
originally announced November 2021.
-
Improving Large-scale Language Models and Resources for Filipino
Authors:
Jan Christian Blaise Cruz,
Charibeth Cheng
Abstract:
In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBE…
▽ More
In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy across the three classification tasks of varying difficulty.
△ Less
Submitted 11 November, 2021;
originally announced November 2021.
-
A Survey on Interactive Reinforcement Learning: Design Principles and Open Challenges
Authors:
Christian Arzate Cruz,
Takeo Igarashi
Abstract:
Interactive reinforcement learning (RL) has been successfully used in various applications in different fields, which has also motivated HCI researchers to contribute in this area. In this paper, we survey interactive RL to empower human-computer interaction (HCI) researchers with the technical background in RL needed to design new interaction techniques and propose new applications. We elucidate…
▽ More
Interactive reinforcement learning (RL) has been successfully used in various applications in different fields, which has also motivated HCI researchers to contribute in this area. In this paper, we survey interactive RL to empower human-computer interaction (HCI) researchers with the technical background in RL needed to design new interaction techniques and propose new applications. We elucidate the roles played by HCI researchers in interactive RL, identifying ideas and promising research directions. Furthermore, we propose generic design principles that will provide researchers with a guide to effectively implement interactive RL applications.
△ Less
Submitted 27 May, 2021;
originally announced May 2021.
-
MarioMix: Creating Aligned Playstyles for Bots with Interactive Reinforcement Learning
Authors:
Christian Arzate Cruz,
Takeo Igarashi
Abstract:
In this paper, we propose a generic framework that enables game developers without knowledge of machine learning to create bot behaviors with playstyles that align with their preferences. Our framework is based on interactive reinforcement learning (RL), and we used it to create a behavior authoring tool called MarioMix. This tool enables non-experts to create bots with varied playstyles for the g…
▽ More
In this paper, we propose a generic framework that enables game developers without knowledge of machine learning to create bot behaviors with playstyles that align with their preferences. Our framework is based on interactive reinforcement learning (RL), and we used it to create a behavior authoring tool called MarioMix. This tool enables non-experts to create bots with varied playstyles for the game titled Super Mario Bros. The main interaction procedure of MarioMix consists of presenting short clips of gameplay displaying precomputed bots with different playstyles to end-users. Then, end-users can select the bot with the playstyle that behaves as intended. We evaluated MarioMix by incorporating input from game designers working in the industry.
△ Less
Submitted 27 May, 2021;
originally announced May 2021.
-
Interactive Explanations: Diagnosis and Repair of Reinforcement Learning Based Agent Behaviors
Authors:
Christian Arzate Cruz,
Takeo Igarashi
Abstract:
Reinforcement learning techniques successfully generate convincing agent behaviors, but it is still difficult to tailor the behavior to align with a user's specific preferences. What is missing is a communication method for the system to explain the behavior and for the user to repair it. In this paper, we present a novel interaction method that uses interactive explanations using templates of nat…
▽ More
Reinforcement learning techniques successfully generate convincing agent behaviors, but it is still difficult to tailor the behavior to align with a user's specific preferences. What is missing is a communication method for the system to explain the behavior and for the user to repair it. In this paper, we present a novel interaction method that uses interactive explanations using templates of natural language as a communication method. The main advantage of this interaction method is that it enables a two-way communication channel between users and the agent; the bot can explain its thinking procedure to the users, and the users can communicate their behavior preferences to the bot using the same interactive explanations. In this manner, the thinking procedure of the bot is transparent, and users can provide corrections to the bot that include a suggested action to take, a goal to achieve, and the reasons behind these decisions. We tested our proposed method in a clone of the video game named \textit{Super Mario Bros.}, and the results demonstrate that our interactive explanation approach is effective at diagnosing and repairing bot behaviors.
△ Less
Submitted 27 May, 2021;
originally announced May 2021.
-
Exploiting News Article Structure for Automatic Corpus Generation of Entailment Datasets
Authors:
Jan Christian Blaise Cruz,
Jose Kristian Resabal,
James Lin,
Dan John Velasco,
Charibeth Cheng
Abstract:
Transformers represent the state-of-the-art in Natural Language Processing (NLP) in recent years, proving effective even in tasks done in low-resource languages. While pretrained transformers for these languages can be made, it is challenging to measure their true performance and capacity due to the lack of hard benchmark datasets, as well as the difficulty and cost of producing them. In this pape…
▽ More
Transformers represent the state-of-the-art in Natural Language Processing (NLP) in recent years, proving effective even in tasks done in low-resource languages. While pretrained transformers for these languages can be made, it is challenging to measure their true performance and capacity due to the lack of hard benchmark datasets, as well as the difficulty and cost of producing them. In this paper, we present three contributions: First, we propose a methodology for automatically producing Natural Language Inference (NLI) benchmark datasets for low-resource languages using published news articles. Through this, we create and release NewsPH-NLI, the first sentence entailment benchmark dataset in the low-resource Filipino language. Second, we produce new pretrained transformers based on the ELECTRA technique to further alleviate the resource scarcity in Filipino, benchmarking them on our dataset against other commonly-used transfer learning techniques. Lastly, we perform analyses on transfer learning techniques to shed light on their true performance when operating in low-data domains through the use of degradation tests.
△ Less
Submitted 13 August, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
Establishing Baselines for Text Classification in Low-Resource Languages
Authors:
Jan Christian Blaise Cruz,
Charibeth Cheng
Abstract:
While transformer-based finetuning techniques have proven effective in tasks that involve low-resource, low-data environments, a lack of properly established baselines and benchmark datasets make it hard to compare different approaches that are aimed at tackling the low-resource setting. In this work, we provide three contributions. First, we introduce two previously unreleased datasets as benchma…
▽ More
While transformer-based finetuning techniques have proven effective in tasks that involve low-resource, low-data environments, a lack of properly established baselines and benchmark datasets make it hard to compare different approaches that are aimed at tackling the low-resource setting. In this work, we provide three contributions. First, we introduce two previously unreleased datasets as benchmark datasets for text classification and low-resource multilabel text classification for the low-resource language Filipino. Second, we pretrain better BERT and DistilBERT models for use within the Filipino setting. Third, we introduce a simple degradation test that benchmarks a model's resistance to performance degradation as the number of training samples are reduced. We analyze our pretrained model's degradation speeds and look towards the use of this method for comparing models aimed at operating within the low-resource setting. We release all our models and datasets for the research community to use.
△ Less
Submitted 5 May, 2020;
originally announced May 2020.
-
Simplifying Paragraph-level Question Generation via Transformer Language Models
Authors:
Luis Enrico Lopez,
Diane Kathryn Cruz,
Jan Christian Blaise Cruz,
Charibeth Cheng
Abstract:
Question generation (QG) is a natural language generation task where a model is trained to ask questions corresponding to some input text. Most recent approaches frame QG as a sequence-to-sequence problem and rely on additional features and mechanisms to increase performance; however, these often increase model complexity, and can rely on auxiliary data unavailable in practical use. A single Trans…
▽ More
Question generation (QG) is a natural language generation task where a model is trained to ask questions corresponding to some input text. Most recent approaches frame QG as a sequence-to-sequence problem and rely on additional features and mechanisms to increase performance; however, these often increase model complexity, and can rely on auxiliary data unavailable in practical use. A single Transformer-based unidirectional language model leveraging transfer learning can be used to produce high quality questions while disposing of additional task-specific complexity. Our QG model, finetuned from GPT-2 Small, outperforms several paragraph-level QG baselines on the SQuAD dataset by 0.95 METEOR points. Human evaluators rated questions as easy to answer, relevant to their context paragraph, and corresponding well to natural human speech. Also introduced is a new set of baseline scores on the RACE dataset, which has not previously been used for QG tasks. Further experimentation with varying model capacities and datasets with non-identification type questions is recommended in order to further verify the robustness of pretrained Transformer-based LMs as question generators.
△ Less
Submitted 13 August, 2021; v1 submitted 3 May, 2020;
originally announced May 2020.
-
Flashlight CNN Image Denoising
Authors:
Pham Huu Thanh Binh,
Cristóvão Cruz,
Karen Egiazarian
Abstract:
This paper proposes a learning-based denoising method called FlashLight CNN (FLCNN) that implements a deep neural network for image denoising. The proposed approach is based on deep residual networks and inception networks and it is able to leverage many more parameters than residual networks alone for denoising grayscale images corrupted by additive white Gaussian noise (AWGN). FlashLight CNN dem…
▽ More
This paper proposes a learning-based denoising method called FlashLight CNN (FLCNN) that implements a deep neural network for image denoising. The proposed approach is based on deep residual networks and inception networks and it is able to leverage many more parameters than residual networks alone for denoising grayscale images corrupted by additive white Gaussian noise (AWGN). FlashLight CNN demonstrates state of the art performance when compared quantitatively and visually with the current state of the art image denoising methods.
△ Less
Submitted 2 July, 2020; v1 submitted 2 March, 2020;
originally announced March 2020.
-
Automated Smart Wick System-Based Microfarm Using Internet of Things
Authors:
R. Jorda, Jr.,
C. Alcabasa,
A. Buhay,
E. C. Dela Cruz,
J. P. Mendoza,
A. Tolentino,
L. K. Tolentino,
E. Fernandez,
A. Thio-ac,
J. Velasco,
N. Arago
Abstract:
This paper presents a study conducted to allow urban farmers to remotely monitor their farm through the design and development of an Internet of Things-based (IoT) microfarm prototype which utilized wick system as planting method. The system involves the detection of three environmental parameters namely, light intensity, soil moisture and temperature through the use of respective sensors which we…
▽ More
This paper presents a study conducted to allow urban farmers to remotely monitor their farm through the design and development of an Internet of Things-based (IoT) microfarm prototype which utilized wick system as planting method. The system involves the detection of three environmental parameters namely, light intensity, soil moisture and temperature through the use of respective sensors which were connected to the Arduino microcontroller, the sensor node of the system. Irregularities in the aforementioned parameters were neutralized through the use of parameter regulators such as LED growlight strips, water pump and air cooler. The data collected by these sensors were gathered by the Arduino microcontroller and were sent to the Web database through the IoT gateway which was the Raspberry Pi computer chip. These data were also sent to an Android unit installed with the Microfarm Companion application which was capable of monitoring and controlling the environmental parameters observed in the microfarm. The application allows the user to view the current value of the parameter involved and to choose whether to control the parameter regulators automatically or manually. The microfarm system runs autonomously which reduces the labor required to produce healthy plants and crops. Mustard greens samples were used in testing the system. After a month of monitoring the height of the samples, it was observed that the average height of the samples is about 0.23 cm taller than the standard height. The proponents has also tested the system functionality by evaluating the sensor data log that provides the values gathered by the sensors and the turn-on times of the parameter regulators. From these data, it can be observed that whenever the values obtained by the sensors fall outside the threshold range, the parameter regulators turns on, indicating that the system is working properly.
△ Less
Submitted 30 October, 2019;
originally announced November 2019.
-
Localization of Fake News Detection via Multitask Transfer Learning
Authors:
Jan Christian Blaise Cruz,
Julianne Agatha Tan,
Charibeth Cheng
Abstract:
The use of the internet as a fast medium of spreading fake news reinforces the need for computational tools that combat it. Techniques that train fake news classifiers exist, but they all assume an abundance of resources including large labeled datasets and expert-curated corpora, which low-resource languages may not have. In this work, we make two main contributions: First, we alleviate resource…
▽ More
The use of the internet as a fast medium of spreading fake news reinforces the need for computational tools that combat it. Techniques that train fake news classifiers exist, but they all assume an abundance of resources including large labeled datasets and expert-curated corpora, which low-resource languages may not have. In this work, we make two main contributions: First, we alleviate resource scarcity by constructing the first expertly-curated benchmark dataset for fake news detection in Filipino, which we call "Fake News Filipino." Second, we benchmark Transfer Learning (TL) techniques and show that they can be used to train robust fake news classifiers from little data, achieving 91% accuracy on our fake news dataset, reducing the error by 14% compared to established few-shot baselines. Furthermore, lifting ideas from multitask learning, we show that augmenting transformer-based transfer techniques with auxiliary language modeling losses improves their performance by adapting to writing style. Using this, we improve TL performance by 4-6%, achieving an accuracy of 96% on our best model. Lastly, we show that our method generalizes well to different types of news articles, including political news, entertainment news, and opinion articles.
△ Less
Submitted 15 May, 2020; v1 submitted 21 October, 2019;
originally announced October 2019.
-
Vertex arboricity of cographs
Authors:
Sebastián González Hermosillo de la Maza,
Pavol Hell,
César Hernández Cruz,
Seyyed Aliasghar Hosseini,
Payam Valadkhan
Abstract:
Arboricity is a graph parameter akin to chromatic number, in that it seeks to partition the vertices into the smallest number of sparse subgraphs. Where for the chromatic number we are partitioning the vertices into independent sets, for the arboricity we want to partition the vertices into cycle-free subsets (i.e., forests). Arboricity is NP-hard in general, and our focus is on the arboricity of…
▽ More
Arboricity is a graph parameter akin to chromatic number, in that it seeks to partition the vertices into the smallest number of sparse subgraphs. Where for the chromatic number we are partitioning the vertices into independent sets, for the arboricity we want to partition the vertices into cycle-free subsets (i.e., forests). Arboricity is NP-hard in general, and our focus is on the arboricity of cographs. For arboricity two, we obtain the complete list of minimal cograph obstructions. These minimal obstructions do generalize to higher arboricities; however, we no longer have a complete list, and in fact, the number of minimal cograph obstructions grows exponentially with arboricity. We obtain bounds on their size and the height of their cotrees.
More generally, we consider the following common generalization of colouring and partition into forests: given non-negative integers $p$ and $q$, we ask if a given cograph $G$ admits a vertex partition into $p$ forests and $q$ independent sets. We give a polynomial-time dynamic programming algorithm for this problem. In fact, the algorithm solves a more general problem which also includes several other problems such as finding a maximum $q$-colourable subgraph, maximum subgraph of arboricity-$p$, minimum vertex feedback set and minimum $q$ of a $q$-colourable vertex feedback set.
△ Less
Submitted 16 July, 2019;
originally announced July 2019.
-
Evaluating Language Model Finetuning Techniques for Low-resource Languages
Authors:
Jan Christian Blaise Cruz,
Charibeth Cheng
Abstract:
Unlike mainstream languages (such as English and French), low-resource languages often suffer from a lack of expert-annotated corpora and benchmark resources that make it hard to apply state-of-the-art techniques directly. In this paper, we alleviate this scarcity problem for the low-resourced Filipino language in two ways. First, we introduce a new benchmark language modeling dataset in Filipino…
▽ More
Unlike mainstream languages (such as English and French), low-resource languages often suffer from a lack of expert-annotated corpora and benchmark resources that make it hard to apply state-of-the-art techniques directly. In this paper, we alleviate this scarcity problem for the low-resourced Filipino language in two ways. First, we introduce a new benchmark language modeling dataset in Filipino which we call WikiText-TL-39. Second, we show that language model finetuning techniques such as BERT and ULMFiT can be used to consistently train robust classifiers in low-resource settings, experiencing at most a 0.0782 increase in validation error when the number of training examples is decreased from 10K to 1K while finetuning using a privately-held sentiment dataset.
△ Less
Submitted 30 June, 2019;
originally announced July 2019.
-
Nonlocality-Reinforced Convolutional Neural Networks for Image Denoising
Authors:
Cristóvão Cruz,
Alessandro Foi,
Vladimir Katkovnik,
Karen Egiazarian
Abstract:
We introduce a paradigm for nonlocal sparsity reinforced deep convolutional neural network denoising. It is a combination of a local multiscale denoising by a convolutional neural network (CNN) based denoiser and a nonlocal denoising based on a nonlocal filter (NLF) exploiting the mutual similarities between groups of patches. CNN models are leveraged with noise levels that progressively decrease…
▽ More
We introduce a paradigm for nonlocal sparsity reinforced deep convolutional neural network denoising. It is a combination of a local multiscale denoising by a convolutional neural network (CNN) based denoiser and a nonlocal denoising based on a nonlocal filter (NLF) exploiting the mutual similarities between groups of patches. CNN models are leveraged with noise levels that progressively decrease at every iteration of our framework, while their output is regularized by a nonlocal prior implicit within the NLF. Unlike complicated neural networks that embed the nonlocality prior within the layers of the network, our framework is modular, it uses standard pre-trained CNNs together with standard nonlocal filters. An instance of the proposed framework, called NN3D, is evaluated over large grayscale image datasets showing state-of-the-art performance.
△ Less
Submitted 21 June, 2018; v1 submitted 6 March, 2018;
originally announced March 2018.
-
Single Image Super-Resolution based on Wiener Filter in Similarity Domain
Authors:
Cristóvão Cruz,
Rakesh Mehta,
Vladimir Katkovnik,
Karen Egiazarian
Abstract:
Single image super resolution (SISR) is an ill-posed problem aiming at estimating a plausible high resolution (HR) image from a single low resolution (LR) image. Current state-of-the-art SISR methods are patch-based. They use either external data or internal self-similarity to learn a prior for a HR image. External data based methods utilize large number of patches from the training data, while se…
▽ More
Single image super resolution (SISR) is an ill-posed problem aiming at estimating a plausible high resolution (HR) image from a single low resolution (LR) image. Current state-of-the-art SISR methods are patch-based. They use either external data or internal self-similarity to learn a prior for a HR image. External data based methods utilize large number of patches from the training data, while self-similarity based approaches leverage one or more similar patches from the input image. In this paper we propose a self-similarity based approach that is able to use large groups of similar patches extracted from the input image to solve the SISR problem. We introduce a novel prior leading to collaborative filtering of patch groups in 1D similarity domain and couple it with an iterative back-projection framework. The performance of the proposed algorithm is evaluated on a number of SISR benchmark datasets. Without using any external data, the proposed approach outperforms the current non-CNN based methods on the tested datasets for various scaling factors. On certain datasets, the gain is over 1 dB, when compared to the recent method A+. For high sampling rate (x4) the proposed method performs similarly to very recent state-of-the-art deep convolutional network based approaches.
△ Less
Submitted 29 November, 2017; v1 submitted 13 April, 2017;
originally announced April 2017.
-
Semantic HMC for Big Data Analysis
Authors:
Thomas Hassan,
Rafael Peixoto,
Christophe Cruz,
Aurlie Bertaux,
Nuno Silva
Abstract:
Analyzing Big Data can help corporations to im-prove their efficiency. In this work we present a new vision to derive Value from Big Data using a Semantic Hierarchical Multi-label Classification called Semantic HMC based in a non-supervised Ontology learning process. We also proposea Semantic HMC process, using scalable Machine-Learning techniques and Rule-based reasoning.
Analyzing Big Data can help corporations to im-prove their efficiency. In this work we present a new vision to derive Value from Big Data using a Semantic Hierarchical Multi-label Classification called Semantic HMC based in a non-supervised Ontology learning process. We also proposea Semantic HMC process, using scalable Machine-Learning techniques and Rule-based reasoning.
△ Less
Submitted 2 December, 2014;
originally announced December 2014.
-
Toward the Automatic Generation of a Semantic VRML Model from Unorganized 3D Point Clouds
Authors:
Helmi Ben Hmida,
Christophe Cruz,
Christophe Nicolle,
Frank Boochs
Abstract:
This paper presents our experience regarding the creation of 3D semantic facility model out of unorganized 3D point clouds. Thus, a knowledge-based detection approach of objects using the OWL ontology language is presented. This knowledge is used to define SWRL detection rules. In addition, the combination of 3D processing built-ins and topological Built-Ins in SWRL rules aims at combining geometr…
▽ More
This paper presents our experience regarding the creation of 3D semantic facility model out of unorganized 3D point clouds. Thus, a knowledge-based detection approach of objects using the OWL ontology language is presented. This knowledge is used to define SWRL detection rules. In addition, the combination of 3D processing built-ins and topological Built-Ins in SWRL rules aims at combining geometrical analysis of 3D point clouds and specialist's knowledge. This combination allows more flexible and intelligent detection and the annotation of objects contained in 3D point clouds. The created WiDOP prototype takes a set of 3D point clouds as input, and produces an indexed scene of colored objects visualized within VRML language as output. The context of the study is the detection of railway objects materialized within the Deutsche Bahn scene such as signals, technical cupboards, electric poles, etc. Therefore, the resulting enriched and populated domain ontology, that contains the annotations of objects in the point clouds, is used to feed a GIS system.
△ Less
Submitted 21 January, 2013;
originally announced January 2013.
-
From 9-IM Topological Operators to Qualitative Spatial Relations using 3D Selective Nef Complexes and Logic Rules for bodies
Authors:
Helmi Ben Hmida,
Christophe Cruz,
Frank Boochs,
Christophe Nicolle
Abstract:
This paper presents a method to compute automatically topological relations using SWRL rules. The calculation of these rules is based on the definition of a Selective Nef Complexes Nef Polyhedra structure generated from standard Polyhedron. The Selective Nef Complexes is a data model providing a set of binary Boolean operators such as Union, Difference, Intersection and Symmetric difference, and u…
▽ More
This paper presents a method to compute automatically topological relations using SWRL rules. The calculation of these rules is based on the definition of a Selective Nef Complexes Nef Polyhedra structure generated from standard Polyhedron. The Selective Nef Complexes is a data model providing a set of binary Boolean operators such as Union, Difference, Intersection and Symmetric difference, and unary operators such as Interior, Closure and Boundary. In this work, these operators are used to compute topological relations between objects defined by the constraints of the 9 Intersection Model (9-IM) from Egenhofer. With the help of these constraints, we defined a procedure to compute the topological relations on Nef polyhedra. These topological relationships are Disjoint, Meets, Contains, Inside, Covers, CoveredBy, Equals and Overlaps, and defined in a top-level ontology with a specific semantic definition on relation such as Transitive, Symmetric, Asymmetric, Functional, Reflexive, and Irreflexive. The results of the computation of topological relationships are stored in an OWL-DL ontology allowing after what to infer on these new relationships between objects. In addition, logic rules based on the Semantic Web Rule Language allows the definition of logic programs that define which topological relationships have to be computed on which kind of objects with specific attributes. For instance, a "Building" that overlaps a "Railway" is a "RailStation".
△ Less
Submitted 21 January, 2013;
originally announced January 2013.
-
Knowledge Base Approach for 3D Objects Detection in Point Clouds Using 3D Processing and Specialists Knowledge
Authors:
Helmi Ben Hmida,
Christophe Cruz,
Frank Boochs,
Christophe Nicolle
Abstract:
This paper presents a knowledge-based detection of objects approach using the OWL ontology language, the Semantic Web Rule Language, and 3D processing built-ins aiming at combining geometrical analysis of 3D point clouds and specialist's knowledge. Here, we share our experience regarding the creation of 3D semantic facility model out of unorganized 3D point clouds. Thus, a knowledge-based detectio…
▽ More
This paper presents a knowledge-based detection of objects approach using the OWL ontology language, the Semantic Web Rule Language, and 3D processing built-ins aiming at combining geometrical analysis of 3D point clouds and specialist's knowledge. Here, we share our experience regarding the creation of 3D semantic facility model out of unorganized 3D point clouds. Thus, a knowledge-based detection approach of objects using the OWL ontology language is presented. This knowledge is used to define SWRL detection rules. In addition, the combination of 3D processing built-ins and topological Built-Ins in SWRL rules allows a more flexible and intelligent detection, and the annotation of objects contained in 3D point clouds. The created WiDOP prototype takes a set of 3D point clouds as input, and produces as output a populated ontology corresponding to an indexed scene visualized within VRML language. The context of the study is the detection of railway objects materialized within the Deutsche Bahn scene such as signals, technical cupboards, electric poles, etc. Thus, the resulting enriched and populated ontology, that contains the annotations of objects in the point clouds, is used to feed a GIS system or an IFC file for architecture purposes.
△ Less
Submitted 21 January, 2013;
originally announced January 2013.