-
Using Federated Machine Learning in Predictive Maintenance of Jet Engines
Authors:
Asaph Matheus Barbosa,
Thao Vy Nhat Ngo,
Elaheh Jafarigol,
Theodore B. Trafalis,
Emuobosa P. Ojoboh
Abstract:
The goal of this paper is to predict the Remaining Useful Life (RUL) of turbine jet engines using a federated machine learning framework. Federated Learning enables multiple edge devices/nodes or servers to collaboratively train a shared model without sharing sensitive data, thus preserving data privacy and security. By implementing a nonlinear model, the system aims to capture complex relationshi…
▽ More
The goal of this paper is to predict the Remaining Useful Life (RUL) of turbine jet engines using a federated machine learning framework. Federated Learning enables multiple edge devices/nodes or servers to collaboratively train a shared model without sharing sensitive data, thus preserving data privacy and security. By implementing a nonlinear model, the system aims to capture complex relationships and patterns in the engine data to enhance the accuracy of RUL predictions. This approach leverages decentralized computation, allowing models to be trained locally at each device before aggregating the learned weights at a central server. By predicting the RUL of jet engines accurately, maintenance schedules can be optimized, downtime reduced, and operational efficiency improved, ultimately leading to cost savings and enhanced performance in the aviation industry. Computational results are provided by using the C-MAPSS dataset which is publicly available on the NASA website and is a valuable resource for studying and analyzing engine degradation behaviors in various operational scenarios.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Energy-Aware Resource Allocation for Energy Harvesting Powered Wireless Sensor Nodes
Authors:
Ngoc M. Ngo,
Trung T. Nguyen,
Phuc H. Nguyen,
Van-Dinh Nguyen
Abstract:
Low harvested energy poses a significant challenge to sustaining continuous communication in energy harvesting (EH)-powered wireless sensor networks. This is mainly due to intermittent and limited power availability from radio frequency signals. In this paper, we introduce a novel energy-aware resource allocation problem aimed at enabling the asynchronous accumulate-then-transmit protocol, offerin…
▽ More
Low harvested energy poses a significant challenge to sustaining continuous communication in energy harvesting (EH)-powered wireless sensor networks. This is mainly due to intermittent and limited power availability from radio frequency signals. In this paper, we introduce a novel energy-aware resource allocation problem aimed at enabling the asynchronous accumulate-then-transmit protocol, offering an alternative to the extensively studied harvest-then-transmit approach. Specifically, we jointly optimize power allocation and time fraction dedicated to EH to maximize the average long-term system throughput, accounting for both data and energy queue lengths. By leveraging inner approximation and network utility maximization techniques, we develop a simple yet efficient iterative algorithm that guarantees at least a local optimum and achieves long-term utility improvement. Numerical results highlight the proposed approach's effectiveness in terms of both queue length and sustained system throughput.
△ Less
Submitted 11 January, 2025;
originally announced January 2025.
-
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models
Authors:
Hieu Man,
Nghia Trung Ngo,
Viet Dac Lai,
Ryan A. Rossi,
Franck Dernoncourt,
Thien Huu Nguyen
Abstract:
Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts…
▽ More
Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.
△ Less
Submitted 7 May, 2025; v1 submitted 1 January, 2025;
originally announced January 2025.
-
GROOT: Effective Design of Biological Sequences with Limited Experimental Data
Authors:
Thanh V. T. Tran,
Nhat Khang Ngo,
Viet Anh Nguyen,
Truong Son Hy
Abstract:
Latent space optimization (LSO) is a powerful method for designing discrete, high-dimensional biological sequences that maximize expensive black-box functions, such as wet lab experiments. This is accomplished by learning a latent space from available data and using a surrogate model to guide optimization algorithms toward optimal outputs. However, existing methods struggle when labeled data is li…
▽ More
Latent space optimization (LSO) is a powerful method for designing discrete, high-dimensional biological sequences that maximize expensive black-box functions, such as wet lab experiments. This is accomplished by learning a latent space from available data and using a surrogate model to guide optimization algorithms toward optimal outputs. However, existing methods struggle when labeled data is limited, as training the surrogate model with few labeled data points can lead to subpar outputs, offering no advantage over the training data itself. We address this challenge by introducing GROOT, a Graph-based Latent Smoothing for Biological Sequence Optimization. In particular, GROOT generates pseudo-labels for neighbors sampled around the training latent embeddings. These pseudo-labels are then refined and smoothed by Label Propagation. Additionally, we theoretically and empirically justify our approach, demonstrate GROOT's ability to extrapolate to regions beyond the training set while maintaining reliability within an upper bound of their expected distances from the training regions. We evaluate GROOT on various biological sequence design tasks, including protein optimization (GFP and AAV) and three tasks with exact oracles from Design-Bench. The results demonstrate that GROOT equalizes and surpasses existing methods without requiring access to black-box oracles or vast amounts of labeled data, highlighting its practicality and effectiveness. We release our code at https://anonymous.4open.science/r/GROOT-D554
△ Less
Submitted 17 November, 2024;
originally announced November 2024.
-
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering
Authors:
Nghia Trung Ngo,
Chien Van Nguyen,
Franck Dernoncourt,
Thien Huu Nguyen
Abstract:
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they o…
▽ More
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs' ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs' reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
Zero-shot Cross-lingual Transfer Learning with Multiple Source and Target Languages for Information Extraction: Language Selection and Adversarial Training
Authors:
Nghia Trung Ngo,
Thien Huu Nguyen
Abstract:
The majority of previous researches addressing multi-lingual IE are limited to zero-shot cross-lingual single-transfer (one-to-one) setting, with high-resource languages predominantly as source training data. As a result, these works provide little understanding and benefit for the realistic goal of developing a multi-lingual IE system that can generalize to as many languages as possible. Our stud…
▽ More
The majority of previous researches addressing multi-lingual IE are limited to zero-shot cross-lingual single-transfer (one-to-one) setting, with high-resource languages predominantly as source training data. As a result, these works provide little understanding and benefit for the realistic goal of developing a multi-lingual IE system that can generalize to as many languages as possible. Our study aims to fill this gap by providing a detailed analysis on Cross-Lingual Multi-Transferability (many-to-many transfer learning), for the recent IE corpora that cover a diverse set of languages. Specifically, we first determine the correlation between single-transfer performance and a wide range of linguistic-based distances. From the obtained insights, a combined language distance metric can be developed that is not only highly correlated but also robust across different tasks and model scales. Next, we investigate the more general zero-shot multi-lingual transfer settings where multiple languages are involved in the training and evaluation processes. Language clustering based on the newly defined distance can provide directions for achieving the optimal cost-performance trade-off in data (languages) selection problem. Finally, a relational-transfer setting is proposed to further incorporate multi-lingual unlabeled data based on adversarial training using the relation induced from the above linguistic distance.
△ Less
Submitted 13 November, 2024;
originally announced November 2024.
-
Range-aware Positional Encoding via High-order Pretraining: Theory and Practice
Authors:
Viet Anh Nguyen,
Nhat Khang Ngo,
Truong Son Hy
Abstract:
Unsupervised pre-training on vast amounts of graph data is critical in real-world applications wherein labeled data is limited, such as molecule properties prediction or materials science. Existing approaches pre-train models for specific graph domains, neglecting the inherent connections within networks. This limits their ability to transfer knowledge to various supervised tasks. In this work, we…
▽ More
Unsupervised pre-training on vast amounts of graph data is critical in real-world applications wherein labeled data is limited, such as molecule properties prediction or materials science. Existing approaches pre-train models for specific graph domains, neglecting the inherent connections within networks. This limits their ability to transfer knowledge to various supervised tasks. In this work, we propose a novel pre-training strategy on graphs that focuses on modeling their multi-resolution structural information, allowing us to capture global information of the whole graph while preserving local structures around its nodes. We extend the work of Wave}let Positional Encoding (WavePE) from (Ngo et al., 2023) by pretraining a High-Order Permutation-Equivariant Autoencoder (HOPE-WavePE) to reconstruct node connectivities from their multi-resolution wavelet signals. Unlike existing positional encodings, our method is designed to become sensitivity to the input graph size in downstream tasks, which efficiently capture global structure on graphs. Since our approach relies solely on the graph structure, it is also domain-agnostic and adaptable to datasets from various domains, therefore paving the wave for developing general graph structure encoders and graph foundation models. We theoretically demonstrate that there exists a parametrization of such architecture that it can predict the output adjacency up to arbitrarily low error. We also evaluate HOPE-WavePE on graph-level prediction tasks of different areas and show its superiority compared to other methods.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning
Authors:
Hieu Man,
Nghia Trung Ngo,
Franck Dernoncourt,
Thien Huu Nguyen
Abstract:
Large Language Models (LLMs) excel in various natural language processing tasks, but leveraging them for dense passage embedding remains challenging. This is due to their causal attention mechanism and the misalignment between their pre-training objectives and the text ranking tasks. Despite some recent efforts to address these issues, existing frameworks for LLM-based text embeddings have been li…
▽ More
Large Language Models (LLMs) excel in various natural language processing tasks, but leveraging them for dense passage embedding remains challenging. This is due to their causal attention mechanism and the misalignment between their pre-training objectives and the text ranking tasks. Despite some recent efforts to address these issues, existing frameworks for LLM-based text embeddings have been limited by their support for only a limited range of LLM architectures and fine-tuning strategies, limiting their practical application and versatility. In this work, we introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs and supports a range of fine-tuning strategies. We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks. GRL enforces consistency between representation-based and generation-based relevance scores, leveraging LLMs' powerful generative abilities for learning passage embeddings. To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures, ranging from 1.5B to 8B parameters, all of which demonstrate strong performance on the Massive Text Embedding Benchmark. Our framework is publicly available at: https://github.com/nlp-uoregon/ullme. A demo video for ULLME can also be found at https://rb.gy/ws1ile.
△ Less
Submitted 6 August, 2024;
originally announced August 2024.
-
Private Blockchain-based Procurement and Asset Management System with QR Code
Authors:
Alonel A. Hugo,
Gerard Nathaniel C. Ngo
Abstract:
The developed system aims to incorporate a private blockchain technology in the procurement process for the supply office. The procurement process includes the canvassing, purchasing, delivery and inspection of items, inventory, and disposal. The blockchain-based system includes a distributed ledger technology, peer-to-peer network, Proof-of-Authority consensus mechanism, and SHA3-512 cryptographi…
▽ More
The developed system aims to incorporate a private blockchain technology in the procurement process for the supply office. The procurement process includes the canvassing, purchasing, delivery and inspection of items, inventory, and disposal. The blockchain-based system includes a distributed ledger technology, peer-to-peer network, Proof-of-Authority consensus mechanism, and SHA3-512 cryptographic hash function algorithm. This will ensure trust and proper accountability to the custodian of the property while safeguarding sensitive information in the procurement records. The extreme prototyping model will be used as software development life cycle. It is mostly used for web-based applications and has an increased user involvement. The prototype version of the system allows the users get a better understanding of the system being developed. It also reduces the time and cost, has quicker user feedback, missing and difficult functions can be recognized, and confusing processes can be addressed on an early stage. The implementation of a private blockchain technology has an increased privacy, enhanced security, improved efficiency, and reduced complexity over traditional blockchain network. The use of SHA3-512 as cryptographic hash function algorithm is much faster than its predecessors when cryptography is handled by hardware components. Furthermore, it is not vulnerable to length extension attacks making it reliable in terms of security of data. The study recommends the use of private blockchain-based technology with the procurement and asset management system in the supply office. The procurement records will be protected against tampering using this technology. This will promote trust and confidence of the stakeholders. The implementation of blockchain technology in developing a system served as advancement and innovation in terms of securing data.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Blurry-Consistency Segmentation Framework with Selective Stacking on Differential Interference Contrast 3D Breast Cancer Spheroid
Authors:
Thanh-Huy Nguyen,
Thi Kim Ngan Ngo,
Mai Anh Vu,
Ting-Yuan Tu
Abstract:
The ability of three-dimensional (3D) spheroid modeling to study the invasive behavior of breast cancer cells has drawn increased attention. The deep learning-based image processing framework is very effective at speeding up the cell morphological analysis process. Out-of-focus photos taken while capturing 3D cells under several z-slices, however, could negatively impact the deep learning model. I…
▽ More
The ability of three-dimensional (3D) spheroid modeling to study the invasive behavior of breast cancer cells has drawn increased attention. The deep learning-based image processing framework is very effective at speeding up the cell morphological analysis process. Out-of-focus photos taken while capturing 3D cells under several z-slices, however, could negatively impact the deep learning model. In this work, we created a new algorithm to handle blurry images while preserving the stacked image quality. Furthermore, we proposed a unique training architecture that leverages consistency training to help reduce the bias of the model when dense-slice stacking is applied. Additionally, the model's stability is increased under the sparse-slice stacking effect by utilizing the self-training approach. The new blurring stacking technique and training flow are combined with the suggested architecture and self-training mechanism to provide an innovative yet easy-to-use framework. Our methods produced noteworthy experimental outcomes in terms of both quantitative and qualitative aspects.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
Boosting Skull-Stripping Performance for Pediatric Brain Images
Authors:
William Kelley,
Nathan Ngo,
Adrian V. Dalca,
Bruce Fischl,
Lilla Zöllei,
Malte Hoffmann
Abstract:
Skull-stripping is the removal of background and non-brain anatomical features from brain images. While many skull-stripping tools exist, few target pediatric populations. With the emergence of multi-institutional pediatric data acquisition efforts to broaden the understanding of perinatal brain development, it is essential to develop robust and well-tested tools ready for the relevant data proces…
▽ More
Skull-stripping is the removal of background and non-brain anatomical features from brain images. While many skull-stripping tools exist, few target pediatric populations. With the emergence of multi-institutional pediatric data acquisition efforts to broaden the understanding of perinatal brain development, it is essential to develop robust and well-tested tools ready for the relevant data processing. However, the broad range of neuroanatomical variation in the developing brain, combined with additional challenges such as high motion levels, as well as shoulder and chest signal in the images, leaves many adult-specific tools ill-suited for pediatric skull-stripping. Building on an existing framework for robust and accurate skull-stripping, we propose developmental SynthStrip (d-SynthStrip), a skull-stripping model tailored to pediatric images. This framework exposes networks to highly variable images synthesized from label maps. Our model substantially outperforms pediatric baselines across scan types and age cohorts. In addition, the <1-minute runtime of our tool compares favorably to the fastest baselines. We distribute our model at https://w3id.org/synthstrip.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
E(3)-Equivariant Mesh Neural Networks
Authors:
Thuan Trang,
Nhat Khang Ngo,
Daniel Levy,
Thieu N. Vo,
Siamak Ravanbakhsh,
Truong Son Hy
Abstract:
Triangular meshes are widely used to represent three-dimensional objects. As a result, many recent works have address the need for geometric deep learning on 3D mesh. However, we observe that the complexities in many of these architectures does not translate to practical performance, and simple deep models for geometric graphs are competitive in practice. Motivated by this observation, we minimall…
▽ More
Triangular meshes are widely used to represent three-dimensional objects. As a result, many recent works have address the need for geometric deep learning on 3D mesh. However, we observe that the complexities in many of these architectures does not translate to practical performance, and simple deep models for geometric graphs are competitive in practice. Motivated by this observation, we minimally extend the update equations of E(n)-Equivariant Graph Neural Networks (EGNNs) (Satorras et al., 2021) to incorporate mesh face information, and further improve it to account for long-range interactions through hierarchy. The resulting architecture, Equivariant Mesh Neural Network (EMNN), outperforms other, more complicated equivariant methods on mesh tasks, with a fast run-time and no expensive pre-processing. Our implementation is available at https://github.com/HySonLab/EquiMesh
△ Less
Submitted 18 February, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Target-aware Variational Auto-encoders for Ligand Generation with Multimodal Protein Representation Learning
Authors:
Nhat Khang Ngo,
Truong Son Hy
Abstract:
Without knowledge of specific pockets, generating ligands based on the global structure of a protein target plays a crucial role in drug discovery as it helps reduce the search space for potential drug-like candidates in the pipeline. However, contemporary methods require optimizing tailored networks for each protein, which is arduous and costly. To address this issue, we introduce TargetVAE, a ta…
▽ More
Without knowledge of specific pockets, generating ligands based on the global structure of a protein target plays a crucial role in drug discovery as it helps reduce the search space for potential drug-like candidates in the pipeline. However, contemporary methods require optimizing tailored networks for each protein, which is arduous and costly. To address this issue, we introduce TargetVAE, a target-aware variational auto-encoder that generates ligands with high binding affinities to arbitrary protein targets, guided by a novel multimodal deep neural network built based on graph Transformers as the prior for the generative model. This is the first effort to unify different representations of proteins (e.g., sequence of amino-acids, 3D structure) into a single model that we name as Protein Multimodal Network (PMN). Our multimodal architecture learns from the entire protein structures and is able to capture their sequential, topological and geometrical information. We showcase the superiority of our approach by conducting extensive experiments and evaluations, including the assessment of generative model quality, ligand generation for unseen targets, docking score computation, and binding affinity prediction. Empirical results demonstrate the promising performance of our proposed approach. Our software package is publicly available at https://github.com/HySonLab/Ligand_Generation
△ Less
Submitted 2 August, 2023;
originally announced September 2023.
-
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Authors:
Thuat Nguyen,
Chien Van Nguyen,
Viet Dac Lai,
Hieu Man,
Nghia Trung Ngo,
Franck Dernoncourt,
Ryan A. Rossi,
Thien Huu Nguyen
Abstract:
The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, es…
▽ More
The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback
Authors:
Viet Dac Lai,
Chien Van Nguyen,
Nghia Trung Ngo,
Thuat Nguyen,
Franck Dernoncourt,
Ryan A. Rossi,
Thien Huu Nguyen
Abstract:
A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercia…
▽ More
A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for research and development efforts, various instruction-tuned open-source LLMs have also been introduced recently, e.g., Alpaca, Vicuna, to name a few. However, existing open-source LLMs have only been instruction-tuned for English and a few popular languages, thus hindering their impacts and accessibility to many other languages in the world. Among a few very recent work to explore instruction tuning for LLMs in multiple languages, SFT has been used as the only approach to instruction-tune LLMs for multiple languages. This has left a significant gap for fine-tuned LLMs based on RLHF in diverse languages and raised important questions on how RLHF can boost the performance of multilingual instruction tuning. To overcome this issue, we present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research. We also present benchmark datasets to enable the evaluation of generative LLMs in multiple languages. Our experiments demonstrate the advantages of RLHF for multilingual instruction over SFT for different base models and datasets. Our framework and resources are released at https://github.com/nlp-uoregon/Okapi.
△ Less
Submitted 1 August, 2023; v1 submitted 29 July, 2023;
originally announced July 2023.
-
ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning
Authors:
Viet Dac Lai,
Nghia Trung Ngo,
Amir Pouran Ben Veyseh,
Hieu Man,
Franck Dernoncourt,
Trung Bui,
Thien Huu Nguyen
Abstract:
Over the last few years, large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP) that fundamentally transform research and developments in the field. ChatGPT represents one of the most exciting LLM systems developed recently to showcase impressive skills for language generation and highly attract public attention. Among various exciting ap…
▽ More
Over the last few years, large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP) that fundamentally transform research and developments in the field. ChatGPT represents one of the most exciting LLM systems developed recently to showcase impressive skills for language generation and highly attract public attention. Among various exciting applications discovered for ChatGPT in English, the model can process and generate texts for multiple languages due to its multilingual training data. Given the broad adoption of ChatGPT for English in different problems and areas, a natural question is whether ChatGPT can also be applied effectively for other languages or it is necessary to develop more language-specific technologies. The answer to this question requires a thorough evaluation of ChatGPT over multiple tasks with diverse languages and large datasets (i.e., beyond reported anecdotes), which is still missing or limited in current research. Our work aims to fill this gap for the evaluation of ChatGPT and similar LLMs to provide more comprehensive information for multilingual NLP applications. While this work will be an ongoing effort to include additional experiments in the future, our current paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources. We also focus on the zero-shot learning setting for ChatGPT to improve reproducibility and better simulate the interactions of general users. Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages, calling for further research to develop better models and understanding for multilingual learning.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
Modeling Polypharmacy and Predicting Drug-Drug Interactions using Deep Generative Models on Multimodal Graphs
Authors:
Nhat Khang Ngo,
Truong Son Hy,
Risi Kondor
Abstract:
Latent representations of drugs and their targets produced by contemporary graph autoencoder models have proved useful in predicting many types of node-pair interactions on large networks, including drug-drug, drug-target, and target-target interactions. However, most existing approaches model either the node's latent spaces in which node distributions are rigid or do not effectively capture the i…
▽ More
Latent representations of drugs and their targets produced by contemporary graph autoencoder models have proved useful in predicting many types of node-pair interactions on large networks, including drug-drug, drug-target, and target-target interactions. However, most existing approaches model either the node's latent spaces in which node distributions are rigid or do not effectively capture the interrelations between drugs; these limitations hinder the methods from accurately predicting drug-pair interactions. In this paper, we present the effectiveness of variational graph autoencoders (VGAE) in modeling latent node representations on multimodal networks. Our approach can produce flexible latent spaces for each node type of the multimodal graph; the embeddings are used later for predicting links among node pairs under different edge types. To further enhance the models' performance, we suggest a new method that concatenates Morgan fingerprints, which capture the molecular structures of each drug, with their latent embeddings before preceding them to the decoding stage for link prediction. Our proposed model shows competitive results on three multimodal networks: (1) a multimodal graph consisting of drug and protein nodes, (2) a multimodal graph constructed from a subset of the DrugBank database involving drug nodes under different interaction types, and (3) a multimodal graph consisting of drug and cell line nodes. Our source code is publicly available at https://github.com/HySonLab/drug-interactions.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
Multiresolution Graph Transformers and Wavelet Positional Encoding for Learning Hierarchical Structures
Authors:
Nhat Khang Ngo,
Truong Son Hy,
Risi Kondor
Abstract:
Contemporary graph learning algorithms are not well-defined for large molecules since they do not consider the hierarchical interactions among the atoms, which are essential to determine the molecular properties of macromolecules. In this work, we propose Multiresolution Graph Transformers (MGT), the first graph transformer architecture that can learn to represent large molecules at multiple scale…
▽ More
Contemporary graph learning algorithms are not well-defined for large molecules since they do not consider the hierarchical interactions among the atoms, which are essential to determine the molecular properties of macromolecules. In this work, we propose Multiresolution Graph Transformers (MGT), the first graph transformer architecture that can learn to represent large molecules at multiple scales. MGT can learn to produce representations for the atoms and group them into meaningful functional groups or repeating units. We also introduce Wavelet Positional Encoding (WavePE), a new positional encoding method that can guarantee localization in both spectral and spatial domains. Our proposed model achieves competitive results on two macromolecule datasets consisting of polymers and peptides, and one drug-like molecule dataset. Importantly, our model outperforms other state-of-the-art methods and achieves chemical accuracy in estimating molecular properties (e.g., GAP, HOMO and LUMO) calculated by Density Functional Theory (DFT) in the polymers dataset. Furthermore, the visualizations, including clustering results on macromolecules and low-dimensional spaces of their representations, demonstrate the capability of our methodology in learning to represent long-range and hierarchical structures. Our PyTorch implementation is publicly available at https://github.com/HySonLab/Multires-Graph-Transformer
△ Less
Submitted 21 July, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Predicting Drug-Drug Interactions using Deep Generative Models on Graphs
Authors:
Nhat Khang Ngo,
Truong Son Hy,
Risi Kondor
Abstract:
Latent representations of drugs and their targets produced by contemporary graph autoencoder-based models have proved useful in predicting many types of node-pair interactions on large networks, including drug-drug, drug-target, and target-target interactions. However, most existing approaches model the node's latent spaces in which node distributions are rigid and disjoint; these limitations hind…
▽ More
Latent representations of drugs and their targets produced by contemporary graph autoencoder-based models have proved useful in predicting many types of node-pair interactions on large networks, including drug-drug, drug-target, and target-target interactions. However, most existing approaches model the node's latent spaces in which node distributions are rigid and disjoint; these limitations hinder the methods from generating new links among pairs of nodes. In this paper, we present the effectiveness of variational graph autoencoders (VGAE) in modeling latent node representations on multimodal networks. Our approach can produce flexible latent spaces for each node type of the multimodal graph; the embeddings are used later for predicting links among node pairs under different edge types. To further enhance the models' performance, we suggest a new method that concatenates Morgan fingerprints, which capture the molecular structures of each drug, with their latent embeddings before preceding them to the decoding stage for link prediction. Our proposed model shows competitive results on two multimodal networks: (1) a multi-graph consisting of drug and protein nodes, and (2) a multi-graph consisting of drug and cell line nodes. Our source code is publicly available at https://github.com/HySonLab/drug-interactions.
△ Less
Submitted 30 October, 2022; v1 submitted 14 September, 2022;
originally announced September 2022.
-
Cryptographic and Financial Fairness
Authors:
Daniele Friolo,
Fabio Massacci,
Chan Nam Ngo,
Daniele Venturi
Abstract:
A recent trend in multi-party computation is to achieve cryptographic fairness via monetary penalties, i.e. each honest player either obtains the output or receives a compensation in the form of a cryptocurrency. We pioneer another type of fairness, financial fairness, that is closer to the real-world valuation of financial transactions. Intuitively, a penalty protocol is financially fair if the n…
▽ More
A recent trend in multi-party computation is to achieve cryptographic fairness via monetary penalties, i.e. each honest player either obtains the output or receives a compensation in the form of a cryptocurrency. We pioneer another type of fairness, financial fairness, that is closer to the real-world valuation of financial transactions. Intuitively, a penalty protocol is financially fair if the net present cost of participation (the total value of cash inflows less cash outflows, weighted by the relative discount rate) is the same for all honest participants, even when some parties cheat.
We formally define the notion, show several impossibility results based on game theory, and analyze the practical effects of (lack of) financial fairness if one was to run the protocols for real on Bitcoin using Bloomberg's dark pool trading.
For example, we show that the ladder protocol (CRYPTO'14), and its variants (CCS'15 and CCS'16), fail to achieve financial fairness both in theory and in practice, while the penalty protocols of Kumaresan and Bentov (CCS'14) and Baum, David and Dowsley (FC'20) are financially fair.
This version contains formal definitions, detailed security proofs, demos and experimental data in the appendix.
△ Less
Submitted 11 August, 2022; v1 submitted 21 July, 2022;
originally announced July 2022.
-
SHREC'22 Track: Sketch-Based 3D Shape Retrieval in the Wild
Authors:
Jie Qin,
Shuaihang Yuan,
Jiaxin Chen,
Boulbaba Ben Amor,
Yi Fang,
Nhat Hoang-Xuan,
Chi-Bien Chu,
Khoi-Nguyen Nguyen-Ngoc,
Thien-Tri Cao,
Nhat-Khang Ngo,
Tuan-Luc Huynh,
Hai-Dang Nguyen,
Minh-Triet Tran,
Haoyang Luo,
Jianning Wang,
Zheng Zhang,
Zihao Xin,
Yang Wang,
Feng Wang,
Ying Tang,
Haiqin Chen,
Yan Wang,
Qunying Zhou,
Ji Zhang,
Hongyuan Wang
Abstract:
Sketch-based 3D shape retrieval (SBSR) is an important yet challenging task, which has drawn more and more attention in recent years. Existing approaches address the problem in a restricted setting, without appropriately simulating real application scenarios. To mimic the realistic setting, in this track, we adopt large-scale sketches drawn by amateurs of different levels of drawing skills, as wel…
▽ More
Sketch-based 3D shape retrieval (SBSR) is an important yet challenging task, which has drawn more and more attention in recent years. Existing approaches address the problem in a restricted setting, without appropriately simulating real application scenarios. To mimic the realistic setting, in this track, we adopt large-scale sketches drawn by amateurs of different levels of drawing skills, as well as a variety of 3D shapes including not only CAD models but also models scanned from real objects. We define two SBSR tasks and construct two benchmarks consisting of more than 46,000 CAD models, 1,700 realistic models, and 145,000 sketches in total. Four teams participated in this track and submitted 15 runs for the two tasks, evaluated by 7 commonly-adopted metrics. We hope that, the benchmarks, the comparative results, and the open-sourced evaluation code will foster future research in this direction among the 3D object retrieval community.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction
Authors:
Minh Van Nguyen,
Nghia Trung Ngo,
Bonan Min,
Thien Huu Nguyen
Abstract:
This paper presents FAMIE, a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction. FAMIE is designed to address a fundamental problem in existing AL frameworks where annotators need to wait for a long time between annotation batches due to the time-consuming nature of model training and data selection at each AL iteration. This hinders the engagement, pr…
▽ More
This paper presents FAMIE, a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction. FAMIE is designed to address a fundamental problem in existing AL frameworks where annotators need to wait for a long time between annotation batches due to the time-consuming nature of model training and data selection at each AL iteration. This hinders the engagement, productivity, and efficiency of annotators. Based on the idea of using a small proxy network for fast data selection, we introduce a novel knowledge distillation mechanism to synchronize the proxy network with the main large model (i.e., BERT-based) to ensure the appropriateness of the selected annotation examples for the main model. Our AL framework can support multiple languages. The experiments demonstrate the advantages of FAMIE in terms of competitive performance and time efficiency for sequence labeling with AL. We publicly release our code (\url{https://github.com/nlp-uoregon/famie}) and demo website (\url{http://nlp.uoregon.edu:9000/}). A demo video for FAMIE is provided at: \url{https://youtu.be/I2i8n_jAyrY}.
△ Less
Submitted 4 May, 2022; v1 submitted 16 February, 2022;
originally announced February 2022.
-
MAGNeto: An Efficient Deep Learning Method for the Extractive Tags Summarization Problem
Authors:
Hieu Trong Phung,
Anh Tuan Vu,
Tung Dinh Nguyen,
Lam Thanh Do,
Giang Nam Ngo,
Trung Thanh Tran,
Ngoc C. Lê
Abstract:
In this work, we study a new image annotation task named Extractive Tags Summarization (ETS). The goal is to extract important tags from the context lying in an image and its corresponding tags. We adjust some state-of-the-art deep learning models to utilize both visual and textual information. Our proposed solution consists of different widely used blocks like convolutional and self-attention lay…
▽ More
In this work, we study a new image annotation task named Extractive Tags Summarization (ETS). The goal is to extract important tags from the context lying in an image and its corresponding tags. We adjust some state-of-the-art deep learning models to utilize both visual and textual information. Our proposed solution consists of different widely used blocks like convolutional and self-attention layers, together with a novel idea of combining auxiliary loss functions and the gating mechanism to glue and elevate these fundamental components and form a unified architecture. Besides, we introduce a loss function that aims to reduce the imbalance of the training data and a simple but effective data augmentation technique dedicated to alleviates the effect of outliers on the final results. Last but not least, we explore an unsupervised pre-training strategy to further boost the performance of the model by making use of the abundant amount of available unlabeled data. Our model shows the good results as 90% $F_\text{1}$ score on the public NUS-WIDE benchmark, and 50% $F_\text{1}$ score on a noisy large-scale real-world private dataset. Source code for reproducing the experiments is publicly available at: https://github.com/pixta-dev/labteam
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
Conjoined Dirichlet Process
Authors:
Michelle N. Ngo,
Dustin S. Pluta,
Alexander N. Ngo,
Babak Shahbaba
Abstract:
Biclustering is a class of techniques that simultaneously clusters the rows and columns of a matrix to sort heterogeneous data into homogeneous blocks. Although many algorithms have been proposed to find biclusters, existing methods suffer from the pre-specification of the number of biclusters or place constraints on the model structure. To address these issues, we develop a novel, non-parametric…
▽ More
Biclustering is a class of techniques that simultaneously clusters the rows and columns of a matrix to sort heterogeneous data into homogeneous blocks. Although many algorithms have been proposed to find biclusters, existing methods suffer from the pre-specification of the number of biclusters or place constraints on the model structure. To address these issues, we develop a novel, non-parametric probabilistic biclustering method based on Dirichlet processes to identify biclusters with strong co-occurrence in both rows and columns. The proposed method utilizes dual Dirichlet process mixture models to learn row and column clusters, with the number of resulting clusters determined by the data rather than pre-specified. Probabilistic biclusters are identified by modeling the mutual dependence between the row and column clusters. We apply our method to two different applications, text mining and gene expression analysis, and demonstrate that our method improves bicluster extraction in many settings compared to existing approaches.
△ Less
Submitted 8 February, 2020;
originally announced February 2020.
-
SELF: Learning to Filter Noisy Labels with Self-Ensembling
Authors:
Duc Tam Nguyen,
Chaithanya Kumar Mummadi,
Thi Phuong Nhung Ngo,
Thi Hoai Phuong Nguyen,
Laura Beggel,
Thomas Brox
Abstract:
Deep neural networks (DNNs) have been shown to over-fit a dataset when being trained with noisy labels for a long enough time. To overcome this problem, we present a simple and effective method self-ensemble label filtering (SELF) to progressively filter out the wrong labels during training. Our method improves the task performance by gradually allowing supervision only from the potentially non-no…
▽ More
Deep neural networks (DNNs) have been shown to over-fit a dataset when being trained with noisy labels for a long enough time. To overcome this problem, we present a simple and effective method self-ensemble label filtering (SELF) to progressively filter out the wrong labels during training. Our method improves the task performance by gradually allowing supervision only from the potentially non-noisy (clean) labels and stops learning on the filtered noisy labels. For the filtering, we form running averages of predictions over the entire training dataset using the network output at different training epochs. We show that these ensemble estimates yield more accurate identification of inconsistent predictions throughout training than the single estimates of the network at the most recent training epoch. While filtered samples are removed entirely from the supervised training loss, we dynamically leverage them via semi-supervised learning in the unsupervised loss. We demonstrate the positive effect of such an approach on various image classification tasks under both symmetric and asymmetric label noise and at different noise ratios. It substantially outperforms all previous works on noise-aware learning across different datasets and can be applied to a broad set of network architectures.
△ Less
Submitted 4 October, 2019;
originally announced October 2019.
-
DeepUSPS: Deep Robust Unsupervised Saliency Prediction With Self-Supervision
Authors:
Duc Tam Nguyen,
Maximilian Dax,
Chaithanya Kumar Mummadi,
Thi Phuong Nhung Ngo,
Thi Hoai Phuong Nguyen,
Zhongyu Lou,
Thomas Brox
Abstract:
Deep neural network (DNN) based salient object detection in images based on high-quality labels is expensive. Alternative unsupervised approaches rely on careful selection of multiple handcrafted saliency methods to generate noisy pseudo-ground-truth labels. In this work, we propose a two-stage mechanism for robust unsupervised object saliency prediction, where the first stage involves refinement…
▽ More
Deep neural network (DNN) based salient object detection in images based on high-quality labels is expensive. Alternative unsupervised approaches rely on careful selection of multiple handcrafted saliency methods to generate noisy pseudo-ground-truth labels. In this work, we propose a two-stage mechanism for robust unsupervised object saliency prediction, where the first stage involves refinement of the noisy pseudo labels generated from different handcrafted methods. Each handcrafted method is substituted by a deep network that learns to generate the pseudo labels. These labels are refined incrementally in multiple iterations via our proposed self-supervision technique. In the second stage, the refined labels produced from multiple networks representing multiple saliency methods are used to train the actual saliency detection network. We show that this self-learning procedure outperforms all the existing unsupervised methods over different datasets. Results are even comparable to those of fully-supervised state-of-the-art approaches. The code is available at https://tinyurl.com/wtlhgo3 .
△ Less
Submitted 15 March, 2021; v1 submitted 28 September, 2019;
originally announced September 2019.