-
TabFlex: Scaling Tabular Learning to Millions with Linear Attention
Authors:
Yuchen Zeng,
Tuan Dinh,
Wonjun Kang,
Andreas C Mueller
Abstract:
Leveraging the in-context learning (ICL) capability of Large Language Models (LLMs) for tabular classification has gained significant attention for its training-free adaptability across diverse datasets. Recent advancements, like TabPFN, excel in small-scale tabular datasets but struggle to scale for large and complex datasets. Our work enhances the efficiency and scalability of TabPFN for larger…
▽ More
Leveraging the in-context learning (ICL) capability of Large Language Models (LLMs) for tabular classification has gained significant attention for its training-free adaptability across diverse datasets. Recent advancements, like TabPFN, excel in small-scale tabular datasets but struggle to scale for large and complex datasets. Our work enhances the efficiency and scalability of TabPFN for larger datasets by incorporating linear attention mechanisms as a scalable alternative to complexity-quadratic self-attention. Our model, TabFlex, efficiently handles tabular datasets with thousands of features and hundreds of classes, scaling seamlessly to millions of samples. For instance, TabFlex processes the poker-hand dataset with over a million samples in just 5 seconds. Our extensive evaluations demonstrate that TabFlex can achieve over a 2x speedup compared to TabPFN and a 1.5x speedup over XGBoost, outperforming 25 tested baselines in terms of efficiency across a diverse range of datasets. Furthermore, TabFlex remains highly effective on large-scale datasets, delivering strong performance with significantly reduced computational costs, especially when combined with data-efficient techniques such as dimensionality reduction and data sampling.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons
Authors:
Isik Baran Sandan,
Tu Anh Dinh,
Jan Niehues
Abstract:
Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge met…
▽ More
Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective. To address this, we present Knockout Assessment, an LLM-asa Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring.
△ Less
Submitted 5 June, 2025; v1 submitted 4 June, 2025;
originally announced June 2025.
-
Synchronic Web Digital Identity: Speculations on the Art of the Possible
Authors:
Thien-Nam Dinh,
Justin Li,
Mitch Negus,
Ken Goss
Abstract:
As search, social media, and artificial intelligence continue to reshape collective knowledge, the preservation of trust on the public infosphere has become a defining challenge of our time. Given the breadth and versatility of adversarial threats, the best--and perhaps only--defense is an equally broad and versatile infrastructure for digital identity.
This document discusses the opportunities…
▽ More
As search, social media, and artificial intelligence continue to reshape collective knowledge, the preservation of trust on the public infosphere has become a defining challenge of our time. Given the breadth and versatility of adversarial threats, the best--and perhaps only--defense is an equally broad and versatile infrastructure for digital identity.
This document discusses the opportunities and implications of building such an infrastructure from the perspective of a national laboratory. The technical foundation for this discussion is the emergence of the Synchronic Web, a Sandia-developed infrastructure for asserting cryptographic provenance at Internet scale. As of the writing of this document, there is ongoing work to develop the underlying technology and apply it to multiple mission-specific domains within Sandia. The primary objective of this document to extend the body of existing work toward the more public-facing domain of digital identity.
Our approach depends on a non-standard, but philosophically defensible notion of identity: digital identity is an unbroken sequence of states in a well-defined digital space. From this foundation, we abstractly describe the infrastructural foundations and applied configurations that we expect to underpin future notions of digital identity.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization
Authors:
Zhaolin Li,
Yining Liu,
Danni Liu,
Tuan Nam Nguyen,
Enes Yavuz Ugan,
Tu Anh Dinh,
Carlos Mullov,
Alexander Waibel,
Jan Niehues
Abstract:
This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems w…
▽ More
This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
HAKES: Scalable Vector Database for Embedding Search Service
Authors:
Guoyu Hu,
Shaofeng Cai,
Tien Tuan Anh Dinh,
Zhongle Xie,
Cong Yue,
Gang Chen,
Beng Chin Ooi
Abstract:
Modern deep learning models capture the semantics of complex data by transforming them into high-dimensional embedding vectors. Emerging applications, such as retrieval-augmented generation, use approximate nearest neighbor (ANN) search in the embedding vector space to find similar data. Existing vector databases provide indexes for efficient ANN searches, with graph-based indexes being the most p…
▽ More
Modern deep learning models capture the semantics of complex data by transforming them into high-dimensional embedding vectors. Emerging applications, such as retrieval-augmented generation, use approximate nearest neighbor (ANN) search in the embedding vector space to find similar data. Existing vector databases provide indexes for efficient ANN searches, with graph-based indexes being the most popular due to their low latency and high recall in real-world high-dimensional datasets. However, these indexes are costly to build, suffer from significant contention under concurrent read-write workloads, and scale poorly to multiple servers.
Our goal is to build a vector database that achieves high throughput and high recall under concurrent read-write workloads. To this end, we first propose an ANN index with an explicit two-stage design combining a fast filter stage with highly compressed vectors and a refine stage to ensure recall, and we devise a novel lightweight machine learning technique to fine-tune the index parameters. We introduce an early termination check to dynamically adapt the search process for each query. Next, we add support for writes while maintaining search performance by decoupling the management of the learned parameters. Finally, we design HAKES, a distributed vector database that serves the new index in a disaggregated architecture. We evaluate our index and system against 12 state-of-the-art indexes and three distributed vector databases, using high-dimensional embedding datasets generated by deep learning models. The experimental results show that our index outperforms index baselines in the high recall region and under concurrent read-write workloads. Furthermore, \namesys{} is scalable and achieves up to $16\times$ higher throughputs than the baselines. The HAKES project is open-sourced at https://www.comp.nus.edu.sg/~dbsystem/hakes/.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
SuoiAI: Building a Dataset for Aquatic Invertebrates in Vietnam
Authors:
Tue Vo,
Lakshay Sharma,
Tuan Dinh,
Khuong Dinh,
Trang Nguyen,
Trung Phan,
Minh Do,
Duong Vu
Abstract:
Understanding and monitoring aquatic biodiversity is critical for ecological health and conservation efforts. This paper proposes SuoiAI, an end-to-end pipeline for building a dataset of aquatic invertebrates in Vietnam and employing machine learning (ML) techniques for species classification. We outline the methods for data collection, annotation, and model training, focusing on reducing annotati…
▽ More
Understanding and monitoring aquatic biodiversity is critical for ecological health and conservation efforts. This paper proposes SuoiAI, an end-to-end pipeline for building a dataset of aquatic invertebrates in Vietnam and employing machine learning (ML) techniques for species classification. We outline the methods for data collection, annotation, and model training, focusing on reducing annotation effort through semi-supervised learning and leveraging state-of-the-art object detection and classification models. Our approach aims to overcome challenges such as data scarcity, fine-grained classification, and deployment in diverse environmental conditions.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
FairDAG: Consensus Fairness over Concurrent Causal Design
Authors:
Dakai Kang,
Junchao Chen,
Tien Tuan Anh Dinh,
Mohammad Sadoghi
Abstract:
The rise of cryptocurrencies like Bitcoin and Ethereum has driven interest in blockchain technology, with Ethereum's smart contracts enabling the growth of decentralized finance (DeFi). However, research has shown that adversaries exploit transaction ordering to extract profits through attacks like front-running, sandwich attacks, and liquidation manipulation. This issue affects both permissionles…
▽ More
The rise of cryptocurrencies like Bitcoin and Ethereum has driven interest in blockchain technology, with Ethereum's smart contracts enabling the growth of decentralized finance (DeFi). However, research has shown that adversaries exploit transaction ordering to extract profits through attacks like front-running, sandwich attacks, and liquidation manipulation. This issue affects both permissionless and permissioned blockchains, as block proposers have full control over transaction ordering. To address this, a more fair approach to transaction ordering is essential.
Existing fairness protocols, such as Pompe and Themis, operate on leader-based consensus protocols, which not only suffer from low throughput but also allow adversaries to manipulate transaction ordering. To address these limitations, we propose FairDAG-AB and FairDAG-RL, which leverage DAG-based consensus protocols.
We theoretically demonstrate that FairDAG protocols not only uphold fairness guarantees, as previous fairness protocols do, but also achieve higher throughput and greater resilience to adversarial ordering manipulation. Our deployment and evaluation on CloudLab further validate these claims.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
CCaaLF: Concurrency Control as a Learnable Function
Authors:
Hexiang Pan,
Shaofeng Cai,
Tien Tuan Anh Dinh,
Yuncheng Wu,
Yeow Meng Chee,
Gang Chen,
Beng Chin Ooi
Abstract:
Concurrency control (CC) algorithms are important in modern transactional databases, as they enable high performance by executing transactions concurrently while ensuring correctness. However, state-of-the-art CC algorithms struggle to perform well across diverse workloads, and most do not consider workload drifts.
In this paper, we propose CCaaLF (Concurrency Control as a Learnable Function), a…
▽ More
Concurrency control (CC) algorithms are important in modern transactional databases, as they enable high performance by executing transactions concurrently while ensuring correctness. However, state-of-the-art CC algorithms struggle to perform well across diverse workloads, and most do not consider workload drifts.
In this paper, we propose CCaaLF (Concurrency Control as a Learnable Function), a novel learned concurrency control algorithm designed to achieve high performance across varying workloads. The algorithm is quick to optimize, making it robust against dynamic workloads. CCaaLF learns an agent function that captures a large number of design choices from existing CC algorithms. The function is implemented as an efficient in-database lookup table that maps database states to concurrency control actions. The learning process is based on a combination of Bayesian optimization and a novel graph reduction algorithm, which converges quickly to a function that achieves high transaction throughput. We compare CCaaLF against five state-of-the-art CC algorithms and show that our algorithm consistently outperforms them in terms of transaction throughput and optimization time.
△ Less
Submitted 25 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Energy Scale Degradation in Sparse Quantum Solvers: A Barrier to Quantum Utility
Authors:
Thang N. Dinh,
Cao P. Cong
Abstract:
Quantum computing offers a promising route for tackling hard optimization problems by encoding them as Ising models. However, sparse qubit connectivity requires the use of minor-embedding, mapping logical qubits onto chains of physical qubits, which necessitates stronger intra-chain coupling to maintain consistency. This elevated coupling strength forces a rescaling of the Hamiltonian due to hardw…
▽ More
Quantum computing offers a promising route for tackling hard optimization problems by encoding them as Ising models. However, sparse qubit connectivity requires the use of minor-embedding, mapping logical qubits onto chains of physical qubits, which necessitates stronger intra-chain coupling to maintain consistency. This elevated coupling strength forces a rescaling of the Hamiltonian due to hardware-imposed limits on the allowable ranges of coupling strengths, reducing the energy gaps between competing states, thus, degrading the solver's performance. Here, we introduce a theoretical model that quantifies this degradation. We show that as the connectivity degree increases, the effective temperature rises as a polynomial function, resulting in a success probability that decays exponentially. Our analysis further establishes worst-case bounds on the energy scale degradation based on the inverse conductance of chain subgraphs, revealing two most important drivers of chain strength, \textit{chain volume} and \textit{chain connectivity}. Our findings indicate that achieving quantum advantage is inherently challenging. Experiments on D-Wave quantum annealers validate these findings, highlighting the need for hardware with improved connectivity and optimized scale-aware embedding algorithms.
△ Less
Submitted 11 March, 2025;
originally announced March 2025.
-
Sparse Orthogonal Matching Pursuit-based Parameter Estimation for Integrated Sensing and Communications
Authors:
Ngoc-Son Duong,
Khac-Hoang Ngo,
Thai-Mai Dinh,
Van-Linh Nguyen
Abstract:
Accurate parameter estimation such as angle of arrival (AOA) is essential to enhance the performance of integrated sensing and communication (ISAC) in mmWave multiple-input multiple-output (MIMO) systems. This work presents a sensing-aided communication channel estimation mechanism, where the sensing channel shares the same AOA with the uplink communication channel. First, we propose a novel ortho…
▽ More
Accurate parameter estimation such as angle of arrival (AOA) is essential to enhance the performance of integrated sensing and communication (ISAC) in mmWave multiple-input multiple-output (MIMO) systems. This work presents a sensing-aided communication channel estimation mechanism, where the sensing channel shares the same AOA with the uplink communication channel. First, we propose a novel orthogonal matching pursuit (OMP)-based method for coarsely estimating the AOA in a sensing channel, offering improved accuracy compared to conventional methods that rely on rotational invariance techniques. Next, we refine the coarse estimates obtained in the first step by modifying the Space-Alternating Generalized Expectation Maximization algorithm for fine parameter estimation. Through simulations and mathematical analysis, we demonstrate that scenarios with shared AOA achieve a better Cramer-Rao lower bound (CRLB) than those without sharing. This finding highlights the potential of leveraging joint sensing and communication channels to enhance parameter estimation accuracy, particularly in channel or location estimation applications.
△ Less
Submitted 4 March, 2025;
originally announced March 2025.
-
Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability
Authors:
Tu Anh Dinh,
Jan Niehues
Abstract:
Quality Estimation (QE) is estimating quality of the model output during inference when the ground truth is not available. Deriving output quality from the models' output probability is the most trivial and low-effort way. However, we show that the output probability of text-generation models can appear underconfident. At each output step, there can be multiple correct options, making the probabil…
▽ More
Quality Estimation (QE) is estimating quality of the model output during inference when the ground truth is not available. Deriving output quality from the models' output probability is the most trivial and low-effort way. However, we show that the output probability of text-generation models can appear underconfident. At each output step, there can be multiple correct options, making the probability distribution spread out more. Thus, lower probability does not necessarily mean lower output quality. Due to this observation, we propose a QE approach called BoostedProb, which boosts the model's confidence in cases where there are multiple viable output options. With no increase in complexity, BoostedProb is notably better than raw model probability in different settings, achieving on average +0.194 improvement in Pearson correlation to ground-truth quality. It also comes close to or outperforms more costly approaches like supervised or ensemble-based QE in certain settings.
△ Less
Submitted 29 May, 2025; v1 submitted 16 February, 2025;
originally announced February 2025.
-
Data clustering: an essential technique in data science
Authors:
Tai Dinh,
Wong Hauchi,
Daniil Lisik,
Michal Koren,
Dat Tran,
Philip S. Yu,
Joaquín Torres-Sospedra
Abstract:
This paper explores the critical role of data clustering in data science, emphasizing its methodologies, tools, and diverse applications. Traditional techniques, such as partitional and hierarchical clustering, are analyzed alongside advanced approaches such as data stream, density-based, graph-based, and model-based clustering for handling complex structured datasets. The paper highlights key pri…
▽ More
This paper explores the critical role of data clustering in data science, emphasizing its methodologies, tools, and diverse applications. Traditional techniques, such as partitional and hierarchical clustering, are analyzed alongside advanced approaches such as data stream, density-based, graph-based, and model-based clustering for handling complex structured datasets. The paper highlights key principles underpinning clustering, outlines widely used tools and frameworks, introduces the workflow of clustering in data science, discusses challenges in practical implementation, and examines various applications of clustering. By focusing on these foundations and applications, the discussion underscores clustering's transformative potential. The paper concludes with insights into future research directions, emphasizing clustering's role in driving innovation and enabling data-driven decision-making.
△ Less
Submitted 30 January, 2025; v1 submitted 24 December, 2024;
originally announced December 2024.
-
Scalable Quantum-Inspired Optimization through Dynamic Qubit Compression
Authors:
Co Tran,
Quoc-Bao Tran,
Hy Truong Son,
Thang N Dinh
Abstract:
Hard combinatorial optimization problems, often mapped to Ising models, promise potential solutions with quantum advantage but are constrained by limited qubit counts in near-term devices. We present an innovative quantum-inspired framework that dynamically compresses large Ising models to fit available quantum hardware of different sizes. Thus, we aim to bridge the gap between large-scale optimiz…
▽ More
Hard combinatorial optimization problems, often mapped to Ising models, promise potential solutions with quantum advantage but are constrained by limited qubit counts in near-term devices. We present an innovative quantum-inspired framework that dynamically compresses large Ising models to fit available quantum hardware of different sizes. Thus, we aim to bridge the gap between large-scale optimization and current hardware capabilities. Our method leverages a physics-inspired GNN architecture to capture complex interactions in Ising models and accurately predict alignments among neighboring spins (aka qubits) at ground states. By progressively merging such aligned spins, we can reduce the model size while preserving the underlying optimization structure. It also provides a natural trade-off between the solution quality and size reduction, meeting different hardware constraints of quantum computing devices. Extensive numerical studies on Ising instances of diverse topologies show that our method can reduce instance size at multiple levels with virtually no losses in solution quality on the latest D-wave quantum annealers.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
SeSeMI: Secure Serverless Model Inference on Sensitive Data
Authors:
Guoyu Hu,
Yuncheng Wu,
Gang Chen,
Tien Tuan Anh Dinh,
Beng Chin Ooi
Abstract:
Model inference systems are essential for implementing end-to-end data analytics pipelines that deliver the benefits of machine learning models to users. Existing cloud-based model inference systems are costly, not easy to scale, and must be trusted in handling the models and user request data. Serverless computing presents a new opportunity, as it provides elasticity and fine-grained pricing. Our…
▽ More
Model inference systems are essential for implementing end-to-end data analytics pipelines that deliver the benefits of machine learning models to users. Existing cloud-based model inference systems are costly, not easy to scale, and must be trusted in handling the models and user request data. Serverless computing presents a new opportunity, as it provides elasticity and fine-grained pricing. Our goal is to design a serverless model inference system that protects models and user request data from untrusted cloud providers. It offers high performance and low cost, while requiring no intrusive changes to the current serverless platforms. To realize our goal, we leverage trusted hardware. We identify and address three challenges in using trusted hardware for serverless model inference. These challenges arise from the high-level abstraction of serverless computing, the performance overhead of trusted hardware, and the characteristics of model inference workloads. We present SeSeMI, a secure, efficient, and cost-effective serverless model inference system. It adds three novel features non-intrusively to the existing serverless infrastructure and nothing else.The first feature is a key service that establishes secure channels between the user and the serverless instances, which also provides access control to models and users' data. The second is an enclave runtime that allows one enclave to process multiple concurrent requests. The final feature is a model packer that allows multiple models to be executed by one serverless instance. We build SeSeMI on top of Apache OpenWhisk, and conduct extensive experiments with three popular machine learning models. The results show that SeSeMI achieves low latency and low cost at scale for realistic workloads.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
Speech-based Multimodel Pipeline for Vietnamese Services Quality Assessment
Authors:
Quang-Anh N. D.,
Minh-Duc Pham,
Thai Kim Dinh
Abstract:
In the evolving landscape of customer service within the digital economy, traditional methods of service quality assessment have shown significant limitations, this research proposes a novel deep-learning approach to service quality assessment, focusing on the Vietnamese service sector. By leveraging a multi-modal pipeline that transcends traditional evaluation methods, the research addresses the…
▽ More
In the evolving landscape of customer service within the digital economy, traditional methods of service quality assessment have shown significant limitations, this research proposes a novel deep-learning approach to service quality assessment, focusing on the Vietnamese service sector. By leveraging a multi-modal pipeline that transcends traditional evaluation methods, the research addresses the limitations of conventional assessments by analyzing speech, speaker interactions and emotional content, offering a more comprehensive and objective means of understanding customer service interactions. This aims to provide organizations with a sophisticated tool for evaluating and improving service quality in the digital economy.
△ Less
Submitted 18 December, 2024; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Emotional Vietnamese Speech-Based Depression Diagnosis Using Dynamic Attention Mechanism
Authors:
Quang-Anh N. D.,
Manh-Hung Ha,
Thai Kim Dinh,
Minh-Duc Pham,
Ninh Nguyen Van
Abstract:
Major depressive disorder is a prevalent and serious mental health condition that negatively impacts your emotions, thoughts, actions, and overall perception of the world. It is complicated to determine whether a person is depressed due to the symptoms of depression not apparent. However, their voice can be one of the factor from which we can acknowledge signs of depression. People who are depress…
▽ More
Major depressive disorder is a prevalent and serious mental health condition that negatively impacts your emotions, thoughts, actions, and overall perception of the world. It is complicated to determine whether a person is depressed due to the symptoms of depression not apparent. However, their voice can be one of the factor from which we can acknowledge signs of depression. People who are depressed express discomfort, sadness and they may speak slowly, trembly, and lose emotion in their voices. In this study, we proposed the Dynamic Convolutional Block Attention Module (Dynamic-CBAM) to utilized with in an Attention-GRU Network to classify the emotions by analyzing the audio signal of humans. Based on the results, we can diagnose which patients are depressed or prone to depression then so that treatment and prevention can be started as soon as possible. The research delves into the intricate computational steps involved in implementing a Attention-GRU deep learning architecture. Through experimentation, the model has achieved an impressive recognition with Unweighted Accuracy (UA) rate of 0.87 and 0.86 Weighted Accuracy (WA) rate and F1 rate of 0.87 in the VNEMOS dataset. Training code is released in https://github.com/fiyud/Emotional-Vietnamese-Speech-Based-Depression-Diagnosis-Using-Dynamic-Attention-Mechanism
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
Automating Data Science Pipelines with Tensor Completion
Authors:
Shaan Pakala,
Bryce Graw,
Dawon Ahn,
Tam Dinh,
Mehnaz Tabassum Mahin,
Vassilis Tsotras,
Jia Chen,
Evangelos E. Papalexakis
Abstract:
Hyperparameter optimization is an essential component in many data science pipelines and typically entails exhaustive time and resource-consuming computations in order to explore the combinatorial search space. Similar to this problem, other key operations in data science pipelines exhibit the exact same properties. Important examples are: neural architecture search, where the goal is to identify…
▽ More
Hyperparameter optimization is an essential component in many data science pipelines and typically entails exhaustive time and resource-consuming computations in order to explore the combinatorial search space. Similar to this problem, other key operations in data science pipelines exhibit the exact same properties. Important examples are: neural architecture search, where the goal is to identify the best design choices for a neural network, and query cardinality estimation, where given different predicate values for a SQL query the goal is to estimate the size of the output. In this paper, we abstract away those essential components of data science pipelines and we model them as instances of tensor completion, where each variable of the search space corresponds to one mode of the tensor, and the goal is to identify all missing entries of the tensor, corresponding to all combinations of variable values, starting from a very small sample of observed entries. In order to do so, we first conduct a thorough experimental evaluation of existing state-of-the-art tensor completion techniques and introduce domain-inspired adaptations (such as smoothness across the discretized variable space) and an ensemble technique which is able to achieve state-of-the-art performance. We extensively evaluate existing and proposed methods in a number of datasets generated corresponding to (a) hyperparameter optimization for non-neural network models, (b) neural architecture search, and (c) variants of query cardinality estimation, demonstrating the effectiveness of tensor completion as a tool for automating data science pipelines. Furthermore, we release our generated datasets and code in order to provide benchmarks for future work on this topic.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
Categorical data clustering: 25 years beyond K-modes
Authors:
Tai Dinh,
Wong Hauchi,
Philippe Fournier-Viger,
Daniil Lisik,
Minh-Quyet Ha,
Hieu-Chi Dam,
Van-Nam Huynh
Abstract:
The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical data, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provi…
▽ More
The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical data, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provides a comprehensive synthesis of categorical data clustering in the past twenty-five years, starting from the introduction of K-modes. It elucidates the pivotal role of categorical data clustering in diverse fields such as health sciences, natural sciences, social sciences, education, engineering and economics. Practical comparisons are conducted for algorithms having public implementations, highlighting distinguishing clustering methodologies and revealing the performance of recent algorithms on several benchmark categorical datasets. Finally, challenges and opportunities in the field are discussed.
△ Less
Submitted 24 January, 2025; v1 submitted 30 August, 2024;
originally announced August 2024.
-
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
Authors:
Tu Anh Dinh,
Carlos Mullov,
Leonard Bärmann,
Zhaolin Li,
Danni Liu,
Simon Reiß,
Jueun Lee,
Nathan Lerzer,
Fabian Ternava,
Jianfeng Gao,
Tobias Röddiger,
Alexander Waibel,
Tamim Asfour,
Michael Beigl,
Rainer Stiefelhagen,
Carsten Dachsbacher,
Klemens Böhm,
Jan Niehues
Abstract:
With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx -…
▽ More
With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.
△ Less
Submitted 2 October, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Repairing Reed-Solomon Codes with Side Information
Authors:
Thi Xinh Dinh,
Ba Thong Le,
Son Hoang Dau,
Serdar Boztas,
Stanislav Kruglik,
Han Mao Kiah,
Emanuele Viterbo,
Tuvi Etzion,
Yeow Meng Chee
Abstract:
We generalize the problem of recovering a lost/erased symbol in a Reed-Solomon code to the scenario in which some side information about the lost symbol is known. The side information is represented as a set $S$ of linearly independent combinations of the sub-symbols of the lost symbol. When $S = \varnothing$, this reduces to the standard problem of repairing a single codeword symbol. When $S$ is…
▽ More
We generalize the problem of recovering a lost/erased symbol in a Reed-Solomon code to the scenario in which some side information about the lost symbol is known. The side information is represented as a set $S$ of linearly independent combinations of the sub-symbols of the lost symbol. When $S = \varnothing$, this reduces to the standard problem of repairing a single codeword symbol. When $S$ is a set of sub-symbols of the erased one, this becomes the repair problem with partially lost/erased symbol. We first establish that the minimum repair bandwidth depends on $|S|$ and not the content of $S$ and construct a lower bound on the repair bandwidth of a linear repair scheme with side information $S$. We then consider the well-known subspace-polynomial repair schemes and show that their repair bandwidths can be optimized by choosing the right subspaces. Finally, we demonstrate several parameter regimes where the optimal bandwidths can be achieved for full-length Reed-Solomon codes.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Quality Estimation with $k$-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation
Authors:
Tu Anh Dinh,
Tobias Palzer,
Jan Niehues
Abstract:
Providing quality scores along with Machine Translation (MT) output, so-called reference-free Quality Estimation (QE), is crucial to inform users about the reliability of the translation. We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors. Measuring the performance of model-specific QE is n…
▽ More
Providing quality scores along with Machine Translation (MT) output, so-called reference-free Quality Estimation (QE), is crucial to inform users about the reliability of the translation. We propose a model-specific, unsupervised QE approach, termed $k$NN-QE, that extracts information from the MT model's training data using $k$-nearest neighbors. Measuring the performance of model-specific QE is not straightforward, since they provide quality scores on their own MT output, thus cannot be evaluated using benchmark QE test sets containing human quality scores on premade MT output. Therefore, we propose an automatic evaluation method that uses quality scores from reference-based metrics as gold standard instead of human-generated ones. We are the first to conduct detailed analyses and conclude that this automatic method is sufficient, and the reference-based MetricX-23 is best for the task.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Twin Auto-Encoder Model for Learning Separable Representation in Cyberattack Detection
Authors:
Phai Vu Dinh,
Quang Uy Nguyen,
Thai Hoang Dinh,
Diep N. Nguyen,
Bao Son Pham,
Eryk Dutkiewicz
Abstract:
Representation learning (RL) methods for cyberattack detection face the diversity and sophistication of attack data, leading to the issue of mixed representations of different classes, particularly as the number of classes increases. To address this, the paper proposes a novel deep learning architecture/model called the Twin Auto-Encoder (TAE). TAE first maps the input data into latent space and t…
▽ More
Representation learning (RL) methods for cyberattack detection face the diversity and sophistication of attack data, leading to the issue of mixed representations of different classes, particularly as the number of classes increases. To address this, the paper proposes a novel deep learning architecture/model called the Twin Auto-Encoder (TAE). TAE first maps the input data into latent space and then deterministically shifts data samples of different classes further apart to create separable data representations, referred to as representation targets. TAE's decoder then projects the input data into these representation targets. After training, TAE's decoder extracts data representations. TAE's representation target serves as a novel dynamic codeword, which refers to the vector that represents a specific class. This vector is updated after each training epoch for every data sample, in contrast to the conventional fixed codeword that does not incorporate information from the input data. We conduct extensive experiments on diverse cybersecurity datasets, including seven IoT botnet datasets, two network IDS datasets, three malware datasets, one cloud DDoS dataset, and ten artificial datasets as the number of classes increases. TAE boosts accuracy and F-score in attack detection by around 2% compared to state-of-the-art models, achieving up to 96.1% average accuracy in IoT attack detection. Additionally, TAE is well-suited for cybersecurity applications and potentially for IoT systems, with a model size of approximately 1 MB and an average running time of around 2.6E-07 seconds for extracting a data sample.
△ Less
Submitted 28 April, 2025; v1 submitted 21 March, 2024;
originally announced March 2024.
-
Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition
Authors:
Yu Yu,
Chao-Han Huck Yang,
Tuan Dinh,
Sungho Ryu,
Jari Kolehmainen,
Roger Ren,
Denis Filimonov,
Prashanth G. Shivakumar,
Ankur Gandhe,
Ariya Rastow,
Jia Xu,
Ivan Bulyko,
Andreas Stolcke
Abstract:
The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dat…
▽ More
The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dataset and of 3.67\% on an internal dataset in the messaging domain. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine robustness against input perturbations. These perturbations are rooted in homophone replacements and a novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both designed to measure the relative degradation in the performance of rescoring models. Our experimental results indicate that while advanced variants of LoRA, such as dynamic rank-allocated LoRA, lead to performance degradation in $1$-best perturbation, they alleviate the degradation in $N$-best perturbation. This finding is in comparison to fully-tuned models and vanilla LoRA tuning baselines, suggesting that a comprehensive selection is needed when using LoRA-based adaptation for compute-cost savings and robust language modeling.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
A probabilistic forecast methodology for volatile electricity prices in the Australian National Electricity Market
Authors:
Cameron Cornell,
Nam Trong Dinh,
S. Ali Pourmousavi
Abstract:
The South Australia region of the Australian National Electricity Market (NEM) displays some of the highest levels of price volatility observed in modern electricity markets. This paper outlines an approach to probabilistic forecasting under these extreme conditions, including spike filtration and several post-processing steps. We propose using quantile regression as an ensemble tool for probabili…
▽ More
The South Australia region of the Australian National Electricity Market (NEM) displays some of the highest levels of price volatility observed in modern electricity markets. This paper outlines an approach to probabilistic forecasting under these extreme conditions, including spike filtration and several post-processing steps. We propose using quantile regression as an ensemble tool for probabilistic forecasting, with our combined forecasts achieving superior results compared to all constituent models. Within our ensemble framework, we demonstrate that averaging models with varying training length periods leads to a more adaptive model and increased prediction accuracy. The applicability of the final model is evaluated by comparing our median forecasts with the point forecasts available from the Australian NEM operator, with our model outperforming these NEM forecasts by a significant margin.
△ Less
Submitted 12 December, 2023; v1 submitted 13 November, 2023;
originally announced November 2023.
-
On the Financial Consequences of Simplified Battery Sizing Models without Considering Operational Details
Authors:
Nam Trong Dinh,
Sahand Karimi-Arpanahi,
S. Ali Pourmousavi,
Mingyu Guo,
Julian Lemos-Vinasco,
Jon A. R. Liisberg
Abstract:
Optimal battery sizing studies tend to overly simplify the practical aspects of battery operation within the battery sizing framework. Such assumptions may lead to a suboptimal battery capacity, resulting in significant financial losses for a battery project that could last more than a decade. In this paper, we compare the most common existing sizing methods in the literature with a battery sizing…
▽ More
Optimal battery sizing studies tend to overly simplify the practical aspects of battery operation within the battery sizing framework. Such assumptions may lead to a suboptimal battery capacity, resulting in significant financial losses for a battery project that could last more than a decade. In this paper, we compare the most common existing sizing methods in the literature with a battery sizing model that incorporates the practical operation of a battery, that is, receding horizon operation. Consequently, we quantify the financial losses caused by the suboptimal capacities obtained by these models for a realistic case study related to community battery storage (CBS). We develop the case study by constructing a mathematical framework for the CBS and local end users. Our results show that existing sizing methods can lead to financial losses of up to 22%.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition
Authors:
Yu Yu,
Chao-Han Huck Yang,
Jari Kolehmainen,
Prashanth G. Shivakumar,
Yile Gu,
Sungho Ryu,
Roger Ren,
Qi Luo,
Aditya Gourav,
I-Fan Chen,
Yi-Chieh Liu,
Tuan Dinh,
Ankur Gandhe,
Denis Filimonov,
Shalini Ghosh,
Andreas Stolcke,
Ariya Rastow,
Ivan Bulyko
Abstract:
We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we p…
▽ More
We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction (0.08%) of the pretrained parameters. These inserted matrices are optimized through a discriminative training objective along with a correlation-based regularization loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6.
△ Less
Submitted 10 October, 2023; v1 submitted 26 September, 2023;
originally announced September 2023.
-
Modelling Irrational Behaviour of Residential End Users using Non-Stationary Gaussian Processes
Authors:
Nam Trong Dinh,
Sahand Karimi-Arpanahi,
Rui Yuan,
S. Ali Pourmousavi,
Mingyu Guo,
Jon A. R. Liisberg,
Julian Lemos-Vinasco
Abstract:
Demand response (DR) plays a critical role in ensuring efficient electricity consumption and optimal use of network assets. Yet, existing DR models often overlook a crucial element, the irrational behaviour of electricity end users. In this work, we propose a price-responsive model that incorporates key aspects of end-user irrationality, specifically loss aversion, time inconsistency, and bounded…
▽ More
Demand response (DR) plays a critical role in ensuring efficient electricity consumption and optimal use of network assets. Yet, existing DR models often overlook a crucial element, the irrational behaviour of electricity end users. In this work, we propose a price-responsive model that incorporates key aspects of end-user irrationality, specifically loss aversion, time inconsistency, and bounded rationality. To this end, we first develop a framework that uses Multiple Seasonal-Trend decomposition using Loess (MSTL) and non-stationary Gaussian processes to model the randomness in the electricity consumption by residential consumers. The impact of this model is then evaluated through a community battery storage (CBS) business model. Additionally, we apply a chance-constrained optimisation model for CBS operation that deals with the unpredictability of the end-user irrationality. Our simulations using real-world data show that the proposed DR model provides a more realistic estimate of end-user price-responsive behaviour when considering irrationality. Compared to a deterministic model that cannot fully take into account the irrational behaviour of end users, the chance-constrained CBS operation model yields an additional 19% revenue. Lastly, the business model reduces the electricity costs of solar end users by 11%.
△ Less
Submitted 26 March, 2024; v1 submitted 16 September, 2023;
originally announced September 2023.
-
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
Authors:
Christian Huber,
Tu Anh Dinh,
Carlos Mullov,
Ngoc Quan Pham,
Thai Binh Nguyen,
Fabian Retkowski,
Stefan Constantin,
Enes Yavuz Ugan,
Danni Liu,
Zhaolin Li,
Sai Koneru,
Jan Niehues,
Alexander Waibel
Abstract:
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches.
In this work…
▽ More
The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches.
In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components.
Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.
△ Less
Submitted 17 July, 2024; v1 submitted 7 August, 2023;
originally announced August 2023.
-
Federated Deep Reinforcement Learning-based Bitrate Adaptation for Dynamic Adaptive Streaming over HTTP
Authors:
Phuong L. Vo,
Nghia T. Nguyen,
Long Luu,
Canh T. Dinh,
Nguyen H. Tran,
Tuan-Anh Le
Abstract:
In video streaming over HTTP, the bitrate adaptation selects the quality of video chunks depending on the current network condition. Some previous works have applied deep reinforcement learning (DRL) algorithms to determine the chunk's bitrate from the observed states to maximize the quality-of-experience (QoE). However, to build an intelligent model that can predict in various environments, such…
▽ More
In video streaming over HTTP, the bitrate adaptation selects the quality of video chunks depending on the current network condition. Some previous works have applied deep reinforcement learning (DRL) algorithms to determine the chunk's bitrate from the observed states to maximize the quality-of-experience (QoE). However, to build an intelligent model that can predict in various environments, such as 3G, 4G, Wifi, \textit{etc.}, the states observed from these environments must be sent to a server for training centrally. In this work, we integrate federated learning (FL) to DRL-based rate adaptation to train a model appropriate for different environments. The clients in the proposed framework train their model locally and only update the weights to the server. The simulations show that our federated DRL-based rate adaptations, called FDRLABR with different DRL algorithms, such as deep Q-learning, advantage actor-critic, and proximal policy optimization, yield better performance than the traditional bitrate adaptation methods in various environments.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
CroCoDai: A Stablecoin for Cross-Chain Commerce
Authors:
Daniël Reijsbergen,
Bretislav Hajek,
Tien Tuan Anh Dinh,
Jussi Keppo,
Henry F. Korth,
Anwitaman Datta
Abstract:
Decentralized Finance (DeFi), in which digital assets are exchanged without trusted intermediaries, has grown rapidly in value in recent years. The global DeFi ecosystem is fragmented into multiple blockchains, fueling the demand for cross-chain commerce. Existing approaches for cross-chain transactions, e.g., bridges and cross-chain deals, achieve atomicity by locking assets in escrow. However, l…
▽ More
Decentralized Finance (DeFi), in which digital assets are exchanged without trusted intermediaries, has grown rapidly in value in recent years. The global DeFi ecosystem is fragmented into multiple blockchains, fueling the demand for cross-chain commerce. Existing approaches for cross-chain transactions, e.g., bridges and cross-chain deals, achieve atomicity by locking assets in escrow. However, locking up assets increases the financial risks for the participants, especially due to price fluctuations and the long latency of cross-chain transactions. Stablecoins, which are pegged to a non-volatile asset such as the US dollar, help mitigate the risk associated with price fluctuations. However, existing stablecoin designs are tied to individual blockchain platforms, and trusted parties or complex protocols are needed to exchange stablecoin tokens between blockchains.
Our goal is to design a practical stablecoin for cross-chain commerce. Realizing this goal requires addressing two challenges. The first challenge is to support a large and growing number of blockchains efficiently. The second challenge is to be resilient to price fluctuations and blockchain platform failures. We present CroCoDai to address these challenges. We also present three prototype implementations of our stablecoin system, and show that it incurs small execution overhead.
△ Less
Submitted 14 October, 2024; v1 submitted 16 June, 2023;
originally announced June 2023.
-
PIEChain -- A Practical Blockchain Interoperability Framework
Authors:
Daniël Reijsbergen,
Aung Maw,
Jingchi Zhang,
Tien Tuan Anh Dinh,
Anwitaman Datta
Abstract:
A plethora of different blockchain platforms have emerged in recent years, but many of them operate in silos. As such, there is a need for reliable cross-chain communication to enable blockchain interoperability. Blockchain interoperability is challenging because transactions can typically not be reverted - as such, if one transaction is committed then the protocol must ensure that all related tra…
▽ More
A plethora of different blockchain platforms have emerged in recent years, but many of them operate in silos. As such, there is a need for reliable cross-chain communication to enable blockchain interoperability. Blockchain interoperability is challenging because transactions can typically not be reverted - as such, if one transaction is committed then the protocol must ensure that all related transactions are committed as well. Existing interoperability approaches, e.g., Cosmos and Polkadot, are limited in the sense that they only support interoperability between their own subchains, or require intrusive changes to existing blockchains. To overcome this limitation, we propose PIEChain, a general, Kafka-based cross-chain communication framework. We utilize PIEChain for a practical case study: a cross-chain auction in which users who hold tokens on multiple chains bid for a ticket sold on another chain. PIEChain is the first publicly available, practical implementation of a general framework for cross-chain communication.
△ Less
Submitted 16 June, 2023;
originally announced June 2023.
-
KIT's Multilingual Speech Translation System for IWSLT 2023
Authors:
Danni Liu,
Thai Binh Nguyen,
Sai Koneru,
Enes Yavuz Ugan,
Ngoc-Quan Pham,
Tuan-Nam Nguyen,
Tu Anh Dinh,
Carlos Mullov,
Alexander Waibel,
Jan Niehues
Abstract:
Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and te…
▽ More
Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and terminology-dense contents. The task requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.
△ Less
Submitted 12 July, 2023; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Large Language Models of Code Fail at Completing Code with Potential Bugs
Authors:
Tuan Dinh,
Jinman Zhao,
Samson Tan,
Renato Negrinho,
Leonard Lausen,
Sheng Zha,
George Karypis
Abstract:
Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired…
▽ More
Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CODEGEN-2B-MONO on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a significant gap in post-mitigation performance.
△ Less
Submitted 30 November, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
Perturbation-based QE: An Explainable, Unsupervised Word-level Quality Estimation Method for Blackbox Machine Translation
Authors:
Tu Anh Dinh,
Jan Niehues
Abstract:
Quality Estimation (QE) is the task of predicting the quality of Machine Translation (MT) system output, without using any gold-standard translation references. State-of-the-art QE models are supervised: they require human-labeled quality of some MT system output on some datasets for training, making them domain-dependent and MT-system-dependent. There has been research on unsupervised QE, which r…
▽ More
Quality Estimation (QE) is the task of predicting the quality of Machine Translation (MT) system output, without using any gold-standard translation references. State-of-the-art QE models are supervised: they require human-labeled quality of some MT system output on some datasets for training, making them domain-dependent and MT-system-dependent. There has been research on unsupervised QE, which requires glass-box access to the MT systems, or parallel MT data to generate synthetic errors for training QE models. In this paper, we present Perturbation-based QE - a word-level Quality Estimation approach that works simply by analyzing MT system output on perturbed input source sentences. Our approach is unsupervised, explainable, and can evaluate any type of blackbox MT systems, including the currently prominent large language models (LLMs) with opaque internal processes. For language directions with no labeled QE data, our approach has similar or better performance than the zero-shot supervised approach on the WMT21 shared task. Our approach is better at detecting gender bias and word-sense-disambiguation errors in translation than supervised QE, indicating its robustness to out-of-domain usage. The performance gap is larger when detecting errors on a nontraditional translation-prompting LLM, indicating that our approach is more generalizable to different MT systems. We give examples demonstrating our approach's explainability power, where it shows which input source words have influence on a certain MT output word.
△ Less
Submitted 13 July, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
Designing Compact Repair Groups for Reed-Solomon Codes
Authors:
Thi Xinh Dinh,
Serdar Boztas,
Son Hoang Dau,
Emanuele Viterbo
Abstract:
Motivated by the application of Reed-Solomon codes to recently emerging decentralized storage systems such as Storj and Filebase/Sia, we study the problem of designing compact repair groups for recovering multiple failures in a decentralized manner. Here, compactness means that the corresponding trace repair schemes of these groups of helpers can be generated from a single or a few seed repair sch…
▽ More
Motivated by the application of Reed-Solomon codes to recently emerging decentralized storage systems such as Storj and Filebase/Sia, we study the problem of designing compact repair groups for recovering multiple failures in a decentralized manner. Here, compactness means that the corresponding trace repair schemes of these groups of helpers can be generated from a single or a few seed repair schemes, thus saving the time and space required for finding and storing them. The goal is to design compact repair groups that can tolerate as many failures as possible. It turns out that the maximum number of failures a collection of repair groups can tolerate equals the size of a minimum hitting set of a collection of subsets of the finite field {\mathbb{F}_{q^{\ell}}} minus one. When the repair groups for each symbol are generated from a single subspace, we establish a pair of asymptotically tight lower bound and upper bound on the size of such a minimum hitting set. Using Burnside's Lemma and the Möbius inversion formula, we determine a number of subspaces that together attain the upper bound on the minimum hitting set size when the repair groups are generated from multiple subspaces.
△ Less
Submitted 11 May, 2023;
originally announced May 2023.
-
Development of a Vision System to Enhance the Reliability of the Pick-and-Place Robot for Autonomous Testing of Camera Module used in Smartphones
Authors:
Hoang-Anh Phan,
Duy Nam Bui,
Tuan Nguyen Dinh,
Bao-Anh Hoang,
An Nguyen Ngoc,
Dong Tran Huu Quoc,
Ha Tran Thi Thuy,
Tung Thanh Bui,
Van Nguyen Thi Thanh
Abstract:
Pick-and-place robots are commonly used in modern industrial manufacturing. For complex devices/parts like camera modules used in smartphones, which contain optical parts, electrical components and interfacing connectors, the placement operation may not absolutely accurate, which may cause damage in the device under test during the mechanical movement to make good contact for electrical functions…
▽ More
Pick-and-place robots are commonly used in modern industrial manufacturing. For complex devices/parts like camera modules used in smartphones, which contain optical parts, electrical components and interfacing connectors, the placement operation may not absolutely accurate, which may cause damage in the device under test during the mechanical movement to make good contact for electrical functions inspection. In this paper, we proposed an effective vision system including hardware and algorithm to enhance the reliability of the pick-and-place robot for autonomous testing memory of camera modules. With limited hardware based on camera and raspberry PI and using simplify image processing algorithm based on histogram information, the vision system can confirm the presence of the camera modules in feeding tray and the placement accuracy of the camera module in test socket. Through that, the system can work with more flexibility and avoid damaging the device under test. The system was experimentally quantified through testing approximately 2000 camera modules in a stable light condition. Experimental results demonstrate that the system achieves accuracy of more than 99.92%. With its simplicity and effectiveness, the proposed vision system can be considered as a useful solution for using in pick-and-place systems in industry.
△ Less
Submitted 8 May, 2023;
originally announced May 2023.
-
Multiobjective Logistics Optimization for Automated ATM Cash Replenishment Process
Authors:
Bui Tien Thanh,
Dinh Van Tuan,
Tuan Anh Chi,
Nguyen Van Dai,
Nguyen Tai Quang Dinh,
Nguyen Thu Thuy,
Nguyen Thi Xuan Hoa
Abstract:
In the digital transformation era, integrating digital technology into every aspect of banking operations improves process automation, cost efficiency, and service level improvement. Although logistics for ATM cash is a crucial task that impacts operating costs and consumer satisfaction, there has been little effort to enhance it. Specifically, in Vietnam, with a market of more than 20,000 ATMs na…
▽ More
In the digital transformation era, integrating digital technology into every aspect of banking operations improves process automation, cost efficiency, and service level improvement. Although logistics for ATM cash is a crucial task that impacts operating costs and consumer satisfaction, there has been little effort to enhance it. Specifically, in Vietnam, with a market of more than 20,000 ATMs nationally, research and technological solutions that can resolve this issue remain scarce. In this paper, we generalized the vehicle routing problem for ATM cash replenishment, suggested a mathematical model and then offered a tool to evaluate various situations. When being evaluated on the simulated dataset, our proposed model and method produced encouraging results with the benefits of cutting ATM cash operating costs.
△ Less
Submitted 22 July, 2023; v1 submitted 23 April, 2023;
originally announced April 2023.
-
Multi-User Cooperation for Covert Communication Under Quasi-Static Fading
Authors:
Jinyoung Lee,
Duc Trung Dinh,
Hyeonsik Yeom,
Si-Hyeon Lee,
Jeongseok Ha
Abstract:
This work studies a covert communication scheme for an uplink multi-user scenario in which some users are opportunistically selected to help a covert user. In particular, the selected users emit interfering signals via an orthogonal resource dedicated to the covert user together with signals for their own communications using orthogonal resources allocated to the selected users, which helps the co…
▽ More
This work studies a covert communication scheme for an uplink multi-user scenario in which some users are opportunistically selected to help a covert user. In particular, the selected users emit interfering signals via an orthogonal resource dedicated to the covert user together with signals for their own communications using orthogonal resources allocated to the selected users, which helps the covert user hide the presence of the covert communication. For the covert communication scheme, we carry out extensive analysis and find system parameters in closed forms. The analytic derivation for the system parameters allow one to find the optimal combination of system parameters by performing a simple one-dimensional search. In addition, the analytic results elucidate relations among the system parameters. In particular, it will be proved that the optimal strategy for the non-covert users is an on-off scheme with equal transmit power. The theoretical results derived in this work are confirmed by comparing them with numerical results obtained with exhaustive searches. Finally, we demonstrate that the results of work can be utilized in versatile ways by demonstrating a design of covert communication with energy efficiency into account.
△ Less
Submitted 10 April, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
Mining compact high utility sequential patterns
Authors:
Tai Dinh,
Philippe Fournier-Viger,
Huynh Van Hong
Abstract:
High utility sequential pattern mining (HUSPM) aims to mine all patterns that yield a high utility (profit) in a sequence dataset. HUSPM is useful for several applications such as market basket analysis, marketing, and website clickstream analysis. In these applications, users may also consider high utility patterns frequently appearing in the dataset to obtain more fruitful information. However,…
▽ More
High utility sequential pattern mining (HUSPM) aims to mine all patterns that yield a high utility (profit) in a sequence dataset. HUSPM is useful for several applications such as market basket analysis, marketing, and website clickstream analysis. In these applications, users may also consider high utility patterns frequently appearing in the dataset to obtain more fruitful information. However, this task is high computation since algorithms may generate a combinatorial explosive number of candidates that may be redundant or of low importance. To reduce complexity and obtain a compact set of frequent high utility sequential patterns (FHUSPs), this paper proposes an algorithm named CHUSP for mining closed frequent high utility sequential patterns (CHUSPs). Such patterns keep a concise representation while preserving the same expressive power of the complete set of FHUSPs. The proposed algorithm relies on a CHUS data structure to maintain information during mining. It uses three pruning strategies to eliminate early low-utility and non-frequent patterns, thereby reducing the search space. An extensive experimental evaluation was performed on six real-life datasets to evaluate the performance of CHUSP in terms of execution time, memory usage, and the number of generated patterns. Experimental results show that CHUSP can efficiently discover the compact set of CHUSPs under different user-defined thresholds.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Composable Ledgers for Distributed Synchronic Web Archiving
Authors:
Thien-Nam Dinh,
Nicholas Pattengale
Abstract:
The Synchronic Web is a highly scalable notary infrastructure that provides tamper-evident data provenance for historical web data. In this document, we describe the applicability of this infrastructure for web archiving across three envisioned stages of adoption. We codify the core mechanism enabling the value proposition: a procedure for splitting and merging cryptographic information fluidly ac…
▽ More
The Synchronic Web is a highly scalable notary infrastructure that provides tamper-evident data provenance for historical web data. In this document, we describe the applicability of this infrastructure for web archiving across three envisioned stages of adoption. We codify the core mechanism enabling the value proposition: a procedure for splitting and merging cryptographic information fluidly across blockchain-backed ledgers. Finally, we present preliminary performance results that indicate the feasibility of our approach for modern web archiving scales.
△ Less
Submitted 10 February, 2023;
originally announced February 2023.
-
The Synchronic Web
Authors:
Thien-Nam Dinh,
Nicholas Pattengale,
Steven Elliott
Abstract:
The Synchronic Web is a distributed network for securing data provenance on the World Wide Web. By enabling clients around the world to freely commit digital information into a single shared view of history, it provides a foundational basis of truth on which to build decentralized and scalable trust across the Internet. Its core cryptographical capability allows mutually distrusting parties to cre…
▽ More
The Synchronic Web is a distributed network for securing data provenance on the World Wide Web. By enabling clients around the world to freely commit digital information into a single shared view of history, it provides a foundational basis of truth on which to build decentralized and scalable trust across the Internet. Its core cryptographical capability allows mutually distrusting parties to create and verify statements of the following form: "I commit to this information--and only this information--at this moment in time." The backbone of the Synchronic Web infrastructure is a simple, small, and semantic-free blockchain that is accessible to any Internet-enabled entity. The infrastructure is maintained by a permissioned network of well-known servers, called notaries, and accessed by a permissionless group of clients, called ledgers. Through an evolving stack of flexible and composable semantic specifications, the parties cooperate to generate synchronic commitments over arbitrary data. When integrated with existing infrastructures, adapted to diverse domains, and scaled across the breadth of cyberspace, the Synchronic Web provides a ubiquitous mechanism to lock the world's data into unique points in discrete time and digital space.
△ Less
Submitted 10 June, 2024; v1 submitted 25 January, 2023;
originally announced January 2023.
-
Predict+Optimize Problem in Renewable Energy Scheduling
Authors:
Christoph Bergmeir,
Frits de Nijs,
Evgenii Genov,
Abishek Sriramulu,
Mahdi Abolghasemi,
Richard Bean,
John Betts,
Quang Bui,
Nam Trong Dinh,
Nils Einecke,
Rasul Esmaeilbeigi,
Scott Ferraro,
Priya Galketiya,
Robert Glasgow,
Rakshitha Godahewa,
Yanfei Kang,
Steffen Limmer,
Luis Magdalena,
Pablo Montero-Manso,
Daniel Peralta,
Yogesh Pipada Sunil Kumar,
Alejandro Rosales-Pérez,
Julian Ruddick,
Akylas Stratigakos,
Peter Stuckey
, et al. (3 additional authors not shown)
Abstract:
Predict+Optimize frameworks integrate forecasting and optimization to address real-world challenges such as renewable energy scheduling, where variability and uncertainty are critical factors. This paper benchmarks solutions from the IEEE-CIS Technical Challenge on Predict+Optimize for Renewable Energy Scheduling, focusing on forecasting renewable production and demand and optimizing energy cost.…
▽ More
Predict+Optimize frameworks integrate forecasting and optimization to address real-world challenges such as renewable energy scheduling, where variability and uncertainty are critical factors. This paper benchmarks solutions from the IEEE-CIS Technical Challenge on Predict+Optimize for Renewable Energy Scheduling, focusing on forecasting renewable production and demand and optimizing energy cost. The competition attracted 49 participants in total. The top-ranked method employed stochastic optimization using LightGBM ensembles, and achieved at least a 2% reduction in energy costs compared to deterministic approaches, demonstrating that the most accurate point forecast does not necessarily guarantee the best performance in downstream optimization. The published data and problem setting establish a benchmark for further research into integrated forecasting-optimization methods for energy systems, highlighting the importance of considering forecast uncertainty in optimization models to achieve cost-effective and reliable energy management. The novelty of this work lies in its comprehensive evaluation of Predict+Optimize methodologies applied to a real-world renewable energy scheduling problem, providing insights into the scalability, generalizability, and effectiveness of the proposed solutions. Potential applications extend beyond energy systems to any domain requiring integrated forecasting and optimization, such as supply chain management, transportation planning, and financial portfolio optimization.
△ Less
Submitted 14 April, 2025; v1 submitted 20 December, 2022;
originally announced December 2022.
-
QC-StyleGAN -- Quality Controllable Image Generation and Manipulation
Authors:
Dat Viet Thanh Nguyen,
Phong Tran The,
Tan M. Dinh,
Cuong Pham,
Anh Tuan Tran
Abstract:
The introduction of high-quality image generation models, particularly the StyleGAN family, provides a powerful tool to synthesize and manipulate images. However, existing models are built upon high-quality (HQ) data as desired outputs, making them unfit for in-the-wild low-quality (LQ) images, which are common inputs for manipulation. In this work, we bridge this gap by proposing a novel GAN stru…
▽ More
The introduction of high-quality image generation models, particularly the StyleGAN family, provides a powerful tool to synthesize and manipulate images. However, existing models are built upon high-quality (HQ) data as desired outputs, making them unfit for in-the-wild low-quality (LQ) images, which are common inputs for manipulation. In this work, we bridge this gap by proposing a novel GAN structure that allows for generating images with controllable quality. The network can synthesize various image degradation and restore the sharp image via a quality control code. Our proposed QC-StyleGAN can directly edit LQ images without altering their quality by applying GAN inversion and manipulation techniques. It also provides for free an image restoration solution that can handle various degradations, including noise, blur, compression artifacts, and their mixtures. Finally, we demonstrate numerous other applications such as image degradation synthesis, transfer, and interpolation. The code is available at https://github.com/VinAIResearch/QC-StyleGAN.
△ Less
Submitted 7 December, 2022; v1 submitted 2 December, 2022;
originally announced December 2022.
-
F2SD: A dataset for end-to-end group detection algorithms
Authors:
Giang Hoang,
Tuan Nguyen Dinh,
Tung Cao Hoang,
Son Le Duy,
Keisuke Hihara,
Yumeka Utada,
Akihiko Torii,
Naoki Izumi,
Long Tran Quoc
Abstract:
The lack of large-scale datasets has been impeding the advance of deep learning approaches to the problem of F-formation detection. Moreover, most research works on this problem rely on input sensor signals of object location and orientation rather than image signals. To address this, we develop a new, large-scale dataset of simulated images for F-formation detection, called F-formation Simulation…
▽ More
The lack of large-scale datasets has been impeding the advance of deep learning approaches to the problem of F-formation detection. Moreover, most research works on this problem rely on input sensor signals of object location and orientation rather than image signals. To address this, we develop a new, large-scale dataset of simulated images for F-formation detection, called F-formation Simulation Dataset (F2SD). F2SD contains nearly 60,000 images simulated from GTA-5, with bounding boxes and orientation information on images, making it useful for a wide variety of modelling approaches. It is also closer to practical scenarios, where three-dimensional location and orientation information are costly to record. It is challenging to construct such a large-scale simulated dataset while keeping it realistic. Furthermore, the available research utilizes conventional methods to detect groups. They do not detect groups directly from the image. In this work, we propose (1) a large-scale simulation dataset F2SD and a pipeline for F-formation simulation, (2) a first-ever end-to-end baseline model for the task, and experiments on our simulation dataset.
△ Less
Submitted 20 November, 2022;
originally announced November 2022.
-
Optimal activity and battery scheduling algorithm using load and solar generation forecasts
Authors:
Yogesh Pipada Sunil Kumar,
Rui Yuan,
Nam Trong Dinh,
S. Ali Pourmousavi
Abstract:
Energy usage optimal scheduling has attracted great attention in the power system community, where various methodologies have been proposed. However, in real-world applications, the optimal scheduling problems require reliable energy forecasting, which is scarcely discussed as a joint solution to the scheduling problem. The 5\textsuperscript{th} IEEE Computational Intelligence Society (IEEE-CIS) c…
▽ More
Energy usage optimal scheduling has attracted great attention in the power system community, where various methodologies have been proposed. However, in real-world applications, the optimal scheduling problems require reliable energy forecasting, which is scarcely discussed as a joint solution to the scheduling problem. The 5\textsuperscript{th} IEEE Computational Intelligence Society (IEEE-CIS) competition raised a practical problem of decreasing the electricity bill by scheduling building activities, where forecasting the solar energy generation and building consumption is a necessity. To solve this problem, we propose a technical sequence for tackling the solar PV and demand forecast and optimal scheduling problems, where solar generation prediction methods and an optimal university lectures scheduling algorithm are proposed.
△ Less
Submitted 24 October, 2022;
originally announced October 2022.
-
TAP: Transparent and Privacy-Preserving Data Services
Authors:
Daniel Reijsbergen,
Aung Maw,
Zheng Yang,
Tien Tuan Anh Dinh,
Jianying Zhou
Abstract:
Users today expect more security from services that handle their data. In addition to traditional data privacy and integrity requirements, they expect transparency, i.e., that the service's processing of the data is verifiable by users and trusted auditors. Our goal is to build a multi-user system that provides data privacy, integrity, and transparency for a large number of operations, while achie…
▽ More
Users today expect more security from services that handle their data. In addition to traditional data privacy and integrity requirements, they expect transparency, i.e., that the service's processing of the data is verifiable by users and trusted auditors. Our goal is to build a multi-user system that provides data privacy, integrity, and transparency for a large number of operations, while achieving practical performance.
To this end, we first identify the limitations of existing approaches that use authenticated data structures. We find that they fall into two categories: 1) those that hide each user's data from other users, but have a limited range of verifiable operations (e.g., CONIKS, Merkle2, and Proofs of Liabilities), and 2) those that support a wide range of verifiable operations, but make all data publicly visible (e.g., IntegriDB and FalconDB). We then present TAP to address the above limitations. The key component of TAP is a novel tree data structure that supports efficient result verification, and relies on independent audits that use zero-knowledge range proofs to show that the tree is constructed correctly without revealing user data. TAP supports a broad range of verifiable operations, including quantiles and sample standard deviations. We conduct a comprehensive evaluation of TAP, and compare it against two state-of-the-art baselines, namely IntegriDB and Merkle2, showing that the system is practical at scale.
△ Less
Submitted 20 October, 2022;
originally announced October 2022.
-
Efficient Hamiltonian Reduction for Quantum Annealing on SatCom Beam Placement Problem
Authors:
Thinh Q. Dinh,
Son Hoang Dau,
Eva Lagunas,
Symeon Chatzinotas
Abstract:
Beam Placement (BP) is a well-known problem in Low-Earth Orbit (LEO) satellite communication (SatCom) systems, which can be modelled as an NP-hard clique cover problem. Recently, quantum computing has emerged as a novel technology which revolutionizes how to solve challenging optimization problems by formulating Quadratic Unconstrained Binary Optimization (QUBO), then preparing Hamiltonians as inp…
▽ More
Beam Placement (BP) is a well-known problem in Low-Earth Orbit (LEO) satellite communication (SatCom) systems, which can be modelled as an NP-hard clique cover problem. Recently, quantum computing has emerged as a novel technology which revolutionizes how to solve challenging optimization problems by formulating Quadratic Unconstrained Binary Optimization (QUBO), then preparing Hamiltonians as inputs for quantum computers. In this paper, we study how to use quantum computing to solve BP problems. However, due to limited hardware resources, existing quantum computers are unable to tackle large optimization spaces. Therefore, we propose an efficient Hamiltonian Reduction method that allows quantum processors to solve large BP instances encountered in LEO systems. We conduct our simulations on real quantum computers (D-Wave Advantage) using a real dataset of vessel locations in the US. Numerical results show that our algorithm outperforms commercialized solutions of D-Wave by allowing existing quantum annealers to solve 17.5 times larger BP instances while maintaining high solution quality. Although quantum computing cannot theoretically overcome the hardness of BP problems, this work contributes early efforts to applying quantum computing in satellite optimization problems, especially applications formulated as clique cover/graph coloring problems.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Res-Dense Net for 3D Covid Chest CT-scan classification
Authors:
Quoc-Huy Trinh,
Minh-Van Nguyen,
Thien-Phuc Nguyen Dinh
Abstract:
One of the most contentious areas of research in Medical Image Preprocessing is 3D CT-scan. With the rapid spread of COVID-19, the function of CT-scan in properly and swiftly diagnosing the disease has become critical. It has a positive impact on infection prevention. There are many tasks to diagnose the illness through CT-scan images, include COVID-19. In this paper, we propose a method that usin…
▽ More
One of the most contentious areas of research in Medical Image Preprocessing is 3D CT-scan. With the rapid spread of COVID-19, the function of CT-scan in properly and swiftly diagnosing the disease has become critical. It has a positive impact on infection prevention. There are many tasks to diagnose the illness through CT-scan images, include COVID-19. In this paper, we propose a method that using a Stacking Deep Neural Network to detect the Covid 19 through the series of 3D CT-scans images . In our method, we experiment with two backbones are DenseNet 121 and ResNet 101. This method achieves a competitive performance on some evaluation metrics
△ Less
Submitted 9 August, 2022;
originally announced August 2022.
-
E2EG: End-to-End Node Classification Using Graph Topology and Text-based Node Attributes
Authors:
Tu Anh Dinh,
Jeroen den Boef,
Joran Cornelisse,
Paul Groth
Abstract:
Node classification utilizing text-based node attributes has many real-world applications, ranging from prediction of paper topics in academic citation graphs to classification of user characteristics in social media networks. State-of-the-art node classification frameworks, such as GIANT, use a two-stage pipeline: first embedding the text attributes of graph nodes then feeding the resulting embed…
▽ More
Node classification utilizing text-based node attributes has many real-world applications, ranging from prediction of paper topics in academic citation graphs to classification of user characteristics in social media networks. State-of-the-art node classification frameworks, such as GIANT, use a two-stage pipeline: first embedding the text attributes of graph nodes then feeding the resulting embeddings into a node classification model. In this paper, we eliminate these two stages and develop an end-to-end node classification model that builds upon GIANT, called End-to-End-GIANT (E2EG). The tandem utilization of a main and an auxiliary classification objectives in our approach results in a more robust model, enabling the BERT backbone to be switched out for a distilled encoder with a 25% - 40% reduction in the number of parameters. Moreover, the model's end-to-end nature increases ease of use, as it avoids the need of chaining multiple models for node classification. Compared to a GIANT+MLP baseline on the ogbn-arxiv and ogbn-products datasets, E2EG obtains slightly better accuracy in the transductive setting (+0.5%), while reducing model training time by up to 40%. Our model is also applicable in the inductive setting, outperforming GIANT+MLP by up to +2.23%.
△ Less
Submitted 26 September, 2023; v1 submitted 9 August, 2022;
originally announced August 2022.
-
GlassDB: An Efficient Verifiable Ledger Database System Through Transparency
Authors:
Cong Yue,
Tien Tuan Anh Dinh,
Zhongle Xie,
Meihui Zhang,
Gang Chen,
Beng Chin Ooi,
Xiaokui Xiao
Abstract:
Verifiable ledger databases protect data history against malicious tampering. Existing systems, such as blockchains and certificate transparency, are based on transparency logs -- a simple abstraction allowing users to verify that a log maintained by an untrusted server is append-only. They expose a simple key-value interface. Building a practical database from transparency logs, on the other hand…
▽ More
Verifiable ledger databases protect data history against malicious tampering. Existing systems, such as blockchains and certificate transparency, are based on transparency logs -- a simple abstraction allowing users to verify that a log maintained by an untrusted server is append-only. They expose a simple key-value interface. Building a practical database from transparency logs, on the other hand, remains a challenge.
In this paper, we explore the design space of verifiable ledger databases along three dimensions: abstraction, threat model, and performance. We survey existing systems and identify their two limitations, namely, the lack of transaction support and the inferior efficiency. We then present GlassDB, a distributed database that addresses these limitations under a practical threat model. GlassDB inherits the verifiability of transparency logs, but supports transactions and offers high performance. It extends a ledger-like key-value store with a data structure for efficient proofs, and adds a concurrency control mechanism for transactions. GlassDB batches independent operations from concurrent transactions when updating the core data structures. In addition, we design a new benchmark for evaluating verifiable ledger databases, by extending YCSB and TPC-C benchmarks. Using this benchmark, we compare GlassDB against four baselines: reimplemented versions of three verifiable databases, and a verifiable map backed by a transparency log. Experimental results demonstrate that GlassDB is an efficient, transactional, and verifiable ledger database.
△ Less
Submitted 19 February, 2023; v1 submitted 2 July, 2022;
originally announced July 2022.