-
MEIM: Multi-partition Embedding Interaction Beyond Block Term Format for Efficient and Expressive Link Prediction
Authors:
Hung Nghiep Tran,
Atsuhiro Takasu
Abstract:
Knowledge graph embedding aims to predict the missing relations between entities in knowledge graphs. Tensor-decomposition-based models, such as ComplEx, provide a good trade-off between efficiency and expressiveness, that is crucial because of the large size of real world knowledge graphs. The recent multi-partition embedding interaction (MEI) model subsumes these models by using the block term t…
▽ More
Knowledge graph embedding aims to predict the missing relations between entities in knowledge graphs. Tensor-decomposition-based models, such as ComplEx, provide a good trade-off between efficiency and expressiveness, that is crucial because of the large size of real world knowledge graphs. The recent multi-partition embedding interaction (MEI) model subsumes these models by using the block term tensor format and provides a systematic solution for the trade-off. However, MEI has several drawbacks, some of which carried from its subsumed tensor-decomposition-based models. In this paper, we address these drawbacks and introduce the Multi-partition Embedding Interaction iMproved beyond block term format (MEIM) model, with independent core tensor for ensemble effects and soft orthogonality for max-rank mapping, in addition to multi-partition embedding. MEIM improves expressiveness while still being highly efficient, helping it to outperform strong baselines and achieve state-of-the-art results on difficult link prediction benchmarks using fairly small embedding sizes. The source code is released at https://github.com/tranhungnghiep/MEIM-KGE.
△ Less
Submitted 4 October, 2022; v1 submitted 30 September, 2022;
originally announced September 2022.
-
ur-iw-hnt at GermEval 2021: An Ensembling Strategy with Multiple BERT Models
Authors:
Hoai Nam Tran,
Udo Kruschwitz
Abstract:
This paper describes our approach (ur-iw-hnt) for the Shared Task of GermEval2021 to identify toxic, engaging, and fact-claiming comments. We submitted three runs using an ensembling strategy by majority (hard) voting with multiple different BERT models of three different types: German-based, Twitter-based, and multilingual models. All ensemble models outperform single models, while BERTweet is th…
▽ More
This paper describes our approach (ur-iw-hnt) for the Shared Task of GermEval2021 to identify toxic, engaging, and fact-claiming comments. We submitted three runs using an ensembling strategy by majority (hard) voting with multiple different BERT models of three different types: German-based, Twitter-based, and multilingual models. All ensemble models outperform single models, while BERTweet is the winner of all individual models in every subtask. Twitter-based models perform better than GermanBERT models, and multilingual models perform worse but by a small margin.
△ Less
Submitted 5 October, 2021;
originally announced October 2021.
-
Multi-Partition Embedding Interaction with Block Term Format for Knowledge Graph Completion
Authors:
Hung Nghiep Tran,
Atsuhiro Takasu
Abstract:
Knowledge graph completion is an important task that aims to predict the missing relational link between entities. Knowledge graph embedding methods perform this task by representing entities and relations as embedding vectors and modeling their interactions to compute the matching score of each triple. Previous work has usually treated each embedding as a whole and has modeled the interactions be…
▽ More
Knowledge graph completion is an important task that aims to predict the missing relational link between entities. Knowledge graph embedding methods perform this task by representing entities and relations as embedding vectors and modeling their interactions to compute the matching score of each triple. Previous work has usually treated each embedding as a whole and has modeled the interactions between these whole embeddings, potentially making the model excessively expensive or requiring specially designed interaction mechanisms. In this work, we propose the multi-partition embedding interaction (MEI) model with block term format to systematically address this problem. MEI divides each embedding into a multi-partition vector to efficiently restrict the interactions. Each local interaction is modeled with the Tucker tensor format and the full interaction is modeled with the block term tensor format, enabling MEI to control the trade-off between expressiveness and computational cost, learn the interaction mechanisms from data automatically, and achieve state-of-the-art performance on the link prediction task. In addition, we theoretically study the parameter efficiency problem and derive a simple empirically verified criterion for optimal parameter trade-off. We also apply the framework of MEI to provide a new generalized explanation for several specially designed interaction mechanisms in previous models. The source code is released at https://github.com/tranhungnghiep/MEI-KGE.
△ Less
Submitted 1 October, 2022; v1 submitted 29 June, 2020;
originally announced June 2020.
-
Exploring Scholarly Data by Semantic Query on Knowledge Graph Embedding Space
Authors:
Hung Nghiep Tran,
Atsuhiro Takasu
Abstract:
The trends of open science have enabled several open scholarly datasets which include millions of papers and authors. Managing, exploring, and utilizing such large and complicated datasets effectively are challenging. In recent years, the knowledge graph has emerged as a universal data format for representing knowledge about heterogeneous entities and their relationships. The knowledge graph can b…
▽ More
The trends of open science have enabled several open scholarly datasets which include millions of papers and authors. Managing, exploring, and utilizing such large and complicated datasets effectively are challenging. In recent years, the knowledge graph has emerged as a universal data format for representing knowledge about heterogeneous entities and their relationships. The knowledge graph can be modeled by knowledge graph embedding methods, which represent entities and relations as embedding vectors in semantic space, then model the interactions between these embedding vectors. However, the semantic structures in the knowledge graph embedding space are not well-studied, thus knowledge graph embedding methods are usually only used for knowledge graph completion but not data representation and analysis. In this paper, we propose to analyze these semantic structures based on the well-studied word embedding space and use them to support data exploration. We also define the semantic queries, which are algebraic operations between the embedding vectors in the knowledge graph embedding space, to solve queries such as similarity and analogy between the entities on the original datasets. We then design a general framework for data exploration by semantic queries and discuss the solution to some traditional scholarly data exploration tasks. We also propose some new interesting tasks that can be solved based on the uncanny semantic structures of the embedding space.
△ Less
Submitted 3 November, 2022; v1 submitted 17 September, 2019;
originally announced September 2019.
-
Analyzing Knowledge Graph Embedding Methods from a Multi-Embedding Interaction Perspective
Authors:
Hung Nghiep Tran,
Atsuhiro Takasu
Abstract:
Knowledge graph is a popular format for representing knowledge, with many applications to semantic search engines, question-answering systems, and recommender systems. Real-world knowledge graphs are usually incomplete, so knowledge graph embedding methods, such as Canonical decomposition/Parallel factorization (CP), DistMult, and ComplEx, have been proposed to address this issue. These methods re…
▽ More
Knowledge graph is a popular format for representing knowledge, with many applications to semantic search engines, question-answering systems, and recommender systems. Real-world knowledge graphs are usually incomplete, so knowledge graph embedding methods, such as Canonical decomposition/Parallel factorization (CP), DistMult, and ComplEx, have been proposed to address this issue. These methods represent entities and relations as embedding vectors in semantic space and predict the links between them. The embedding vectors themselves contain rich semantic information and can be used in other applications such as data analysis. However, mechanisms in these models and the embedding vectors themselves vary greatly, making it difficult to understand and compare them. Given this lack of understanding, we risk using them ineffectively or incorrectly, particularly for complicated models, such as CP, with two role-based embedding vectors, or the state-of-the-art ComplEx model, with complex-valued embedding vectors. In this paper, we propose a multi-embedding interaction mechanism as a new approach to uniting and generalizing these models. We derive them theoretically via this mechanism and provide empirical analyses and comparisons between them. We also propose a new multi-embedding model based on quaternion algebra and show that it achieves promising results using popular benchmarks. Source code is available on GitHub at https://github.com/tranhungnghiep/AnalyzeKGE.
△ Less
Submitted 25 April, 2023; v1 submitted 27 March, 2019;
originally announced March 2019.
-
A Potential Approach to Overcome Data Limitation in Scientific Publication Recommendation
Authors:
Hung Nghiep Tran,
Tin Huynh,
Kiem Hoang
Abstract:
Data are essential for the experiments of relevant scientific publication recommendation methods but it is difficult to build ground truth data. A naturally promising solution is using publications that are referenced by researchers to build their ground truth data. Unfortunately, this approach has not been explored in the literature, so its applicability is still a gap in our knowledge. In this r…
▽ More
Data are essential for the experiments of relevant scientific publication recommendation methods but it is difficult to build ground truth data. A naturally promising solution is using publications that are referenced by researchers to build their ground truth data. Unfortunately, this approach has not been explored in the literature, so its applicability is still a gap in our knowledge. In this research, we systematically study this approach by theoretical and empirical analyses. In general, the results show that this approach is reasonable and has many advantages. However, the empirical analysis shows both positive and negative results. We conclude that, in some situations, this is a useful alternative approach toward overcoming data limitation. Based on this approach, we build and publish a dataset in computer science domain to help advancing other researches.
△ Less
Submitted 15 October, 2015;
originally announced October 2015.
-
Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization
Authors:
Hung Nghiep Tran,
Atsuhiro Takasu
Abstract:
Topic modeling is a very powerful technique in data analysis and data mining but it is generally slow. Many parallelization approaches have been proposed to speed up the learning process. However, they are usually not very efficient because of the many kinds of overhead, especially the load-balancing problem. We address this problem by proposing three partitioning algorithms, which either run more…
▽ More
Topic modeling is a very powerful technique in data analysis and data mining but it is generally slow. Many parallelization approaches have been proposed to speed up the learning process. However, they are usually not very efficient because of the many kinds of overhead, especially the load-balancing problem. We address this problem by proposing three partitioning algorithms, which either run more quickly or achieve better load balance than current partitioning algorithms. These algorithms can easily be extended to improve parallelization efficiency on other topic models similar to LDA, e.g., Bag of Timestamps, which is an extension of LDA with time information. We evaluate these algorithms on two popular datasets, NIPS and NYTimes. We also build a dataset containing over 1,000,000 scientific publications in the computer science domain from 1951 to 2010 to experiment with Bag of Timestamps parallelization, which we design to demonstrate the proposed algorithms' extensibility. The results strongly confirm the advantages of these algorithms.
△ Less
Submitted 14 October, 2015;
originally announced October 2015.
-
SciRecSys: A Recommendation System for Scientific Publication by Discovering Keyword Relationships
Authors:
Vu Le Anh,
Vo Hoang Hai,
Hung Nghiep Tran,
Jason J. Jung
Abstract:
In this work, we propose a new approach for discovering various relationships among keywords over the scientific publications based on a Markov Chain model. It is an important problem since keywords are the basic elements for representing abstract objects such as documents, user profiles, topics and many things else. Our model is very effective since it combines four important factors in scientifi…
▽ More
In this work, we propose a new approach for discovering various relationships among keywords over the scientific publications based on a Markov Chain model. It is an important problem since keywords are the basic elements for representing abstract objects such as documents, user profiles, topics and many things else. Our model is very effective since it combines four important factors in scientific publications: content, publicity, impact and randomness. Particularly, a recommendation system (called SciRecSys) has been presented to support users to efficiently find out relevant articles.
△ Less
Submitted 27 February, 2015;
originally announced February 2015.
-
Author Name Disambiguation by Using Deep Neural Network
Authors:
Hung Nghiep Tran,
Tin Huynh,
Tien Do
Abstract:
Author name ambiguity decreases the quality and reliability of information retrieved from digital libraries. Existing methods have tried to solve this problem by predefining a feature set based on expert's knowledge for a specific dataset. In this paper, we propose a new approach which uses deep neural network to learn features automatically from data. Additionally, we propose the general system a…
▽ More
Author name ambiguity decreases the quality and reliability of information retrieved from digital libraries. Existing methods have tried to solve this problem by predefining a feature set based on expert's knowledge for a specific dataset. In this paper, we propose a new approach which uses deep neural network to learn features automatically from data. Additionally, we propose the general system architecture for author name disambiguation on any dataset. In this research, we evaluate the proposed method on a dataset containing Vietnamese author names. The results show that this method significantly outperforms other methods that use predefined feature set. The proposed method achieves 99.31% in terms of accuracy. Prediction error rate decreases from 1.83% to 0.69%, i.e., it decreases by 1.14%, or 62.3% relatively compared with other methods that use predefined feature set (Table 3).
△ Less
Submitted 28 July, 2017; v1 submitted 27 February, 2015;
originally announced February 2015.