Search | arXiv e-print repository

arXiv:2001.04484 [pdf, other]

On the Replicability of Combining Word Embeddings and Retrieval Models

Authors: Luca Papariello, Alexandros Bampoulidis, Mihai Lupu

Abstract: We replicate recent experiments attempting to demonstrate an attractive hypothesis about the use of the Fisher kernel framework and mixture models for aggregating word embeddings towards document representations and the use of these representations in document classification, clustering, and retrieval. Specifically, the hypothesis was that the use of a mixture model of von Mises-Fisher (VMF) distr… ▽ More We replicate recent experiments attempting to demonstrate an attractive hypothesis about the use of the Fisher kernel framework and mixture models for aggregating word embeddings towards document representations and the use of these representations in document classification, clustering, and retrieval. Specifically, the hypothesis was that the use of a mixture model of von Mises-Fisher (VMF) distributions instead of Gaussian distributions would be beneficial because of the focus on cosine distances of both VMF and the vector space model traditionally used in information retrieval. Previous experiments had validated this hypothesis. Our replication was not able to validate it, despite a large parameter scan space. △ Less

Submitted 13 January, 2020; originally announced January 2020.

arXiv:1902.09897 [pdf, ps, other]

An Abstract View on the De-anonymization Process

Authors: Alexandros Bampoulidis, Mihai Lupu

Abstract: Over the recent years, the availability of datasets containing personal, but anonymized information has been continuously increasing. Extensive research has revealed that such datasets are vulnerable to privacy breaches: being able to reveal sensitive information about individuals through deanonymization methods. Here, we provide a taxonomy of the research in de-anonymization. Over the recent years, the availability of datasets containing personal, but anonymized information has been continuously increasing. Extensive research has revealed that such datasets are vulnerable to privacy breaches: being able to reveal sensitive information about individuals through deanonymization methods. Here, we provide a taxonomy of the research in de-anonymization. △ Less

Submitted 26 February, 2019; originally announced February 2019.

arXiv:1711.06196 [pdf, other]

Addressing Cross-Lingual Word Sense Disambiguation on Low-Density Languages: Application to Persian

Authors: Navid Rekabsaz, Mihai Lupu, Allan Hanbury, Andres Duque

Abstract: We explore the use of unsupervised methods in Cross-Lingual Word Sense Disambiguation (CL-WSD) with the application of English to Persian. Our proposed approach targets the languages with scarce resources (low-density) by exploiting word embedding and semantic similarity of the words in context. We evaluate the approach on a recent evaluation benchmark and compare it with the state-of-the-art unsu… ▽ More We explore the use of unsupervised methods in Cross-Lingual Word Sense Disambiguation (CL-WSD) with the application of English to Persian. Our proposed approach targets the languages with scarce resources (low-density) by exploiting word embedding and semantic similarity of the words in context. We evaluate the approach on a recent evaluation benchmark and compare it with the state-of-the-art unsupervised system (CO-Graph). The results show that our approach outperforms both the standard baseline and the CO-Graph system in both of the task evaluation metrics (Out-Of-Five and Best result). △ Less

Submitted 21 March, 2018; v1 submitted 16 November, 2017; originally announced November 2017.

arXiv:1707.06598 [pdf, other]

Toward Incorporation of Relevant Documents in word2vec

Authors: Navid Rekabsaz, Bhaskar Mitra, Mihai Lupu, Allan Hanbury

Abstract: Recent advances in neural word embedding provide significant benefit to various information retrieval tasks. However as shown by recent studies, adapting the embedding models for the needs of IR tasks can bring considerable further improvements. The embedding models in general define the term relatedness by exploiting the terms' co-occurrences in short-window contexts. An alternative (and well-stu… ▽ More Recent advances in neural word embedding provide significant benefit to various information retrieval tasks. However as shown by recent studies, adapting the embedding models for the needs of IR tasks can bring considerable further improvements. The embedding models in general define the term relatedness by exploiting the terms' co-occurrences in short-window contexts. An alternative (and well-studied) approach in IR for related terms to a query is using local information i.e. a set of top-retrieved documents. In view of these two methods of term relatedness, in this work, we report our study on incorporating the local information of the query in the word embeddings. One main challenge in this direction is that the dense vectors of word embeddings and their estimation of term-to-term relatedness remain difficult to interpret and hard to analyze. As an alternative, explicit word representations propose vectors whose dimensions are easily interpretable, and recent methods show competitive performance to the dense vectors. We introduce a neural-based explicit representation, rooted in the conceptual ideas of the word2vec Skip-Gram model. The method provides interpretable explicit vectors while keeping the effectiveness of the Skip-Gram model. The evaluation of various explicit representations on word association collections shows that the newly proposed method out- performs the state-of-the-art explicit representations when tasked with ranking highly similar terms. Based on the introduced ex- plicit representation, we discuss our approaches on integrating local documents in globally-trained embedding models and discuss the preliminary results. △ Less

Submitted 4 April, 2018; v1 submitted 20 July, 2017; originally announced July 2017.

Comments: Neu-IR Workshop at the ACM Conference on Research and Development in Information Retrieval (NeuIR-SIGIR 2017)

arXiv:1703.05123 [pdf, other]

Character-based Neural Embeddings for Tweet Clustering

Authors: Svitlana Vakulenko, Lyndon Nixon, Mihai Lupu

Abstract: In this paper we show how the performance of tweet clustering can be improved by leveraging character-based neural networks. The proposed approach overcomes the limitations related to the vocabulary explosion in the word-based models and allows for the seamless processing of the multilingual content. Our evaluation results and code are available on-line at https://github.com/vendi12/tweet2vec_clus… ▽ More In this paper we show how the performance of tweet clustering can be improved by leveraging character-based neural networks. The proposed approach overcomes the limitations related to the vocabulary explosion in the word-based models and allows for the seamless processing of the multilingual content. Our evaluation results and code are available on-line at https://github.com/vendi12/tweet2vec_clustering △ Less

Submitted 16 March, 2017; v1 submitted 15 March, 2017; originally announced March 2017.

Comments: Accepted at the SocialNLP 2017 workshop held in conjunction with EACL 2017, April 3, 2017, Valencia, Spain

arXiv:1702.01978 [pdf, other]

doi 10.18653/v1/P17-1157

Volatility Prediction using Financial Disclosures Sentiments with Word Embedding-based IR Models

Authors: Navid Rekabsaz, Mihai Lupu, Artem Baklanov, Allan Hanbury, Alexander Duer, Linda Anderson

Abstract: Volatility prediction--an essential concept in financial markets--has recently been addressed using sentiment analysis methods. We investigate the sentiment of annual disclosures of companies in stock markets to forecast volatility. We specifically explore the use of recent Information Retrieval (IR) term weighting models that are effectively extended by related terms using word embeddings. In par… ▽ More Volatility prediction--an essential concept in financial markets--has recently been addressed using sentiment analysis methods. We investigate the sentiment of annual disclosures of companies in stock markets to forecast volatility. We specifically explore the use of recent Information Retrieval (IR) term weighting models that are effectively extended by related terms using word embeddings. In parallel to textual information, factual market data have been widely used as the mainstream approach to forecast market risk. We therefore study different fusion methods to combine text and market data resources. Our word embedding-based approach significantly outperforms state-of-the-art methods. In addition, we investigate the characteristics of the reports of the companies in different financial sectors. △ Less

Submitted 28 September, 2017; v1 submitted 7 February, 2017; originally announced February 2017.

arXiv:1606.06086 [pdf, other]

Uncertainty in Neural Network Word Embedding: Exploration of Threshold for Similarity

Authors: Navid Rekabsaz, Mihai Lupu, Allan Hanbury

Abstract: Word embedding, specially with its recent developments, promises a quantification of the similarity between terms. However, it is not clear to which extent this similarity value can be genuinely meaningful and useful for subsequent tasks. We explore how the similarity score obtained from the models is really indicative of term relatedness. We first observe and quantify the uncertainty factor of th… ▽ More Word embedding, specially with its recent developments, promises a quantification of the similarity between terms. However, it is not clear to which extent this similarity value can be genuinely meaningful and useful for subsequent tasks. We explore how the similarity score obtained from the models is really indicative of term relatedness. We first observe and quantify the uncertainty factor of the word embedding models regarding to the similarity value. Based on this factor, we introduce a general threshold on various dimensions which effectively filters the highly related terms. Our evaluation on four information retrieval collections supports the effectiveness of our approach as the results of the introduced threshold are significantly better than the baseline while being equal to or statistically indistinguishable from the optimal results. △ Less

Submitted 4 April, 2018; v1 submitted 20 June, 2016; originally announced June 2016.

Comments: Neu-IR Workshop at the ACM Conference on Research and Development in Information Retrieval (NeuIR-SIGIR 2016)

Showing 1–7 of 7 results for author: Lupu, M