Search | arXiv e-print repository

Leveraging Wikidata's edit history in knowledge graph refinement tasks

Authors: Alejandro Gonzalez-Hevia, Daniel Gayo-Avello

Abstract: Knowledge graphs have been adopted in many diverse fields for a variety of purposes. Most of those applications rely on valid and complete data to deliver their results, pressing the need to improve the quality of knowledge graphs. A number of solutions have been proposed to that end, ranging from rule-based approaches to the use of probabilistic methods, but there is an element that has not been… ▽ More Knowledge graphs have been adopted in many diverse fields for a variety of purposes. Most of those applications rely on valid and complete data to deliver their results, pressing the need to improve the quality of knowledge graphs. A number of solutions have been proposed to that end, ranging from rule-based approaches to the use of probabilistic methods, but there is an element that has not been considered yet: the edit history of the graph. In the case of collaborative knowledge graphs (e.g., Wikidata), those edits represent the process in which the community reaches some kind of fuzzy and distributed consensus over the information that best represents each entity, and can hold potentially interesting information to be used by knowledge graph refinement methods. In this paper, we explore the use of edit history information from Wikidata to improve the performance of type prediction methods. To do that, we have first built a JSON dataset containing the edit history of every instance from the 100 most important classes in Wikidata. This edit history information is then explored and analyzed, with a focus on its potential applicability in knowledge graph refinement tasks. Finally, we propose and evaluate two new methods to leverage this edit history information in knowledge graph embedding models for type prediction tasks. Our results show an improvement in one of the proposed methods against current approaches, showing the potential of using edit information in knowledge graph refinement tasks and opening new promising research lines within the field. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: 18 pages, 7 figures. Submitted to the Journal of Web Semantics

ACM Class: H.3; H.4; I.2

arXiv:1611.08144 [pdf, other]

How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself

Authors: Daniel Gayo-Avello

Abstract: Twitter is among the commonest sources of data employed in social media research mainly because of its convenient APIs to collect tweets. However, most researchers do not have access to the expensive Firehose and Twitter Historical Archive, and they must rely on data collected with free APIs whose representativeness has been questioned. In 2010 the Library of Congress announced an agreement with T… ▽ More Twitter is among the commonest sources of data employed in social media research mainly because of its convenient APIs to collect tweets. However, most researchers do not have access to the expensive Firehose and Twitter Historical Archive, and they must rely on data collected with free APIs whose representativeness has been questioned. In 2010 the Library of Congress announced an agreement with Twitter to provide researchers access to the whole Twitter Archive. However, such a task proved to be daunting and, at the moment of this writing, no researcher has had the opportunity to access such materials. Still, there have been experiences that proved that smaller searchable archives are feasible and, therefore, amenable for academics to build with relatively little resources. In this paper I describe my efforts to build one of such archives, covering the first three years of Twitter (actually from March 2006 to July 2009) and containing 1.48 billion tweets. If you carefully follow my directions you may have your very own little Twitter Historical Archive and you may forget about paying for historical tweets. Please note that to achieve that you should be proficient in some programming language, knowable about Twitter APIs, and have some basic knowledge on ElasticSearch; moreover, you may very well get disappointed by the quality of the contents of the final dataset. △ Less

Submitted 24 November, 2016; originally announced November 2016.

Comments: 22 pages, 13 figures

arXiv:1510.00618 [pdf]

Automatic Taxonomy Extraction from Query Logs with no Additional Sources of Information

Authors: Miguel Fernandez-Fernandez, Daniel Gayo-Avello

Abstract: Search engine logs store detailed information on Web users interactions. Thus, as more and more people use search engines on a daily basis, important trails of users common knowledge are being recorded in those files. Previous research has shown that it is possible to extract concept taxonomies from full text documents, while other scholars have proposed methods to obtain similar queries from quer… ▽ More Search engine logs store detailed information on Web users interactions. Thus, as more and more people use search engines on a daily basis, important trails of users common knowledge are being recorded in those files. Previous research has shown that it is possible to extract concept taxonomies from full text documents, while other scholars have proposed methods to obtain similar queries from query logs. We propose a mixture of both lines of research, that is, mining query logs not to find related queries nor query hierarchies, but actual term taxonomies that could be used to improve search engine effectiveness and efficiency. As a result, in this study we have developed a method that combines lexical heuristics with a supervised classification model to successfully extract hyponymy relations from specialization search patterns revealed from log missions, with no additional sources of information, and in a language independent way. △ Less

Submitted 5 October, 2015; v1 submitted 2 October, 2015; originally announced October 2015.

Comments: 21 pages, 4 figures, 5 tables. Old (2012) unpublished manuscript

arXiv:1206.5851 [pdf]

doi 10.1177/0894439313493979

A meta-analysis of state-of-the-art electoral prediction from Twitter data

Authors: Daniel Gayo-Avello

Abstract: Electoral prediction from Twitter data is an appealing research topic. It seems relatively straightforward and the prevailing view is overly optimistic. This is problematic because while simple approaches are assumed to be good enough, core problems are not addressed. Thus, this paper aims to (1) provide a balanced and critical review of the state of the art; (2) cast light on the presume predicti… ▽ More Electoral prediction from Twitter data is an appealing research topic. It seems relatively straightforward and the prevailing view is overly optimistic. This is problematic because while simple approaches are assumed to be good enough, core problems are not addressed. Thus, this paper aims to (1) provide a balanced and critical review of the state of the art; (2) cast light on the presume predictive power of Twitter data; and (3) depict a roadmap to push forward the field. Hence, a scheme to characterize Twitter prediction methods is proposed. It covers every aspect from data collection to performance evaluation, through data processing and vote inference. Using that scheme, prior research is analyzed and organized to explain the main approaches taken up to date but also their weaknesses. This is the first meta-analysis of the whole body of research regarding electoral prediction from Twitter data. It reveals that its presumed predictive power regarding electoral prediction has been rather exaggerated: although social media may provide a glimpse on electoral outcomes current research does not provide strong evidence to support it can replace traditional polls. Finally, future lines of research along with a set of requirements they must fulfill are provided. △ Less

Submitted 25 June, 2012; originally announced June 2012.

Comments: 19 pages, 3 tables

ACM Class: H.2.8; H.3.5; H.4.3; I.2.7; I.5.4; J.4; K.4.1

Journal ref: Social Science Computer Review, August 23, 2013, 0894439313493979

arXiv:1204.6441 [pdf, ps, other]

"I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter Data

Authors: Daniel Gayo-Avello

Abstract: Predicting X from Twitter is a popular fad within the Twitter research subculture. It seems both appealing and relatively easy. Among such kind of studies, electoral prediction is maybe the most attractive, and at this moment there is a growing body of literature on such a topic. This is not only an interesting research problem but, above all, it is extremely difficult. However, most of the author… ▽ More Predicting X from Twitter is a popular fad within the Twitter research subculture. It seems both appealing and relatively easy. Among such kind of studies, electoral prediction is maybe the most attractive, and at this moment there is a growing body of literature on such a topic. This is not only an interesting research problem but, above all, it is extremely difficult. However, most of the authors seem to be more interested in claiming positive results than in providing sound and reproducible methods. It is also especially worrisome that many recent papers seem to only acknowledge those studies supporting the idea of Twitter predicting elections, instead of conducting a balanced literature review showing both sides of the matter. After reading many of such papers I have decided to write such a survey myself. Hence, in this paper, every study relevant to the matter of electoral prediction using social media is commented. From this review it can be concluded that the predictive power of Twitter regarding elections has been greatly exaggerated, and that hard research problems still lie ahead. △ Less

Submitted 28 April, 2012; originally announced April 2012.

Comments: 13 pages, no figures. Annotated bibliography of 25 papers regarding electoral prediction from Twitter data

arXiv:1012.5913 [pdf, ps, other]

doi 10.1145/1995966.1995991

All liaisons are dangerous when all your friends are known to us

Authors: Daniel Gayo-Avello

Abstract: Online Social Networks (OSNs) are used by millions of users worldwide. Academically speaking, there is little doubt about the usefulness of demographic studies conducted on OSNs and, hence, methods to label unknown users from small labeled samples are very useful. However, from the general public point of view, this can be a serious privacy concern. Thus, both topics are tackled in this paper: Fir… ▽ More Online Social Networks (OSNs) are used by millions of users worldwide. Academically speaking, there is little doubt about the usefulness of demographic studies conducted on OSNs and, hence, methods to label unknown users from small labeled samples are very useful. However, from the general public point of view, this can be a serious privacy concern. Thus, both topics are tackled in this paper: First, a new algorithm to perform user profiling in social networks is described, and its performance is reported and discussed. Secondly, the experiments --conducted on information usually considered sensitive-- reveal that by just publicizing one's contacts privacy is at risk and, thus, measures to minimize privacy leaks due to social graph data mining are outlined. △ Less

Submitted 29 December, 2010; originally announced December 2010.

Comments: 10 pages, 5 tables

ACM Class: G.2.2; I.5.2; K.4.1

arXiv:1012.2057 [pdf, ps, other]

doi 10.1209/0295-5075/94/38001

De retibus socialibus et legibus momenti

Authors: Daniel Gayo-Avello, David J. Brenes, Diego Fernández-Fernández, María E. Fernández-Menéndez, Rodrigo García-Suárez

Abstract: Online Social Networks (OSNs) are a cutting edge topic. Almost everybody --users, marketers, brands, companies, and researchers-- is approaching OSNs to better understand them and take advantage of their benefits. Maybe one of the key concepts underlying OSNs is that of influence which is highly related, although not entirely identical, to those of popularity and centrality. Influence is, accordin… ▽ More Online Social Networks (OSNs) are a cutting edge topic. Almost everybody --users, marketers, brands, companies, and researchers-- is approaching OSNs to better understand them and take advantage of their benefits. Maybe one of the key concepts underlying OSNs is that of influence which is highly related, although not entirely identical, to those of popularity and centrality. Influence is, according to Merriam-Webster, "the capacity of causing an effect in indirect or intangible ways". Hence, in the context of OSNs, it has been proposed to analyze the clicks received by promoted URLs in order to check for any positive correlation between the number of visits and different "influence" scores. Such an evaluation methodology is used in this paper to compare a number of those techniques with a new method firstly described here. That new method is a simple and rather elegant solution which tackles with influence in OSNs by applying a physical metaphor. △ Less

Submitted 14 February, 2011; v1 submitted 9 December, 2010; originally announced December 2010.

Comments: Changes made for third revision: Brief description of the dataset employed added to Introduction. Minor changes to the description of preparation of the bit.ly datasets. Minor changes to the captions of Tables 1 and 3. Brief addition in the Conclusions section (future line of work added). Added references 16 and 18. Some typos and grammar polished

Journal ref: 2011 EPL 94 38001

arXiv:1005.5516 [pdf, ps, other]

On the Fly Query Entity Decomposition Using Snippets

Authors: David J. Brenes, Daniel Gayo-Avello, Rodrigo Garcia

Abstract: One of the most important issues in Information Retrieval is inferring the intents underlying users' queries. Thus, any tool to enrich or to better contextualized queries can proof extremely valuable. Entity extraction, provided it is done fast, can be one of such tools. Such techniques usually rely on a prior training phase involving large datasets. That training is costly, specially in environme… ▽ More One of the most important issues in Information Retrieval is inferring the intents underlying users' queries. Thus, any tool to enrich or to better contextualized queries can proof extremely valuable. Entity extraction, provided it is done fast, can be one of such tools. Such techniques usually rely on a prior training phase involving large datasets. That training is costly, specially in environments which are increasingly moving towards real time scenarios where latency to retrieve fresh informacion should be minimal. In this paper an `on-the-fly' query decomposition method is proposed. It uses snippets which are mined by means of a naïve statistical algorithm. An initial evaluation of such a method is provided, in addition to a discussion on its applicability to different scenarios. △ Less

Submitted 6 June, 2010; v1 submitted 30 May, 2010; originally announced May 2010.

Comments: Extended version of paper submitted to CERI 2010

arXiv:1004.0816 [pdf]

doi 10.1016/j.ipm.2013.06.003

Nepotistic Relationships in Twitter and their Impact on Rank Prestige Algorithms

Authors: Daniel Gayo-Avello

Abstract: Micro-blogging services such as Twitter allow anyone to publish anything, anytime. Needless to say, many of the available contents can be diminished as babble or spam. However, given the number and diversity of users, some valuable pieces of information should arise from the stream of tweets. Thus, such services can develop into valuable sources of up-to-date information (the so-called real-time w… ▽ More Micro-blogging services such as Twitter allow anyone to publish anything, anytime. Needless to say, many of the available contents can be diminished as babble or spam. However, given the number and diversity of users, some valuable pieces of information should arise from the stream of tweets. Thus, such services can develop into valuable sources of up-to-date information (the so-called real-time web) provided a way to find the most relevant/trustworthy/authoritative users is available. Hence, this makes a highly pertinent question for which graph centrality methods can provide an answer. In this paper the author offers a comprehensive survey of feasible algorithms for ranking users in social networks, he examines their vulnerabilities to linking malpractice in such networks, and suggests an objective criterion against which to compare such algorithms. Additionally, he suggests a first step towards "desensitizing" prestige algorithms against cheating by spammers and other abusive users. △ Less

Submitted 18 October, 2012; v1 submitted 6 April, 2010; originally announced April 2010.

Comments: 40 pages, 17 tables, 14 figures. Paper has been restructured, new section "3.2. The importance of reciprocal linking in Twitter spam" was added, experiments with verified accounts in addition to spammers have bee conducted to show performance with relevant users and not only regarding spam demotion

Journal ref: Information Processing & Management Volume 49, Issue 6, November 2013, Pages 1250-1280

arXiv:0911.3979 [pdf]

Making the road by searching - A search engine based on Swarm Information Foraging

Authors: Daniel Gayo-Avello, David J. Brenes

Abstract: Search engines are nowadays one of the most important entry points for Internet users and a central tool to solve most of their information needs. Still, there exist a substantial amount of users' searches which obtain unsatisfactory results. Needless to say, several lines of research aim to increase the relevancy of the results users retrieve. In this paper the authors frame this problem within… ▽ More Search engines are nowadays one of the most important entry points for Internet users and a central tool to solve most of their information needs. Still, there exist a substantial amount of users' searches which obtain unsatisfactory results. Needless to say, several lines of research aim to increase the relevancy of the results users retrieve. In this paper the authors frame this problem within the much broader (and older) one of information overload. They argue that users' dissatisfaction with search engines is a currently common manifestation of such a problem, and propose a different angle from which to tackle with it. As it will be discussed, their approach shares goals with a current hot research topic (namely, learning to rank for information retrieval) but, unlike the techniques commonly applied in that field, their technique cannot be exactly considered machine learning and, additionally, it can be used to change the search engine's response in real-time, driven by the users behavior. Their proposal adapts concepts from Swarm Intelligence (in particular, Ant Algorithms) from an Information Foraging point of view. It will be shown that the technique is not only feasible, but also an elegant solution to the stated problem; what's more, it achieves promising results, both increasing the performance of a major search engine for informational queries, and substantially reducing the time users require to answer complex information needs. △ Less

Submitted 20 November, 2009; originally announced November 2009.

arXiv:cs/0411074 [pdf]

Building Chinese Lexicons from Scratch by Unsupervised Short Document Self-Segmentation

Authors: Daniel Gayo-Avello

Abstract: Chinese text segmentation is a well-known and difficult problem. On one side, there is not a simple notion of "word" in Chinese language making really hard to implement rule-based systems to segment written texts, thus lexicons and statistical information are usually employed to achieve such a task. On the other side, any piece of Chinese text usually includes segments present neither in the lex… ▽ More Chinese text segmentation is a well-known and difficult problem. On one side, there is not a simple notion of "word" in Chinese language making really hard to implement rule-based systems to segment written texts, thus lexicons and statistical information are usually employed to achieve such a task. On the other side, any piece of Chinese text usually includes segments present neither in the lexicons nor in the training data. Even worse, such unseen sequences can be segmented into a number of totally unrelated words making later processing phases difficult. For instance, using a lexicon-based system the sequence ???(Baluozuo, Barroso, current president-designate of the European Commission) can be segmented into ?(ba, to hope, to wish) and ??(luozuo, an undefined word) changing completely the meaning of the sentence. A new and extremely simple algorithm specially suited to work over short Chinese documents is introduced. This new algorithm performs text "self-segmentation" producing results comparable to those achieved by native speakers without using either lexicons or any statistical information beyond the obtained from the input text. Furthermore, it is really robust for finding new "words", especially proper nouns, and it is well suited to build lexicons from scratch. Some preliminary results are provided in addition to examples of its employment. △ Less

Submitted 19 November, 2004; originally announced November 2004.

Comments: 9 pages 3 figures 2 tables

Showing 1–11 of 11 results for author: Gayo-Avello, D