Search | arXiv e-print repository

Prediction of Reposting on X

Authors: Ziming Xu, Shi Zhou, Vasileios Lampos, Ingemar J. Cox

Abstract: There have been considerable efforts to predict a user's reposting behaviour on X (formerly Twitter) using machine learning models. The problem is previously cast as a supervised classification task, where Tweets are randomly assigned to a test or training set. The random assignment helps to ensure that the test and training sets are drawn from the same distribution. In practice, we would like to… ▽ More There have been considerable efforts to predict a user's reposting behaviour on X (formerly Twitter) using machine learning models. The problem is previously cast as a supervised classification task, where Tweets are randomly assigned to a test or training set. The random assignment helps to ensure that the test and training sets are drawn from the same distribution. In practice, we would like to predict users' reposting behaviour for a set of messages related to a new, previously unseen, topic (defined by a hashtag). In this case, the problem becomes an out-of-distribution generalisation classification task. Experimental results reveal that while existing algorithms, which predominantly use features derived from the content of Tweet messages, perform well when the training and test distributions are the same, these algorithms perform much worse when the test set is out of distribution. We then show that if the message features are supplemented or replaced with features derived from users' profile and past behaviour, the out-of-distribution prediction is greatly improved, with the F1 score increasing from 0.24 to 0.70. Our experimental results suggest that a significant component of reposting behaviour can be predicted based on users' profile and past behaviour, and is independent of the content of messages. △ Less

Submitted 21 May, 2025; originally announced May 2025.

arXiv:2401.15061 [pdf, other]

Digital-analog hybrid matrix multiplication processor for optical neural networks

Authors: Xiansong Meng, Deming Kong, Kwangwoong Kim, Qiuchi Li, Po Dong, Ingemar J. Cox, Christina Lioma, Hao Hu

Abstract: The computational demands of modern AI have spurred interest in optical neural networks (ONNs) which offer the potential benefits of increased speed and lower power consumption. However, current ONNs face various challenges,most significantly a limited calculation precision (typically around 4 bits) and the requirement for high-resolution signal format converters (digital-to-analogue conversions (… ▽ More The computational demands of modern AI have spurred interest in optical neural networks (ONNs) which offer the potential benefits of increased speed and lower power consumption. However, current ONNs face various challenges,most significantly a limited calculation precision (typically around 4 bits) and the requirement for high-resolution signal format converters (digital-to-analogue conversions (DACs) and analogue-to-digital conversions (ADCs)). These challenges are inherent to their analog computing nature and pose significant obstacles in practical implementation. Here, we propose a digital-analog hybrid optical computing architecture for ONNs, which utilizes digital optical inputs in the form of binary words. By introducing the logic levels and decisions based on thresholding, the calculation precision can be significantly enhanced. The DACs for input data can be removed and the resolution of the ADCs can be greatly reduced. This can increase the operating speed at a high calculation precision and facilitate the compatibility with microelectronics. To validate our approach, we have fabricated a proof-of-concept photonic chip and built up a hybrid optical processor (HOP) system for neural network applications. We have demonstrated an unprecedented 16-bit calculation precision for high-definition image processing, with a pixel error rate (PER) as low as $1.8\times10^{-3}$ at an signal-to-noise ratio (SNR) of 18.2 dB. We have also implemented a convolutional neural network for handwritten digit recognition that shows the same accuracy as the one achieved by a desktop computer. The concept of the digital-analog hybrid optical computing architecture offers a methodology that could potentially be applied to various ONN implementations and may intrigue new research into efficient and accurate domain-specific optical computing architectures for neural networks. △ Less

Submitted 26 January, 2024; originally announced January 2024.

arXiv:2212.09306 [pdf, other]

E-NER -- An Annotated Named Entity Recognition Corpus of Legal Text

Authors: Ting Wai Terence Au, Ingemar J. Cox, Vasileios Lampos

Abstract: Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal tex… ▽ More Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal text. However, prior work and experimental results reported here indicate that there is a significant degradation in performance when NER methods trained on a general English data set are applied to legal text. We describe a publicly available legal NER data set, called E-NER, based on legal company filings available from the US Securities and Exchange Commission's EDGAR data set. Training a number of different NER algorithms on the general English CoNLL-2003 corpus but testing on our test collection confirmed significant degradations in accuracy, as measured by the F1-score, of between 29.4\% and 60.4\%, compared to training and testing on the E-NER collection. △ Less

Submitted 19 December, 2022; originally announced December 2022.

Comments: 5 pages, 3 figures, submitted to NLLP workshop in EMNLP 2022

arXiv:2105.12433 [pdf, other]

Estimating the Uncertainty of Neural Network Forecasts for Influenza Prevalence Using Web Search Activity

Authors: Michael Morris, Peter Hayes, Ingemar J. Cox, Vasileios Lampos

Abstract: Influenza is an infectious disease with the potential to become a pandemic, and hence, forecasting its prevalence is an important undertaking for planning an effective response. Research has found that web search activity can be used to improve influenza models. Neural networks (NN) can provide state-of-the-art forecasting accuracy but do not commonly incorporate uncertainty in their estimates, so… ▽ More Influenza is an infectious disease with the potential to become a pandemic, and hence, forecasting its prevalence is an important undertaking for planning an effective response. Research has found that web search activity can be used to improve influenza models. Neural networks (NN) can provide state-of-the-art forecasting accuracy but do not commonly incorporate uncertainty in their estimates, something essential for using them effectively during decision making. In this paper, we demonstrate how Bayesian Neural Networks (BNNs) can be used to both provide a forecast and a corresponding uncertainty without significant loss in forecasting accuracy compared to traditional NNs. Our method accounts for two sources of uncertainty: data and model uncertainty, arising due to measurement noise and model specification, respectively. Experiments are conducted using 14 years of data for England, assessing the model's accuracy over the last 4 flu seasons in this dataset. We evaluate the performance of different models including competitive baselines with conventional metrics as well as error functions that incorporate uncertainty estimates. Our empirical analysis indicates that considering both sources of uncertainty simultaneously is superior to considering either one separately. We also show that a BNN with recurrent layers that models both sources of uncertainty yields superior accuracy for these metrics for forecasting horizons greater than 7 days. △ Less

Submitted 26 May, 2021; originally announced May 2021.

arXiv:2007.11821 [pdf, other]

Providing early indication of regional anomalies in COVID19 case counts in England using search engine queries

Authors: Elad Yom-Tov, Vasileios Lampos, Ingemar J. Cox, Michael Edelstein

Abstract: COVID19 was first reported in England at the end of January 2020, and by mid-June over 150,000 cases were reported. We assume that, similarly to influenza-like illnesses, people who suffer from COVID19 may query for their symptoms prior to accessing the medical system (or in lieu of it). Therefore, we analyzed searches to Bing from users in England, identifying cases where unexpected rises in rele… ▽ More COVID19 was first reported in England at the end of January 2020, and by mid-June over 150,000 cases were reported. We assume that, similarly to influenza-like illnesses, people who suffer from COVID19 may query for their symptoms prior to accessing the medical system (or in lieu of it). Therefore, we analyzed searches to Bing from users in England, identifying cases where unexpected rises in relevant symptom searches occurred at specific areas of the country. Our analysis shows that searches for "fever" and "cough" were the most correlated with future case counts, with searches preceding case counts by 16-17 days. Unexpected rises in search patterns were predictive of future case counts multiplying by 2.5 or more within a week, reaching an Area Under Curve (AUC) of 0.64. Similar rises in mortality were predicted with an AUC of approximately 0.61 at a lead time of 3 weeks. Thus, our metric provided Public Health England with an indication which could be used to plan the response to COVID19 and could possibly be utilized to detect regional anomalies of other pathogens. △ Less

Submitted 23 July, 2020; originally announced July 2020.

arXiv:2007.02603 [pdf]

Go local: The key to controlling the COVID-19 pandemic in the post lockdown era

Authors: Isabel Bennett, Jobie Budd, Erin M. Manning, Ed Manley, Mengdie Zhuang, Ingemar J. Cox, Michael Short, Anne M. Johnson, Deenan Pillay, Rachel A. McKendry

Abstract: The UK government announced its first wave of lockdown easing on 10 May 2020, two months after the non-pharmaceutical measures to reduce the spread of COVID-19 were first introduced on 23 March 2020. Analysis of reported case rate data from Public Health England and aggregated and anonymised crowd level mobility data shows variability across local authorities in the UK. A locality-based approach t… ▽ More The UK government announced its first wave of lockdown easing on 10 May 2020, two months after the non-pharmaceutical measures to reduce the spread of COVID-19 were first introduced on 23 March 2020. Analysis of reported case rate data from Public Health England and aggregated and anonymised crowd level mobility data shows variability across local authorities in the UK. A locality-based approach to lockdown easing is needed, enabling local public health and associated health and social care services to rapidly respond to emerging hotspots of infection. National level data will hide an increasing heterogeneity of COVID-19 infections and mobility, and new ways of real-time data presentation to the public are required. Data sources (including mobile) allow for faster visualisation than more traditional data sources, and are part of a wider trend towards near real-time analysis of outbreaks needed for timely, targeted local public health interventions. Real time data visualisation may give early warnings of unusual levels of activity which warrant further investigation by local public health authorities. △ Less

Submitted 6 July, 2020; originally announced July 2020.

Comments: 6 pages, 3 figures

arXiv:2003.08086 [pdf, other]

doi 10.1038/s41746-021-00384-w

Tracking COVID-19 using online search

Authors: Vasileios Lampos, Maimuna S. Majumder, Elad Yom-Tov, Michael Edelstein, Simon Moura, Yohhei Hamada, Molebogeng X. Rangaka, Rachel A. McKendry, Ingemar J. Cox

Abstract: Previous research has demonstrated that various properties of infectious diseases can be inferred from online search behaviour. In this work we use time series of online search query frequencies to gain insights about the prevalence of COVID-19 in multiple countries. We first develop unsupervised modelling techniques based on associated symptom categories identified by the United Kingdom's Nationa… ▽ More Previous research has demonstrated that various properties of infectious diseases can be inferred from online search behaviour. In this work we use time series of online search query frequencies to gain insights about the prevalence of COVID-19 in multiple countries. We first develop unsupervised modelling techniques based on associated symptom categories identified by the United Kingdom's National Health Service and Public Health England. We then attempt to minimise an expected bias in these signals caused by public interest -- as opposed to infections -- using the proportion of news media coverage devoted to COVID-19 as a proxy indicator. Our analysis indicates that models based on online searches precede the reported confirmed cases and deaths by 16.7 (10.2 - 23.2) and 22.1 (17.4 - 26.9) days, respectively. We also investigate transfer learning techniques for mapping supervised models from countries where the spread of disease has progressed extensively to countries that are in earlier phases of their respective epidemic curves. Furthermore, we compare time series of online search activity against confirmed COVID-19 cases or deaths jointly across multiple countries, uncovering interesting querying patterns, including the finding that rarer symptoms are better predictors than common ones. Finally, we show that web searches improve the short-term forecasting accuracy of autoregressive models for COVID-19 deaths. Our work provides evidence that online search data can be used to develop complementary public health surveillance methods to help inform the COVID-19 response in conjunction with more established approaches. △ Less

Submitted 10 February, 2021; v1 submitted 18 March, 2020; originally announced March 2020.

Comments: Published in Nature Digital Medicine. Please note that the published version differs from this preprint

Journal ref: Nature Digital Medicine 4, 17 (2021)

arXiv:1802.06833 [pdf, other]

Seasonal Web Search Query Selection for Influenza-Like Illness (ILI) Estimation

Authors: Niels Dalum Hansen, Kåre Mølbak, Ingemar J. Cox, Christina Lioma

Abstract: Influenza-like illness (ILI) estimation from web search data is an important web analytics task. The basic idea is to use the frequencies of queries in web search logs that are correlated with past ILI activity as features when estimating current ILI activity. It has been noted that since influenza is seasonal, this approach can lead to spurious correlations with features/queries that also exhibit… ▽ More Influenza-like illness (ILI) estimation from web search data is an important web analytics task. The basic idea is to use the frequencies of queries in web search logs that are correlated with past ILI activity as features when estimating current ILI activity. It has been noted that since influenza is seasonal, this approach can lead to spurious correlations with features/queries that also exhibit seasonality, but have no relationship with ILI. Spurious correlations can, in turn, degrade performance. To address this issue, we propose modeling the seasonal variation in ILI activity and selecting queries that are correlated with the residual of the seasonal model and the observed ILI signal. Experimental results show that re-ranking queries obtained by Google Correlate based on their correlation with the residual strongly favours ILI-related queries. △ Less

Submitted 19 February, 2018; originally announced February 2018.

arXiv:1702.07326 [pdf, other]

Time-Series Adaptive Estimation of Vaccination Uptake Using Web Search Queries

Authors: Niels Dalum Hansen, Kåre Mølbak, Ingemar J. Cox, Christina Lioma

Abstract: Estimating vaccination uptake is an integral part of ensuring public health. It was recently shown that vaccination uptake can be estimated automatically from web data, instead of slowly collected clinical records or population surveys. All prior work in this area assumes that features of vaccination uptake collected from the web are temporally regular. We present the first ever method to remove t… ▽ More Estimating vaccination uptake is an integral part of ensuring public health. It was recently shown that vaccination uptake can be estimated automatically from web data, instead of slowly collected clinical records or population surveys. All prior work in this area assumes that features of vaccination uptake collected from the web are temporally regular. We present the first ever method to remove this assumption from vaccination uptake estimation: our method dynamically adapts to temporal fluctuations in time series web data used to estimate vaccination uptake. We show our method to outperform the state of the art compared to competitive baselines that use not only web data but also curated clinical data. This performance improvement is more pronounced for vaccines whose uptake has been irregular due to negative media attention (HPV-1 and HPV-2), problems in vaccine supply (DiTeKiPol), and targeted at children of 12 years old (whose vaccination is more irregular compared to younger children). △ Less

Submitted 23 February, 2017; originally announced February 2017.

arXiv:1608.06253 [pdf, other]

Multi-Dueling Bandits and Their Application to Online Ranker Evaluation

Authors: Brian Brost, Yevgeny Seldin, Ingemar J. Cox, Christina Lioma

Abstract: New ranking algorithms are continually being developed and refined, necessitating the development of efficient methods for evaluating these rankers. Online ranker evaluation focuses on the challenge of efficiently determining, from implicit user feedback, which ranker out of a finite set of rankers is the best. Online ranker evaluation can be modeled by dueling ban- dits, a mathematical model for… ▽ More New ranking algorithms are continually being developed and refined, necessitating the development of efficient methods for evaluating these rankers. Online ranker evaluation focuses on the challenge of efficiently determining, from implicit user feedback, which ranker out of a finite set of rankers is the best. Online ranker evaluation can be modeled by dueling ban- dits, a mathematical model for online learning under limited feedback from pairwise comparisons. Comparisons of pairs of rankers is performed by interleaving their result sets and examining which documents users click on. The dueling bandits model addresses the key issue of which pair of rankers to compare at each iteration, thereby providing a solution to the exploration-exploitation trade-off. Recently, methods for simultaneously comparing more than two rankers have been developed. However, the question of which rankers to compare at each iteration was left open. We address this question by proposing a generalization of the dueling bandits model that uses simultaneous comparisons of an unrestricted number of rankers. We evaluate our algorithm on synthetic data and several standard large-scale online ranker evaluation datasets. Our experimental results show that the algorithm yields orders of magnitude improvement in performance compared to stateof- the-art dueling bandit algorithms. △ Less

Submitted 22 August, 2016; originally announced August 2016.

arXiv:1608.00788 [pdf, other]

An Improved Multileaving Algorithm for Online Ranker Evaluation

Authors: Brian Brost, Ingemar J. Cox, Yevgeny Seldin, Christina Lioma

Abstract: Online ranker evaluation is a key challenge in information retrieval. An important task in the online evaluation of rankers is using implicit user feedback for inferring preferences between rankers. Interleaving methods have been found to be efficient and sensitive, i.e. they can quickly detect even small differences in quality. It has recently been shown that multileaving methods exhibit similar… ▽ More Online ranker evaluation is a key challenge in information retrieval. An important task in the online evaluation of rankers is using implicit user feedback for inferring preferences between rankers. Interleaving methods have been found to be efficient and sensitive, i.e. they can quickly detect even small differences in quality. It has recently been shown that multileaving methods exhibit similar sensitivity but can be more efficient than interleaving methods. This paper presents empirical results demonstrating that existing multileaving methods either do not scale well with the number of rankers, or, more problematically, can produce results which substantially differ from evaluation measures like NDCG. The latter problem is caused by the fact that they do not correctly account for the similarities that can occur between rankers being multileaved. We propose a new multileaving method for handling this problem and demonstrate that it substantially outperforms existing methods, in some cases reducing errors by as much as 50%. △ Less

Submitted 2 August, 2016; originally announced August 2016.

arXiv:1409.7291 [pdf, other]

doi 10.1038/srep09924

Optimizing Hybrid Spreading in Metapopulations

Authors: Changwang Zhang, Shi Zhou, Joel C. Miller, Ingemar J. Cox, Benjamin M. Chain

Abstract: Epidemic spreading phenomena are ubiquitous in nature and society. Examples include the spreading of diseases, information, and computer viruses. Epidemics can spread by local spreading, where infected nodes can only infect a limited set of direct target nodes and global spreading, where an infected node can infect every other node. In reality, many epidemics spread using a hybrid mixture of both… ▽ More Epidemic spreading phenomena are ubiquitous in nature and society. Examples include the spreading of diseases, information, and computer viruses. Epidemics can spread by local spreading, where infected nodes can only infect a limited set of direct target nodes and global spreading, where an infected node can infect every other node. In reality, many epidemics spread using a hybrid mixture of both types of spreading. In this study we develop a theoretical framework for studying hybrid epidemics, and examine the optimum balance between spreading mechanisms in terms of achieving the maximum outbreak size. We show the existence of critically hybrid epidemics where neither spreading mechanism alone can cause a noticeable spread but a combination of the two spreading mechanisms would produce an enormous outbreak. Our results provide new strategies for maximising beneficial epidemics and estimating the worst outcome of damaging hybrid epidemics. △ Less

Submitted 31 March, 2015; v1 submitted 25 September, 2014; originally announced September 2014.

Journal ref: Scientific Reports. 2015 Apr 29;5:9924

arXiv:1307.4980 [pdf, other]

Multi-keyword multi-click advertisement option contracts for sponsored search

Authors: Bowei Chen, Jun Wang, Ingemar J. Cox, Mohan S. Kankanhalli

Abstract: In sponsored search, advertisement (abbreviated ad) slots are usually sold by a search engine to an advertiser through an auction mechanism in which advertisers bid on keywords. In theory, auction mechanisms have many desirable economic properties. However, keyword auctions have a number of limitations including: the uncertainty in payment prices for advertisers; the volatility in the search engin… ▽ More In sponsored search, advertisement (abbreviated ad) slots are usually sold by a search engine to an advertiser through an auction mechanism in which advertisers bid on keywords. In theory, auction mechanisms have many desirable economic properties. However, keyword auctions have a number of limitations including: the uncertainty in payment prices for advertisers; the volatility in the search engine's revenue; and the weak loyalty between advertiser and search engine. In this paper we propose a special ad option that alleviates these problems. In our proposal, an advertiser can purchase an option from a search engine in advance by paying an upfront fee, known as the option price. He then has the right, but no obligation, to purchase among the pre-specified set of keywords at the fixed cost-per-clicks (CPCs) for a specified number of clicks in a specified period of time. The proposed option is closely related to a special exotic option in finance that contains multiple underlying assets (multi-keyword) and is also multi-exercisable (multi-click). This novel structure has many benefits: advertisers can have reduced uncertainty in advertising; the search engine can improve the advertisers' loyalty as well as obtain a stable and increased expected revenue over time. Since the proposed ad option can be implemented in conjunction with the existing keyword auctions, the option price and corresponding fixed CPCs must be set such that there is no arbitrage between the two markets. Option pricing methods are discussed and our experimental results validate the development. Compared to keyword auctions, a search engine can have an increased expected revenue by selling an ad option. △ Less

Submitted 9 December, 2015; v1 submitted 18 July, 2013; originally announced July 2013.

Comments: Chen, Bowei and Wang, Jun and Cox, Ingemar J. and Kankanhalli, Mohan S. (2015) Multi-keyword multi-click advertisement option contracts for sponsored search. ACM Transactions on Intelligent Systems and Technology, 7 (1). pp. 1-29. ISSN: 2157-6904

arXiv:1303.3229 [pdf, other]

doi 10.1016/j.ijmedinf.2013.01.005

FindZebra: A search engine for rare diseases

Authors: Radu Dragusin, Paula Petcu, Christina Lioma, Birger Larsen, Henrik L. Jørgensen, Ingemar J. Cox, Lars Kai Hansen, Peter Ingwersen, Ole Winther

Abstract: Background: The web has become a primary information resource about illnesses and treatments for both medical and non-medical users. Standard web search is by far the most common interface for such information. It is therefore of interest to find out how well web search engines work for diagnostic queries and what factors contribute to successes and failures. Among diseases, rare (or orphan) disea… ▽ More Background: The web has become a primary information resource about illnesses and treatments for both medical and non-medical users. Standard web search is by far the most common interface for such information. It is therefore of interest to find out how well web search engines work for diagnostic queries and what factors contribute to successes and failures. Among diseases, rare (or orphan) diseases represent an especially challenging and thus interesting class to diagnose as each is rare, diverse in symptoms and usually has scattered resources associated with it. Methods: We use an evaluation approach for web search engines for rare disease diagnosis which includes 56 real life diagnostic cases, state-of-the-art evaluation measures, and curated information resources. In addition, we introduce FindZebra, a specialized (vertical) rare disease search engine. FindZebra is powered by open source search technology and uses curated freely available online medical information. Results: FindZebra outperforms Google Search in both default setup and customised to the resources used by FindZebra. We extend FindZebra with specialized functionalities exploiting medical ontological information and UMLS medical concepts to demonstrate different ways of displaying the retrieved results to medical experts. Conclusions: Our results indicate that a specialized search engine can improve the diagnostic quality without compromising the ease of use of the currently widely popular web search engines. The proposed evaluation approach can be valuable for future development and benchmarking. The FindZebra search engine is available at http://www.findzebra.com/. △ Less

Submitted 13 March, 2013; originally announced March 2013.

Journal ref: International Journal of Medical Informatics, Available online 23 February 2013, ISSN 1386-5056

arXiv:0903.0687 [pdf, ps, other]

doi 10.1007/978-3-319-54241-6_1

Second-Order Assortative Mixing in Social Networks

Authors: Shi Zhou, Ingemar J. Cox, Lars K. Hansen

Abstract: In a social network, the number of links of a node, or node degree, is often assumed as a proxy for the node's importance or prominence within the network. It is known that social networks exhibit the (first-order) assortative mixing, i.e. if two nodes are connected, they tend to have similar node degrees, suggesting that people tend to mix with those of comparable prominence. In this paper, we re… ▽ More In a social network, the number of links of a node, or node degree, is often assumed as a proxy for the node's importance or prominence within the network. It is known that social networks exhibit the (first-order) assortative mixing, i.e. if two nodes are connected, they tend to have similar node degrees, suggesting that people tend to mix with those of comparable prominence. In this paper, we report the second-order assortative mixing in social networks. If two nodes are connected, we measure the degree correlation between their most prominent neighbours, rather than between the two nodes themselves. We observe very strong second-order assortative mixing in social networks, often significantly stronger than the first-order assortative mixing. This suggests that if two people interact in a social network, then the importance of the most prominent person each knows is very likely to be the same. This is also true if we measure the average prominence of neighbours of the two people. This property is weaker or negative in non-social networks. We investigate a number of possible explanations for this property. However, none of them was found to provide an adequate explanation. We therefore conclude that second-order assortative mixing is a new property of social networks. △ Less

Submitted 23 October, 2017; v1 submitted 3 March, 2009; originally announced March 2009.

Comments: Cite as: Zhou S., Cox I.J., Hansen L.K. (2017) Second-Order Assortative Mixing in Social Networks. In: Goncalves B., Menezes R., Sinatra R., Zlatic V. (eds) Complex Networks VIII. CompleNet 2017. Springer Proceedings in Complexity. Springer, Cham. https://doi.org/10.1007/978-3-319-54241-6_1

arXiv:cs/0703043 [pdf, other]

A Comparison of On-Line Computer Science Citation Databases

Authors: Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles

Abstract: This paper examines the difference and similarities between the two on-line computer science citation databases DBLP and CiteSeer. The database entries in DBLP are inserted manually while the CiteSeer entries are obtained autonomously via a crawl of the Web and automatic processing of user submissions. CiteSeer's autonomous citation database can be considered a form of self-selected on-line surv… ▽ More This paper examines the difference and similarities between the two on-line computer science citation databases DBLP and CiteSeer. The database entries in DBLP are inserted manually while the CiteSeer entries are obtained autonomously via a crawl of the Web and automatic processing of user submissions. CiteSeer's autonomous citation database can be considered a form of self-selected on-line survey. It is important to understand the limitations of such databases, particularly when citation information is used to assess the performance of authors, institutions and funding bodies. We show that the CiteSeer database contains considerably fewer single author papers. This bias can be modeled by an exponential process with intuitive explanation. The model permits us to predict that the DBLP database covers approximately 24% of the entire literature of Computer Science. CiteSeer is also biased against low-cited papers. Despite their difference, both databases exhibit similar and significantly different citation distributions compared with previous analysis of the Physics community. In both databases, we also observe that the number of authors per paper has been increasing over time. △ Less

Submitted 9 March, 2007; originally announced March 2007.

Comments: ECDL 2005

ACM Class: H.3.7

Showing 1–16 of 16 results for author: Cox, I J