-
Prediction of Reposting on X
Authors:
Ziming Xu,
Shi Zhou,
Vasileios Lampos,
Ingemar J. Cox
Abstract:
There have been considerable efforts to predict a user's reposting behaviour on X (formerly Twitter) using machine learning models. The problem is previously cast as a supervised classification task, where Tweets are randomly assigned to a test or training set. The random assignment helps to ensure that the test and training sets are drawn from the same distribution. In practice, we would like to…
▽ More
There have been considerable efforts to predict a user's reposting behaviour on X (formerly Twitter) using machine learning models. The problem is previously cast as a supervised classification task, where Tweets are randomly assigned to a test or training set. The random assignment helps to ensure that the test and training sets are drawn from the same distribution. In practice, we would like to predict users' reposting behaviour for a set of messages related to a new, previously unseen, topic (defined by a hashtag). In this case, the problem becomes an out-of-distribution generalisation classification task.
Experimental results reveal that while existing algorithms, which predominantly use features derived from the content of Tweet messages, perform well when the training and test distributions are the same, these algorithms perform much worse when the test set is out of distribution. We then show that if the message features are supplemented or replaced with features derived from users' profile and past behaviour, the out-of-distribution prediction is greatly improved, with the F1 score increasing from 0.24 to 0.70. Our experimental results suggest that a significant component of reposting behaviour can be predicted based on users' profile and past behaviour, and is independent of the content of messages.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Sonnet: Spectral Operator Neural Network for Multivariable Time Series Forecasting
Authors:
Yuxuan Shu,
Vasileios Lampos
Abstract:
Multivariable time series forecasting methods can integrate information from exogenous variables, leading to significant prediction accuracy gains. Transformer architecture has been widely applied in various time series forecasting models due to its ability to capture long-range sequential dependencies. However, a naïve application of transformers often struggles to effectively model complex relat…
▽ More
Multivariable time series forecasting methods can integrate information from exogenous variables, leading to significant prediction accuracy gains. Transformer architecture has been widely applied in various time series forecasting models due to its ability to capture long-range sequential dependencies. However, a naïve application of transformers often struggles to effectively model complex relationships among variables over time. To mitigate against this, we propose a novel architecture, namely the Spectral Operator Neural Network (Sonnet). Sonnet applies learnable wavelet transformations to the input and incorporates spectral analysis using the Koopman operator. Its predictive skill relies on the Multivariable Coherence Attention (MVCA), an operation that leverages spectral coherence to model variable dependencies. Our empirical analysis shows that Sonnet yields the best performance on $34$ out of $47$ forecasting tasks with an average mean absolute error (MAE) reduction of $1.1\%$ against the most competitive baseline (different per task). We further show that MVCA -- when put in place of the naïve attention used in various deep learning models -- can remedy its deficiencies, reducing MAE by $10.7\%$ on average in the most challenging forecasting tasks.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Machine-generated text detection prevents language model collapse
Authors:
George Drayson,
Emine Yilmaz,
Vasileios Lampos
Abstract:
As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since online data is the primary resource for LLM pre-training, subsequent models could be trained on an unknown portion of synthetic samples. This will lead to model collapse, a degenerative process w…
▽ More
As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since online data is the primary resource for LLM pre-training, subsequent models could be trained on an unknown portion of synthetic samples. This will lead to model collapse, a degenerative process whereby LLMs reinforce their own errors, converge to a low variance output distribution, and ultimately yield a declining performance. In this study, we investigate the impact of decoding strategy on model collapse, analysing the text characteristics at each model generation, the similarity to human references, and the resulting model performance. Using the decoding strategies that lead to the most significant degradation, we evaluate model collapse in more realistic scenarios where the origin of the data (human or synthetic) is unknown. We train a machine-generated text detector and propose an importance sampling approach to alleviate model collapse. Our method is validated on two LLM variants (GPT-2 and SmolLM2), across a range of model sizes (124M to 1.7B), on the open-ended text generation task. We demonstrate that it can not only prevent model collapse but also improve performance when sufficient human-authored samples are present. Source code: github.com/GeorgeDrayson/model_collapse.
△ Less
Submitted 19 May, 2025; v1 submitted 21 February, 2025;
originally announced February 2025.
-
DeformTime: Capturing Variable Dependencies with Deformable Attention for Time Series Forecasting
Authors:
Yuxuan Shu,
Vasileios Lampos
Abstract:
In multivariable time series (MTS) forecasting, existing state-of-the-art deep learning approaches tend to focus on autoregressive formulations and often overlook the potential of using exogenous variables in enhancing the prediction of the target endogenous variable. To address this limitation, we present DeformTime, a neural network architecture that attempts to capture correlated temporal patte…
▽ More
In multivariable time series (MTS) forecasting, existing state-of-the-art deep learning approaches tend to focus on autoregressive formulations and often overlook the potential of using exogenous variables in enhancing the prediction of the target endogenous variable. To address this limitation, we present DeformTime, a neural network architecture that attempts to capture correlated temporal patterns from the input space, and hence, improve forecasting accuracy. It deploys two core operations performed by deformable attention blocks (DABs): learning dependencies across variables from different time steps (variable DAB), and preserving temporal dependencies in data from previous time steps (temporal DAB). Input data transformation is explicitly designed to enhance learning from the deformed series of information while passing through a DAB. We conduct extensive experiments on 6 MTS data sets, using previously established benchmarks as well as challenging infectious disease modelling tasks with more exogenous variables. The results demonstrate that DeformTime improves accuracy against previous competitive methods across the vast majority of MTS forecasting tasks, reducing the mean absolute error by 7.2% on average. Notably, performance gains remain consistent across longer forecasting horizons.
△ Less
Submitted 1 April, 2025; v1 submitted 11 June, 2024;
originally announced June 2024.
-
Unsupervised hard Negative Augmentation for contrastive learning
Authors:
Yuxuan Shu,
Vasileios Lampos
Abstract:
We present Unsupervised hard Negative Augmentation (UNA), a method that generates synthetic negative instances based on the term frequency-inverse document frequency (TF-IDF) retrieval model. UNA uses TF-IDF scores to ascertain the perceived importance of terms in a sentence and then produces negative samples by replacing terms with respect to that. Our experiments demonstrate that models trained…
▽ More
We present Unsupervised hard Negative Augmentation (UNA), a method that generates synthetic negative instances based on the term frequency-inverse document frequency (TF-IDF) retrieval model. UNA uses TF-IDF scores to ascertain the perceived importance of terms in a sentence and then produces negative samples by replacing terms with respect to that. Our experiments demonstrate that models trained with UNA improve the overall performance in semantic textual similarity tasks. Additional performance gains are obtained when combining UNA with the paraphrasing augmentation. Further results show that our method is compatible with different backbone models. Ablation studies also support the choice of having a TF-IDF-driven control on negative augmentation.
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
E-NER -- An Annotated Named Entity Recognition Corpus of Legal Text
Authors:
Ting Wai Terence Au,
Ingemar J. Cox,
Vasileios Lampos
Abstract:
Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal tex…
▽ More
Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal text. However, prior work and experimental results reported here indicate that there is a significant degradation in performance when NER methods trained on a general English data set are applied to legal text. We describe a publicly available legal NER data set, called E-NER, based on legal company filings available from the US Securities and Exchange Commission's EDGAR data set. Training a number of different NER algorithms on the general English CoNLL-2003 corpus but testing on our test collection confirmed significant degradations in accuracy, as measured by the F1-score, of between 29.4\% and 60.4\%, compared to training and testing on the E-NER collection.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Estimating the Uncertainty of Neural Network Forecasts for Influenza Prevalence Using Web Search Activity
Authors:
Michael Morris,
Peter Hayes,
Ingemar J. Cox,
Vasileios Lampos
Abstract:
Influenza is an infectious disease with the potential to become a pandemic, and hence, forecasting its prevalence is an important undertaking for planning an effective response. Research has found that web search activity can be used to improve influenza models. Neural networks (NN) can provide state-of-the-art forecasting accuracy but do not commonly incorporate uncertainty in their estimates, so…
▽ More
Influenza is an infectious disease with the potential to become a pandemic, and hence, forecasting its prevalence is an important undertaking for planning an effective response. Research has found that web search activity can be used to improve influenza models. Neural networks (NN) can provide state-of-the-art forecasting accuracy but do not commonly incorporate uncertainty in their estimates, something essential for using them effectively during decision making. In this paper, we demonstrate how Bayesian Neural Networks (BNNs) can be used to both provide a forecast and a corresponding uncertainty without significant loss in forecasting accuracy compared to traditional NNs. Our method accounts for two sources of uncertainty: data and model uncertainty, arising due to measurement noise and model specification, respectively. Experiments are conducted using 14 years of data for England, assessing the model's accuracy over the last 4 flu seasons in this dataset. We evaluate the performance of different models including competitive baselines with conventional metrics as well as error functions that incorporate uncertainty estimates. Our empirical analysis indicates that considering both sources of uncertainty simultaneously is superior to considering either one separately. We also show that a BNN with recurrent layers that models both sources of uncertainty yields superior accuracy for these metrics for forecasting horizons greater than 7 days.
△ Less
Submitted 26 May, 2021;
originally announced May 2021.
-
Providing early indication of regional anomalies in COVID19 case counts in England using search engine queries
Authors:
Elad Yom-Tov,
Vasileios Lampos,
Ingemar J. Cox,
Michael Edelstein
Abstract:
COVID19 was first reported in England at the end of January 2020, and by mid-June over 150,000 cases were reported. We assume that, similarly to influenza-like illnesses, people who suffer from COVID19 may query for their symptoms prior to accessing the medical system (or in lieu of it). Therefore, we analyzed searches to Bing from users in England, identifying cases where unexpected rises in rele…
▽ More
COVID19 was first reported in England at the end of January 2020, and by mid-June over 150,000 cases were reported. We assume that, similarly to influenza-like illnesses, people who suffer from COVID19 may query for their symptoms prior to accessing the medical system (or in lieu of it). Therefore, we analyzed searches to Bing from users in England, identifying cases where unexpected rises in relevant symptom searches occurred at specific areas of the country. Our analysis shows that searches for "fever" and "cough" were the most correlated with future case counts, with searches preceding case counts by 16-17 days. Unexpected rises in search patterns were predictive of future case counts multiplying by 2.5 or more within a week, reaching an Area Under Curve (AUC) of 0.64. Similar rises in mortality were predicted with an AUC of approximately 0.61 at a lead time of 3 weeks. Thus, our metric provided Public Health England with an indication which could be used to plan the response to COVID19 and could possibly be utilized to detect regional anomalies of other pathogens.
△ Less
Submitted 23 July, 2020;
originally announced July 2020.
-
Tracking COVID-19 using online search
Authors:
Vasileios Lampos,
Maimuna S. Majumder,
Elad Yom-Tov,
Michael Edelstein,
Simon Moura,
Yohhei Hamada,
Molebogeng X. Rangaka,
Rachel A. McKendry,
Ingemar J. Cox
Abstract:
Previous research has demonstrated that various properties of infectious diseases can be inferred from online search behaviour. In this work we use time series of online search query frequencies to gain insights about the prevalence of COVID-19 in multiple countries. We first develop unsupervised modelling techniques based on associated symptom categories identified by the United Kingdom's Nationa…
▽ More
Previous research has demonstrated that various properties of infectious diseases can be inferred from online search behaviour. In this work we use time series of online search query frequencies to gain insights about the prevalence of COVID-19 in multiple countries. We first develop unsupervised modelling techniques based on associated symptom categories identified by the United Kingdom's National Health Service and Public Health England. We then attempt to minimise an expected bias in these signals caused by public interest -- as opposed to infections -- using the proportion of news media coverage devoted to COVID-19 as a proxy indicator. Our analysis indicates that models based on online searches precede the reported confirmed cases and deaths by 16.7 (10.2 - 23.2) and 22.1 (17.4 - 26.9) days, respectively. We also investigate transfer learning techniques for mapping supervised models from countries where the spread of disease has progressed extensively to countries that are in earlier phases of their respective epidemic curves. Furthermore, we compare time series of online search activity against confirmed COVID-19 cases or deaths jointly across multiple countries, uncovering interesting querying patterns, including the finding that rarer symptoms are better predictors than common ones. Finally, we show that web searches improve the short-term forecasting accuracy of autoregressive models for COVID-19 deaths. Our work provides evidence that online search data can be used to develop complementary public health surveillance methods to help inform the COVID-19 response in conjunction with more established approaches.
△ Less
Submitted 10 February, 2021; v1 submitted 18 March, 2020;
originally announced March 2020.
-
Assessing public health interventions using Web content
Authors:
Vasileios Lampos
Abstract:
Public health interventions are a fundamental tool for mitigating the spread of an infectious disease. However, it is not always possible to obtain a conclusive estimate for the impact of an intervention, especially in situations where the effects are fragmented in population parts that are under-represented within traditional public health surveillance schemes. To this end, online user activity c…
▽ More
Public health interventions are a fundamental tool for mitigating the spread of an infectious disease. However, it is not always possible to obtain a conclusive estimate for the impact of an intervention, especially in situations where the effects are fragmented in population parts that are under-represented within traditional public health surveillance schemes. To this end, online user activity can be used as a complementary sensor to establish alternative measures. Here, we provide a summary of our research on formulating statistical frameworks for assessing public health interventions based on data from social media and search engines (Lampos et al., 2015 [20]; Wagner et al., 2017 [37]). Our methodology has been applied in two real-world case studies: the 2013/14 and 2014/15 flu vaccination campaigns in England, where school-age children were vaccinated in a number of locations aiming to reduce the overall transmission of the virus. Disease models from online data combined with historical patterns of disease prevalence across different areas allowed us to quantify the impact of the intervention. In addition, a qualitative evaluation of our impact estimates demonstrated that they were in line with independent assessments from public health authorities.
△ Less
Submitted 21 December, 2017;
originally announced December 2017.
-
Flu Detector: Estimating influenza-like illness rates from online user-generated content
Authors:
Vasileios Lampos
Abstract:
We provide a brief technical description of an online platform for disease monitoring, titled as the Flu Detector (fludetector.cs.ucl.ac.uk). Flu Detector, in its current version (v.0.5), uses either Twitter or Google search data in conjunction with statistical Natural Language Processing models to estimate the rate of influenza-like illness in the population of England. Its back-end is a live ser…
▽ More
We provide a brief technical description of an online platform for disease monitoring, titled as the Flu Detector (fludetector.cs.ucl.ac.uk). Flu Detector, in its current version (v.0.5), uses either Twitter or Google search data in conjunction with statistical Natural Language Processing models to estimate the rate of influenza-like illness in the population of England. Its back-end is a live service that collects online data, utilises modern technologies for large-scale text processing, and finally applies statistical inference models that are trained offline. The front-end visualises the various disease rate estimates. Notably, the models based on Google data achieve a high level of accuracy with respect to the most recent four flu seasons in England (2012/13 to 2015/16). This highlighted Flu Detector as having a great potential of becoming a complementary source to the domestic traditional flu surveillance schemes.
△ Less
Submitted 11 December, 2016;
originally announced December 2016.
-
Analysing Mood Patterns in the United Kingdom through Twitter Content
Authors:
Vasileios Lampos,
Thomas Lansdall-Welfare,
Ricardo Araya,
Nello Cristianini
Abstract:
Social Media offer a vast amount of geo-located and time-stamped textual content directly generated by people. This information can be analysed to obtain insights about the general state of a large population of users and to address scientific questions from a diversity of disciplines. In this work, we estimate temporal patterns of mood variation through the use of emotionally loaded words contain…
▽ More
Social Media offer a vast amount of geo-located and time-stamped textual content directly generated by people. This information can be analysed to obtain insights about the general state of a large population of users and to address scientific questions from a diversity of disciplines. In this work, we estimate temporal patterns of mood variation through the use of emotionally loaded words contained in Twitter messages, possibly reflecting underlying circadian and seasonal rhythms in the mood of the users. We present a method for computing mood scores from text using affective word taxonomies, and apply it to millions of tweets collected in the United Kingdom during the seasons of summer and winter. Our analysis results in the detection of strong and statistically significant circadian patterns for all the investigated mood types. Seasonal variation does not seem to register any important divergence in the signals, but a periodic oscillation within a 24-hour period is identified for each mood type. The main common characteristic for all emotions is their mid-morning peak, however their mood score patterns differ in the evenings.
△ Less
Submitted 19 April, 2013;
originally announced April 2013.
-
Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical Learning Methods
Authors:
Vasileios Lampos
Abstract:
A vast amount of textual web streams is influenced by events or phenomena emerging in the real world. The social web forms an excellent modern paradigm, where unstructured user generated content is published on a regular basis and in most occasions is freely distributed. The present Ph.D. Thesis deals with the problem of inferring information - or patterns in general - about events emerging in rea…
▽ More
A vast amount of textual web streams is influenced by events or phenomena emerging in the real world. The social web forms an excellent modern paradigm, where unstructured user generated content is published on a regular basis and in most occasions is freely distributed. The present Ph.D. Thesis deals with the problem of inferring information - or patterns in general - about events emerging in real life based on the contents of this textual stream. We show that it is possible to extract valuable information about social phenomena, such as an epidemic or even rainfall rates, by automatic analysis of the content published in Social Media, and in particular Twitter, using Statistical Machine Learning methods. An important intermediate task regards the formation and identification of features which characterise a target event; we select and use those textual features in several linear, non-linear and hybrid inference approaches achieving a significantly good performance in terms of the applied loss function. By examining further this rich data set, we also propose methods for extracting various types of mood signals revealing how affective norms - at least within the social web's population - evolve during the day and how significant events emerging in the real world are influencing them. Lastly, we present some preliminary findings showing several spatiotemporal characteristics of this textual information as well as the potential of using it to tackle tasks such as the prediction of voting intentions.
△ Less
Submitted 13 August, 2012;
originally announced August 2012.
-
On voting intentions inference from Twitter content: a case study on UK 2010 General Election
Authors:
Vasileios Lampos
Abstract:
This is a report, where preliminary work regarding the topic of voting intention inference from Social Media - such as Twitter - is presented. Our case study is the UK 2010 General Election and we are focusing on predicting the percentages of voting intention polls (conducted by YouGov) for the three major political parties - Conservatives, Labours and Liberal Democrats - during a 5-month period b…
▽ More
This is a report, where preliminary work regarding the topic of voting intention inference from Social Media - such as Twitter - is presented. Our case study is the UK 2010 General Election and we are focusing on predicting the percentages of voting intention polls (conducted by YouGov) for the three major political parties - Conservatives, Labours and Liberal Democrats - during a 5-month period before the election date (May 6, 2010). We form three methodologies for extracting positive or negative sentiment from tweets, which build on each other, and then propose two supervised models for turning sentiment into voting intention percentages. Interestingly, when the content of tweets is enriched by attaching synonymous words, a significant improvement on inference performance is achieved reaching a mean absolute error of 4.34% +/- 2.13%; in that case, the predictions are also shown to be statistically significant. The presented methods should be considered as work-in-progress; limitations and suggestions for future work appear in the final section of this script.
△ Less
Submitted 21 May, 2012; v1 submitted 2 April, 2012;
originally announced April 2012.