-
An analysis of the NCAA college football playoff team selections using an Elo ratings model
Authors:
Benjamin Lucas
Abstract:
In December 2023 the Florida State Seminoles became the first Power 5 school to have an undefeated season and miss selection for the College Football Playoff. In order to assess this decision, we employed an Elo ratings model to rank the teams and found that the selection committee's decision was justified and that Florida State were not one of the four best teams in college football in that seaso…
▽ More
In December 2023 the Florida State Seminoles became the first Power 5 school to have an undefeated season and miss selection for the College Football Playoff. In order to assess this decision, we employed an Elo ratings model to rank the teams and found that the selection committee's decision was justified and that Florida State were not one of the four best teams in college football in that season (ranking only 11th!). We extended this analysis to all other years of the CFP and found that the top four teams by Elo ratings differ greatly from the four teams selected in almost every year of the CFP's existence. Furthermore, we found that there have been more egregious non-selections including when Alabama was ranked first by Elo ratings in 2022 and were not selected. The analysis suggests that the current criteria are too subjective and a ratings model should be implemented to provide transparency for the sport, its teams, and its fans.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
A spatiotemporal machine learning approach to forecasting COVID-19 incidence at the county level in the USA
Authors:
Benjamin Lucas,
Behzad Vahedi,
Morteza Karimzadeh
Abstract:
With COVID-19 affecting every country globally and changing everyday life, the ability to forecast the spread of the disease is more important than any previous epidemic. The conventional methods of disease-spread modeling, compartmental models, are based on the assumption of spatiotemporal homogeneity of the spread of the virus, which may cause forecasting to underperform, especially at high spat…
▽ More
With COVID-19 affecting every country globally and changing everyday life, the ability to forecast the spread of the disease is more important than any previous epidemic. The conventional methods of disease-spread modeling, compartmental models, are based on the assumption of spatiotemporal homogeneity of the spread of the virus, which may cause forecasting to underperform, especially at high spatial resolutions. In this paper we approach the forecasting task with an alternative technique - spatiotemporal machine learning. We present COVID-LSTM, a data-driven model based on a Long Short-term Memory deep learning architecture for forecasting COVID-19 incidence at the county-level in the US. We use the weekly number of new positive cases as temporal input, and hand-engineered spatial features from Facebook movement and connectedness datasets to capture the spread of the disease in time and space. COVID-LSTM outperforms the COVID-19 Forecast Hub's Ensemble model (COVIDhub-ensemble) on our 17-week evaluation period, making it the first model to be more accurate than the COVIDhub-ensemble over one or more forecast periods. Over the 4-week forecast horizon, our model is on average 50 cases per county more accurate than the COVIDhub-ensemble. We highlight that the underutilization of data-driven forecasting of disease spread prior to COVID-19 is likely due to the lack of sufficient data available for previous diseases, in addition to the recent advances in machine learning methods for spatiotemporal forecasting. We discuss the impediments to the wider uptake of data-driven forecasting, and whether it is likely that more deep learning-based models will be used in the future.
△ Less
Submitted 18 August, 2022; v1 submitted 24 September, 2021;
originally announced September 2021.
-
A Bayesian-inspired, deep learning-based, semi-supervised domain adaptation technique for land cover mapping
Authors:
Benjamin Lucas,
Charlotte Pelletier,
Daniel Schmidt,
Geoffrey I. Webb,
François Petitjean
Abstract:
Land cover maps are a vital input variable to many types of environmental research and management. While they can be produced automatically by machine learning techniques, these techniques require substantial training data to achieve high levels of accuracy, which are not always available. One technique researchers use when labelled training data are scarce is domain adaptation (DA) -- where data…
▽ More
Land cover maps are a vital input variable to many types of environmental research and management. While they can be produced automatically by machine learning techniques, these techniques require substantial training data to achieve high levels of accuracy, which are not always available. One technique researchers use when labelled training data are scarce is domain adaptation (DA) -- where data from an alternate region, known as the source domain, are used to train a classifier and this model is adapted to map the study region, or target domain. The scenario we address in this paper is known as semi-supervised DA, where some labelled samples are available in the target domain. In this paper we present Sourcerer, a Bayesian-inspired, deep learning-based, semi-supervised DA technique for producing land cover maps from SITS data. The technique takes a convolutional neural network trained on a source domain and then trains further on the available target domain with a novel regularizer applied to the model weights. The regularizer adjusts the degree to which the model is modified to fit the target data, limiting the degree of change when the target data are few in number and increasing it as target data quantity increases. Our experiments on Sentinel-2 time series images compare Sourcerer with two state-of-the-art semi-supervised domain adaptation techniques and four baseline models. We show that on two different source-target domain pairings Sourcerer outperforms all other methods for any quantity of labelled target data available. In fact, the results on the more difficult target domain show that the starting accuracy of Sourcerer (when no labelled target data are available), 74.2%, is greater than the next-best state-of-the-art method trained on 20,000 labelled target instances.
△ Less
Submitted 10 March, 2021; v1 submitted 25 May, 2020;
originally announced May 2020.
-
InceptionTime: Finding AlexNet for Time Series Classification
Authors:
Hassan Ismail Fawaz,
Benjamin Lucas,
Germain Forestier,
Charlotte Pelletier,
Daniel F. Schmidt,
Jonathan Weber,
Geoffrey I. Webb,
Lhassane Idoumghar,
Pierre-Alain Muller,
François Petitjean
Abstract:
This paper brings deep learning at the forefront of research into Time Series Classification (TSC). TSC is the area of machine learning tasked with the categorization (or labelling) of time series. The last few decades of work in this area have led to significant progress in the accuracy of classifiers, with the state of the art now represented by the HIVE-COTE algorithm. While extremely accurate,…
▽ More
This paper brings deep learning at the forefront of research into Time Series Classification (TSC). TSC is the area of machine learning tasked with the categorization (or labelling) of time series. The last few decades of work in this area have led to significant progress in the accuracy of classifiers, with the state of the art now represented by the HIVE-COTE algorithm. While extremely accurate, HIVE-COTE cannot be applied to many real-world datasets because of its high training time complexity in O(N2 * T4) for a dataset with N time series of length T. For example, it takes HIVE-COTE more than 8 days to learn from a small dataset with N = 1500 time series of short length T = 46. Meanwhile deep learning has received enormous attention because of its high accuracy and scalability. Recent approaches to deep learning for TSC have been scalable, but less accurate than HIVE-COTE. We introduce InceptionTime - an ensemble of deep Convolutional Neural Network (CNN) models, inspired by the Inception-v4 architecture. Our experiments show that InceptionTime is on par with HIVE-COTE in terms of accuracy while being much more scalable: not only can it learn from 1,500 time series in one hour but it can also learn from 8M time series in 13 hours, a quantity of data that is fully out of reach of HIVE-COTE.
△ Less
Submitted 5 December, 2020; v1 submitted 11 September, 2019;
originally announced September 2019.
-
Proximity Forest: An effective and scalable distance-based classifier for time series
Authors:
Benjamin Lucas,
Ahmed Shifaz,
Charlotte Pelletier,
Lachlan O'Neill,
Nayyar Zaidi,
Bart Goethals,
Francois Petitjean,
Geoffrey I. Webb
Abstract:
Research into the classification of time series has made enormous progress in the last decade. The UCR time series archive has played a significant role in challenging and guiding the development of new learners for time series classification. The largest dataset in the UCR archive holds 10 thousand time series only; which may explain why the primary research focus has been in creating algorithms…
▽ More
Research into the classification of time series has made enormous progress in the last decade. The UCR time series archive has played a significant role in challenging and guiding the development of new learners for time series classification. The largest dataset in the UCR archive holds 10 thousand time series only; which may explain why the primary research focus has been in creating algorithms that have high accuracy on relatively small datasets.
This paper introduces Proximity Forest, an algorithm that learns accurate models from datasets with millions of time series, and classifies a time series in milliseconds. The models are ensembles of highly randomized Proximity Trees. Whereas conventional decision trees branch on attribute values (and usually perform poorly on time series), Proximity Trees branch on the proximity of time series to one exemplar time series or another; allowing us to leverage the decades of work into developing relevant measures for time series. Proximity Forest gains both efficiency and accuracy by stochastic selection of both exemplars and similarity measures.
Our work is motivated by recent time series applications that provide orders of magnitude more time series than the UCR benchmarks. Our experiments demonstrate that Proximity Forest is highly competitive on the UCR archive: it ranks among the most accurate classifiers while being significantly faster. We demonstrate on a 1M time series Earth observation dataset that Proximity Forest retains this accuracy on datasets that are many orders of magnitude greater than those in the UCR repository, while learning its models at least 100,000 times faster than current state of the art models Elastic Ensemble and COTE.
△ Less
Submitted 12 December, 2018; v1 submitted 31 August, 2018;
originally announced August 2018.