-
Cast vote records: A database of ballots from the 2020 U.S. Election
Authors:
Shiro Kuriwaki,
Mason Reece,
Samuel Baltz,
Aleksandra Conevska,
Joseph R. Loffredo,
Can Mutlu,
Taran Samarth,
Kevin E. Acevedo Jetter,
Zachary Djanogly Garai,
Kate Murray,
Shigeo Hirano,
Jeffrey B. Lewis,
James M. Snyder Jr.,
Charles H. Stewart III
Abstract:
Ballots are the core records of elections. Electronic records of actual ballots cast (cast vote records) are available to the public in some jurisdictions. However, they have been released in a variety of formats and have not been independently evaluated. Here we introduce a database of cast vote records from the 2020 U.S. general election. We downloaded publicly available unstandardized cast vote…
▽ More
Ballots are the core records of elections. Electronic records of actual ballots cast (cast vote records) are available to the public in some jurisdictions. However, they have been released in a variety of formats and have not been independently evaluated. Here we introduce a database of cast vote records from the 2020 U.S. general election. We downloaded publicly available unstandardized cast vote records, standardized them into a multi-state database, and extensively compared their totals to certified election results. Our release includes vote records for President, Governor, U.S. Senate and House, and state upper and lower chambers -- covering 42.7 million voters in 20 states who voted for more than 2,204 candidates. This database serves as a uniquely granular administrative dataset for studying voting behavior and election administration. Using this data, we show that in battleground states, 1.9 percent of solid Republicans (as defined by their congressional and state legislative voting) in our database split their ticket for Joe Biden, while 1.2 percent of solid Democrats split their ticket for Donald Trump.
△ Less
Submitted 24 October, 2024;
originally announced November 2024.
-
Wastewater-based Epidemiology for COVID-19 Surveillance and Beyond: A Survey
Authors:
Chen Chen,
Yunfan Wang,
Gursharn Kaur,
Aniruddha Adiga,
Baltazar Espinoza,
Srinivasan Venkatramanan,
Andrew Warren,
Bryan Lewis,
Justin Crow,
Rekha Singh,
Alexandra Lorentz,
Denise Toney,
Madhav Marathe
Abstract:
The pandemic of COVID-19 has imposed tremendous pressure on public health systems and social economic ecosystems over the past years. To alleviate its social impact, it is important to proactively track the prevalence of COVID-19 within communities. The traditional way to estimate the disease prevalence is to estimate from reported clinical test data or surveys. However, the coverage of clinical t…
▽ More
The pandemic of COVID-19 has imposed tremendous pressure on public health systems and social economic ecosystems over the past years. To alleviate its social impact, it is important to proactively track the prevalence of COVID-19 within communities. The traditional way to estimate the disease prevalence is to estimate from reported clinical test data or surveys. However, the coverage of clinical tests is often limited and the tests can be labor-intensive, requires reliable and timely results, and consistent diagnostic and reporting criteria. Recent studies revealed that patients who are diagnosed with COVID-19 often undergo fecal shedding of SARS-CoV-2 virus into wastewater, which makes wastewater-based epidemiology for COVID-19 surveillance a promising approach to complement traditional clinical testing. In this paper, we survey the existing literature regarding wastewater-based epidemiology for COVID-19 surveillance and summarize the current advances in the area. Specifically, we have covered the key aspects of wastewater sampling, sample testing, and presented a comprehensive and organized summary of wastewater data analytical methods. Finally, we provide the open challenges on current wastewater-based COVID-19 surveillance studies, aiming to encourage new ideas to advance the development of effective wastewater-based surveillance systems for general infectious diseases.
△ Less
Submitted 23 September, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
Privacy Violations in Election Results
Authors:
Shiro Kuriwaki,
Jeffrey B. Lewis,
Michael Morse
Abstract:
After an election, should election officials release a copy of each anonymous ballot? Some policymakers have championed public disclosure to counter distrust, but others worry that it might undermine ballot secrecy. We introduce the term vote revelation to refer to the linkage of a vote on an anonymous ballot to the voter's name in the public voter file, and detail how such revelation could theore…
▽ More
After an election, should election officials release a copy of each anonymous ballot? Some policymakers have championed public disclosure to counter distrust, but others worry that it might undermine ballot secrecy. We introduce the term vote revelation to refer to the linkage of a vote on an anonymous ballot to the voter's name in the public voter file, and detail how such revelation could theoretically occur. Using the 2020 election in Maricopa County, Arizona, as a case study, we show that the release of individual ballot records would lead to no revelation of any vote choice for 99.83% of voters as compared to 99.95% under Maricopa's current practice of reporting aggregate results by precinct and method of voting. Further, revelation is overwhelmingly concentrated among the few voters who cast provisional ballots or federal-only ballots. We discuss the potential benefits of transparency, compare remedies to reduce or eliminate privacy violations, and highlight the privacy-transparency tradeoff inherent in all election reporting.
△ Less
Submitted 14 March, 2025; v1 submitted 8 August, 2023;
originally announced August 2023.
-
Examining Deep Learning Models with Multiple Data Sources for COVID-19 Forecasting
Authors:
Lijing Wang,
Aniruddha Adiga,
Srinivasan Venkatramanan,
Jiangzhuo Chen,
Bryan Lewis,
Madhav Marathe
Abstract:
The COVID-19 pandemic represents the most significant public health disaster since the 1918 influenza pandemic. During pandemics such as COVID-19, timely and reliable spatio-temporal forecasting of epidemic dynamics is crucial. Deep learning-based time series models for forecasting have recently gained popularity and have been successfully used for epidemic forecasting. Here we focus on the design…
▽ More
The COVID-19 pandemic represents the most significant public health disaster since the 1918 influenza pandemic. During pandemics such as COVID-19, timely and reliable spatio-temporal forecasting of epidemic dynamics is crucial. Deep learning-based time series models for forecasting have recently gained popularity and have been successfully used for epidemic forecasting. Here we focus on the design and analysis of deep learning-based models for COVID-19 forecasting. We implement multiple recurrent neural network-based deep learning models and combine them using the stacking ensemble technique. In order to incorporate the effects of multiple factors in COVID-19 spread, we consider multiple sources such as COVID-19 confirmed and death case count data and testing data for better predictions. To overcome the sparsity of training data and to address the dynamic correlation of the disease, we propose clustering-based training for high-resolution forecasting. The methods help us to identify the similar trends of certain groups of regions due to various spatio-temporal effects. We examine the proposed method for forecasting weekly COVID-19 new confirmed cases at county-, state-, and country-level. A comprehensive comparison between different time series models in COVID-19 context is conducted and analyzed. The results show that simple deep learning models can achieve comparable or better performance when compared with more complicated models. We are currently integrating our methods as a part of our weekly forecasts that we provide state and federal authorities.
△ Less
Submitted 23 November, 2020; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Calibrating a Stochastic Agent Based Model Using Quantile-based Emulation
Authors:
Arindam Fadikar,
Dave Higdon,
Jiangzhuo Chen,
Brian Lewis,
Srini Venkatramanan,
Madhav Marathe
Abstract:
In a number of cases, the Quantile Gaussian Process (QGP) has proven effective in emulating stochastic, univariate computer model output (Plumlee and Tuo, 2014). In this paper, we develop an approach that uses this emulation approach within a Bayesian model calibration framework to calibrate an agent-based model of an epidemic. In addition, this approach is extended to handle the multivariate natu…
▽ More
In a number of cases, the Quantile Gaussian Process (QGP) has proven effective in emulating stochastic, univariate computer model output (Plumlee and Tuo, 2014). In this paper, we develop an approach that uses this emulation approach within a Bayesian model calibration framework to calibrate an agent-based model of an epidemic. In addition, this approach is extended to handle the multivariate nature of the model output, which gives a time series of the count of infected individuals. The basic modeling approach is adapted from Higdon et al. (2008), using a basis representation to capture the multivariate model output. The approach is motivated with an example taken from the 2015 Ebola Challenge workshop which simulated an ebola epidemic to evaluate methodology.
△ Less
Submitted 1 December, 2017;
originally announced December 2017.
-
Numerical tolerance for spectral decompositions of random matrices
Authors:
Avanti Athreya,
Michael Kane,
Bryan Lewis,
Zachary Lubberts,
Vince Lyzinski,
Youngser Park,
Carey E. Priebe,
Minh Tang
Abstract:
We precisely quantify the impact of statistical error in the quality of a numerical approximation to a random matrix eigendecomposition, and under mild conditions, we use this to introduce an optimal numerical tolerance for residual error in spectral decompositions of random matrices. We demonstrate that terminating an eigendecomposition algorithm when the numerical error and statistical error are…
▽ More
We precisely quantify the impact of statistical error in the quality of a numerical approximation to a random matrix eigendecomposition, and under mild conditions, we use this to introduce an optimal numerical tolerance for residual error in spectral decompositions of random matrices. We demonstrate that terminating an eigendecomposition algorithm when the numerical error and statistical error are of the same order results in computational savings with no loss of accuracy. We also repair a flaw in a ubiquitous termination condition, one in wide employ in several computational linear algebra implementations. We illustrate the practical consequences of our stopping criterion with an analysis of simulated and real networks. Our theoretical results and real-data examples establish that the tradeoff between statistical and numerical error is of significant import for data science.
△ Less
Submitted 30 January, 2020; v1 submitted 1 August, 2016;
originally announced August 2016.
-
Efficient Thresholded Correlation using Truncated Singular Value Decomposition
Authors:
James Baglama,
Michael Kane,
Bryan Lewis,
Alex Poliakov
Abstract:
Efficiently computing a subset of a correlation matrix consisting of values above a specified threshold is important to many practical applications. Real-world problems in genomics, machine learning, finance other applications can produce correlation matrices too large to explicitly form and tractably compute. Often, only values corresponding to highly-correlated vectors are of interest, and those…
▽ More
Efficiently computing a subset of a correlation matrix consisting of values above a specified threshold is important to many practical applications. Real-world problems in genomics, machine learning, finance other applications can produce correlation matrices too large to explicitly form and tractably compute. Often, only values corresponding to highly-correlated vectors are of interest, and those values typically make up a small fraction of the overall correlation matrix. We present a method based on the singular value decomposition (SVD) and its relationship to the data covariance structure that can efficiently compute thresholded subsets of very large correlation matrices.
△ Less
Submitted 11 March, 2016; v1 submitted 22 December, 2015;
originally announced December 2015.
-
Scatter Matrix Concordance: A Diagnostic for Regressions on Subsets of Data
Authors:
Michael J. Kane,
Bryan Lewis,
Sekhar Tatikonda,
Simon Urbanek
Abstract:
Linear regression models depend directly on the design matrix and its properties. Techniques that efficiently estimate model coefficients by partitioning rows of the design matrix are increasingly popular for large-scale problems because they fit well with modern parallel computing architectures. We propose a simple measure of {\em concordance} between a design matrix and a subset of its rows that…
▽ More
Linear regression models depend directly on the design matrix and its properties. Techniques that efficiently estimate model coefficients by partitioning rows of the design matrix are increasingly popular for large-scale problems because they fit well with modern parallel computing architectures. We propose a simple measure of {\em concordance} between a design matrix and a subset of its rows that estimates how well a subset captures the variance-covariance structure of a larger data set. We illustrate the use of this measure in a heuristic method for selecting row partition sizes that balance statistical and computational efficiency goals in real-world problems.
△ Less
Submitted 12 July, 2015;
originally announced July 2015.