-
Learning Representational Disparities
Authors:
Pavan Ravishankar,
Rushabh Shah,
Daniel B. Neill
Abstract:
We propose a fair machine learning algorithm to model interpretable differences between observed and desired human decision-making, with the latter aimed at reducing disparity in a downstream outcome impacted by the human decision. Prior work learns fair representations without considering the outcome in the decision-making process. We model the outcome disparities as arising due to the different…
▽ More
We propose a fair machine learning algorithm to model interpretable differences between observed and desired human decision-making, with the latter aimed at reducing disparity in a downstream outcome impacted by the human decision. Prior work learns fair representations without considering the outcome in the decision-making process. We model the outcome disparities as arising due to the different representations of the input seen by the observed and desired decision-maker, which we term representational disparities. Our goal is to learn interpretable representational disparities which could potentially be corrected by specific nudges to the human decision, mitigating disparities in the downstream outcome; we frame this as a multi-objective optimization problem using a neural network. Under reasonable simplifying assumptions, we prove that our neural network model of the representational disparity learns interpretable weights that fully mitigate the outcome disparity. We validate objectives and interpret results using real-world German Credit, Adult, and Heritage Health datasets.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Be Intentional About Fairness!: Fairness, Size, and Multiplicity in the Rashomon Set
Authors:
Gordon Dai,
Pavan Ravishankar,
Rachel Yuan,
Daniel B. Neill,
Emily Black
Abstract:
When selecting a model from a set of equally performant models, how much unfairness can you really reduce? Is it important to be intentional about fairness when choosing among this set, or is arbitrarily choosing among the set of ''good'' models good enough? Recent work has highlighted that the phenomenon of model multiplicity-where multiple models with nearly identical predictive accuracy exist f…
▽ More
When selecting a model from a set of equally performant models, how much unfairness can you really reduce? Is it important to be intentional about fairness when choosing among this set, or is arbitrarily choosing among the set of ''good'' models good enough? Recent work has highlighted that the phenomenon of model multiplicity-where multiple models with nearly identical predictive accuracy exist for the same task-has both positive and negative implications for fairness, from strengthening the enforcement of civil rights law in AI systems to showcasing arbitrariness in AI decision-making. Despite the enormous implications of model multiplicity, there is little work that explores the properties of sets of equally accurate models, or Rashomon sets, in general. In this paper, we present five main theoretical and methodological contributions which help us to understand the relatively unexplored properties of the Rashomon set, in particular with regards to fairness. Our contributions include methods for efficiently sampling models from this set and techniques for identifying the fairest models according to key fairness metrics such as statistical parity. We also derive the probability that an individual's prediction will be flipped within the Rashomon set, as well as expressions for the set's size and the distribution of error tolerance used across models. These results lead to policy-relevant takeaways, such as the importance of intentionally looking for fair models within the Rashomon set, and understanding which individuals or groups may be more susceptible to arbitrary decisions.
△ Less
Submitted 26 January, 2025;
originally announced January 2025.
-
Auditing Predictive Models for Intersectional Biases
Authors:
Kate S. Boxer,
Edward McFowland III,
Daniel B. Neill
Abstract:
Predictive models that satisfy group fairness criteria in aggregate for members of a protected class, but do not guarantee subgroup fairness, could produce biased predictions for individuals at the intersection of two or more protected classes. To address this risk, we propose Conditional Bias Scan (CBS), a flexible auditing framework for detecting intersectional biases in classification models. C…
▽ More
Predictive models that satisfy group fairness criteria in aggregate for members of a protected class, but do not guarantee subgroup fairness, could produce biased predictions for individuals at the intersection of two or more protected classes. To address this risk, we propose Conditional Bias Scan (CBS), a flexible auditing framework for detecting intersectional biases in classification models. CBS identifies the subgroup for which there is the most significant bias against the protected class, as compared to the equivalent subgroup in the non-protected class, and can incorporate multiple commonly used fairness definitions for both probabilistic and binarized predictions. We show that this methodology can detect previously unidentified intersectional and contextual biases in the COMPAS pre-trial risk assessment tool and has higher bias detection power compared to similar methods that audit for subgroup fairness.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Insufficiently Justified Disparate Impact: A New Criterion for Subgroup Fairness
Authors:
Neil Menghani,
Edward McFowland III,
Daniel B. Neill
Abstract:
In this paper, we develop a new criterion, "insufficiently justified disparate impact" (IJDI), for assessing whether recommendations (binarized predictions) made by an algorithmic decision support tool are fair. Our novel, utility-based IJDI criterion evaluates false positive and false negative error rate imbalances, identifying statistically significant disparities between groups which are presen…
▽ More
In this paper, we develop a new criterion, "insufficiently justified disparate impact" (IJDI), for assessing whether recommendations (binarized predictions) made by an algorithmic decision support tool are fair. Our novel, utility-based IJDI criterion evaluates false positive and false negative error rate imbalances, identifying statistically significant disparities between groups which are present even when adjusting for group-level differences in base rates. We describe a novel IJDI-Scan approach which can efficiently identify the intersectional subpopulations, defined across multiple observed attributes of the data, with the most significant IJDI. To evaluate IJDI-Scan's performance, we conduct experiments on both simulated and real-world data, including recidivism risk assessment and credit scoring. Further, we implement and evaluate approaches to mitigating IJDI for the detected subpopulations in these domains.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
Provable Detection of Propagating Sampling Bias in Prediction Models
Authors:
Pavan Ravishankar,
Qingyu Mo,
Edward McFowland III,
Daniel B. Neill
Abstract:
With an increased focus on incorporating fairness in machine learning models, it becomes imperative not only to assess and mitigate bias at each stage of the machine learning pipeline but also to understand the downstream impacts of bias across stages. Here we consider a general, but realistic, scenario in which a predictive model is learned from (potentially biased) training data, and model predi…
▽ More
With an increased focus on incorporating fairness in machine learning models, it becomes imperative not only to assess and mitigate bias at each stage of the machine learning pipeline but also to understand the downstream impacts of bias across stages. Here we consider a general, but realistic, scenario in which a predictive model is learned from (potentially biased) training data, and model predictions are assessed post-hoc for fairness by some auditing method. We provide a theoretical analysis of how a specific form of data bias, differential sampling bias, propagates from the data stage to the prediction stage. Unlike prior work, we evaluate the downstream impacts of data biases quantitatively rather than qualitatively and prove theoretical guarantees for detection. Under reasonable assumptions, we quantify how the amount of bias in the model predictions varies as a function of the amount of differential sampling bias in the data, and at what point this bias becomes provably detectable by the auditor. Through experiments on two criminal justice datasets -- the well-known COMPAS dataset and historical data from NYPD's stop and frisk policy -- we demonstrate that the theoretical results hold in practice even when our assumptions are relaxed.
△ Less
Submitted 13 February, 2023;
originally announced February 2023.
-
Calibrated Nonparametric Scan Statistics for Anomalous Pattern Detection in Graphs
Authors:
Chunpai Wang,
Daniel B. Neill,
Feng Chen
Abstract:
We propose a new approach, the calibrated nonparametric scan statistic (CNSS), for more accurate detection of anomalous patterns in large-scale, real-world graphs. Scan statistics identify connected subgraphs that are interesting or unexpected through maximization of a likelihood ratio statistic; in particular, nonparametric scan statistics (NPSSs) identify subgraphs with a higher than expected pr…
▽ More
We propose a new approach, the calibrated nonparametric scan statistic (CNSS), for more accurate detection of anomalous patterns in large-scale, real-world graphs. Scan statistics identify connected subgraphs that are interesting or unexpected through maximization of a likelihood ratio statistic; in particular, nonparametric scan statistics (NPSSs) identify subgraphs with a higher than expected proportion of individually significant nodes. However, we show that recently proposed NPSS methods are miscalibrated, failing to account for the maximization of the statistic over the multiplicity of subgraphs. This results in both reduced detection power for subtle signals, and low precision of the detected subgraph even for stronger signals. Thus we develop a new statistical approach to recalibrate NPSSs, correctly adjusting for multiple hypothesis testing and taking the underlying graph structure into account. While the recalibration, based on randomization testing, is computationally expensive, we propose both an efficient (approximate) algorithm and new, closed-form lower bounds (on the expected maximum proportion of significant nodes for subgraphs of a given size, under the null hypothesis of no anomalous patterns). These advances, along with the integration of recent core-tree decomposition methods, enable CNSS to scale to large real-world graphs, with substantial improvement in the accuracy of detected subgraphs. Extensive experiments on both semi-synthetic and real-world datasets are demonstrated to validate the effectiveness of our proposed methods, in comparison with state-of-the-art counterparts.
△ Less
Submitted 26 June, 2022;
originally announced June 2022.
-
Positional Encoder Graph Neural Networks for Geographic Data
Authors:
Konstantin Klemmer,
Nathan Safir,
Daniel B. Neill
Abstract:
Graph neural networks (GNNs) provide a powerful and scalable solution for modeling continuous spatial data. However, they often rely on Euclidean distances to construct the input graphs. This assumption can be improbable in many real-world settings, where the spatial structure is more complex and explicitly non-Euclidean (e.g., road networks). Here, we propose PE-GNN, a new framework that incorpor…
▽ More
Graph neural networks (GNNs) provide a powerful and scalable solution for modeling continuous spatial data. However, they often rely on Euclidean distances to construct the input graphs. This assumption can be improbable in many real-world settings, where the spatial structure is more complex and explicitly non-Euclidean (e.g., road networks). Here, we propose PE-GNN, a new framework that incorporates spatial context and correlation explicitly into the models. Building on recent advances in geospatial auxiliary task learning and semantic spatial embeddings, our proposed method (1) learns a context-aware vector encoding of the geographic coordinates and (2) predicts spatial autocorrelation in the data in parallel with the main task. On spatial interpolation and regression tasks, we show the effectiveness of our approach, improving performance over different state-of-the-art GNN approaches. We observe that our approach not only vastly improves over the GNN baselines, but can match Gaussian processes, the most commonly utilized method for spatial interpolation problems.
△ Less
Submitted 15 February, 2023; v1 submitted 19 November, 2021;
originally announced November 2021.
-
SPATE-GAN: Improved Generative Modeling of Dynamic Spatio-Temporal Patterns with an Autoregressive Embedding Loss
Authors:
Konstantin Klemmer,
Tianlin Xu,
Beatrice Acciaio,
Daniel B. Neill
Abstract:
From ecology to atmospheric sciences, many academic disciplines deal with data characterized by intricate spatio-temporal complexities, the modeling of which often requires specialized approaches. Generative models of these data are of particular interest, as they enable a range of impactful downstream applications like simulation or creating synthetic training data. Recent work has highlighted th…
▽ More
From ecology to atmospheric sciences, many academic disciplines deal with data characterized by intricate spatio-temporal complexities, the modeling of which often requires specialized approaches. Generative models of these data are of particular interest, as they enable a range of impactful downstream applications like simulation or creating synthetic training data. Recent work has highlighted the potential of generative adversarial nets (GANs) for generating spatio-temporal data. A new GAN algorithm COT-GAN, inspired by the theory of causal optimal transport (COT), was proposed in an attempt to better tackle this challenge. However, the task of learning more complex spatio-temporal patterns requires additional knowledge of their specific data structures. In this study, we propose a novel loss objective combined with COT-GAN based on an autoregressive embedding to reinforce the learning of spatio-temporal dynamics. We devise SPATE (spatio-temporal association), a new metric measuring spatio-temporal autocorrelation by using the deviance of observations from their expected values. We compute SPATE for real and synthetic data samples and use it to compute an embedding loss that considers space-time interactions, nudging the GAN to learn outputs that are faithful to the observed dynamics. We test this new objective on a diverse set of complex spatio-temporal patterns: turbulent flows, log-Gaussian Cox processes and global weather data. We show that our novel embedding loss improves performance without any changes to the architecture of the COT-GAN backbone, highlighting our model's increased capacity for capturing autoregressive structures. We also contextualize our work with respect to recent advances in physics-informed deep learning and interdisciplinary work connecting neural networks with geographic and geophysical sciences.
△ Less
Submitted 30 September, 2021;
originally announced September 2021.
-
Policing Chronic and Temporary Hot Spots of Violent Crime: A Controlled Field Experiment
Authors:
Dylan J. Fitzpatrick,
Wilpen L. Gorr,
Daniel B. Neill
Abstract:
Hot-spot-based policing programs aim to deter crime through increased proactive patrols at high-crime locations. While most hot spot programs target easily identified chronic hot spots, we introduce models for predicting temporary hot spots to address effectiveness and equity objectives for crime prevention, and present findings from a crossover experiment evaluating application of hot spot predic…
▽ More
Hot-spot-based policing programs aim to deter crime through increased proactive patrols at high-crime locations. While most hot spot programs target easily identified chronic hot spots, we introduce models for predicting temporary hot spots to address effectiveness and equity objectives for crime prevention, and present findings from a crossover experiment evaluating application of hot spot predictions to prevent serious violent crime in Pittsburgh, PA. Over a 12-month experimental period, the Pittsburgh Bureau of Police assigned uniformed patrol officers to weekly predicted chronic and temporary hot spots of serious violent crimes comprising 0.5 percent of the city's area. We find statistically and practically significant reductions in serious violent crime counts within treatment hot spots as compared to control hot spots, with an overall reduction of 25.3 percent in the FBI-classified Part 1 Violent (P1V) crimes of homicide, rape, robbery, and aggravated assault, and a 39.7 percent reduction of African-American and other non-white victims of P1V crimes. We find that temporary hot spots increase spatial dispersion of patrols and have a greater percentage reduction in P1V crimes than chronic hot spots but fewer total number of crimes prevented. Only foot patrols, not car patrols, had statistically significant crime reductions in hot spots. We find no evidence of crime displacement; instead, we find weakly statistically significant spillover of crime prevention benefits to adjacent areas. In addition, we find no evidence that the community-oriented hot spot patrols produced over-policing arrests of minority or other populations.
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
Auxiliary-task learning for geographic data with autoregressive embeddings
Authors:
Konstantin Klemmer,
Daniel B. Neill
Abstract:
Machine learning is gaining popularity in a broad range of areas working with geographic data, such as ecology or atmospheric sciences. Here, data often exhibit spatial effects, which can be difficult to learn for neural networks. In this study, we propose SXL, a method for embedding information on the autoregressive nature of spatial data directly into the learning process using auxiliary tasks.…
▽ More
Machine learning is gaining popularity in a broad range of areas working with geographic data, such as ecology or atmospheric sciences. Here, data often exhibit spatial effects, which can be difficult to learn for neural networks. In this study, we propose SXL, a method for embedding information on the autoregressive nature of spatial data directly into the learning process using auxiliary tasks. We utilize the local Moran's I, a popular measure of local spatial autocorrelation, to "nudge" the model to learn the direction and magnitude of local spatial effects, complementing the learning of the primary task. We further introduce a novel expansion of Moran's I to multiple resolutions, thus capturing spatial interactions over longer and shorter distances simultaneously. The novel multi-resolution Moran's I can be constructed easily and as a multi-dimensional tensor offers seamless integration into existing machine learning frameworks. Throughout a range of experiments using real-world data, we highlight how our method consistently improves the training of neural networks in unsupervised and supervised learning tasks. In generative spatial modeling experiments, we propose a novel loss for auxiliary task GANs utilizing task uncertainty weights. Our proposed method outperforms domain-specific spatial interpolation benchmarks, highlighting its potential for downstream applications. This study bridges expertise from geographic information science and machine learning, showing how this integration of disciplines can help to address domain-specific challenges. The code for our experiments is available on Github: https://github.com/konstantinklemmer/sxl.
△ Less
Submitted 19 August, 2021; v1 submitted 18 June, 2020;
originally announced June 2020.
-
Modeling Rape Reporting Delays Using Spatial, Temporal and Social Features
Authors:
Konstantin Klemmer,
Daniel B. Neill,
Stephen A. Jarvis
Abstract:
We present a novel approach to estimate the delay observed between the occurrence and reporting of rape crimes. We explore spatial, temporal and social effects in sparse aggregated (area-level) and high-dimensional disaggregated (event-level) data for New York and Los Angeles. Focusing on inference, we apply Gradient Boosting and Random Forests to assess predictor importance, as well as Gaussian P…
▽ More
We present a novel approach to estimate the delay observed between the occurrence and reporting of rape crimes. We explore spatial, temporal and social effects in sparse aggregated (area-level) and high-dimensional disaggregated (event-level) data for New York and Los Angeles. Focusing on inference, we apply Gradient Boosting and Random Forests to assess predictor importance, as well as Gaussian Processes to model spatial disparities in reporting times. Our results highlight differences and similarities between the two cities. We identify at-risk populations and communities which may be targeted with focused policies and interventions to support rape victims, apprehend perpetrators, and prevent future crimes.
△ Less
Submitted 21 November, 2018; v1 submitted 9 November, 2018;
originally announced November 2018.
-
Change Surfaces for Expressive Multidimensional Changepoints and Counterfactual Prediction
Authors:
William Herlands,
Daniel B. Neill,
Hannes Nickisch,
Andrew Gordon Wilson
Abstract:
Identifying changes in model parameters is fundamental in machine learning and statistics. However, standard changepoint models are limited in expressiveness, often addressing unidimensional problems and assuming instantaneous changes. We introduce change surfaces as a multidimensional and highly expressive generalization of changepoints. We provide a model-agnostic formalization of change surface…
▽ More
Identifying changes in model parameters is fundamental in machine learning and statistics. However, standard changepoint models are limited in expressiveness, often addressing unidimensional problems and assuming instantaneous changes. We introduce change surfaces as a multidimensional and highly expressive generalization of changepoints. We provide a model-agnostic formalization of change surfaces, illustrating how they can provide variable, heterogeneous, and non-monotonic rates of change across multiple dimensions. Additionally, we show how change surfaces can be used for counterfactual prediction. As a concrete instantiation of the change surface framework, we develop Gaussian Process Change Surfaces (GPCS). We demonstrate counterfactual prediction with Bayesian posterior mean and credible sets, as well as massive scalability by introducing novel methods for additive non-separable kernels. Using two large spatio-temporal datasets we employ GPCS to discover and characterize complex changes that can provide scientific and policy relevant insights. Specifically, we analyze twentieth century measles incidence across the United States and discover previously unknown heterogeneous changes after the introduction of the measles vaccine. Additionally, we apply the model to requests for lead testing kits in New York City, discovering distinct spatial and demographic patterns.
△ Less
Submitted 30 October, 2018; v1 submitted 28 October, 2018;
originally announced October 2018.
-
Gaussian Process Subset Scanning for Anomalous Pattern Detection in Non-iid Data
Authors:
William Herlands,
Edward McFowland III,
Andrew Gordon Wilson,
Daniel B. Neill
Abstract:
Identifying anomalous patterns in real-world data is essential for understanding where, when, and how systems deviate from their expected dynamics. Yet methods that separately consider the anomalousness of each individual data point have low detection power for subtle, emerging irregularities. Additionally, recent detection techniques based on subset scanning make strong independence assumptions a…
▽ More
Identifying anomalous patterns in real-world data is essential for understanding where, when, and how systems deviate from their expected dynamics. Yet methods that separately consider the anomalousness of each individual data point have low detection power for subtle, emerging irregularities. Additionally, recent detection techniques based on subset scanning make strong independence assumptions and suffer degraded performance in correlated data. We introduce methods for identifying anomalous patterns in non-iid data by combining Gaussian processes with novel log-likelihood ratio statistic and subset scanning techniques. Our approaches are powerful, interpretable, and can integrate information across multiple data streams. We illustrate their performance on numeric simulations and three open source spatiotemporal datasets of opioid overdose deaths, 311 calls, and storm reports.
△ Less
Submitted 4 April, 2018;
originally announced April 2018.
-
Machine Learning for Drug Overdose Surveillance
Authors:
Daniel B. Neill,
William Herlands
Abstract:
We describe two recently proposed machine learning approaches for discovering emerging trends in fatal accidental drug overdoses. The Gaussian Process Subset Scan enables early detection of emerging patterns in spatio-temporal data, accounting for both the non-iid nature of the data and the fact that detecting subtle patterns requires integration of information across multiple spatial areas and mu…
▽ More
We describe two recently proposed machine learning approaches for discovering emerging trends in fatal accidental drug overdoses. The Gaussian Process Subset Scan enables early detection of emerging patterns in spatio-temporal data, accounting for both the non-iid nature of the data and the fact that detecting subtle patterns requires integration of information across multiple spatial areas and multiple time steps. We apply this approach to 17 years of county-aggregated data for monthly opioid overdose deaths in the New York City metropolitan area, showing clear advantages in the utility of discovered patterns as compared to typical anomaly detection approaches.
To detect and characterize emerging overdose patterns that differentially affect a subpopulation of the data, including geographic, demographic, and behavioral patterns (e.g., which combinations of drugs are involved), we apply the Multidimensional Tensor Scan to 8 years of case-level overdose data from Allegheny County, PA. We discover previously unidentified overdose patterns which reveal unusual demographic clusters, show impacts of drug legislation, and demonstrate potential for early detection and targeted intervention. These approaches to early detection of overdose patterns can inform prevention and response efforts, as well as understanding the effects of policy changes.
△ Less
Submitted 6 October, 2017;
originally announced October 2017.
-
Graph Structure Learning from Unlabeled Data for Event Detection
Authors:
Sriram Somanchi,
Daniel B. Neill
Abstract:
Processes such as disease propagation and information diffusion often spread over some latent network structure which must be learned from observation. Given a set of unlabeled training examples representing occurrences of an event type of interest (e.g., a disease outbreak), our goal is to learn a graph structure that can be used to accurately detect future events of that type. Motivated by new t…
▽ More
Processes such as disease propagation and information diffusion often spread over some latent network structure which must be learned from observation. Given a set of unlabeled training examples representing occurrences of an event type of interest (e.g., a disease outbreak), our goal is to learn a graph structure that can be used to accurately detect future events of that type. Motivated by new theoretical results on the consistency of constrained and unconstrained subset scans, we propose a novel framework for learning graph structure from unlabeled data by comparing the most anomalous subsets detected with and without the graph constraints. Our framework uses the mean normalized log-likelihood ratio score to measure the quality of a graph structure, and efficiently searches for the highest-scoring graph structure. Using simulated disease outbreaks injected into real-world Emergency Department data from Allegheny County, we show that our method learns a structure similar to the true underlying graph, but enables faster and more accurate detection.
△ Less
Submitted 5 January, 2017;
originally announced January 2017.
-
Identifying Significant Predictive Bias in Classifiers
Authors:
Zhe Zhang,
Daniel B. Neill
Abstract:
We present a novel subset scan method to detect if a probabilistic binary classifier has statistically significant bias -- over or under predicting the risk -- for some subgroup, and identify the characteristics of this subgroup. This form of model checking and goodness-of-fit test provides a way to interpretably detect the presence of classifier bias or regions of poor classifier fit. This allows…
▽ More
We present a novel subset scan method to detect if a probabilistic binary classifier has statistically significant bias -- over or under predicting the risk -- for some subgroup, and identify the characteristics of this subgroup. This form of model checking and goodness-of-fit test provides a way to interpretably detect the presence of classifier bias or regions of poor classifier fit. This allows consideration of not just subgroups of a priori interest or small dimensions, but the space of all possible subgroups of features. To address the difficulty of considering these exponentially many possible subgroups, we use subset scan and parametric bootstrap-based methods. Extending this method, we can penalize the complexity of the detected subgroup and also identify subgroups with high classification errors. We demonstrate these methods and find interesting results on the COMPAS crime recidivism and credit delinquency data.
△ Less
Submitted 4 July, 2017; v1 submitted 24 November, 2016;
originally announced November 2016.
-
Semantic Scan: Detecting Subtle, Spatially Localized Events in Text Streams
Authors:
Abhinav Maurya,
Kenton Murray,
Yandong Liu,
Chris Dyer,
William W. Cohen,
Daniel B. Neill
Abstract:
Early detection and precise characterization of emerging topics in text streams can be highly useful in applications such as timely and targeted public health interventions and discovering evolving regional business trends. Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have numerous shortcomings that make them unsuitable…
▽ More
Early detection and precise characterization of emerging topics in text streams can be highly useful in applications such as timely and targeted public health interventions and discovering evolving regional business trends. Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have numerous shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. In this paper, we describe Semantic Scan (SS) that has been developed specifically to overcome these shortcomings in detecting new spatially compact events in text streams.
Semantic Scan integrates novel contrastive topic modeling with online document assignment and principled likelihood ratio-based spatial scanning to identify emerging events with unexpected patterns of keywords hidden in text streams. This enables more timely and accurate detection and characterization of anomalous, spatially localized emerging events. Semantic Scan does not require manual intervention or labeled training data, and is robust to noise in real-world text data since it identifies anomalous text patterns that occur in a cluster of new documents rather than an anomaly in a single new document.
We compare Semantic Scan to alternative state-of-the-art methods such as Topics over Time, Online LDA, and Labeled LDA on two real-world tasks: (i) a disease surveillance task monitoring free-text Emergency Department chief complaints in Allegheny County, and (ii) an emerging business trend detection task based on Yelp reviews. On both tasks, we find that Semantic Scan provides significantly better event detection and characterization accuracy than competing approaches, while providing up to an order of magnitude speedup.
△ Less
Submitted 13 February, 2016;
originally announced February 2016.