-
Persistent Classification: A New Approach to Stability of Data and Adversarial Examples
Authors:
Brian Bell,
Michael Geyer,
David Glickenstein,
Keaton Hamm,
Carlos Scheidegger,
Amanda Fernandez,
Juston Moore
Abstract:
There are a number of hypotheses underlying the existence of adversarial examples for classification problems. These include the high-dimensionality of the data, high codimension in the ambient space of the data manifolds of interest, and that the structure of machine learning models may encourage classifiers to develop decision boundaries close to data points. This article proposes a new framewor…
▽ More
There are a number of hypotheses underlying the existence of adversarial examples for classification problems. These include the high-dimensionality of the data, high codimension in the ambient space of the data manifolds of interest, and that the structure of machine learning models may encourage classifiers to develop decision boundaries close to data points. This article proposes a new framework for studying adversarial examples that does not depend directly on the distance to the decision boundary. Similarly to the smoothed classifier literature, we define a (natural or adversarial) data point to be $(γ,σ)$-stable if the probability of the same classification is at least $γ$ for points sampled in a Gaussian neighborhood of the point with a given standard deviation $σ$. We focus on studying the differences between persistence metrics along interpolants of natural and adversarial points. We show that adversarial examples have significantly lower persistence than natural examples for large neural networks in the context of the MNIST and ImageNet datasets. We connect this lack of persistence with decision boundary geometry by measuring angles of interpolants with respect to decision boundaries. Finally, we connect this approach with robustness by developing a manifold alignment gradient metric and demonstrating the increase in robustness that can be achieved when training with the addition of this metric.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Reducing Access Disparities in Networks using Edge Augmentation
Authors:
Ashkan Bashardoust,
Sorelle A. Friedler,
Carlos E. Scheidegger,
Blair D. Sullivan,
Suresh Venkatasubramanian
Abstract:
In social networks, a node's position is a form of \it{social capital}. Better-positioned members not only benefit from (faster) access to diverse information, but innately have more potential influence on information spread. Structural biases often arise from network formation, and can lead to significant disparities in information access based on position. Further, processes such as link recomme…
▽ More
In social networks, a node's position is a form of \it{social capital}. Better-positioned members not only benefit from (faster) access to diverse information, but innately have more potential influence on information spread. Structural biases often arise from network formation, and can lead to significant disparities in information access based on position. Further, processes such as link recommendation can exacerbate this inequality by relying on network structure to augment connectivity.
We argue that one can understand and quantify this social capital through the lens of information flow in the network. We consider the setting where all nodes may be sources of distinct information, and a node's (dis)advantage deems its ability to access all information available on the network. We introduce three new measures of advantage (broadcast, influence, and control), which are quantified in terms of position in the network using \it{access signatures} -- vectors that represent a node's ability to share information. We then consider the problem of improving equity by making interventions to increase the access of the least-advantaged nodes. We argue that edge augmentation is most appropriate for mitigating bias in the network structure, and frame a budgeted intervention problem for maximizing minimum pairwise access.
Finally, we propose heuristic strategies for selecting edge augmentations and empirically evaluate their performance on a corpus of real-world social networks. We demonstrate that a small number of interventions significantly increase the broadcast measure of access for the least-advantaged nodes (over 5 times more than random), and also improve the minimum influence. Additional analysis shows that these interventions can also dramatically shrink the gap in advantage between nodes (over \%82) and reduce disparities between their access signatures.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
Traveler: Navigating Task Parallel Traces for Performance Analysis
Authors:
Sayef Azad Sakin,
Alex Bigelow,
R. Tohid,
Connor Scully-Allison,
Carlos Scheidegger,
Steven R. Brandt,
Christopher Taylor,
Kevin A. Huck,
Hartmut Kaiser,
Katherine E. Isaacs
Abstract:
Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks can translate into large savings in terms of computational resource use. To aid performance analysis, developers may collect an execution trace - a chronological log of program activit…
▽ More
Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks can translate into large savings in terms of computational resource use. To aid performance analysis, developers may collect an execution trace - a chronological log of program activity during execution. As traces represent the full history, developers can discover a wide array of possibly previously unknown performance issues, making them an important artifact for exploratory performance analysis. However, interactive trace visualization is difficult due to issues of data size and complexity of meaning. Traces represent nanosecond-level events across many parallel processes, meaning the collected data is often large and difficult to explore. The rise of asynchronous task parallel programming paradigms complicates the relation between events and their probable cause. To address these challenges, we conduct a continuing design study in collaboration with high performance computing researchers. We develop diverse and hierarchical ways to navigate and represent execution trace data in support of their trace analysis tasks. Through an iterative design process, we developed Traveler, an integrated visualization platform for task parallel traces. Traveler provides multiple linked interfaces to help navigate trace data from multiple contexts. We evaluate the utility of Traveler through feedback from users and a case study, finding that integrating multiple modes of navigation in our design supported performance analysis tasks and led to the discovery of previously unknown behavior in a distributed array library.
△ Less
Submitted 3 September, 2022; v1 submitted 29 July, 2022;
originally announced August 2022.
-
UnProjection: Leveraging Inverse-Projections for Visual Analytics of High-Dimensional Data
Authors:
Mateus Espadoto,
Gabriel Appleby,
Ashley Suh,
Dylan Cashman,
Mingwei Li,
Carlos Scheidegger,
Erik W Anderson,
Remco Chang,
Alexandru C Telea
Abstract:
Projection techniques are often used to visualize high-dimensional data, allowing users to better understand the overall structure of multi-dimensional spaces on a 2D screen. Although many such methods exist, comparably little work has been done on generalizable methods of inverse-projection -- the process of mapping the projected points, or more generally, the projection space back to the origina…
▽ More
Projection techniques are often used to visualize high-dimensional data, allowing users to better understand the overall structure of multi-dimensional spaces on a 2D screen. Although many such methods exist, comparably little work has been done on generalizable methods of inverse-projection -- the process of mapping the projected points, or more generally, the projection space back to the original high-dimensional space. In this paper we present NNInv, a deep learning technique with the ability to approximate the inverse of any projection or mapping. NNInv learns to reconstruct high-dimensional data from any arbitrary point on a 2D projection space, giving users the ability to interact with the learned high-dimensional representation in a visual analytics system. We provide an analysis of the parameter space of NNInv, and offer guidance in selecting these parameters. We extend validation of the effectiveness of NNInv through a series of quantitative and qualitative analyses. We then demonstrate the method's utility by applying it to three visualization tasks: interactive instance interpolation, classifier agreement, and gradient visualization.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
Comparing Deep Neural Nets with UMAP Tour
Authors:
Mingwei Li,
Carlos Scheidegger
Abstract:
Neural networks should be interpretable to humans. In particular, there is a growing interest in concepts learned in a layer and similarity between layers. In this work, a tool, UMAP Tour, is built to visually inspect and compare internal behavior of real-world neural network models using well-aligned, instance-level representations. The method used in the visualization also implies a new similari…
▽ More
Neural networks should be interpretable to humans. In particular, there is a growing interest in concepts learned in a layer and similarity between layers. In this work, a tool, UMAP Tour, is built to visually inspect and compare internal behavior of real-world neural network models using well-aligned, instance-level representations. The method used in the visualization also implies a new similarity measure between neural network layers. Using the visual tool and the similarity measure, we find concepts learned in state-of-the-art models and dissimilarities between them, such as GoogLeNet and ResNet.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
STFT-LDA: An Algorithm to Facilitate the Visual Analysis of Building Seismic Responses
Authors:
Zhenge Zhao,
Danilo Motta,
Matthew Berger,
Joshua A. Levine,
Ismail B. Kuzucu,
Robert B. Fleischman,
Afonso Paiva,
Carlos Scheidegger
Abstract:
Civil engineers use numerical simulations of a building's responses to seismic forces to understand the nature of building failures, the limitations of building codes, and how to determine the latter to prevent the former. Such simulations generate large ensembles of multivariate, multiattribute time series. Comprehensive understanding of this data requires techniques that support the multivariate…
▽ More
Civil engineers use numerical simulations of a building's responses to seismic forces to understand the nature of building failures, the limitations of building codes, and how to determine the latter to prevent the former. Such simulations generate large ensembles of multivariate, multiattribute time series. Comprehensive understanding of this data requires techniques that support the multivariate nature of the time series and can compare behaviors that are both periodic and non-periodic across multiple time scales and multiple time series themselves. In this paper, we present a novel technique to extract such patterns from time series generated from simulations of seismic responses. The core of our approach is the use of topic modeling, where topics correspond to interpretable and discriminative features of the earthquakes. We transform the raw time series data into a time series of topics, and use this visual summary to compare temporal patterns in earthquakes, query earthquakes via the topics across arbitrary time scales, and enable details on demand by linking the topic visualization with the original earthquake data. We show, through a surrogate task and an expert study, that this technique allows analysts to more easily identify recurring patterns in such time series. By integrating this technique in a prototype system, we show how it enables novel forms of visual interaction.
△ Less
Submitted 1 September, 2021;
originally announced September 2021.
-
Human-in-the-loop Extraction of Interpretable Concepts in Deep Learning Models
Authors:
Zhenge Zhao,
Panpan Xu,
Carlos Scheidegger,
Liu Ren
Abstract:
The interpretation of deep neural networks (DNNs) has become a key topic as more and more people apply them to solve various problems and making critical decisions. Concept-based explanations have recently become a popular approach for post-hoc interpretation of DNNs. However, identifying human-understandable visual concepts that affect model decisions is a challenging task that is not easily addr…
▽ More
The interpretation of deep neural networks (DNNs) has become a key topic as more and more people apply them to solve various problems and making critical decisions. Concept-based explanations have recently become a popular approach for post-hoc interpretation of DNNs. However, identifying human-understandable visual concepts that affect model decisions is a challenging task that is not easily addressed with automatic approaches. We present a novel human-in-the-loop approach to generate user-defined concepts for model interpretation and diagnostics. Central to our proposal is the use of active learning, where human knowledge and feedback are combined to train a concept extractor with very little human labeling effort. We integrate this process into an interactive system, ConceptExtract. Through two case studies, we show how our approach helps analyze model behavior and extract human-friendly concepts for different machine learning tasks and datasets and how to use these concepts to understand the predictions, compare model performance and make suggestions for model refinement. Quantitative experiments show that our active learning approach can accurately extract meaningful visual concepts. More importantly, by identifying visual concepts that negatively affect model performance, we develop the corresponding data augmentation strategy that consistently improves model performance.
△ Less
Submitted 8 August, 2021;
originally announced August 2021.
-
Information access representations and social capital in networks
Authors:
Ashkan Bashardoust,
Hannah C. Beilinson,
Sorelle A. Friedler,
Jiajie Ma,
Jade Rousseau,
Carlos E. Scheidegger,
Blair D. Sullivan,
Nasanbayar Ulzii-Orshikh,
Suresh Venkatasubramanian
Abstract:
Social network position confers power and social capital. In the setting of online social networks that have massive reach, creating mathematical representations of social capital is an important step towards understanding how network position can differentially confer advantage to different groups and how network position can itself be a source of advantage. In this paper, we use well established…
▽ More
Social network position confers power and social capital. In the setting of online social networks that have massive reach, creating mathematical representations of social capital is an important step towards understanding how network position can differentially confer advantage to different groups and how network position can itself be a source of advantage. In this paper, we use well established models for information flow on networks as a base to propose a formal descriptor of the network position of a node as represented by its information access. Combining these descriptors allows a full representation of social capital across the network. Using real-world networks, we demonstrate that this representation allows the identification of differences between groups based on network specific measures of inequality of access.
△ Less
Submitted 16 October, 2023; v1 submitted 23 October, 2020;
originally announced October 2020.
-
Problems with Shapley-value-based explanations as feature importance measures
Authors:
I. Elizabeth Kumar,
Suresh Venkatasubramanian,
Carlos Scheidegger,
Sorelle Friedler
Abstract:
Game-theoretic formulations of feature importance have become popular as a way to "explain" machine learning models. These methods define a cooperative game between the features of a model and distribute influence among these input elements using some form of the game's unique Shapley values. Justification for these methods rests on two pillars: their desirable mathematical properties, and their a…
▽ More
Game-theoretic formulations of feature importance have become popular as a way to "explain" machine learning models. These methods define a cooperative game between the features of a model and distribute influence among these input elements using some form of the game's unique Shapley values. Justification for these methods rests on two pillars: their desirable mathematical properties, and their applicability to specific motivations for explanations. We show that mathematical problems arise when Shapley values are used for feature importance and that the solutions to mitigate these necessarily induce further complexity, such as the need for causal reasoning. We also draw on additional literature to argue that Shapley values do not provide explanations which suit human-centric goals of explainability.
△ Less
Submitted 30 June, 2020; v1 submitted 25 February, 2020;
originally announced February 2020.
-
Anteater: Interactive Visualization of Program Execution Values in Context
Authors:
Rebecca Faust,
Katherine Isaacs,
William Z. Bernstein,
Michael Sharp,
Carlos Scheidegger
Abstract:
Debugging is famously one the hardest parts in programming. In this paper, we tackle the question: what does a debugging environment look like when we take interactive visualization as a central design principle? We introduce Anteater, an interactive visualization system for tracing and exploring the execution of Python programs. Existing systems often have visualization components built on top of…
▽ More
Debugging is famously one the hardest parts in programming. In this paper, we tackle the question: what does a debugging environment look like when we take interactive visualization as a central design principle? We introduce Anteater, an interactive visualization system for tracing and exploring the execution of Python programs. Existing systems often have visualization components built on top of an existing infrastructure. In contrast, Anteater's organization of trace data enables an intermediate representation which can be leveraged to automatically synthesize a variety of visualizations and interactions. These interactive visualizations help with tasks such as discovering important structures in the execution and understanding and debugging unexpected behaviors. To assess the utility of Anteater, we conducted a participant study where programmers completed tasks on their own python programs using Anteater. Finally, we discuss limitations and where further research is needed.
△ Less
Submitted 26 February, 2024; v1 submitted 5 July, 2019;
originally announced July 2019.
-
Disentangling Influence: Using Disentangled Representations to Audit Model Predictions
Authors:
Charles T. Marx,
Richard Lanas Phillips,
Sorelle A. Friedler,
Carlos Scheidegger,
Suresh Venkatasubramanian
Abstract:
Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect…
▽ More
Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect to a single point. Current research has typically focused on one of each of these dimensions. In this paper, we develop disentangled influence audits, a procedure to audit the indirect influence of features. Specifically, we show that disentangled representations provide a mechanism to identify proxy features in the dataset, while allowing an explicit computation of feature influence on either individual outcomes or aggregate-level outcomes. We show through both theory and experiments that disentangled influence audits can both detect proxy features and show, for each individual or in aggregate, which of these proxy features affects the classifier being audited the most. In this respect, our method is more powerful than existing methods for ascertaining feature influence.
△ Less
Submitted 20 June, 2019;
originally announced June 2019.
-
Gaps in Information Access in Social Networks
Authors:
Benjamin Fish,
Ashkan Bashardoust,
danah boyd,
Sorelle A. Friedler,
Carlos Scheidegger,
Suresh Venkatasubramanian
Abstract:
The study of influence maximization in social networks has largely ignored disparate effects these algorithms might have on the individuals contained in the social network. Individuals may place a high value on receiving information, e.g. job openings or advertisements for loans. While well-connected individuals at the center of the network are likely to receive the information that is being distr…
▽ More
The study of influence maximization in social networks has largely ignored disparate effects these algorithms might have on the individuals contained in the social network. Individuals may place a high value on receiving information, e.g. job openings or advertisements for loans. While well-connected individuals at the center of the network are likely to receive the information that is being distributed through the network, poorly connected individuals are systematically less likely to receive the information, producing a gap in access to the information between individuals. In this work, we study how best to spread information in a social network while minimizing this access gap. We propose to use the maximin social welfare function as an objective function, where we maximize the minimum probability of receiving the information under an intervention. We prove that in this setting this welfare function constrains the access gap whereas maximizing the expected number of nodes reached does not. We also investigate the difficulties of using the maximin, and present hardness results and analysis for standard greedy strategies. Finally, we investigate practical ways of optimizing for the maximin, and give empirical evidence that a simple greedy-based strategy works well in practice.
△ Less
Submitted 5 March, 2019;
originally announced March 2019.
-
Assessing the Local Interpretability of Machine Learning Models
Authors:
Dylan Slack,
Sorelle A. Friedler,
Carlos Scheidegger,
Chitradeep Dutta Roy
Abstract:
The increasing adoption of machine learning tools has led to calls for accountability via model interpretability. But what does it mean for a machine learning model to be interpretable by humans, and how can this be assessed? We focus on two definitions of interpretability that have been introduced in the machine learning literature: simulatability (a user's ability to run a model on a given input…
▽ More
The increasing adoption of machine learning tools has led to calls for accountability via model interpretability. But what does it mean for a machine learning model to be interpretable by humans, and how can this be assessed? We focus on two definitions of interpretability that have been introduced in the machine learning literature: simulatability (a user's ability to run a model on a given input) and "what if" local explainability (a user's ability to correctly determine a model's prediction under local changes to the input, given knowledge of the model's original prediction). Through a user study with 1,000 participants, we test whether humans perform well on tasks that mimic the definitions of simulatability and "what if" local explainability on models that are typically considered locally interpretable. To track the relative interpretability of models, we employ a simple metric, the runtime operation count on the simulatability task. We find evidence that as the number of operations increases, participant accuracy on the local interpretability tasks decreases. In addition, this evidence is consistent with the common intuition that decision trees and logistic regression models are interpretable and are more interpretable than neural networks.
△ Less
Submitted 2 August, 2019; v1 submitted 9 February, 2019;
originally announced February 2019.
-
Fairness in representation: quantifying stereotyping as a representational harm
Authors:
Mohsen Abbasi,
Sorelle A. Friedler,
Carlos Scheidegger,
Suresh Venkatasubramanian
Abstract:
While harms of allocation have been increasingly studied as part of the subfield of algorithmic fairness, harms of representation have received considerably less attention. In this paper, we formalize two notions of stereotyping and show how they manifest in later allocative harms within the machine learning pipeline. We also propose mitigation strategies and demonstrate their effectiveness on syn…
▽ More
While harms of allocation have been increasingly studied as part of the subfield of algorithmic fairness, harms of representation have received considerably less attention. In this paper, we formalize two notions of stereotyping and show how they manifest in later allocative harms within the machine learning pipeline. We also propose mitigation strategies and demonstrate their effectiveness on synthetic datasets.
△ Less
Submitted 28 January, 2019;
originally announced January 2019.
-
NeuralCubes: Deep Representations for Visual Data Exploration
Authors:
Zhe Wang,
Dylan Cashman,
Mingwei Li,
Jixian Li,
Matthew Berger,
Joshua A. Levine,
Remco Chang,
Carlos Scheidegger
Abstract:
Visual exploration of large multidimensional datasets has seen tremendous progress in recent years, allowing users to express rich data queries that produce informative visual summaries, all in real time. Techniques based on data cubes are some of the most promising approaches. However, these techniques usually require a large memory footprint for large datasets. To tackle this problem, we present…
▽ More
Visual exploration of large multidimensional datasets has seen tremendous progress in recent years, allowing users to express rich data queries that produce informative visual summaries, all in real time. Techniques based on data cubes are some of the most promising approaches. However, these techniques usually require a large memory footprint for large datasets. To tackle this problem, we present NeuralCubes: neural networks that predict results for aggregate queries, similar to data cubes. NeuralCubes learns a function that takes as input a given query, for instance, a geographic region and temporal interval, and outputs the result of the query. The learned function serves as a real-time, low-memory approximator for aggregation queries. NeuralCubes models are small enough to be sent to the client side (e.g. the web browser for a web-based application) for evaluation, enabling data exploration of large datasets without database/network connection. We demonstrate the effectiveness of NeuralCubes through extensive experiments on a variety of datasets and discuss how NeuralCubes opens up opportunities for new types of visualization and interaction.
△ Less
Submitted 10 July, 2019; v1 submitted 27 August, 2018;
originally announced August 2018.
-
Homology-Preserving Dimensionality Reduction via Manifold Landmarking and Tearing
Authors:
Lin Yan,
Yaodong Zhao,
Paul Rosen,
Carlos Scheidegger,
Bei Wang
Abstract:
Dimensionality reduction is an integral part of data visualization. It is a process that obtains a structure preserving low-dimensional representation of the high-dimensional data. Two common criteria can be used to achieve a dimensionality reduction: distance preservation and topology preservation. Inspired by recent work in topological data analysis, we are on the quest for a dimensionality redu…
▽ More
Dimensionality reduction is an integral part of data visualization. It is a process that obtains a structure preserving low-dimensional representation of the high-dimensional data. Two common criteria can be used to achieve a dimensionality reduction: distance preservation and topology preservation. Inspired by recent work in topological data analysis, we are on the quest for a dimensionality reduction technique that achieves the criterion of homology preservation, a generalized version of topology preservation. Specifically, we are interested in using topology-inspired manifold landmarking and manifold tearing to aid such a process and evaluate their effectiveness.
△ Less
Submitted 21 June, 2018;
originally announced June 2018.
-
A comparative study of fairness-enhancing interventions in machine learning
Authors:
Sorelle A. Friedler,
Carlos Scheidegger,
Suresh Venkatasubramanian,
Sonam Choudhary,
Evan P. Hamilton,
Derek Roth
Abstract:
Computers are increasingly used to make decisions that have significant impact in people's lives. Often, these predictions can affect different population subgroups disproportionately. As a result, the issue of fairness has received much recent interest, and a number of fairness-enhanced classifiers and predictors have appeared in the literature. This paper seeks to study the following questions:…
▽ More
Computers are increasingly used to make decisions that have significant impact in people's lives. Often, these predictions can affect different population subgroups disproportionately. As a result, the issue of fairness has received much recent interest, and a number of fairness-enhanced classifiers and predictors have appeared in the literature. This paper seeks to study the following questions: how do these different techniques fundamentally compare to one another, and what accounts for the differences? Specifically, we seek to bring attention to many under-appreciated aspects of such fairness-enhancing interventions. Concretely, we present the results of an open benchmark we have developed that lets us compare a number of different algorithms under a variety of fairness measures, and a large number of existing datasets. We find that although different algorithms tend to prefer specific formulations of fairness preservations, many of these measures strongly correlate with one another. In addition, we find that fairness-preserving algorithms tend to be sensitive to fluctuations in dataset composition (simulated in our benchmark by varying training-test splits), indicating that fairness interventions might be more brittle than previously thought.
△ Less
Submitted 12 February, 2018;
originally announced February 2018.
-
Persistent Homology Guided Force-Directed Graph Layouts
Authors:
Ashley Suh,
Mustafa Hajij,
Bei Wang,
Carlos Scheidegger,
Paul Rosen
Abstract:
Graphs are commonly used to encode relationships among entities, yet their abstractness makes them difficult to analyze. Node-link diagrams are popular for drawing graphs, and force-directed layouts provide a flexible method for node arrangements that use local relationships in an attempt to reveal the global shape of the graph. However, clutter and overlap of unrelated structures can lead to conf…
▽ More
Graphs are commonly used to encode relationships among entities, yet their abstractness makes them difficult to analyze. Node-link diagrams are popular for drawing graphs, and force-directed layouts provide a flexible method for node arrangements that use local relationships in an attempt to reveal the global shape of the graph. However, clutter and overlap of unrelated structures can lead to confusing graph visualizations. This paper leverages the persistent homology features of an undirected graph as derived information for interactive manipulation of force-directed layouts. We first discuss how to efficiently extract 0-dimensional persistent homology features from both weighted and unweighted undirected graphs. We then introduce the interactive persistence barcode used to manipulate the force-directed graph layout. In particular, the user adds and removes contracting and repulsing forces generated by the persistent homology features, eventually selecting the set of persistent homology features that most improve the layout. Finally, we demonstrate the utility of our approach across a variety of synthetic and real datasets.
△ Less
Submitted 4 October, 2019; v1 submitted 15 December, 2017;
originally announced December 2017.
-
DimReader: Axis lines that explain non-linear projections
Authors:
Rebecca Faust,
David Glickenstein,
Carlos Scheidegger
Abstract:
Non-linear dimensionality reduction (NDR) methods such as LLE and t-SNE are popular with visualization researchers and experienced data analysts, but present serious problems of interpretation. In this paper, we present DimReader, a technique that recovers readable axes from such techniques. DimReader is based on analyzing infinitesimal perturbations of the dataset with respect to variables of int…
▽ More
Non-linear dimensionality reduction (NDR) methods such as LLE and t-SNE are popular with visualization researchers and experienced data analysts, but present serious problems of interpretation. In this paper, we present DimReader, a technique that recovers readable axes from such techniques. DimReader is based on analyzing infinitesimal perturbations of the dataset with respect to variables of interest. The perturbations define exactly how we want to change each point in the original dataset and we measure the effect that these changes have on the projection. The recovered axes are in direct analogy with the axis lines (grid lines) of traditional scatterplots. We also present methods for discovering perturbations on the input data that change the projection the most. The calculation of the perturbations is efficient and easily integrated into programs written in modern programming languages. We present results of DimReader on a variety of NDR methods and datasets both synthetic and real-life, and show how it can be used to compare different NDR methods. Finally, we discuss limitations of our proposal and situations where further research is needed.
△ Less
Submitted 30 July, 2018; v1 submitted 3 October, 2017;
originally announced October 2017.
-
Visual Detection of Structural Changes in Time-Varying Graphs Using Persistent Homology
Authors:
Mustafa Hajij,
Bei Wang,
Carlos Scheidegger,
Paul Rosen
Abstract:
Topological data analysis is an emerging area in exploratory data analysis and data mining. Its main tool, persistent homology, has become a popular technique to study the structure of complex, high-dimensional data. In this paper, we propose a novel method using persistent homology to quantify structural changes in time-varying graphs. Specifically, we transform each instance of the time-varying…
▽ More
Topological data analysis is an emerging area in exploratory data analysis and data mining. Its main tool, persistent homology, has become a popular technique to study the structure of complex, high-dimensional data. In this paper, we propose a novel method using persistent homology to quantify structural changes in time-varying graphs. Specifically, we transform each instance of the time-varying graph into metric spaces, extract topological features using persistent homology, and compare those features over time. We provide a visualization that assists in time-varying graph exploration and helps to identify patterns of behavior within the data. To validate our approach, we conduct several case studies on real world data sets and show how our method can find cyclic patterns, deviations from those patterns, and one-time events in time-varying graphs. We also examine whether persistence-based similarity measure as a graph metric satisfies a set of well-established, desirable properties for graph metrics.
△ Less
Submitted 2 October, 2017; v1 submitted 20 July, 2017;
originally announced July 2017.
-
Runaway Feedback Loops in Predictive Policing
Authors:
Danielle Ensign,
Sorelle A. Friedler,
Scott Neville,
Carlos Scheidegger,
Suresh Venkatasubramanian
Abstract:
Predictive policing systems are increasingly used to determine how to allocate police across a city in order to best prevent crime. Discovered crime data (e.g., arrest counts) are used to help update the model, and the process is repeated. Such systems have been empirically shown to be susceptible to runaway feedback loops, where police are repeatedly sent back to the same neighborhoods regardless…
▽ More
Predictive policing systems are increasingly used to determine how to allocate police across a city in order to best prevent crime. Discovered crime data (e.g., arrest counts) are used to help update the model, and the process is repeated. Such systems have been empirically shown to be susceptible to runaway feedback loops, where police are repeatedly sent back to the same neighborhoods regardless of the true crime rate.
In response, we develop a mathematical model of predictive policing that proves why this feedback loop occurs, show empirically that this model exhibits such problems, and demonstrate how to change the inputs to a predictive policing system (in a black-box manner) so the runaway feedback loop does not occur, allowing the true crime rate to be learned. Our results are quantitative: we can establish a link (in our model) between the degree to which runaway feedback causes problems and the disparity in crime rates between areas. Moreover, we can also demonstrate the way in which \emph{reported} incidents of crime (those reported by residents) and \emph{discovered} incidents of crime (i.e. those directly observed by police officers dispatched as a result of the predictive policing algorithm) interact: in brief, while reported incidents can attenuate the degree of runaway feedback, they cannot entirely remove it without the interventions we suggest.
△ Less
Submitted 21 December, 2017; v1 submitted 29 June, 2017;
originally announced June 2017.
-
On the (im)possibility of fairness
Authors:
Sorelle A. Friedler,
Carlos Scheidegger,
Suresh Venkatasubramanian
Abstract:
What does it mean for an algorithm to be fair? Different papers use different notions of algorithmic fairness, and although these appear internally consistent, they also seem mutually incompatible. We present a mathematical setting in which the distinctions in previous papers can be made formal. In addition to characterizing the spaces of inputs (the "observed" space) and outputs (the "decision" s…
▽ More
What does it mean for an algorithm to be fair? Different papers use different notions of algorithmic fairness, and although these appear internally consistent, they also seem mutually incompatible. We present a mathematical setting in which the distinctions in previous papers can be made formal. In addition to characterizing the spaces of inputs (the "observed" space) and outputs (the "decision" space), we introduce the notion of a construct space: a space that captures unobservable, but meaningful variables for the prediction.
We show that in order to prove desirable properties of the entire decision-making process, different mechanisms for fairness require different assumptions about the nature of the mapping from construct space to decision space. The results in this paper imply that future treatments of algorithmic fairness should more explicitly state assumptions about the relationship between constructs and observations.
△ Less
Submitted 23 September, 2016;
originally announced September 2016.
-
Auditing Black-box Models for Indirect Influence
Authors:
Philip Adler,
Casey Falk,
Sorelle A. Friedler,
Gabriel Rybeck,
Carlos Scheidegger,
Brandon Smith,
Suresh Venkatasubramanian
Abstract:
Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior, and in particular how different features influence the model prediction. This is important when interpreting the behavior of complex models, or asserting that certain problematic attribute…
▽ More
Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior, and in particular how different features influence the model prediction. This is important when interpreting the behavior of complex models, or asserting that certain problematic attributes (like race or gender) are not unduly influencing decisions.
In this paper, we present a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the dataset, without knowing how the models work. Our work focuses on the problem of indirect influence: how some features might indirectly influence outcomes via other, related features. As a result, we can find attribute influences even in cases where, upon further direct examination of the model, the attribute is not referred to by the model at all.
Our approach does not require the black-box model to be retrained. This is important if (for example) the model is only accessible via an API, and contrasts our work with other methods that investigate feature influence like feature selection. We present experimental evidence for the effectiveness of our procedure using a variety of publicly available datasets and models. We also validate our procedure using techniques from interpretable learning and feature selection, as well as against other black-box auditing procedures.
△ Less
Submitted 30 November, 2016; v1 submitted 22 February, 2016;
originally announced February 2016.
-
Towards Understanding Enjoyment and Flow in Information Visualization
Authors:
Bahador Saket,
Carlos Scheidegger,
Stephen Kobourov
Abstract:
Traditionally, evaluation studies in information visualization have measured effectiveness by assessing performance time and accuracy. More recently, there has been a concerted effort to understand aspects beyond time and errors. In this paper we study enjoyment, which, while arguably not the primary goal of visualization, has been shown to impact performance and memorability. Different models of…
▽ More
Traditionally, evaluation studies in information visualization have measured effectiveness by assessing performance time and accuracy. More recently, there has been a concerted effort to understand aspects beyond time and errors. In this paper we study enjoyment, which, while arguably not the primary goal of visualization, has been shown to impact performance and memorability. Different models of enjoyment have been proposed in psychology, education and gaming; yet there is no standard approach to evaluate and measure enjoyment in visualization. In this paper we relate the flow model of Csikszentmihalyi to Munzner's nested model of visualization evaluation and previous work in the area. We suggest that, even though previous papers tackled individual elements of flow, in order to understand what specifically makes a visualization enjoyable, it might be necessary to measure all specific elements.
△ Less
Submitted 2 March, 2015;
originally announced March 2015.
-
Certifying and removing disparate impact
Authors:
Michael Feldman,
Sorelle Friedler,
John Moeller,
Carlos Scheidegger,
Suresh Venkatasubramanian
Abstract:
What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process.
When th…
▽ More
What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process.
When the process is implemented using computers, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the algorithm, we propose making inferences based on the data the algorithm uses.
We make four contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on analyzing the information leakage of the protected class from the other data attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.
△ Less
Submitted 15 July, 2015; v1 submitted 11 December, 2014;
originally announced December 2014.
-
Vector Field k-Means: Clustering Trajectories by Fitting Multiple Vector Fields
Authors:
Nivan Ferreira,
James T. Klosowski,
Carlos Scheidegger,
Claudio Silva
Abstract:
Scientists study trajectory data to understand trends in movement patterns, such as human mobility for traffic analysis and urban planning. There is a pressing need for scalable and efficient techniques for analyzing this data and discovering the underlying patterns. In this paper, we introduce a novel technique which we call vector-field $k$-means.
The central idea of our approach is to use vec…
▽ More
Scientists study trajectory data to understand trends in movement patterns, such as human mobility for traffic analysis and urban planning. There is a pressing need for scalable and efficient techniques for analyzing this data and discovering the underlying patterns. In this paper, we introduce a novel technique which we call vector-field $k$-means.
The central idea of our approach is to use vector fields to induce a similarity notion between trajectories. Other clustering algorithms seek a representative trajectory that best describes each cluster, much like $k$-means identifies a representative "center" for each cluster. Vector-field $k$-means, on the other hand, recognizes that in all but the simplest examples, no single trajectory adequately describes a cluster. Our approach is based on the premise that movement trends in trajectory data can be modeled as flows within multiple vector fields, and the vector field itself is what defines each of the clusters. We also show how vector-field $k$-means connects techniques for scalar field design on meshes and $k$-means clustering.
We present an algorithm that finds a locally optimal clustering of trajectories into vector fields, and demonstrate how vector-field $k$-means can be used to mine patterns from trajectory data. We present experimental evidence of its effectiveness and efficiency using several datasets, including historical hurricane data, GPS tracks of people and vehicles, and anonymous call records from a large phone company. We compare our results to previous trajectory clustering techniques, and find that our algorithm performs faster in practice than the current state-of-the-art in trajectory clustering, in some examples by a large margin.
△ Less
Submitted 31 August, 2012; v1 submitted 28 August, 2012;
originally announced August 2012.