PMDG: Privacy for Multi-Perspective Process Mining through Data Generalization
Authors:
Ryan Hildebrant,
Stephan A. Fahrenkrog-Petersen,
Matthias Weidlich,
Shangping Ren
Abstract:
Anonymization of event logs facilitates process mining while protecting sensitive information of process stakeholders. Existing techniques, however, focus on the privatization of the control-flow. Other process perspectives, such as roles, resources, and objects are neglected or subject to randomization, which breaks the dependencies between the perspectives. Hence, existing techniques are not sui…
▽ More
Anonymization of event logs facilitates process mining while protecting sensitive information of process stakeholders. Existing techniques, however, focus on the privatization of the control-flow. Other process perspectives, such as roles, resources, and objects are neglected or subject to randomization, which breaks the dependencies between the perspectives. Hence, existing techniques are not suited for advanced process mining tasks, e.g., social network mining or predictive monitoring. To address this gap, we propose PMDG, a framework to ensure privacy for multi-perspective process mining through data generalization. It provides group-based privacy guarantees for an event log, while preserving the characteristic dependencies between the control-flow and further process perspectives. Unlike existin privatization techniques that rely on data suppression or noise insertion, PMDG adopts data generalization: a technique where the activities and attribute values referenced in events are generalized into more abstract ones, to obtain equivalence classes that are sufficiently large from a privacy point of view. We demonstrate empirically that PMDG outperforms state-of-the-art anonymization techniques, when mining handovers and predicting outcomes.
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
Towards Better Bounds for Finding Quasi-Identifiers
Authors:
Ryan Hildebrant,
Quoc-Tung Le,
Duy-Hoang Ta,
Hoa T. Vu
Abstract:
We revisit the problem of finding small $ε$-separation keys introduced by Motwani and Xu (2008). In this problem, the input is $m$-dimensional tuples $x_1,x_2,\ldots,x_n $. The goal is to find a small subset of coordinates that separates at least $(1-ε){n \choose 2}$ pairs of tuples. They provided a fast algorithm that runs on $Θ(m/ε)$ tuples sampled uniformly at random. We show that the sample si…
▽ More
We revisit the problem of finding small $ε$-separation keys introduced by Motwani and Xu (2008). In this problem, the input is $m$-dimensional tuples $x_1,x_2,\ldots,x_n $. The goal is to find a small subset of coordinates that separates at least $(1-ε){n \choose 2}$ pairs of tuples. They provided a fast algorithm that runs on $Θ(m/ε)$ tuples sampled uniformly at random. We show that the sample size can be improved to $Θ(m/\sqrtε)$. Our algorithm also enjoys a faster running time. To obtain this result, we provide upper and lower bounds on the sample size to solve the following decision problem. Given a subset of coordinates $A$, reject if $A$ separates fewer than $(1-ε){n \choose 2}$ pairs, and accept if $A$ separates all pairs. The algorithm must be correct with probability at least $1-δ$ for all $A$. We show that for algorithms based on sampling:
- $Θ(m/\sqrtε)$ samples are sufficient and necessary so that $δ\leq e^{-m}$ and
- $Ω(\sqrt{\frac{\log m}ε})$ samples are necessary so that $δ$ is a constant.
Our analysis is based on a constrained version of the balls-into-bins problem. We believe our analysis may be of independent interest. We also study a related problem that asks for the following sketching algorithm: with given parameters $α,k$ and $ε$, the algorithm takes a subset of coordinates $A$ of size at most $k$ and returns an estimate of the number of unseparated pairs in $A$ up to a $(1\pmε)$ factor if it is at least $α{n \choose 2}$. We show that even for constant $α$ and success probability, such a sketching algorithm must use $Ω(mk \log ε^{-1})$ bits of space; on the other hand, uniform sampling yields a sketch of size $Θ(\frac{mk \log m}{αε^2})$ for this purpose.
△ Less
Submitted 13 April, 2023; v1 submitted 24 November, 2022;
originally announced November 2022.