-
Prink: $k_s$-Anonymization for Streaming Data in Apache Flink
Authors:
Philip Groneberg,
Saskia Nuñez von Voigt,
Thomas Janke,
Louis Loechel,
Karl Wolf,
Elias Grünewald,
Frank Pallas
Abstract:
In this paper, we present Prink, a novel and practically applicable concept and fully implemented prototype for ks-anonymizing data streams in real-world application architectures. Building upon the pre-existing, yet rudimentary CASTLE scheme, Prink for the first time introduces semantics-aware ks-anonymization of non-numerical (such as categorical or hierarchically generalizable) streaming data i…
▽ More
In this paper, we present Prink, a novel and practically applicable concept and fully implemented prototype for ks-anonymizing data streams in real-world application architectures. Building upon the pre-existing, yet rudimentary CASTLE scheme, Prink for the first time introduces semantics-aware ks-anonymization of non-numerical (such as categorical or hierarchically generalizable) streaming data in a information loss-optimized manner. In addition, it provides native integration into Apache Flink, one of the prevailing frameworks for enterprise-grade stream data processing in numerous application domains.
Our contributions excel the previously established state of the art for the privacy guarantee-providing anonymization of streaming data in that they 1) allow to include non-numerical data in the anonymization process, 2) provide discrete datapoints instead of aggregates, thereby facilitating flexible data use, 3) are applicable in real-world system contexts with minimal integration efforts, and 4) are experimentally proven to raise acceptable performance overheads and information loss in realistic settings. With these characteristics, Prink provides an anonymization approach which is practically feasible for a broad variety of real-world, enterprise-grade stream processing applications and environments.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
An applied Perspective: Estimating the Differential Identifiability Risk of an Exemplary SOEP Data Set
Authors:
Jonas Allmann,
Saskia Nuñez von Voigt,
Florian Tschorsch
Abstract:
Using real-world study data usually requires contractual agreements where research results may only be published in anonymized form. Requiring formal privacy guarantees, such as differential privacy, could be helpful for data-driven projects to comply with data protection. However, deploying differential privacy in consumer use cases raises the need to explain its underlying mechanisms and the res…
▽ More
Using real-world study data usually requires contractual agreements where research results may only be published in anonymized form. Requiring formal privacy guarantees, such as differential privacy, could be helpful for data-driven projects to comply with data protection. However, deploying differential privacy in consumer use cases raises the need to explain its underlying mechanisms and the resulting privacy guarantees. In this paper, we thoroughly review and extend an existing privacy metric. We show how to compute this risk metric efficiently for a set of basic statistical queries. Our empirical analysis based on an extensive, real-world scientific data set expands the knowledge on how to compute risks under realistic conditions, while presenting more challenges than solutions.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Illustrating an Effective Workflow for Accelerated Materials Discovery
Authors:
Mrinalini Mulukutla,
A. Nicole Person,
Sven Voigt,
Lindsey Kuettner,
Branden Kappes,
Danial Khatamsaz,
Robert Robinson,
Daniel Salas,
Wenle Xu,
Daniel Lewis,
Hongkyu Eoh,
Kailu Xiao,
Haoren Wang,
Jaskaran Singh Saini,
Raj Mahat,
Trevor Hastings,
Matthew Skokan,
Vahid Attari,
Michael Elverud,
James D. Paramore,
Brady Butler,
Kenneth Vecchio,
Surya R. Kalidindi,
Douglas Allaire,
Ibrahim Karaman
, et al. (4 additional authors not shown)
Abstract:
Algorithmic materials discovery is a multi-disciplinary domain that integrates insights from specialists in alloy design, synthesis, characterization, experimental methodologies, computational modeling, and optimization. Central to this effort is a robust data management system paired with an interactive work platform. This platform should empower users to not only access others data but also inte…
▽ More
Algorithmic materials discovery is a multi-disciplinary domain that integrates insights from specialists in alloy design, synthesis, characterization, experimental methodologies, computational modeling, and optimization. Central to this effort is a robust data management system paired with an interactive work platform. This platform should empower users to not only access others data but also integrate their analyses, paving the way for sophisticated data pipelines. To realize this vision, there is a need for an integrative collaboration platform, streamlined data sharing and analysis tools, and efficient communication channels. Such a collaborative mechanism should transcend geographical barriers, facilitating remote interaction and fostering a challenge-response dynamic. In this paper, we present our ongoing efforts in addressing the critical challenges related to an accelerated Materials Discovery Framework as a part of the High-Throughput Materials Discovery for Extreme Conditions Initiative. Our BIRDSHOT Center has successfully harnessed various tools and strategies, including the utilization of cloud-based storage, a standardized sample naming convention, a structured file system, the implementation of sample travelers, a robust sample tracking method, and the incorporation of knowledge graphs for efficient data management. Additionally, we present the development of a data collection platform, reinforcing seamless collaboration among our team members. In summary, this paper provides an illustration and insight into the various elements of an efficient and effective workflow within an accelerated materials discovery framework while highlighting the dynamic and adaptable nature of the data management tools and sharing platforms.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
From Theory to Comprehension: A Comparative Study of Differential Privacy and $k$-Anonymity
Authors:
Saskia Nuñez von Voigt,
Luise Mehner,
Florian Tschorsch
Abstract:
The notion of $\varepsilon$-differential privacy is a widely used concept of providing quantifiable privacy to individuals. However, it is unclear how to explain the level of privacy protection provided by a differential privacy mechanism with a set $\varepsilon$. In this study, we focus on users' comprehension of the privacy protection provided by a differential privacy mechanism. To do so, we st…
▽ More
The notion of $\varepsilon$-differential privacy is a widely used concept of providing quantifiable privacy to individuals. However, it is unclear how to explain the level of privacy protection provided by a differential privacy mechanism with a set $\varepsilon$. In this study, we focus on users' comprehension of the privacy protection provided by a differential privacy mechanism. To do so, we study three variants of explaining the privacy protection provided by differential privacy: (1) the original mathematical definition; (2) $\varepsilon$ translated into a specific privacy risk; and (3) an explanation using the randomized response technique. We compare users' comprehension of privacy protection employing these explanatory models with their comprehension of privacy protection of $k$-anonymity as baseline comprehensibility. Our findings suggest that participants' comprehension of differential privacy protection is enhanced by the privacy risk model and the randomized response-based model. Moreover, our results confirm our intuition that privacy protection provided by $k$-anonymity is more comprehensible.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Towards Standardized Mobility Reports with User-Level Privacy
Authors:
Alexandra Kapp,
Saskia Nuñez von Voigt,
Helena Mihaljević,
Florian Tschorsch
Abstract:
The importance of human mobility analyses is growing in both research and practice, especially as applications for urban planning and mobility rely on them. Aggregate statistics and visualizations play an essential role as building blocks of data explorations and summary reports, the latter being increasingly released to third parties such as municipal administrations or in the context of citizen…
▽ More
The importance of human mobility analyses is growing in both research and practice, especially as applications for urban planning and mobility rely on them. Aggregate statistics and visualizations play an essential role as building blocks of data explorations and summary reports, the latter being increasingly released to third parties such as municipal administrations or in the context of citizen participation. However, such explorations already pose a threat to privacy as they reveal potentially sensitive location information, and thus should not be shared without further privacy measures.
There is a substantial gap between state-of-the-art research on privacy methods and their utilization in practice. We thus conceptualize a standardized mobility report with differential privacy guarantees and implement it as open-source software to enable a privacy-preserving exploration of key aspects of mobility data in an easily accessible way. Moreover, we evaluate the benefits of limiting user contributions using three data sets relevant to research and practice. Our results show that even a strong limit on user contribution alters the original geospatial distribution only within a comparatively small range, while significantly reducing the error introduced by adding noise to achieve privacy guarantees.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
"Am I Private and If So, how Many?" - Communicating Privacy Guarantees of Differential Privacy with Risk Communication Formats
Authors:
Daniel Franzen,
Saskia Nuñez von Voigt,
Peter Sörries,
Florian Tschorsch,
Claudia Müller-Birn
Abstract:
Decisions about sharing personal information are not trivial, since there are many legitimate and important purposes for such data collection, but often the collected data can reveal sensitive information about individuals. Privacy-preserving technologies, such as differential privacy (DP), can be employed to protect the privacy of individuals and, furthermore, provide mathematically sound guarant…
▽ More
Decisions about sharing personal information are not trivial, since there are many legitimate and important purposes for such data collection, but often the collected data can reveal sensitive information about individuals. Privacy-preserving technologies, such as differential privacy (DP), can be employed to protect the privacy of individuals and, furthermore, provide mathematically sound guarantees on the maximum privacy risk. However, they can only support informed privacy decisions, if individuals understand the provided privacy guarantees. This article proposes a novel approach for communicating privacy guarantees to support individuals in their privacy decisions when sharing data. For this, we adopt risk communication formats from the medical domain in conjunction with a model for privacy guarantees of DP to create quantitative privacy risk notifications. We conducted a crowd-sourced study with 343 participants to evaluate how well our notifications conveyed the privacy risk information and how confident participants were about their own understanding of the privacy risk. Our findings suggest that these new notifications can communicate the objective information similarly well to currently used qualitative notifications, but left individuals less confident in their understanding. We also discovered that several of our notifications and the currently used qualitative notification disadvantage individuals with low numeracy: these individuals appear overconfident compared to their actual understanding of the associated privacy risks and are, therefore, less likely to seek the needed additional information before an informed decision. The promising results allow for multiple directions in future research, for example, adding visual aids or tailoring privacy risk communication to characteristics of the individuals.
△ Less
Submitted 23 August, 2022;
originally announced August 2022.
-
"Am I Private and If So, how Many?" -- Using Risk Communication Formats for Making Differential Privacy Understandable
Authors:
Daniel Franzen,
Saskia Nuñez von Voigt,
Peter Sörries,
Florian Tschorsch,
Claudia Müller-Birn
Abstract:
Mobility data is essential for cities and communities to identify areas for necessary improvement. Data collected by mobility providers already contains all the information necessary, but privacy of the individuals needs to be preserved. Differential privacy (DP) defines a mathematical property which guarantees that certain limits of privacy are preserved while sharing such data, but its functiona…
▽ More
Mobility data is essential for cities and communities to identify areas for necessary improvement. Data collected by mobility providers already contains all the information necessary, but privacy of the individuals needs to be preserved. Differential privacy (DP) defines a mathematical property which guarantees that certain limits of privacy are preserved while sharing such data, but its functionality and privacy protection are difficult to explain to laypeople. In this paper, we adapt risk communication formats in conjunction with a model for the privacy risks of DP. The result are privacy notifications which explain the risk to an individual's privacy when using DP, rather than DP's functionality. We evaluate these novel privacy communication formats in a crowdsourced study. We find that they perform similarly to the best performing DP communications used currently in terms of objective understanding, but did not make our participants as confident in their understanding. We also discovered an influence, similar to the Dunning-Kruger effect, of the statistical numeracy on the effectiveness of some of our privacy communication formats and the DP communication format used currently. These results generate hypotheses in multiple directions, for example, toward the use of risk visualization to improve the understandability of our formats or toward adaptive user interfaces which tailor the risk communication to the characteristics of the reader.
△ Less
Submitted 22 June, 2023; v1 submitted 8 April, 2022;
originally announced April 2022.
-
Self-Determined Reciprocal Recommender System with Strong Privacy Guarantees
Authors:
S. Nuñez von Voigt,
E. Daniel,
F. Tschorsch
Abstract:
Recommender systems are widely used. Usually, recommender systems are based on a centralized client-server architecture. However, this approach implies drawbacks regarding the privacy of users. In this paper, we propose a distributed reciprocal recommender system with strong, self-determined privacy guarantees, i.e., local differential privacy. More precisely, users randomize their profiles locall…
▽ More
Recommender systems are widely used. Usually, recommender systems are based on a centralized client-server architecture. However, this approach implies drawbacks regarding the privacy of users. In this paper, we propose a distributed reciprocal recommender system with strong, self-determined privacy guarantees, i.e., local differential privacy. More precisely, users randomize their profiles locally and exchange them via a peer-to-peer network. Recommendations are then computed and ranked locally by estimating similarities between profiles. We evaluate recommendation accuracy of a job recommender system and demonstrate that our method provides acceptable utility under strong privacy requirements.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
Privacy and Confidentiality in Process Mining -- Threats and Research Challenges
Authors:
Gamal Elkoumy,
Stephan A. Fahrenkrog-Petersen,
Mohammadreza Fani Sani,
Agnes Koschmider,
Felix Mannhardt,
Saskia Nuñez von Voigt,
Majid Rafiei,
Leopold von Waldthausen
Abstract:
Privacy and confidentiality are very important prerequisites for applying process mining in order to comply with regulations and keep company secrets. This paper provides a foundation for future research on privacy-preserving and confidential process mining techniques. Main threats are identified and related to an motivation application scenario in a hospital context as well as to the current body…
▽ More
Privacy and confidentiality are very important prerequisites for applying process mining in order to comply with regulations and keep company secrets. This paper provides a foundation for future research on privacy-preserving and confidential process mining techniques. Main threats are identified and related to an motivation application scenario in a hospital context as well as to the current body of work on privacy and confidentiality in process mining. A newly developed conceptual model structures the discussion that existing techniques leave room for improvement. This results in a number of important research challenges that should be addressed by future process mining research.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
Signal Processing Challenges and Examples for {\it in-situ} Transmission Electron Microscopy
Authors:
Josh Kacher,
Yao Xie,
Sven P. Voigt,
Shixiang Zhu,
Henry Yuchi,
Jordan Key,
Surya R. Kalidindi
Abstract:
Transmission Electron Microscopy (TEM) is a powerful tool for imaging material structure and characterizing material chemistry. Recent advances in data collection technology for TEM have enabled high-volume and high-resolution data collection at a microsecond frame rate. Taking advantage of these advances in data collection rates requires the development and application of data processing tools, i…
▽ More
Transmission Electron Microscopy (TEM) is a powerful tool for imaging material structure and characterizing material chemistry. Recent advances in data collection technology for TEM have enabled high-volume and high-resolution data collection at a microsecond frame rate. Taking advantage of these advances in data collection rates requires the development and application of data processing tools, including image analysis, feature extraction, and streaming data processing techniques. In this paper, we highlight a few areas in materials science that have benefited from combining signal processing and statistical analysis with data collection capabilities in TEM and present a future outlook on opportunities of integrating signal processing with automated TEM data analysis.
△ Less
Submitted 20 August, 2021; v1 submitted 17 April, 2021;
originally announced April 2021.
-
Every Query Counts: Analyzing the Privacy Loss of Exploratory Data Analyses
Authors:
Saskia Nuñez von Voigt,
Mira Pauli,
Johanna Reichert,
Florian Tschorsch
Abstract:
An exploratory data analysis is an essential step for every data analyst to gain insights, evaluate data quality and (if required) select a machine learning model for further processing. While privacy-preserving machine learning is on the rise, more often than not this initial analysis is not counted towards the privacy budget. In this paper, we quantify the privacy loss for basic statistical func…
▽ More
An exploratory data analysis is an essential step for every data analyst to gain insights, evaluate data quality and (if required) select a machine learning model for further processing. While privacy-preserving machine learning is on the rise, more often than not this initial analysis is not counted towards the privacy budget. In this paper, we quantify the privacy loss for basic statistical functions and highlight the importance of taking it into account when calculating the privacy-loss budget of a machine learning approach.
△ Less
Submitted 27 August, 2020;
originally announced August 2020.
-
Quantifying the Re-identification Risk of Event Logs for Process Mining
Authors:
S. Nuñez von Voigt,
S. A. Fahrenkrog-Petersen,
D. Janssen,
A. Koschmider,
F. Tschorsch,
F. Mannhardt,
O. Landsiedel,
M. Weidlich
Abstract:
Event logs recorded during the execution of business processes constitute a valuable source of information. Applying process mining techniques to them, event logs may reveal the actual process execution and enable reasoning on quantitative or qualitative process properties. However, event logs often contain sensitive information that could be related to individual process stakeholders through back…
▽ More
Event logs recorded during the execution of business processes constitute a valuable source of information. Applying process mining techniques to them, event logs may reveal the actual process execution and enable reasoning on quantitative or qualitative process properties. However, event logs often contain sensitive information that could be related to individual process stakeholders through background information and cross-correlation. We therefore argue that, when publishing event logs, the risk of such re-identification attacks must be considered. In this paper, we show how to quantify the re-identification risk with measures for the individual uniqueness in event logs. We also report on a large-scale study that explored the individual uniqueness in a collection of publicly available event logs. Our results suggest that potentially up to all of the cases in an event log may be re-identified, which highlights the importance of privacy-preserving techniques in process mining.
△ Less
Submitted 19 June, 2020; v1 submitted 24 March, 2020;
originally announced March 2020.
-
Building Trust Takes Time: Limits to Arbitrage for Blockchain-Based Assets
Authors:
Nikolaus Hautsch,
Christoph Scheuch,
Stefan Voigt
Abstract:
A blockchain replaces central counterparties with time-consuming consensus protocols to record the transfer of ownership. This settlement latency slows cross-exchange trading, exposing arbitrageurs to price risk. Off-chain settlement, instead, exposes arbitrageurs to costly default risk. We show with Bitcoin network and order book data that cross-exchange price differences coincide with periods of…
▽ More
A blockchain replaces central counterparties with time-consuming consensus protocols to record the transfer of ownership. This settlement latency slows cross-exchange trading, exposing arbitrageurs to price risk. Off-chain settlement, instead, exposes arbitrageurs to costly default risk. We show with Bitcoin network and order book data that cross-exchange price differences coincide with periods of high settlement latency, asset flows chase arbitrage opportunities, and price differences across exchanges with low default risk are smaller. Blockchain-based trading thus faces a dilemma: Reliable consensus protocols require time-consuming settlement latency, leading to arbitrage limits. Circumventing such arbitrage costs is possible only by reinstalling trusted intermediation, which mitigates default risk.
△ Less
Submitted 19 October, 2023; v1 submitted 3 December, 2018;
originally announced December 2018.
-
Large-Scale Portfolio Allocation Under Transaction Costs and Model Uncertainty
Authors:
Nikolaus Hautsch,
Stefan Voigt
Abstract:
We theoretically and empirically study portfolio optimization under transaction costs and establish a link between turnover penalization and covariance shrinkage with the penalization governed by transaction costs. We show how the ex ante incorporation of transaction costs shifts optimal portfolios towards regularized versions of efficient allocations. The regulatory effect of transaction costs is…
▽ More
We theoretically and empirically study portfolio optimization under transaction costs and establish a link between turnover penalization and covariance shrinkage with the penalization governed by transaction costs. We show how the ex ante incorporation of transaction costs shifts optimal portfolios towards regularized versions of efficient allocations. The regulatory effect of transaction costs is studied in an econometric setting incorporating parameter uncertainty and optimally combining predictive distributions resulting from high-frequency and low-frequency data. In an extensive empirical study, we illustrate that turnover penalization is more effective than commonly employed shrinkage methods and is crucial in order to construct empirically well-performing portfolios.
△ Less
Submitted 21 June, 2018; v1 submitted 19 September, 2017;
originally announced September 2017.