-
Realistic Image-to-Image Machine Unlearning via Decoupling and Knowledge Retention
Authors:
Ayush K. Varshney,
Vicenç Torra
Abstract:
Machine Unlearning allows participants to remove their data from a trained machine learning model in order to preserve their privacy, and security. However, the machine unlearning literature for generative models is rather limited. The literature for image-to-image generative model (I2I model) considers minimizing the distance between Gaussian noise and the output of I2I model for forget samples a…
▽ More
Machine Unlearning allows participants to remove their data from a trained machine learning model in order to preserve their privacy, and security. However, the machine unlearning literature for generative models is rather limited. The literature for image-to-image generative model (I2I model) considers minimizing the distance between Gaussian noise and the output of I2I model for forget samples as machine unlearning. However, we argue that the machine learning model performs fairly well on unseen data i.e., a retrained model will be able to catch generic patterns in the data and hence will not generate an output which is equivalent to Gaussian noise. In this paper, we consider that the model after unlearning should treat forget samples as out-of-distribution (OOD) data, i.e., the unlearned model should no longer recognize or encode the specific patterns found in the forget samples. To achieve this, we propose a framework which decouples the model parameters with gradient ascent, ensuring that forget samples are OOD for unlearned model with theoretical guarantee. We also provide $(ε, δ)$-unlearning guarantee for model updates with gradient ascent. The unlearned model is further fine-tuned on the remaining samples to maintain its performance. We also propose an attack model to ensure that the unlearned model has effectively removed the influence of forget samples. Extensive empirical evaluation on two large-scale datasets, ImageNet-1K and Places365 highlights the superiority of our approach. To show comparable performance with retrained model, we also show the comparison of a simple AutoEncoder on various baselines on CIFAR-10 dataset.
△ Less
Submitted 6 February, 2025;
originally announced February 2025.
-
Unlearning Clients, Features and Samples in Vertical Federated Learning
Authors:
Ayush K. Varshney,
Konstantinos Vandikas,
Vicenç Torra
Abstract:
Federated Learning (FL) has emerged as a prominent distributed learning paradigm. Within the scope of privacy preservation, information privacy regulations such as GDPR entitle users to request the removal (or unlearning) of their contribution from a service that is hosting the model. For this purpose, a server hosting an ML model must be able to unlearn certain information in cases such as copyri…
▽ More
Federated Learning (FL) has emerged as a prominent distributed learning paradigm. Within the scope of privacy preservation, information privacy regulations such as GDPR entitle users to request the removal (or unlearning) of their contribution from a service that is hosting the model. For this purpose, a server hosting an ML model must be able to unlearn certain information in cases such as copyright infringement or security issues that can make the model vulnerable or impact the performance of a service based on that model. While most unlearning approaches in FL focus on Horizontal FL (HFL), where clients share the feature space and the global model, Vertical FL (VFL) has received less attention from the research community. VFL involves clients (passive parties) sharing the sample space among them while not having access to the labels. In this paper, we explore unlearning in VFL from three perspectives: unlearning clients, unlearning features, and unlearning samples. To unlearn clients and features we introduce VFU-KD which is based on knowledge distillation (KD) while to unlearn samples, VFU-GA is introduced which is based on gradient ascent. To provide evidence of approximate unlearning, we utilize Membership Inference Attack (MIA) to audit the effectiveness of our unlearning approach. Our experiments across six tabular datasets and two image datasets demonstrate that VFU-KD and VFU-GA achieve performance comparable to or better than both retraining from scratch and the benchmark R2S method in many cases, with improvements of $(0-2\%)$. In the remaining cases, utility scores remain comparable, with a modest utility loss ranging from $1-5\%$. Unlike existing methods, VFU-KD and VFU-GA require no communication between active and passive parties during unlearning. However, they do require the active party to store the previously communicated embeddings.
△ Less
Submitted 23 January, 2025;
originally announced January 2025.
-
Efficient Federated Unlearning under Plausible Deniability
Authors:
Ayush K. Varshney,
Vicenç Torra
Abstract:
Privacy regulations like the GDPR in Europe and the CCPA in the US allow users the right to remove their data ML applications. Machine unlearning addresses this by modifying the ML parameters in order to forget the influence of a specific data point on its weights. Recent literature has highlighted that the contribution from data point(s) can be forged with some other data points in the dataset wi…
▽ More
Privacy regulations like the GDPR in Europe and the CCPA in the US allow users the right to remove their data ML applications. Machine unlearning addresses this by modifying the ML parameters in order to forget the influence of a specific data point on its weights. Recent literature has highlighted that the contribution from data point(s) can be forged with some other data points in the dataset with probability close to one. This allows a server to falsely claim unlearning without actually modifying the model's parameters. However, in distributed paradigms such as FL, where the server lacks access to the dataset and the number of clients are limited, claiming unlearning in such cases becomes a challenge. This paper introduces an efficient way to achieve federated unlearning, by employing a privacy model which allows the FL server to plausibly deny the client's participation in the training up to a certain extent. We demonstrate that the server can generate a Proof-of-Deniability, where each aggregated update can be associated with at least x number of client updates. This enables the server to plausibly deny a client's participation. However, in the event of frequent unlearning requests, the server is required to adopt an unlearning strategy and, accordingly, update its model parameters. We also perturb the client updates in a cluster in order to avoid inference from an honest but curious server. We show that the global model satisfies differential privacy after T number of communication rounds. The proposed methodology has been evaluated on multiple datasets in different privacy settings. The experimental results show that our framework achieves comparable utility while providing a significant reduction in terms of memory (30 times), as well as retraining time (1.6-500769 times). The source code for the paper is available.
△ Less
Submitted 13 October, 2024;
originally announced October 2024.
-
Concept Drift Detection using Ensemble of Integrally Private Models
Authors:
Ayush K. Varshney,
Vicenc Torra
Abstract:
Deep neural networks (DNNs) are one of the most widely used machine learning algorithm. DNNs requires the training data to be available beforehand with true labels. This is not feasible for many real-world problems where data arrives in the streaming form and acquisition of true labels are scarce and expensive. In the literature, not much focus has been given to the privacy prospect of the streami…
▽ More
Deep neural networks (DNNs) are one of the most widely used machine learning algorithm. DNNs requires the training data to be available beforehand with true labels. This is not feasible for many real-world problems where data arrives in the streaming form and acquisition of true labels are scarce and expensive. In the literature, not much focus has been given to the privacy prospect of the streaming data, where data may change its distribution frequently. These concept drifts must be detected privately in order to avoid any disclosure risk from DNNs. Existing privacy models use concept drift detection schemes such ADWIN, KSWIN to detect the drifts. In this paper, we focus on the notion of integrally private DNNs to detect concept drifts. Integrally private DNNs are the models which recur frequently from different datasets. Based on this, we introduce an ensemble methodology which we call 'Integrally Private Drift Detection' (IPDD) method to detect concept drift from private models. Our IPDD method does not require labels to detect drift but assumes true labels are available once the drift has been detected. We have experimented with binary and multi-class synthetic and real-world data. Our experimental results show that our methodology can privately detect concept drift, has comparable utility (even better in some cases) with ADWIN and outperforms utility from different levels of differentially private models. The source code for the paper is available \hyperlink{https://github.com/Ayush-Umu/Concept-drift-detection-Using-Integrally-private-models}{here}.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
The transport problem for non-additive measures
Authors:
Vicenç Torra
Abstract:
Non-additive measures, also known as fuzzy measures, capacities, and monotonic games, are increasingly used in different fields. Applications have been built within computer science and artificial intelligence related to e.g. decision making, image processing, machine learning for both classification, and regression. Tools for measure identification have been built. In short, as non-additive measu…
▽ More
Non-additive measures, also known as fuzzy measures, capacities, and monotonic games, are increasingly used in different fields. Applications have been built within computer science and artificial intelligence related to e.g. decision making, image processing, machine learning for both classification, and regression. Tools for measure identification have been built. In short, as non-additive measures are more general than additive ones (i.e., than probabilities), they have better modeling capabilities allowing to model situations and problems that cannot be modeled by the latter. See e.g. the application of non-additive measures and the Choquet integral to model both Ellsberg paradox and Allais paradox.
Because of that, there is an increasing need to analyze non-additive measures. The need for distances and similarities to compare them is no exception. Some work has been done for defining $f$-divergence for them. In this work we tackle the problem of defining the optimal transport problem for non-additive measures. Distances for pairs of probability distributions based on the optimal transport are extremely used in practical applications, and they are being studied extensively for their mathematical properties. We consider that it is necessary to provide appropriate definitions with a similar flavour, and that generalize the standard ones, for non-additive measures.
We provide definitions based on the Möbius transform, but also based on the $(\max, +)$-transform that we consider that has some advantages. We will discuss in this paper the problems that arise to define the transport problem for non-additive measures, and discuss ways to solve them. In this paper we provide the definitions of the optimal transport problem, and prove some properties.
△ Less
Submitted 8 December, 2022; v1 submitted 22 November, 2022;
originally announced November 2022.
-
Literature Review of the Recent Trends and Applications in various Fuzzy Rule based systems
Authors:
Ayush K. Varshney,
Vicenç Torra
Abstract:
Fuzzy rule based systems (FRBSs) is a rule-based system which uses linguistic fuzzy variables as antecedents and consequent to represent human understandable knowledge. They have been applied to various applications and areas throughout the soft computing literature. However, FRBSs suffers from many drawbacks such as uncertainty representation, high number of rules, interpretability loss, high com…
▽ More
Fuzzy rule based systems (FRBSs) is a rule-based system which uses linguistic fuzzy variables as antecedents and consequent to represent human understandable knowledge. They have been applied to various applications and areas throughout the soft computing literature. However, FRBSs suffers from many drawbacks such as uncertainty representation, high number of rules, interpretability loss, high computational time for learning etc. To overcome these issues with FRBSs, there exists many extensions of FRBSs. This paper presents an overview and literature review of recent trends on various types and prominent areas of fuzzy systems (FRBSs) namely genetic fuzzy system (GFS), hierarchical fuzzy system (HFS), neuro fuzzy system (NFS), evolving fuzzy system (eFS), FRBSs for big data, FRBSs for imbalanced data, interpretability in FRBSs and FRBSs which use cluster centroids as fuzzy rules. The review is for years 2010-2021. This paper also highlights important contributions, publication statistics and current trends in the field. The paper also addresses several open research areas which need further attention from the FRBSs research community.
△ Less
Submitted 16 May, 2023; v1 submitted 15 September, 2022;
originally announced September 2022.
-
Next Generation Resilient Cyber-Physical Systems
Authors:
Michel Barbeau,
Georg Carle,
Joaquin Garcia-Alfaro,
Vicenç Torra
Abstract:
Cyber-Physical Systems (CPS) consist of distributed engineered environments where the monitoring and surveillance tasks are governed by tightly integrated computing, communication and control technologies. CPS are omnipresent in our everyday life. Hacking and failures of such systems have impact on critical services with potentially significant and lasting consequences. In this paper, we review wh…
▽ More
Cyber-Physical Systems (CPS) consist of distributed engineered environments where the monitoring and surveillance tasks are governed by tightly integrated computing, communication and control technologies. CPS are omnipresent in our everyday life. Hacking and failures of such systems have impact on critical services with potentially significant and lasting consequences. In this paper, we review which requirements a CPS must meet to address the challenges of tomorrow. Two key challenges are understanding and reinforcing the resilience of CPS.
△ Less
Submitted 8 November, 2019; v1 submitted 20 July, 2019;
originally announced July 2019.
-
Privacy by design in big data: An overview of privacy enhancing technologies in the era of big data analytics
Authors:
Giuseppe D'Acquisto,
Josep Domingo-Ferrer,
Panayiotis Kikiras,
Vicenç Torra,
Yves-Alexandre de Montjoye,
Athena Bourka
Abstract:
The extensive collection and processing of personal information in big data analytics has given rise to serious privacy concerns, related to wide scale electronic surveillance, profiling, and disclosure of private data. To reap the benefits of analytics without invading the individuals' private sphere, it is essential to draw the limits of big data processing and integrate data protection safeguar…
▽ More
The extensive collection and processing of personal information in big data analytics has given rise to serious privacy concerns, related to wide scale electronic surveillance, profiling, and disclosure of private data. To reap the benefits of analytics without invading the individuals' private sphere, it is essential to draw the limits of big data processing and integrate data protection safeguards in the analytics value chain. ENISA, with the current report, supports this approach and the position that the challenges of technology (for big data) should be addressed by the opportunities of technology (for privacy).
We first explain the need to shift from "big data versus privacy" to "big data with privacy". In this respect, the concept of privacy by design is key to identify the privacy requirements early in the big data analytics value chain and in subsequently implementing the necessary technical and organizational measures.
After an analysis of the proposed privacy by design strategies in the different phases of the big data value chain, we review privacy enhancing technologies of special interest for the current and future big data landscape. In particular, we discuss anonymization, the "traditional" analytics technique, the emerging area of encrypted search and privacy preserving computations, granular access control mechanisms, policy enforcement and accountability, as well as data provenance issues. Moreover, new transparency and access tools in big data are explored, together with techniques for user empowerment and control.
Achieving "big data with privacy" is no easy task and a lot of research and implementation is still needed. Yet, it remains a possible task, as long as all the involved stakeholders take the necessary steps to integrate privacy and data protection safeguards in the heart of big data, by design and by default.
△ Less
Submitted 18 December, 2015;
originally announced December 2015.
-
The effect of constraints on information loss and risk for clustering and modification based graph anonymization methods
Authors:
David F. Nettleton,
Vicenc Torra,
Anton Dries
Abstract:
In this paper we present a novel approach for anonymizing Online Social Network graphs which can be used in conjunction with existing perturbation approaches such as clustering and modification. The main insight of this paper is that by imposing additional constraints on which nodes can be selected we can reduce the information loss with respect to key structural metrics, while maintaining an acce…
▽ More
In this paper we present a novel approach for anonymizing Online Social Network graphs which can be used in conjunction with existing perturbation approaches such as clustering and modification. The main insight of this paper is that by imposing additional constraints on which nodes can be selected we can reduce the information loss with respect to key structural metrics, while maintaining an acceptable risk. We present and evaluate two constraints, 'local1' and 'local2' which select the most similar subgraphs within the same community while excluding some key structural nodes. To this end, we introduce a novel distance metric based on local subgraph characteristics and which is calibrated using an isomorphism matcher. Empirical testing is conducted with three real OSN datasets, six information loss measures, five adversary queries as risk measures, and different levels of k-anonymity. The results show that overall, the methods with constraints give the best results for information loss and risk of disclosure.
△ Less
Submitted 1 July, 2014; v1 submitted 2 January, 2014;
originally announced January 2014.
-
Evolutionary Algorithm for Graph Anonymization
Authors:
Jordi Casas-Roma,
Jordi Herrera-Joancomartí,
Vicenç Torra
Abstract:
In recent years there has been a significant increase in the use of graphs as a tool for representing information. It is very important to preserve the privacy of users when one wants to publish this information, especially in the case of social graphs. In this case, it is essential to implement an anonymization process in the data in order to preserve users' privacy. In this paper we present an a…
▽ More
In recent years there has been a significant increase in the use of graphs as a tool for representing information. It is very important to preserve the privacy of users when one wants to publish this information, especially in the case of social graphs. In this case, it is essential to implement an anonymization process in the data in order to preserve users' privacy. In this paper we present an algorithm for graph anonymization, called Evolutionary Algorithm for Graph Anonymization (EAGA), based on edge modifications to preserve the k-anonymity model.
△ Less
Submitted 26 March, 2014; v1 submitted 1 October, 2013;
originally announced October 2013.
-
A formalization of re-identification in terms of compatible probabilities
Authors:
Vicenç Torra,
Klara Stokes
Abstract:
Re-identification algorithms are used in data privacy to measure disclosure risk. They model the situation in which an adversary attacks a published database by means of linking the information of this adversary with the database.
In this paper we formalize this type of algorithm in terms of true probabilities and compatible belief functions. The purpose of this work is to leave aside as re-iden…
▽ More
Re-identification algorithms are used in data privacy to measure disclosure risk. They model the situation in which an adversary attacks a published database by means of linking the information of this adversary with the database.
In this paper we formalize this type of algorithm in terms of true probabilities and compatible belief functions. The purpose of this work is to leave aside as re-identification algorithms those algorithms that do not satisfy a minimum requirement.
△ Less
Submitted 21 January, 2013;
originally announced January 2013.
-
Reidentification and k-anonymity: a model for disclosure risk in graphs
Authors:
Klara Stokes,
Vicenç Torra
Abstract:
In this article we provide a formal framework for reidentification in general. We define n-confusion as a concept for modelling the anonymity of a database table and we prove that n-confusion is a generalization of k- anonymity. After a short survey on the different available definitions of k- anonymity for graphs we provide a new definition for k-anonymous graph, which we consider to be the corre…
▽ More
In this article we provide a formal framework for reidentification in general. We define n-confusion as a concept for modelling the anonymity of a database table and we prove that n-confusion is a generalization of k- anonymity. After a short survey on the different available definitions of k- anonymity for graphs we provide a new definition for k-anonymous graph, which we consider to be the correct definition. We provide a description of the k-anonymous graphs, both for the regular and the non-regular case. We also introduce the more flexible concept of (k,l)-anonymous graph. Our definition of (k,l)-anonymous graph is meant to replace a previous definition of (k, l)-anonymous graph, which we here prove to have severe weaknesses. Finally we provide a set of algorithms for k-anonymization of graphs.
△ Less
Submitted 12 March, 2012; v1 submitted 8 December, 2011;
originally announced December 2011.