-
A Weighted Loss Approach to Robust Federated Learning under Data Heterogeneity
Authors:
Johan Erbani,
Sonia Ben Mokhtar,
Pierre-Edouard Portier,
Elod Egyed-Zsigmond,
Diana Nurbakova
Abstract:
Federated learning (FL) is a machine learning paradigm that enables multiple data holders to collaboratively train a machine learning model without sharing their training data with external parties. In this paradigm, workers locally update a model and share with a central server their updated gradients (or model parameters). While FL seems appealing from a privacy perspective, it opens a number of…
▽ More
Federated learning (FL) is a machine learning paradigm that enables multiple data holders to collaboratively train a machine learning model without sharing their training data with external parties. In this paradigm, workers locally update a model and share with a central server their updated gradients (or model parameters). While FL seems appealing from a privacy perspective, it opens a number of threats from a security perspective as (Byzantine) participants can contribute poisonous gradients (or model parameters) harming model convergence. Byzantine-resilient FL addresses this issue by ensuring that the training proceeds as if Byzantine participants were absent. Towards this purpose, common strategies ignore outlier gradients during model aggregation, assuming that Byzantine gradients deviate more from honest gradients than honest gradients do from each other. However, in heterogeneous settings, honest gradients may differ significantly, making it difficult to distinguish honest outliers from Byzantine ones. In this paper, we introduce the Worker Label Alignement Loss (WoLA), a weighted loss that aligns honest worker gradients despite data heterogeneity, which facilitates the identification of Byzantines' gradients. This approach significantly outperforms state-of-the-art methods in heterogeneous settings. In this paper, we provide both theoretical insights and empirical evidence of its effectiveness.
△ Less
Submitted 12 June, 2025; v1 submitted 11 June, 2025;
originally announced June 2025.
-
Dropout-Robust Mechanisms for Differentially Private and Fully Decentralized Mean Estimation
Authors:
César Sabater,
Sonia Ben Mokhtar,
Jan Ramon
Abstract:
Achieving differentially private computations in decentralized settings poses significant challenges, particularly regarding accuracy, communication cost, and robustness against information leakage. While cryptographic solutions offer promise, they often suffer from high communication overhead or require centralization in the presence of network failures. Conversely, existing fully decentralized a…
▽ More
Achieving differentially private computations in decentralized settings poses significant challenges, particularly regarding accuracy, communication cost, and robustness against information leakage. While cryptographic solutions offer promise, they often suffer from high communication overhead or require centralization in the presence of network failures. Conversely, existing fully decentralized approaches typically rely on relaxed adversarial models or pairwise noise cancellation, the latter suffering from substantial accuracy degradation if parties unexpectedly disconnect. In this work, we propose IncA, a new protocol for fully decentralized mean estimation, a widely used primitive in data-intensive processing. Our protocol, which enforces differential privacy, requires no central orchestration and employs low-variance correlated noise, achieved by incrementally injecting sensitive information into the computation. First, we theoretically demonstrate that, when no parties permanently disconnect, our protocol achieves accuracy comparable to that of a centralized setting-already an improvement over most existing decentralized differentially private techniques. Second, we empirically show that our use of low-variance correlated noise significantly mitigates the accuracy loss experienced by existing techniques in the presence of dropouts.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
GRANITE : a Byzantine-Resilient Dynamic Gossip Learning Framework
Authors:
Yacine Belal,
Mohamed Maouche,
Sonia Ben Mokhtar,
Anthony Simonet-Boulogne
Abstract:
Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighboring peers. Recent GL approaches rely on dynamic communication graphs built and maintained using Random Peer Sampling (RPS) protocols. Thanks to graph dynamics, GL can achieve fast convergence even over extremely sparse topologies. However, the robustness of GL…
▽ More
Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighboring peers. Recent GL approaches rely on dynamic communication graphs built and maintained using Random Peer Sampling (RPS) protocols. Thanks to graph dynamics, GL can achieve fast convergence even over extremely sparse topologies. However, the robustness of GL over dy- namic graphs to Byzantine (model poisoning) attacks remains unaddressed especially when Byzantine nodes attack the RPS protocol to scale up model poisoning. We address this issue by introducing GRANITE, a framework for robust learning over sparse, dynamic graphs in the presence of a fraction of Byzantine nodes. GRANITE relies on two key components (i) a History-aware Byzantine-resilient Peer Sampling protocol (HaPS), which tracks previously encountered identifiers to reduce adversarial influence over time, and (ii) an Adaptive Probabilistic Threshold (APT), which leverages an estimate of Byzantine presence to set aggregation thresholds with formal guarantees. Empirical results confirm that GRANITE maintains convergence with up to 30% Byzantine nodes, improves learning speed via adaptive filtering of poisoned models and obtains these results in up to 9 times sparser graphs than dictated by current theory.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
COoL-TEE: Client-TEE Collaboration for Resilient Distributed Search
Authors:
Matthieu Bettinger,
Etienne Rivière,
Sonia Ben Mokhtar,
Anthony Simonet-Boulogne
Abstract:
Current marketplaces rely on search mechanisms with distributed systems but centralized governance, making them vulnerable to attacks, failures, censorship and biases. While search mechanisms with more decentralized governance (e.g., DeSearch) have been recently proposed, these are still exposed to information head-start attacks (IHS) despite the use of Trusted Execution Environments (TEEs). These…
▽ More
Current marketplaces rely on search mechanisms with distributed systems but centralized governance, making them vulnerable to attacks, failures, censorship and biases. While search mechanisms with more decentralized governance (e.g., DeSearch) have been recently proposed, these are still exposed to information head-start attacks (IHS) despite the use of Trusted Execution Environments (TEEs). These attacks allow malicious users to gain a head-start over other users for the discovery of new assets in the market, which give them an unfair advantage in asset acquisition. We propose COoL-TEE, a TEE-based provider selection mechanism for distributed search, running in single- or multi-datacenter environments, that is resilient to information head-start attacks. COoL-TEE relies on a Client-TEE collaboration, which enables clients to distinguish between slow providers and malicious ones. Performance evaluations in single- and multi-datacenter environments show that, using COoL-TEE, malicious users respectively gain only up to 2% and 7% of assets more than without IHS, while they can claim 20% or more on top of their fair share in the same conditions with DeSearch.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Reliability is Blind: Collective Incentives for Decentralized Computing Marketplaces without Individual Behavior Information
Authors:
Henry Mont,
Matthieu Bettinger,
Sonia Ben Mokhtar,
Anthony Simonet-Boulogne
Abstract:
In decentralized cloud computing marketplaces, ensuring fair and efficient interactions among asset providers and end-users is crucial. A key concern is meeting agreed-upon service-level objectives like the service's reliability. In this decentralized context, traditional mechanisms often fail to address the complexity of task failures, due to limited available and trustworthy insights into these…
▽ More
In decentralized cloud computing marketplaces, ensuring fair and efficient interactions among asset providers and end-users is crucial. A key concern is meeting agreed-upon service-level objectives like the service's reliability. In this decentralized context, traditional mechanisms often fail to address the complexity of task failures, due to limited available and trustworthy insights into these independent actors' individual behavior. This paper proposes a collective incentive mechanism that blindly punishes all involved parties when a task fails. Based on ruin theory, we show that Collective Incentives improve behavior in the marketplace by creating a disincentive for faults and misbehavior even when the parties at fault are unknown, in turn leading to a more robust marketplace. Simulations for small and large pools of marketplace assets show that Collective Incentives enable to meet or exceed a reliability target, i.e., the success-rate of tasks run using marketplace assets, by eventually discarding failure-prone assets while preserving reliable ones.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Secure Federated Graph-Filtering for Recommender Systems
Authors:
Julien Nicolas,
César Sabater,
Mohamed Maouche,
Sonia Ben Mokhtar,
Mark Coates
Abstract:
Recommender systems often rely on graph-based filters, such as normalized item-item adjacency matrices and low-pass filters. While effective, the centralized computation of these components raises concerns about privacy, security, and the ethical use of user data. This work proposes two decentralized frameworks for securely computing these critical graph components without centralizing sensitive i…
▽ More
Recommender systems often rely on graph-based filters, such as normalized item-item adjacency matrices and low-pass filters. While effective, the centralized computation of these components raises concerns about privacy, security, and the ethical use of user data. This work proposes two decentralized frameworks for securely computing these critical graph components without centralizing sensitive information. The first approach leverages lightweight Multi-Party Computation and distributed singular vector computations to privately compute key graph filters. The second extends this framework by incorporating low-rank approximations, enabling a trade-off between communication efficiency and predictive performance. Empirical evaluations on benchmark datasets demonstrate that the proposed methods achieve comparable accuracy to centralized state-of-the-art systems while ensuring data confidentiality and maintaining low communication costs. Our results highlight the potential for privacy-preserving decentralized architectures to bridge the gap between utility and user data protection in modern recommender systems.
△ Less
Submitted 28 January, 2025;
originally announced January 2025.
-
TEE-based Key-Value Stores: a Survey
Authors:
Aghiles Ait Messaoud,
Sonia Ben Mokhtar,
Anthony Simonet-Boulogne
Abstract:
Key-Value Stores (KVSs) are No-SQL databases that store data as key-value pairs and have gained popularity due to their simplicity, scalability, and fast retrieval capabilities. However, storing sensitive data in KVSs requires strong security properties to prevent data leakage and unauthorized tampering. While software (SW)-based encryption techniques are commonly used to maintain data confidentia…
▽ More
Key-Value Stores (KVSs) are No-SQL databases that store data as key-value pairs and have gained popularity due to their simplicity, scalability, and fast retrieval capabilities. However, storing sensitive data in KVSs requires strong security properties to prevent data leakage and unauthorized tampering. While software (SW)-based encryption techniques are commonly used to maintain data confidentiality and integrity, they suffer from several drawbacks. They strongly assume trust in the hosting system stack and do not secure data during processing unless using performance-heavy techniques (e.g., homomorphic encryption). Alternatively, Trusted Execution Environments (TEEs) provide a solution that enforces the confidentiality and integrity of code and data at the CPU level, allowing users to build trusted applications in an untrusted environment. They also secure data in use by providing an encapsulated processing environment called enclave. Nevertheless, TEEs come with their own set of drawbacks, including performance issues due to memory size limitations and CPU context switching. This paper examines the state of the art in TEE-based confidential KVSs and highlights common design strategies used in KVSs to leverage TEE security features while overcoming their inherent limitations. This work aims to provide a comprehensive understanding of the use of TEEs in KVSs and to identify research directions for future work.
△ Less
Submitted 6 January, 2025;
originally announced January 2025.
-
Scrutinizing the Vulnerability of Decentralized Learning to Membership Inference Attacks
Authors:
Ousmane Touat,
Jezekael Brunon,
Yacine Belal,
Julien Nicolas,
Mohamed Maouche,
César Sabater,
Sonia Ben Mokhtar
Abstract:
The primary promise of decentralized learning is to allow users to engage in the training of machine learning models in a collaborative manner while keeping their data on their premises and without relying on any central entity. However, this paradigm necessitates the exchange of model parameters or gradients between peers. Such exchanges can be exploited to infer sensitive information about train…
▽ More
The primary promise of decentralized learning is to allow users to engage in the training of machine learning models in a collaborative manner while keeping their data on their premises and without relying on any central entity. However, this paradigm necessitates the exchange of model parameters or gradients between peers. Such exchanges can be exploited to infer sensitive information about training data, which is achieved through privacy attacks (e.g Membership Inference Attacks -- MIA). In order to devise effective defense mechanisms, it is important to understand the factors that increase/reduce the vulnerability of a given decentralized learning architecture to MIA. In this study, we extensively explore the vulnerability to MIA of various decentralized learning architectures by varying the graph structure (e.g number of neighbors), the graph dynamics, and the aggregation strategy, across diverse datasets and data distributions. Our key finding, which to the best of our knowledge we are the first to report, is that the vulnerability to MIA is heavily correlated to (i) the local model mixing strategy performed by each node upon reception of models from neighboring nodes and (ii) the global mixing properties of the communication graph. We illustrate these results experimentally using four datasets and by theoretically analyzing the mixing properties of various decentralized architectures. Our paper draws a set of lessons learned for devising decentralized learning systems that reduce by design the vulnerability to MIA.
△ Less
Submitted 6 February, 2025; v1 submitted 17 December, 2024;
originally announced December 2024.
-
Differentially private and decentralized randomized power method
Authors:
Julien Nicolas,
César Sabater,
Mohamed Maouche,
Sonia Ben Mokhtar,
Mark Coates
Abstract:
The randomized power method has gained significant interest due to its simplicity and efficient handling of large-scale spectral analysis and recommendation tasks. However, its application to large datasets containing personal information (e.g., web interactions, search history, personal tastes) raises critical privacy problems. This paper addresses these issues by proposing enhanced privacy-prese…
▽ More
The randomized power method has gained significant interest due to its simplicity and efficient handling of large-scale spectral analysis and recommendation tasks. However, its application to large datasets containing personal information (e.g., web interactions, search history, personal tastes) raises critical privacy problems. This paper addresses these issues by proposing enhanced privacy-preserving variants of the method. First, we propose a variant that reduces the amount of the noise required in current techniques to achieve Differential Privacy (DP). More precisely, we refine the privacy analysis so that the Gaussian noise variance no longer grows linearly with the target rank, achieving the same DP guarantees with strictly less noise. Second, we adapt our method to a decentralized framework in which data is distributed among multiple users. The decentralized protocol strengthens privacy guarantees with no accuracy penalty and a low computational and communication overhead. Our results include the provision of tighter convergence bounds for both the centralized and decentralized versions, and an empirical comparison with previous work using real recommendation datasets.
△ Less
Submitted 12 June, 2025; v1 submitted 4 November, 2024;
originally announced November 2024.
-
Inferring Communities of Interest in Collaborative Learning-based Recommender Systems
Authors:
Yacine Belal,
Sonia Ben Mokhtar,
Mohamed Maouche,
Anthony Simonet-Boulogne
Abstract:
Collaborative-learning-based recommender systems, such as those employing Federated Learning (FL) and Gossip Learning (GL), allow users to train models while keeping their history of liked items on their devices. While these methods were seen as promising for enhancing privacy, recent research has shown that collaborative learning can be vulnerable to various privacy attacks. In this paper, we pro…
▽ More
Collaborative-learning-based recommender systems, such as those employing Federated Learning (FL) and Gossip Learning (GL), allow users to train models while keeping their history of liked items on their devices. While these methods were seen as promising for enhancing privacy, recent research has shown that collaborative learning can be vulnerable to various privacy attacks. In this paper, we propose a novel attack called Community Inference Attack (CIA), which enables an adversary to identify community members based on a set of target items. What sets CIA apart is its efficiency: it operates at low computational cost by eliminating the need for training surrogate models. Instead, it uses a comparison-based approach, inferring sensitive information by comparing users' models rather than targeting any specific individual model. To evaluate the effectiveness of CIA, we conduct experiments on three real-world recommendation datasets using two recommendation models under both Federated and Gossip-like settings. The results demonstrate that CIA can be up to 10 times more accurate than random guessing. Additionally, we evaluate two mitigation strategies: Differentially Private Stochastic Gradient Descent (DP-SGD) and a Share less policy, which involves sharing fewer, less sensitive model parameters. Our findings suggest that the Share less strategy offers a better privacy-utility trade-off, especially in GL.
△ Less
Submitted 15 April, 2025; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Survey of Federated Learning Models for Spatial-Temporal Mobility Applications
Authors:
Yacine Belal,
Sonia Ben Mokhtar,
Hamed Haddadi,
Jaron Wang,
Afra Mashhadi
Abstract:
Federated learning involves training statistical models over edge devices such as mobile phones such that the training data is kept local. Federated Learning (FL) can serve as an ideal candidate for training spatial temporal models that rely on heterogeneous and potentially massive numbers of participants while preserving the privacy of highly sensitive location data. However, there are unique cha…
▽ More
Federated learning involves training statistical models over edge devices such as mobile phones such that the training data is kept local. Federated Learning (FL) can serve as an ideal candidate for training spatial temporal models that rely on heterogeneous and potentially massive numbers of participants while preserving the privacy of highly sensitive location data. However, there are unique challenges involved with transitioning existing spatial temporal models to decentralized learning. In this survey paper, we review the existing literature that has proposed FL-based models for predicting human mobility, traffic prediction, community detection, location-based recommendation systems, and other spatial-temporal tasks. We describe the metrics and datasets these works have been using and create a baseline of these approaches in comparison to the centralized settings. Finally, we discuss the challenges of applying spatial-temporal models in a decentralized setting and by highlighting the gaps in the literature we provide a road map and opportunities for the research community.
△ Less
Submitted 8 February, 2024; v1 submitted 9 May, 2023;
originally announced May 2023.
-
Shielding Federated Learning Systems against Inference Attacks with ARM TrustZone
Authors:
Aghiles Ait Messaoud,
Sonia Ben Mokhtar,
Vlad Nitu,
Valerio Schiavoni
Abstract:
Federated Learning (FL) opens new perspectives for training machine learning models while keeping personal data on the users premises. Specifically, in FL, models are trained on the users devices and only model updates (i.e., gradients) are sent to a central server for aggregation purposes. However, the long list of inference attacks that leak private data from gradients, published in the recent y…
▽ More
Federated Learning (FL) opens new perspectives for training machine learning models while keeping personal data on the users premises. Specifically, in FL, models are trained on the users devices and only model updates (i.e., gradients) are sent to a central server for aggregation purposes. However, the long list of inference attacks that leak private data from gradients, published in the recent years, have emphasized the need of devising effective protection mechanisms to incentivize the adoption of FL at scale. While there exist solutions to mitigate these attacks on the server side, little has been done to protect users from attacks performed on the client side. In this context, the use of Trusted Execution Environments (TEEs) on the client side are among the most proposing solutions. However, existing frameworks (e.g., DarkneTZ) require statically putting a large portion of the machine learning model into the TEE to effectively protect against complex attacks or a combination of attacks. We present GradSec, a solution that allows protecting in a TEE only sensitive layers of a machine learning model, either statically or dynamically, hence reducing both the TCB size and the overall training time by up to 30% and 56%, respectively compared to state-of-the-art competitors.
△ Less
Submitted 15 October, 2022; v1 submitted 11 August, 2022;
originally announced August 2022.
-
PEPPER: Empowering User-Centric Recommender Systems over Gossip Learning
Authors:
Yacine Belal,
Aurélien Bellet,
Sonia Ben Mokhtar,
Vlad Nitu
Abstract:
Recommender systems are proving to be an invaluable tool for extracting user-relevant content helping users in their daily activities (e.g., finding relevant places to visit, content to consume, items to purchase). However, to be effective, these systems need to collect and analyze large volumes of personal data (e.g., location check-ins, movie ratings, click rates .. etc.), which exposes users to…
▽ More
Recommender systems are proving to be an invaluable tool for extracting user-relevant content helping users in their daily activities (e.g., finding relevant places to visit, content to consume, items to purchase). However, to be effective, these systems need to collect and analyze large volumes of personal data (e.g., location check-ins, movie ratings, click rates .. etc.), which exposes users to numerous privacy threats. In this context, recommender systems based on Federated Learning (FL) appear to be a promising solution for enforcing privacy as they compute accurate recommendations while keeping personal data on the users' devices. However, FL, and therefore FL-based recommender systems, rely on a central server that can experience scalability issues besides being vulnerable to attacks. To remedy this, we propose PEPPER, a decentralized recommender system based on gossip learning principles. In PEPPER, users gossip model updates and aggregate them asynchronously. At the heart of PEPPER reside two key components: a personalized peer-sampling protocol that keeps in the neighborhood of each node, a proportion of nodes that have similar interests to the former and a simple yet effective model aggregation function that builds a model that is better suited to each user. Through experiments on three real datasets implementing two use cases: a location check-in recommendation and a movie recommendation, we demonstrate that our solution converges up to 42% faster than with other decentralized solutions providing up to 9% improvement on average performance metric such as hit ratio and up to 21% improvement on long tail performance compared to decentralized competitors.
△ Less
Submitted 9 August, 2022;
originally announced August 2022.
-
SplitBFT: Improving Byzantine Fault Tolerance Safety Using Trusted Compartments
Authors:
Ines Messadi,
Markus Horst Becker,
Kai Bleeke,
Leander Jehl,
Sonia Ben Mokhtar,
Rüdiger Kapitza
Abstract:
Byzantine fault-tolerant agreement (BFT) in a partially synchronous system usually requires 3f + 1 nodes to tolerate f faulty replicas. Due to their high throughput and finality property BFT algorithms build the core of recent permissioned blockchains. As a complex and resource-demanding infrastructure, multiple cloud providers have started offering Blockchain-as-a-Service. This eases the deployme…
▽ More
Byzantine fault-tolerant agreement (BFT) in a partially synchronous system usually requires 3f + 1 nodes to tolerate f faulty replicas. Due to their high throughput and finality property BFT algorithms build the core of recent permissioned blockchains. As a complex and resource-demanding infrastructure, multiple cloud providers have started offering Blockchain-as-a-Service. This eases the deployment of permissioned blockchains but places the cloud provider in a central controlling position, thereby questioning blockchains' fault tolerance and decentralization properties and their underlying BFT algorithm. This paper presents SplitBFT, a new way to utilize trusted execution technology (TEEs), such as Intel SGX, to harden the safety and confidentiality guarantees of BFT systems thereby strengthening the trust in could-based deployments of permissioned blockchains. Deviating from standard assumptions, SplitBFT acknowledges that code protected by trusted execution may fail. We address this by splitting and isolating the core logic of BFT protocols into multiple compartments resulting in a more resilient architecture. We apply SplitBFT to the traditional practical byzantine fault tolerance algorithm (PBFT) and evaluate it using SGX. Our results show that SplitBFT adds only a reasonable overhead compared to the non-compartmentalized variant.
△ Less
Submitted 24 May, 2022; v1 submitted 18 May, 2022;
originally announced May 2022.
-
Enhancing Robustness of On-line Learning Models on Highly Noisy Data
Authors:
Zilong Zhao,
Robert Birke,
Rui Han,
Bogdan Robu,
Sara Bouchenak,
Sonia Ben Mokhtar,
Lydia Y. Chen
Abstract:
Classification algorithms have been widely adopted to detect anomalies for various systems, e.g., IoT, cloud and face recognition, under the common assumption that the data source is clean, i.e., features and labels are correctly set. However, data collected from the wild can be unreliable due to careless annotations or malicious data transformation for incorrect anomaly detection. In this paper,…
▽ More
Classification algorithms have been widely adopted to detect anomalies for various systems, e.g., IoT, cloud and face recognition, under the common assumption that the data source is clean, i.e., features and labels are correctly set. However, data collected from the wild can be unreliable due to careless annotations or malicious data transformation for incorrect anomaly detection. In this paper, we extend a two-layer on-line data selection framework: Robust Anomaly Detector (RAD) with a newly designed ensemble prediction where both layers contribute to the final anomaly detection decision. To adapt to the on-line nature of anomaly detection, we consider additional features of conflicting opinions of classifiers, repetitive cleaning, and oracle knowledge. We on-line learn from incoming data streams and continuously cleanse the data, so as to adapt to the increasing learning capacity from the larger accumulated data set. Moreover, we explore the concept of oracle learning that provides additional information of true labels for difficult data points. We specifically focus on three use cases, (i) detecting 10 classes of IoT attacks, (ii) predicting 4 classes of task failures of big data jobs, and (iii) recognising 100 celebrities faces. Our evaluation results show that RAD can robustly improve the accuracy of anomaly detection, to reach up to 98.95% for IoT device attacks (i.e., +7%), up to 85.03% for cloud task failures (i.e., +14%) under 40% label noise, and for its extension, it can reach up to 77.51% for face recognition (i.e., +39%) under 30% label noise. The proposed RAD and its extensions are general and can be applied to different anomaly detection algorithms.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
RAD: On-line Anomaly Detection for Highly Unreliable Data
Authors:
Zilong Zhao,
Robert Birke,
Rui Han,
Bogdan Robu,
Sara Bouchenak,
Sonia Ben Mokhtar,
Lydia Y. Chen
Abstract:
Classification algorithms have been widely adopted to detect anomalies for various systems, e.g., IoT, cloud and face recognition, under the common assumption that the data source is clean, i.e., features and labels are correctly set. However, data collected from the wild can be unreliable due to careless annotations or malicious data transformation for incorrect anomaly detection. In this paper,…
▽ More
Classification algorithms have been widely adopted to detect anomalies for various systems, e.g., IoT, cloud and face recognition, under the common assumption that the data source is clean, i.e., features and labels are correctly set. However, data collected from the wild can be unreliable due to careless annotations or malicious data transformation for incorrect anomaly detection. In this paper, we present a two-layer on-line learning framework for robust anomaly detection (RAD) in the presence of unreliable anomaly labels, where the first layer is to filter out the suspicious data, and the second layer detects the anomaly patterns from the remaining data. To adapt to the on-line nature of anomaly detection, we extend RAD with additional features of repetitively cleaning, conflicting opinions of classifiers, and oracle knowledge. We on-line learn from the incoming data streams and continuously cleanse the data, so as to adapt to the increasing learning capacity from the larger accumulated data set. Moreover, we explore the concept of oracle learning that provides additional information of true labels for difficult data points. We specifically focus on three use cases, (i) detecting 10 classes of IoT attacks, (ii) predicting 4 classes of task failures of big data jobs, (iii) recognising 20 celebrities faces. Our evaluation results show that RAD can robustly improve the accuracy of anomaly detection, to reach up to 98% for IoT device attacks (i.e., +11%), up to 84% for cloud task failures (i.e., +20%) under 40% noise, and up to 74% for face recognition (i.e., +28%) under 30% noisy labels. The proposed RAD is general and can be applied to different anomaly detection algorithms.
△ Less
Submitted 11 November, 2019;
originally announced November 2019.
-
X-Search: Revisiting Private Web Search using Intel SGX
Authors:
Sonia Ben Mokhtar,
Antoine Boutet,
Pascal Felber,
Marcelo Pasin,
Rafael Pires,
Valerio Schiavoni
Abstract:
The exploitation of user search queries by search engines is at the heart of their economic model. As consequence, offering private Web search functionalities is essential to the users who care about their privacy. Nowadays, there exists no satisfactory approach to enable users to access search engines in a privacy-preserving way. Existing solutions are either too costly due to the heavy use of cr…
▽ More
The exploitation of user search queries by search engines is at the heart of their economic model. As consequence, offering private Web search functionalities is essential to the users who care about their privacy. Nowadays, there exists no satisfactory approach to enable users to access search engines in a privacy-preserving way. Existing solutions are either too costly due to the heavy use of cryptographic mechanisms (e.g., private information retrieval protocols), subject to attacks (e.g., Tor, TrackMeNot, GooPIR) or rely on weak adversarial models (e.g., PEAS). This paper introduces X-Search , a novel private Web search mechanism building on the disruptive Software Guard Extensions (SGX) proposed by Intel. We compare X-Search to its closest competitors, Tor and PEAS, using a dataset of real web search queries. Our evaluation shows that: (1) X-Search offers stronger privacy guarantees than its competitors as it operates under a stronger adversarial model; (2) it better resists state-of-the-art re-identification attacks; and (3) from the performance perspective, X-Search outperforms its competitors both in terms of latency and throughput by orders of magnitude.
△ Less
Submitted 4 May, 2018;
originally announced May 2018.
-
CYCLOSA: Decentralizing Private Web Search Through SGX-Based Browser Extensions
Authors:
Rafael Pires,
David Goltzsche,
Sonia Ben Mokhtar,
Sara Bouchenak,
Antoine Boutet,
Pascal Felber,
Rüdiger Kapitza,
Marcelo Pasin,
Valerio Schiavoni
Abstract:
By regularly querying Web search engines, users (unconsciously) disclose large amounts of their personal data as part of their search queries, among which some might reveal sensitive information (e.g. health issues, sexual, political or religious preferences). Several solutions exist to allow users querying search engines while improving privacy protection. However, these solutions suffer from a n…
▽ More
By regularly querying Web search engines, users (unconsciously) disclose large amounts of their personal data as part of their search queries, among which some might reveal sensitive information (e.g. health issues, sexual, political or religious preferences). Several solutions exist to allow users querying search engines while improving privacy protection. However, these solutions suffer from a number of limitations: some are subject to user re-identification attacks, while others lack scalability or are unable to provide accurate results. This paper presents CYCLOSA, a secure, scalable and accurate private Web search solution. CYCLOSA improves security by relying on trusted execution environments (TEEs) as provided by Intel SGX. Further, CYCLOSA proposes a novel adaptive privacy protection solution that reduces the risk of user re- identification. CYCLOSA sends fake queries to the search engine and dynamically adapts their count according to the sensitivity of the user query. In addition, CYCLOSA meets scalability as it is fully decentralized, spreading the load for distributing fake queries among other nodes. Finally, CYCLOSA achieves accuracy of Web search as it handles the real query and the fake queries separately, in contrast to other existing solutions that mix fake and real query results.
△ Less
Submitted 27 July, 2018; v1 submitted 3 May, 2018;
originally announced May 2018.
-
Adaptive Location Privacy with ALP
Authors:
Vincent Primault,
Antoine Boutet,
Sonia Ben Mokhtar,
Lionel Brunie
Abstract:
With the increasing amount of mobility data being collected on a daily basis by location-based services (LBSs) comes a new range of threats for users, related to the over-sharing of their location information. To deal with this issue, several location privacy protection mechanisms (LPPMs) have been proposed in the past years. However, each of these mechanisms comes with different configuration par…
▽ More
With the increasing amount of mobility data being collected on a daily basis by location-based services (LBSs) comes a new range of threats for users, related to the over-sharing of their location information. To deal with this issue, several location privacy protection mechanisms (LPPMs) have been proposed in the past years. However, each of these mechanisms comes with different configuration parameters that have a direct impact both on the privacy guarantees offered to the users and on the resulting utility of the protected data. In this context, it can be difficult for non-expert system designers to choose the appropriate configuration to use. Moreover, these mechanisms are generally configured once for all, which results in the same configuration for every protected piece of information. However, not all users have the same behaviour, and even the behaviour of a single user is likely to change over time. To address this issue, we present in this paper ALP, a new framework enabling the dynamic configuration of LPPMs. ALP can be used in two scenarios: (1) offline, where ALP enables a system designer to choose and automatically tune the most appropriate LPPM for the protection of a given dataset; (2) online, where ALP enables the user of a crowd sensing application to protect consecutive batches of her geolocated data by automatically tuning an existing LPPM to fulfil a set of privacy and utility objectives. We evaluate ALP on both scenarios with two real-life mobility datasets and two state-of-the-art LPPMs. Our experiments show that the adaptive LPPM configurations found by ALP outperform both in terms of privacy and utility a set of static configurations manually fixed by a system designer.
△ Less
Submitted 23 September, 2016;
originally announced September 2016.
-
Time Distortion Anonymization for the Publication of Mobility Data with High Utility
Authors:
Vincent Primault,
Sonia Ben Mokhtar,
Cédric Lauradoux,
Lionel Brunie
Abstract:
An increasing amount of mobility data is being collected every day by different means, such as mobile applications or crowd-sensing campaigns. This data is sometimes published after the application of simple anonymization techniques (e.g., putting an identifier instead of the users' names), which might lead to severe threats to the privacy of the participating users. Literature contains more sophi…
▽ More
An increasing amount of mobility data is being collected every day by different means, such as mobile applications or crowd-sensing campaigns. This data is sometimes published after the application of simple anonymization techniques (e.g., putting an identifier instead of the users' names), which might lead to severe threats to the privacy of the participating users. Literature contains more sophisticated anonymization techniques, often based on adding noise to the spatial data. However, these techniques either compromise the privacy if the added noise is too little or the utility of the data if the added noise is too strong. We investigate in this paper an alternative solution, which builds on time distortion instead of spatial distortion. Specifically, our contribution lies in (1) the introduction of the concept of time distortion to anonymize mobility datasets (2) Promesse, a protection mechanism implementing this concept (3) a practical study of Promesse compared to two representative spatial distortion mechanisms, namely Wait For Me, which enforces k-anonymity, and Geo-Indistinguishability, which enforces differential privacy. We evaluate our mechanism practically using three real-life datasets. Our results show that time distortion reduces the number of points of interest that can be retrieved by an adversary to under 3 %, while the introduced spatial error is almost null and the distortion introduced on the results of range queries is kept under 13 % on average.
△ Less
Submitted 2 July, 2015;
originally announced July 2015.
-
Privacy-preserving Publication of Mobility Data with High Utility
Authors:
Vincent Primault,
Sonia Ben Mokhtar,
Lionel Brunie
Abstract:
An increasing amount of mobility data is being collected every day by different means, e.g., by mobile phone operators. This data is sometimes published after the application of simple anonymization techniques, which might lead to severe privacy threats. We propose in this paper a new solution whose novelty is twofold. Firstly, we introduce an algorithm designed to hide places where a user stops d…
▽ More
An increasing amount of mobility data is being collected every day by different means, e.g., by mobile phone operators. This data is sometimes published after the application of simple anonymization techniques, which might lead to severe privacy threats. We propose in this paper a new solution whose novelty is twofold. Firstly, we introduce an algorithm designed to hide places where a user stops during her journey (namely points of interest), by enforcing a constant speed along her trajectory. Secondly, we leverage places where users meet to take a chance to swap their trajectories and therefore confuse an attacker.
△ Less
Submitted 30 June, 2015;
originally announced June 2015.
-
Differentially Private Location Privacy in Practice
Authors:
Vincent Primault,
Sonia Ben Mokhtar,
Cedric Lauradoux,
Lionel Brunie
Abstract:
With the wide adoption of handheld devices (e.g. smartphones, tablets) a large number of location-based services (also called LBSs) have flourished providing mobile users with real-time and contextual information on the move. Accounting for the amount of location information they are given by users, these services are able to track users wherever they go and to learn sensitive information about th…
▽ More
With the wide adoption of handheld devices (e.g. smartphones, tablets) a large number of location-based services (also called LBSs) have flourished providing mobile users with real-time and contextual information on the move. Accounting for the amount of location information they are given by users, these services are able to track users wherever they go and to learn sensitive information about them (e.g. their points of interest including home, work, religious or political places regularly visited). A number of solutions have been proposed in the past few years to protect users location information while still allowing them to enjoy geo-located services. Among the most robust solutions are those that apply the popular notion of differential privacy to location privacy (e.g. Geo-Indistinguishability), promising strong theoretical privacy guarantees with a bounded accuracy loss. While these theoretical guarantees are attracting, it might be difficult for end users or practitioners to assess their effectiveness in the wild. In this paper, we carry on a practical study using real mobility traces coming from two different datasets, to assess the ability of Geo-Indistinguishability to protect users' points of interest (POIs). We show that a curious LBS collecting obfuscated location information sent by mobile users is still able to infer most of the users POIs with a reasonable both geographic and semantic precision. This precision depends on the degree of obfuscation applied by Geo-Indistinguishability. Nevertheless, the latter also has an impact on the overhead incurred on mobile devices resulting in a privacy versus overhead trade-off. Finally, we show in our study that POIs constitute a quasi-identifier for mobile users and that obfuscating them using Geo-Indistinguishability is not sufficient as an attacker is able to re-identify at least 63% of them despite a high degree of obfuscation.
△ Less
Submitted 28 October, 2014;
originally announced October 2014.