-
A Manually Annotated Image-Caption Dataset for Detecting Children in the Wild
Authors:
Klim Kireev,
Ana-Maria Creţu,
Raphael Meier,
Sarah Adel Bargal,
Elissa Redmiles,
Carmela Troncoso
Abstract:
Platforms and the law regulate digital content depicting minors (defined as individuals under 18 years of age) differently from other types of content. Given the sheer amount of content that needs to be assessed, machine learning-based automation tools are commonly used to detect content depicting minors. To our knowledge, no dataset or benchmark currently exists for detecting these identification…
▽ More
Platforms and the law regulate digital content depicting minors (defined as individuals under 18 years of age) differently from other types of content. Given the sheer amount of content that needs to be assessed, machine learning-based automation tools are commonly used to detect content depicting minors. To our knowledge, no dataset or benchmark currently exists for detecting these identification methods in a multi-modal environment. To fill this gap, we release the Image-Caption Children in the Wild Dataset (ICCWD), an image-caption dataset aimed at benchmarking tools that detect depictions of minors. Our dataset is richer than previous child image datasets, containing images of children in a variety of contexts, including fictional depictions and partially visible bodies. ICCWD contains 10,000 image-caption pairs manually labeled to indicate the presence or absence of a child in the image. To demonstrate the possible utility of our dataset, we use it to benchmark three different detectors, including a commercial age estimation system applied to images. Our results suggest that child detection is a challenging task, with the best method achieving a 75.3% true positive rate. We hope the release of our dataset will aid in the design of better minor detection methods in a wide range of scenarios.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
A Low-Cost Privacy-Preserving Digital Wallet for Humanitarian Aid Distribution
Authors:
Eva Luvison,
Sylvain Chatel,
Justinas Sukaitis,
Vincent Graf Narbel,
Carmela Troncoso,
Wouter Lueks
Abstract:
Humanitarian organizations distribute aid to people affected by armed conflicts or natural disasters. Digitalization has the potential to increase the efficiency and fairness of aid-distribution systems, and recent work by Wang et al. has shown that these benefits are possible without creating privacy harms for aid recipients. However, their work only provides a solution for one particular aid-dis…
▽ More
Humanitarian organizations distribute aid to people affected by armed conflicts or natural disasters. Digitalization has the potential to increase the efficiency and fairness of aid-distribution systems, and recent work by Wang et al. has shown that these benefits are possible without creating privacy harms for aid recipients. However, their work only provides a solution for one particular aid-distribution scenario in which aid recipients receive a pre-defined set of goods. Yet, in many situations it is desirable to enable recipients to decide which items they need at each moment to satisfy their specific needs. We formalize these needs into functional, deployment, security, and privacy requirements, and design a privacy-preserving digital wallet for aid distribution. Our smart-card-based solution enables aid recipients to spend a pre-defined budget at different vendors to obtain the items that they need. We prove our solution's security and privacy properties, and show it is practical at scale.
△ Less
Submitted 21 October, 2024;
originally announced October 2024.
-
Attack-Aware Noise Calibration for Differential Privacy
Authors:
Bogdan Kulynych,
Juan Felipe Gomez,
Georgios Kaissis,
Flavio du Pin Calmon,
Carmela Troncoso
Abstract:
Differential privacy (DP) is a widely used approach for mitigating privacy risks when training machine learning models on sensitive data. DP mechanisms add noise during training to limit the risk of information leakage. The scale of the added noise is critical, as it determines the trade-off between privacy and utility. The standard practice is to select the noise scale to satisfy a given privacy…
▽ More
Differential privacy (DP) is a widely used approach for mitigating privacy risks when training machine learning models on sensitive data. DP mechanisms add noise during training to limit the risk of information leakage. The scale of the added noise is critical, as it determines the trade-off between privacy and utility. The standard practice is to select the noise scale to satisfy a given privacy budget $\varepsilon$. This privacy budget is in turn interpreted in terms of operational attack risks, such as accuracy, sensitivity, and specificity of inference attacks aimed to recover information about the training data records. We show that first calibrating the noise scale to a privacy budget $\varepsilon$, and then translating ε to attack risk leads to overly conservative risk assessments and unnecessarily low utility. Instead, we propose methods to directly calibrate the noise scale to a desired attack risk level, bypassing the step of choosing $\varepsilon$. For a given notion of attack risk, our approach significantly decreases noise scale, leading to increased utility at the same level of privacy. We empirically demonstrate that calibrating noise to attack sensitivity/specificity, rather than $\varepsilon$, when training privacy-preserving ML models substantially improves model accuracy for the same risk level. Our work provides a principled and practical way to improve the utility of privacy-preserving ML without compromising on privacy. The code is available at https://github.com/Felipe-Gomez/riskcal
△ Less
Submitted 7 November, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
Characterizing and Detecting Propaganda-Spreading Accounts on Telegram
Authors:
Klim Kireev,
Yevhen Mykhno,
Carmela Troncoso,
Rebekah Overdorf
Abstract:
Information-based attacks on social media, such as disinformation campaigns and propaganda, are emerging cybersecurity threats. The security community has focused on countering these threats on social media platforms like X and Reddit. However, they also appear in instant-messaging social media platforms such as WhatsApp, Telegram, and Signal. In these platforms information-based attacks primarily…
▽ More
Information-based attacks on social media, such as disinformation campaigns and propaganda, are emerging cybersecurity threats. The security community has focused on countering these threats on social media platforms like X and Reddit. However, they also appear in instant-messaging social media platforms such as WhatsApp, Telegram, and Signal. In these platforms information-based attacks primarily happen in groups and channels, requiring manual moderation efforts by channel administrators. We collect, label, and analyze a large dataset of more than 17 million Telegram comments and messages. Our analysis uncovers two independent, coordinated networks that spread pro-Russian and pro-Ukrainian propaganda, garnering replies from real users. We propose a novel mechanism for detecting propaganda that capitalizes on the relationship between legitimate user messages and propaganda replies and is tailored to the information that Telegram makes available to moderators. Our method is faster, cheaper, and has a detection rate (97.6%) 11.6 percentage points higher than human moderators after seeing only one message from an account. It remains effective despite evolving propaganda.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
SINBAD: Saliency-informed detection of breakage caused by ad blocking
Authors:
Saiid El Hajj Chehade,
Sandra Siby,
Carmela Troncoso
Abstract:
Privacy-enhancing blocking tools based on filter-list rules tend to break legitimate functionality. Filter-list maintainers could benefit from automated breakage detection tools that allow them to proactively fix problematic rules before deploying them to millions of users. We introduce SINBAD, an automated breakage detector that improves the accuracy over the state of the art by 20%, and is the f…
▽ More
Privacy-enhancing blocking tools based on filter-list rules tend to break legitimate functionality. Filter-list maintainers could benefit from automated breakage detection tools that allow them to proactively fix problematic rules before deploying them to millions of users. We introduce SINBAD, an automated breakage detector that improves the accuracy over the state of the art by 20%, and is the first to detect dynamic breakage and breakage caused by style-oriented filter rules. The success of SINBAD is rooted in three innovations: (1) the use of user-reported breakage issues in forums that enable the creation of a high-quality dataset for training in which only breakage that users perceive as an issue is included; (2) the use of 'web saliency' to automatically identify user-relevant regions of a website on which to prioritize automated interactions aimed at triggering breakage; and (3) the analysis of webpages via subtrees which enables fine-grained identification of problematic filter rules.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
Authors:
Dario Pasquini,
Martin Strohmeier,
Carmela Troncoso
Abstract:
We introduce a new family of prompt injection attacks, termed Neural Exec. Unlike known attacks that rely on handcrafted strings (e.g., "Ignore previous instructions and..."), we show that it is possible to conceptualize the creation of execution triggers as a differentiable search problem and use learning-based methods to autonomously generate them.
Our results demonstrate that a motivated adve…
▽ More
We introduce a new family of prompt injection attacks, termed Neural Exec. Unlike known attacks that rely on handcrafted strings (e.g., "Ignore previous instructions and..."), we show that it is possible to conceptualize the creation of execution triggers as a differentiable search problem and use learning-based methods to autonomously generate them.
Our results demonstrate that a motivated adversary can forge triggers that are not only drastically more effective than current handcrafted ones but also exhibit inherent flexibility in shape, properties, and functionality. In this direction, we show that an attacker can design and generate Neural Execs capable of persisting through multi-stage preprocessing pipelines, such as in the case of Retrieval-Augmented Generation (RAG)-based applications. More critically, our findings show that attackers can produce triggers that deviate markedly in form and shape from any known attack, sidestepping existing blacklist-based detection and sanitation approaches.
△ Less
Submitted 2 May, 2024; v1 submitted 6 March, 2024;
originally announced March 2024.
-
On the Conflict of Robustness and Learning in Collaborative Machine Learning
Authors:
Mathilde Raynal,
Carmela Troncoso
Abstract:
Collaborative Machine Learning (CML) allows participants to jointly train a machine learning model while keeping their training data private. In many scenarios where CML is seen as the solution to privacy issues, such as health-related applications, safety is also a primary concern. To ensure that CML processes produce models that output correct and reliable decisions \emph{even in the presence of…
▽ More
Collaborative Machine Learning (CML) allows participants to jointly train a machine learning model while keeping their training data private. In many scenarios where CML is seen as the solution to privacy issues, such as health-related applications, safety is also a primary concern. To ensure that CML processes produce models that output correct and reliable decisions \emph{even in the presence of potentially untrusted participants}, researchers propose to use \textit{robust aggregators} to filter out malicious contributions that negatively influence the training process. In this work, we formalize the two prevalent forms of robust aggregators in the literature. We then show that neither can provide the intended protection: either they use distance-based metrics that cannot reliably identify malicious inputs to training; or use metrics based on the behavior of the loss function which create a conflict with the ability of CML participants to learn, i.e., they cannot eliminate the risk of compromise without preventing learning.
△ Less
Submitted 26 July, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
The Fundamental Limits of Least-Privilege Learning
Authors:
Theresa Stadler,
Bogdan Kulynych,
Michael C. Gastpar,
Nicolas Papernot,
Carmela Troncoso
Abstract:
The promise of least-privilege learning -- to find feature representations that are useful for a learning task but prevent inference of any sensitive information unrelated to this task -- is highly appealing. However, so far this concept has only been stated informally. It thus remains an open question whether and how we can achieve this goal. In this work, we provide the first formalisation of th…
▽ More
The promise of least-privilege learning -- to find feature representations that are useful for a learning task but prevent inference of any sensitive information unrelated to this task -- is highly appealing. However, so far this concept has only been stated informally. It thus remains an open question whether and how we can achieve this goal. In this work, we provide the first formalisation of the least-privilege principle for machine learning and characterise its feasibility. We prove that there is a fundamental trade-off between a representation's utility for a given task and its leakage beyond the intended task: it is not possible to learn representations that have high utility for the intended task but, at the same time prevent inference of any attribute other than the task label itself. This trade-off holds under realistic assumptions on the data distribution and regardless of the technique used to learn the feature mappings that produce these representations. We empirically validate this result for a wide range of learning techniques, model architectures, and datasets.
△ Less
Submitted 26 June, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
Janus: Safe Biometric Deduplication for Humanitarian Aid Distribution
Authors:
Kasra EdalatNejad,
Wouter Lueks,
Justinas Sukaitis,
Vincent Graf Narbel,
Massimo Marelli,
Carmela Troncoso
Abstract:
Humanitarian organizations provide aid to people in need. To use their limited budget efficiently, their distribution processes must ensure that legitimate recipients cannot receive more aid than they are entitled to. Thus, it is essential that recipients can register at most once per aid program. Taking the International Committee of the Red Cross's aid distribution registration process as a use…
▽ More
Humanitarian organizations provide aid to people in need. To use their limited budget efficiently, their distribution processes must ensure that legitimate recipients cannot receive more aid than they are entitled to. Thus, it is essential that recipients can register at most once per aid program. Taking the International Committee of the Red Cross's aid distribution registration process as a use case, we identify the requirements to detect double registration without creating new risks for aid recipients. We then design Janus, which combines privacy-enhancing technologies with biometrics to prevent double registration in a safe manner. Janus does not create plaintext biometric databases and reveals only one bit of information at registration time (whether the user registering is present in the database or not). We implement and evaluate three instantiations of Janus based on secure multiparty computation, somewhat homomorphic encryption, and trusted execution environments. We demonstrate that they support the privacy, accuracy, and performance needs of humanitarian organizations. We compare Janus with existing alternatives and show it is the first system that provides the accuracy our scenario requires while providing strong protection.
△ Less
Submitted 5 August, 2023;
originally announced August 2023.
-
Transferable Adversarial Robustness for Categorical Data via Universal Robust Embeddings
Authors:
Klim Kireev,
Maksym Andriushchenko,
Carmela Troncoso,
Nicolas Flammarion
Abstract:
Research on adversarial robustness is primarily focused on image and text data. Yet, many scenarios in which lack of robustness can result in serious risks, such as fraud detection, medical diagnosis, or recommender systems often do not rely on images or text but instead on tabular data. Adversarial robustness in tabular data poses two serious challenges. First, tabular datasets often contain cate…
▽ More
Research on adversarial robustness is primarily focused on image and text data. Yet, many scenarios in which lack of robustness can result in serious risks, such as fraud detection, medical diagnosis, or recommender systems often do not rely on images or text but instead on tabular data. Adversarial robustness in tabular data poses two serious challenges. First, tabular datasets often contain categorical features, and therefore cannot be tackled directly with existing optimization procedures. Second, in the tabular domain, algorithms that are not based on deep networks are widely used and offer great performance, but algorithms to enhance robustness are tailored to neural networks (e.g. adversarial training).
In this paper, we tackle both challenges. We present a method that allows us to train adversarially robust deep networks for tabular data and to transfer this robustness to other classifiers via universal robust embeddings tailored to categorical data. These embeddings, created using a bilevel alternating minimization framework, can be transferred to boosted trees or random forests making them robust without the need for adversarial training while preserving their high accuracy on tabular data. We show that our methods outperform existing techniques within a practical threat model suitable for tabular data.
△ Less
Submitted 13 December, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
Not Yet Another Digital ID: Privacy-preserving Humanitarian Aid Distribution
Authors:
Boya Wang,
Wouter Lueks,
Justinas Sukaitis,
Vincent Graf Narbel,
Carmela Troncoso
Abstract:
Humanitarian aid-distribution programs help bring physical goods to people in need. Traditional paper-based solutions to support aid distribution do not scale to large populations and are hard to secure. Existing digital solutions solve these issues, at the cost of collecting large amount of personal information. This lack of privacy can endanger recipients' safety and harm their dignity. In colla…
▽ More
Humanitarian aid-distribution programs help bring physical goods to people in need. Traditional paper-based solutions to support aid distribution do not scale to large populations and are hard to secure. Existing digital solutions solve these issues, at the cost of collecting large amount of personal information. This lack of privacy can endanger recipients' safety and harm their dignity. In collaboration with the International Committee of the Red Cross, we build a safe digital aid-distribution system. We first systematize the requirements such a system should satisfy. We then propose a decentralized solution based on the use of tokens that fulfills the needs of humanitarian organizations. It provides scalability and strong accountability, and, by design, guarantees the recipients' privacy. We provide two instantiations of our design, on a smart card and on a smartphone. We formally prove the security and privacy properties of these solutions, and empirically show that they can operate at scale.
△ Less
Submitted 19 May, 2023; v1 submitted 30 March, 2023;
originally announced March 2023.
-
Can Decentralized Learning be more robust than Federated Learning?
Authors:
Mathilde Raynal,
Dario Pasquini,
Carmela Troncoso
Abstract:
Decentralized Learning (DL) is a peer--to--peer learning approach that allows a group of users to jointly train a machine learning model. To ensure correctness, DL should be robust, i.e., Byzantine users must not be able to tamper with the result of the collaboration. In this paper, we introduce two \textit{new} attacks against DL where a Byzantine user can: make the network converge to an arbitra…
▽ More
Decentralized Learning (DL) is a peer--to--peer learning approach that allows a group of users to jointly train a machine learning model. To ensure correctness, DL should be robust, i.e., Byzantine users must not be able to tamper with the result of the collaboration. In this paper, we introduce two \textit{new} attacks against DL where a Byzantine user can: make the network converge to an arbitrary model of their choice, and exclude an arbitrary user from the learning process. We demonstrate our attacks' efficiency against Self--Centered Clipping, the state--of--the--art robust DL protocol. Finally, we show that the capabilities decentralization grants to Byzantine users result in decentralized learning \emph{always} providing less robustness than federated learning.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
Arbitrary Decisions are a Hidden Cost of Differentially Private Training
Authors:
Bogdan Kulynych,
Hsiang Hsu,
Carmela Troncoso,
Flavio P. Calmon
Abstract:
Mechanisms used in privacy-preserving machine learning often aim to guarantee differential privacy (DP) during model training. Practical DP-ensuring training methods use randomization when fitting model parameters to privacy-sensitive data (e.g., adding Gaussian noise to clipped gradients). We demonstrate that such randomization incurs predictive multiplicity: for a given input example, the output…
▽ More
Mechanisms used in privacy-preserving machine learning often aim to guarantee differential privacy (DP) during model training. Practical DP-ensuring training methods use randomization when fitting model parameters to privacy-sensitive data (e.g., adding Gaussian noise to clipped gradients). We demonstrate that such randomization incurs predictive multiplicity: for a given input example, the output predicted by equally-private models depends on the randomness used in training. Thus, for a given input, the predicted output can vary drastically if a model is re-trained, even if the same training dataset is used. The predictive-multiplicity cost of DP training has not been studied, and is currently neither audited for nor communicated to model designers and stakeholders. We derive a bound on the number of re-trainings required to estimate predictive multiplicity reliably. We analyze--both theoretically and through extensive experiments--the predictive-multiplicity cost of three DP-ensuring algorithms: output perturbation, objective perturbation, and DP-SGD. We demonstrate that the degree of predictive multiplicity rises as the level of privacy increases, and is unevenly distributed across individuals and demographic groups in the data. Because randomness used to ensure DP during training explains predictions for some examples, our results highlight a fundamental challenge to the justifiability of decisions supported by differentially private models in high-stakes settings. We conclude that practitioners should audit the predictive multiplicity of their DP-ensuring algorithms before deploying them in applications of individual-level consequence.
△ Less
Submitted 15 May, 2023; v1 submitted 28 February, 2023;
originally announced February 2023.
-
Universal Neural-Cracking-Machines: Self-Configurable Password Models from Auxiliary Data
Authors:
Dario Pasquini,
Giuseppe Ateniese,
Carmela Troncoso
Abstract:
We introduce the concept of "universal password model" -- a password model that, once pre-trained, can automatically adapt its guessing strategy based on the target system. To achieve this, the model does not need to access any plaintext passwords from the target credentials. Instead, it exploits users' auxiliary information, such as email addresses, as a proxy signal to predict the underlying pas…
▽ More
We introduce the concept of "universal password model" -- a password model that, once pre-trained, can automatically adapt its guessing strategy based on the target system. To achieve this, the model does not need to access any plaintext passwords from the target credentials. Instead, it exploits users' auxiliary information, such as email addresses, as a proxy signal to predict the underlying password distribution. Specifically, the model uses deep learning to capture the correlation between the auxiliary data of a group of users (e.g., users of a web application) and their passwords. It then exploits those patterns to create a tailored password model for the target system at inference time. No further training steps, targeted data collection, or prior knowledge of the community's password distribution is required. Besides improving over current password strength estimation techniques and attacks, the model enables any end-user (e.g., system administrators) to autonomously generate tailored password models for their systems without the often unworkable requirements of collecting suitable training data and fitting the underlying machine learning model. Ultimately, our framework enables the democratization of well-calibrated password models to the community, addressing a major challenge in the deployment of password security solutions at scale.
△ Less
Submitted 13 March, 2024; v1 submitted 18 January, 2023;
originally announced January 2023.
-
Adversarial Robustness for Tabular Data through Cost and Utility Awareness
Authors:
Klim Kireev,
Bogdan Kulynych,
Carmela Troncoso
Abstract:
Many safety-critical applications of machine learning, such as fraud or abuse detection, use data in tabular domains. Adversarial examples can be particularly damaging for these applications. Yet, existing works on adversarial robustness primarily focus on machine-learning models in image and text domains. We argue that, due to the differences between tabular data and images or text, existing thre…
▽ More
Many safety-critical applications of machine learning, such as fraud or abuse detection, use data in tabular domains. Adversarial examples can be particularly damaging for these applications. Yet, existing works on adversarial robustness primarily focus on machine-learning models in image and text domains. We argue that, due to the differences between tabular data and images or text, existing threat models are not suitable for tabular domains. These models do not capture that the costs of an attack could be more significant than imperceptibility, or that the adversary could assign different values to the utility obtained from deploying different adversarial examples. We demonstrate that, due to these differences, the attack and defense methods used for images and text cannot be directly applied to tabular settings. We address these issues by proposing new cost and utility-aware threat models that are tailored to the adversarial capabilities and constraints of attackers targeting tabular domains. We introduce a framework that enables us to design attack and defense mechanisms that result in models protected against cost and utility-aware adversaries, for example, adversaries constrained by a certain financial budget. We show that our approach is effective on three datasets corresponding to applications for which adversarial examples can have economic and social implications.
△ Less
Submitted 24 February, 2023; v1 submitted 27 August, 2022;
originally announced August 2022.
-
COOKIEGRAPH: Understanding and Detecting First-Party Tracking Cookies
Authors:
Shaoor Munir,
Sandra Siby,
Umar Iqbal,
Steven Englehardt,
Zubair Shafiq,
Carmela Troncoso
Abstract:
As third-party cookie blocking is becoming the norm in browsers, advertisers and trackers have started to use first-party cookies for tracking. We conduct a differential measurement study on 10K websites with third-party cookies allowed and blocked. This study reveals that first-party cookies are used to store and exfiltrate identifiers to known trackers even when third-party cookies are blocked.…
▽ More
As third-party cookie blocking is becoming the norm in browsers, advertisers and trackers have started to use first-party cookies for tracking. We conduct a differential measurement study on 10K websites with third-party cookies allowed and blocked. This study reveals that first-party cookies are used to store and exfiltrate identifiers to known trackers even when third-party cookies are blocked.
As opposed to third-party cookie blocking, outright first-party cookie blocking is not practical because it would result in major functionality breakage. We propose CookieGraph, a machine learning-based approach that can accurately and robustly detect first-party tracking cookies. CookieGraph detects first-party tracking cookies with 90.20% accuracy, outperforming the state-of-the-art CookieBlock approach by 17.75%. We show that CookieGraph is fully robust against cookie name manipulation while CookieBlock's acuracy drops by 15.68%. While blocking all first-party cookies results in major breakage on 32% of the sites with SSO logins, and CookieBlock reduces it to 10%, we show that CookieGraph does not cause any major breakage on these sites.
Our deployment of CookieGraph shows that first-party tracking cookies are used on 93.43% of the 10K websites. We also find that first-party tracking cookies are set by fingerprinting scripts. The most prevalent first-party tracking cookies are set by major advertising entities such as Google, Facebook, and TikTok.
△ Less
Submitted 27 November, 2023; v1 submitted 25 August, 2022;
originally announced August 2022.
-
Verifiable Encodings for Secure Homomorphic Analytics
Authors:
Sylvain Chatel,
Christian Knabenhans,
Apostolos Pyrgelis,
Carmela Troncoso,
Jean-Pierre Hubaux
Abstract:
Homomorphic encryption, which enables the execution of arithmetic operations directly on ciphertexts, is a promising solution for protecting privacy of cloud-delegated computations on sensitive data. However, the correctness of the computation result is not ensured. We propose two error detection encodings and build authenticators that enable practical client-verification of cloud-based homomorphi…
▽ More
Homomorphic encryption, which enables the execution of arithmetic operations directly on ciphertexts, is a promising solution for protecting privacy of cloud-delegated computations on sensitive data. However, the correctness of the computation result is not ensured. We propose two error detection encodings and build authenticators that enable practical client-verification of cloud-based homomorphic computations under different trade-offs and without compromising on the features of the encryption algorithm. Our authenticators operate on top of trending ring learning with errors based fully homomorphic encryption schemes over the integers. We implement our solution in VERITAS, a ready-to-use system for verification of outsourced computations executed over encrypted data. We show that contrary to prior work VERITAS supports verification of any homomorphic operation and we demonstrate its practicality for various applications, such as ride-hailing, genomic-data analysis, encrypted search, and machine-learning training and inference.
△ Less
Submitted 4 June, 2024; v1 submitted 28 July, 2022;
originally announced July 2022.
-
Private Collection Matching Protocols
Authors:
Kasra EdalatNejad,
Mathilde Raynal,
Wouter Lueks,
Carmela Troncoso
Abstract:
We introduce Private Collection Matching (PCM) problems, in which a client aims to determine whether a collection of sets owned by a server matches their interests. Existing privacy-preserving cryptographic primitives cannot solve PCM problems efficiently without harming privacy. We propose a modular framework that enables designers to build privacy-preserving PCM systems that output one bit: whet…
▽ More
We introduce Private Collection Matching (PCM) problems, in which a client aims to determine whether a collection of sets owned by a server matches their interests. Existing privacy-preserving cryptographic primitives cannot solve PCM problems efficiently without harming privacy. We propose a modular framework that enables designers to build privacy-preserving PCM systems that output one bit: whether a collection of server sets matches the client's set. The communication cost of our protocols scales linearly with the size of the client's set and is independent of the number of server elements. We demonstrate the potential of our framework by designing and implementing novel solutions for two real-world PCM problems: determining whether a dataset has chemical compounds of interest, and determining whether a document collection has relevant documents. Our evaluation shows that we offer a privacy gain with respect to existing works at a reasonable communication and computation cost.
△ Less
Submitted 14 December, 2022; v1 submitted 14 June, 2022;
originally announced June 2022.
-
On the (In)security of Peer-to-Peer Decentralized Machine Learning
Authors:
Dario Pasquini,
Mathilde Raynal,
Carmela Troncoso
Abstract:
In this work, we carry out the first, in-depth, privacy analysis of Decentralized Learning -- a collaborative machine learning framework aimed at addressing the main limitations of federated learning. We introduce a suite of novel attacks for both passive and active decentralized adversaries. We demonstrate that, contrary to what is claimed by decentralized learning proposers, decentralized learni…
▽ More
In this work, we carry out the first, in-depth, privacy analysis of Decentralized Learning -- a collaborative machine learning framework aimed at addressing the main limitations of federated learning. We introduce a suite of novel attacks for both passive and active decentralized adversaries. We demonstrate that, contrary to what is claimed by decentralized learning proposers, decentralized learning does not offer any security advantage over federated learning. Rather, it increases the attack surface enabling any user in the system to perform privacy attacks such as gradient inversion, and even gain full control over honest users' local model. We also show that, given the state of the art in protections, privacy-preserving configurations of decentralized learning require fully connected networks, losing any practical advantage over the federated setup and therefore completely defeating the objective of the decentralized approach.
△ Less
Submitted 10 November, 2023; v1 submitted 17 May, 2022;
originally announced May 2022.
-
You get PADDING, everybody gets PADDING! You get privacy? Evaluating practical QUIC website fingerprinting protections for the masses
Authors:
Sandra Siby,
Ludovic Barman,
Christopher Wood,
Marwan Fayed,
Nick Sullivan,
Carmela Troncoso
Abstract:
Website fingerprinting (WF) is a well-know threat to users' web privacy. New internet standards, such as QUIC, include padding to support defenses against WF. Previous work only analyzes the effectiveness of defenses when users are behind a VPN. Yet, this is not how most users browse the Internet. In this paper, we provide a comprehensive evaluation of QUIC-padding-based defenses against WF when u…
▽ More
Website fingerprinting (WF) is a well-know threat to users' web privacy. New internet standards, such as QUIC, include padding to support defenses against WF. Previous work only analyzes the effectiveness of defenses when users are behind a VPN. Yet, this is not how most users browse the Internet. In this paper, we provide a comprehensive evaluation of QUIC-padding-based defenses against WF when users directly browse the web. We confirm previous claims that network-layer padding cannot provide good protection against powerful adversaries capable of observing all traffic traces. We further demonstrate that such padding is ineffective even against adversaries with constraints on traffic visibility and processing power. At the application layer, we show that defenses need to be deployed by both first and third parties, and that they can only thwart traffic analysis in limited situations. We identify challenges to deploy effective WF defenses and provide recommendations to address them.
△ Less
Submitted 15 December, 2022; v1 submitted 15 March, 2022;
originally announced March 2022.
-
Bugs in our Pockets: The Risks of Client-Side Scanning
Authors:
Hal Abelson,
Ross Anderson,
Steven M. Bellovin,
Josh Benaloh,
Matt Blaze,
Jon Callas,
Whitfield Diffie,
Susan Landau,
Peter G. Neumann,
Ronald L. Rivest,
Jeffrey I. Schiller,
Bruce Schneier,
Vanessa Teague,
Carmela Troncoso
Abstract:
Our increasing reliance on digital technology for personal, economic, and government affairs has made it essential to secure the communications and devices of private citizens, businesses, and governments. This has led to pervasive use of cryptography across society. Despite its evident advantages, law enforcement and national security agencies have argued that the spread of cryptography has hinde…
▽ More
Our increasing reliance on digital technology for personal, economic, and government affairs has made it essential to secure the communications and devices of private citizens, businesses, and governments. This has led to pervasive use of cryptography across society. Despite its evident advantages, law enforcement and national security agencies have argued that the spread of cryptography has hindered access to evidence and intelligence. Some in industry and government now advocate a new technology to access targeted data: client-side scanning (CSS). Instead of weakening encryption or providing law enforcement with backdoor keys to decrypt communications, CSS would enable on-device analysis of data in the clear. If targeted information were detected, its existence and, potentially, its source, would be revealed to the agencies; otherwise, little or no information would leave the client device. Its proponents claim that CSS is a solution to the encryption versus public safety debate: it offers privacy -- in the sense of unimpeded end-to-end encryption -- and the ability to successfully investigate serious crime. In this report, we argue that CSS neither guarantees efficacious crime prevention nor prevents surveillance. Indeed, the effect is the opposite. CSS by its nature creates serious security and privacy risks for all society while the assistance it can provide for law enforcement is at best problematic. There are multiple ways in which client-side scanning can fail, can be evaded, and can be abused.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.
-
WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking
Authors:
Sandra Siby,
Umar Iqbal,
Steven Englehardt,
Zubair Shafiq,
Carmela Troncoso
Abstract:
Millions of web users directly depend on ad and tracker blocking tools to protect their privacy. However, existing ad and tracker blockers fall short because of their reliance on trivially susceptible advertising and tracking content. In this paper, we first demonstrate that the state-of-the-art machine learning based ad and tracker blockers, such as AdGraph, are susceptible to adversarial evasion…
▽ More
Millions of web users directly depend on ad and tracker blocking tools to protect their privacy. However, existing ad and tracker blockers fall short because of their reliance on trivially susceptible advertising and tracking content. In this paper, we first demonstrate that the state-of-the-art machine learning based ad and tracker blockers, such as AdGraph, are susceptible to adversarial evasions deployed in real-world. Second, we introduce WebGraph, the first graph-based machine learning blocker that detects ads and trackers based on their action rather than their content. By building features around the actions that are fundamental to advertising and tracking - storing an identifier in the browser, or sharing an identifier with another tracker - WebGraph performs nearly as well as prior approaches, but is significantly more robust to adversarial evasions. In particular, we show that WebGraph achieves comparable accuracy to AdGraph, while significantly decreasing the success rate of an adversary from near-perfect under AdGraph to around 8% under WebGraph. Finally, we show that WebGraph remains robust to a more sophisticated adversary that uses evasion techniques beyond those currently deployed on the web.
△ Less
Submitted 17 August, 2021; v1 submitted 23 July, 2021;
originally announced July 2021.
-
Preliminary Analysis of Potential Harms in the Luca Tracing System
Authors:
Theresa Stadler,
Wouter Lueks,
Katharina Kohls,
Carmela Troncoso
Abstract:
In this document, we analyse the potential harms a large-scale deployment of the Luca system might cause to individuals, venues, and communities. The Luca system is a digital presence tracing system designed to provide health departments with the contact information necessary to alert individuals who have visited a location at the same time as a SARS-CoV-2-positive person. Multiple regional health…
▽ More
In this document, we analyse the potential harms a large-scale deployment of the Luca system might cause to individuals, venues, and communities. The Luca system is a digital presence tracing system designed to provide health departments with the contact information necessary to alert individuals who have visited a location at the same time as a SARS-CoV-2-positive person. Multiple regional health departments in Germany have announced their plans to deploy the Luca system for the purpose of presence tracing. The system's developers suggest its use across various types of venues: from bars and restaurants to public and private events, such religious or political gatherings, weddings, and birthday parties. Recently, an extension to include schools and other educational facilities was discussed in public. Our analysis of the potential harms of the system is based on the publicly available Luca Security Concept which describes the system's security architecture and its planned protection mechanisms. The Security Concept furthermore provides a set of claims about the system's security and privacy properties. Besides an analysis of harms, our analysis includes a validation of these claims.
△ Less
Submitted 22 March, 2021;
originally announced March 2021.
-
Towards a common performance and effectiveness terminology for digital proximity tracing applications
Authors:
Justus Benzler,
Dan Bogdanov,
Göran Kirchner,
Wouter Lueks,
Raquel Lucas,
Rui Oliveira,
Bart Preneel,
Marcel Salathe,
Carmela Troncoso,
Viktor von Wyl
Abstract:
Digital proximity tracing (DPT) for Sars-CoV-2 pandemic mitigation is a complex intervention with the primary goal to notify app users about possible risk exposures to infected persons. Policymakers and DPT operators need to know whether their system works as expected in terms of speed or yield (performance) and whether DPT is making an effective contribution to pandemic mitigation (also in compar…
▽ More
Digital proximity tracing (DPT) for Sars-CoV-2 pandemic mitigation is a complex intervention with the primary goal to notify app users about possible risk exposures to infected persons. Policymakers and DPT operators need to know whether their system works as expected in terms of speed or yield (performance) and whether DPT is making an effective contribution to pandemic mitigation (also in comparison to and beyond established mitigation measures, particularly manual contact tracing). Thereby, performance and effectiveness are not to be confused. Not only are there conceptual differences but also diverse data requirements. This article describes differences between performance and effectiveness measures and attempts to develop a terminology and classification system for DPT evaluation. We discuss key aspects for critical assessments of whether the integration of additional data measurements into DPT apps - beyond what is required to fulfill its primary notification role - may facilitate an understanding of performance and effectiveness of planned and deployed DPT apps. Therefore, the terminology and a classification matrix may offer some guidance to DPT system operators regarding which measurements to prioritize. DPT developers and operators may also make conscious decisions to integrate measures for epidemic monitoring but should be aware that this introduces a secondary purpose to DPT that is not part of the original DPT design. Ultimately, the integration of further information for epidemic monitoring into DPT involves a trade-off between data granularity and linkage on the one hand, and privacy on the other. Decision-makers should be aware of the trade-off and take it into account when planning and developing DPT notification and monitoring systems or intending to assess the added value of DPT relative to existing contact tracing systems.
△ Less
Submitted 23 December, 2020;
originally announced December 2020.
-
Synthetic Data -- Anonymisation Groundhog Day
Authors:
Theresa Stadler,
Bristena Oprisanu,
Carmela Troncoso
Abstract:
Synthetic data has been advertised as a silver-bullet solution to privacy-preserving data publishing that addresses the shortcomings of traditional anonymisation techniques. The promise is that synthetic data drawn from generative models preserves the statistical properties of the original dataset but, at the same time, provides perfect protection against privacy attacks. In this work, we present…
▽ More
Synthetic data has been advertised as a silver-bullet solution to privacy-preserving data publishing that addresses the shortcomings of traditional anonymisation techniques. The promise is that synthetic data drawn from generative models preserves the statistical properties of the original dataset but, at the same time, provides perfect protection against privacy attacks. In this work, we present the first quantitative evaluation of the privacy gain of synthetic data publishing and compare it to that of previous anonymisation techniques.
Our evaluation of a wide range of state-of-the-art generative models demonstrates that synthetic data either does not prevent inference attacks or does not retain data utility. In other words, we empirically show that synthetic data does not provide a better tradeoff between privacy and utility than traditional anonymisation techniques.
Furthermore, in contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing is hard to predict. Because it is impossible to predict what signals a synthetic dataset will preserve and what information will be lost, synthetic data leads to a highly variable privacy gain and unpredictable utility loss. In summary, we find that synthetic data is far from the holy grail of privacy-preserving data publishing.
△ Less
Submitted 24 January, 2022; v1 submitted 13 November, 2020;
originally announced November 2020.
-
Bayes Security: A Not So Average Metric
Authors:
Konstantinos Chatzikokolakis,
Giovanni Cherubin,
Catuscia Palamidessi,
Carmela Troncoso
Abstract:
Security system designers favor worst-case security metrics, such as those derived from differential privacy (DP), due to the strong guarantees they provide. On the downside, these guarantees result in a high penalty on the system's performance. In this paper, we study Bayes security, a security metric inspired by the cryptographic advantage. Similarly to DP, Bayes security i) is independent of an…
▽ More
Security system designers favor worst-case security metrics, such as those derived from differential privacy (DP), due to the strong guarantees they provide. On the downside, these guarantees result in a high penalty on the system's performance. In this paper, we study Bayes security, a security metric inspired by the cryptographic advantage. Similarly to DP, Bayes security i) is independent of an adversary's prior knowledge, ii) it captures the worst-case scenario for the two most vulnerable secrets (e.g., data records); and iii) it is easy to compose, facilitating security analyses. Additionally, Bayes security iv) can be consistently estimated in a black-box manner, contrary to DP, which is useful when a formal analysis is not feasible; and v) provides a better utility-security trade-off in high-security regimes because it quantifies the risk for a specific threat model as opposed to threat-agnostic metrics such as DP. We formulate a theory around Bayes security, and we provide a thorough comparison with respect to well-known metrics, identifying the scenarios where Bayes Security is advantageous for designers.
△ Less
Submitted 20 February, 2024; v1 submitted 6 November, 2020;
originally announced November 2020.
-
Privacy Engineering Meets Software Engineering. On the Challenges of Engineering Privacy ByDesign
Authors:
Blagovesta Kostova,
Seda Gürses,
Carmela Troncoso
Abstract:
Current day software development relies heavily on the use of service architectures and on agile iterative development methods to design, implement, and deploy systems. These practices result in systems made up of multiple services that introduce new data flows and evolving designs that escape the control of a single designer. Academic privacy engineering literature typically abstracts away such c…
▽ More
Current day software development relies heavily on the use of service architectures and on agile iterative development methods to design, implement, and deploy systems. These practices result in systems made up of multiple services that introduce new data flows and evolving designs that escape the control of a single designer. Academic privacy engineering literature typically abstracts away such conditions of software production in order to achieve generalizable results. Yet, through a systematic study of the literature, we show that proposed solutions inevitably make assumptions about software architectures, development methods and scope of designer control that are misaligned with current practices. These misalignments are likely to pose an obstacle to operationalizing privacy engineering solutions in the wild. Specifically, we identify important limitations in the approaches that researchers take to design and evaluate privacy enhancing technologies which ripple to proposals for privacy engineering methodologies. Based on our analysis, we delineate research and actions needed to re-align research with practice, changes that serve a precondition for the operationalization of academic privacy results in common software engineering practices.
△ Less
Submitted 16 July, 2020;
originally announced July 2020.
-
DatashareNetwork: A Decentralized Privacy-Preserving Search Engine for Investigative Journalists
Authors:
Kasra EdalatNejad,
Wouter Lueks,
Julien Pierre Martin,
Soline Ledésert,
Anne L'Hôte,
Bruno Thomas,
Laurent Girod,
Carmela Troncoso
Abstract:
Investigative journalists collect large numbers of digital documents during their investigations. These documents can greatly benefit other journalists' work. However, many of these documents contain sensitive information. Hence, possessing such documents can endanger reporters, their stories, and their sources. Consequently, many documents are used only for single, local, investigations.
We pre…
▽ More
Investigative journalists collect large numbers of digital documents during their investigations. These documents can greatly benefit other journalists' work. However, many of these documents contain sensitive information. Hence, possessing such documents can endanger reporters, their stories, and their sources. Consequently, many documents are used only for single, local, investigations.
We present DatashareNetwork, a decentralized and privacy-preserving search system that enables journalists worldwide to find documents via a dedicated network of peers. DatashareNetwork combines well-known anonymous authentication mechanisms and anonymous communication primitives, a novel asynchronous messaging system, and a novel multi-set private set intersection protocol (MS-PSI) into a *decentralized peer-to-peer private document search engine*. We prove that DatashareNetwork is secure; and show using a prototype implementation that it scales to thousands of users and millions of documents.
△ Less
Submitted 30 July, 2020; v1 submitted 29 May, 2020;
originally announced May 2020.
-
Decentralized Privacy-Preserving Proximity Tracing
Authors:
Carmela Troncoso,
Mathias Payer,
Jean-Pierre Hubaux,
Marcel Salathé,
James Larus,
Edouard Bugnion,
Wouter Lueks,
Theresa Stadler,
Apostolos Pyrgelis,
Daniele Antonioli,
Ludovic Barman,
Sylvain Chatel,
Kenneth Paterson,
Srdjan Čapkun,
David Basin,
Jan Beutel,
Dennis Jackson,
Marc Roeschlin,
Patrick Leu,
Bart Preneel,
Nigel Smart,
Aysajan Abidin,
Seda Gürses,
Michael Veale,
Cas Cremers
, et al. (9 additional authors not shown)
Abstract:
This document describes and analyzes a system for secure and privacy-preserving proximity tracing at large scale. This system, referred to as DP3T, provides a technological foundation to help slow the spread of SARS-CoV-2 by simplifying and accelerating the process of notifying people who might have been exposed to the virus so that they can take appropriate measures to break its transmission chai…
▽ More
This document describes and analyzes a system for secure and privacy-preserving proximity tracing at large scale. This system, referred to as DP3T, provides a technological foundation to help slow the spread of SARS-CoV-2 by simplifying and accelerating the process of notifying people who might have been exposed to the virus so that they can take appropriate measures to break its transmission chain. The system aims to minimise privacy and security risks for individuals and communities and guarantee the highest level of data protection. The goal of our proximity tracing system is to determine who has been in close physical proximity to a COVID-19 positive person and thus exposed to the virus, without revealing the contact's identity or where the contact occurred. To achieve this goal, users run a smartphone app that continually broadcasts an ephemeral, pseudo-random ID representing the user's phone and also records the pseudo-random IDs observed from smartphones in close proximity. When a patient is diagnosed with COVID-19, she can upload pseudo-random IDs previously broadcast from her phone to a central server. Prior to the upload, all data remains exclusively on the user's phone. Other users' apps can use data from the server to locally estimate whether the device's owner was exposed to the virus through close-range physical proximity to a COVID-19 positive person who has uploaded their data. In case the app detects a high risk, it will inform the user.
△ Less
Submitted 25 May, 2020;
originally announced May 2020.
-
VoteAgain: A scalable coercion-resistant voting system
Authors:
Wouter Lueks,
Iñigo Querejeta-Azurmendi,
Carmela Troncoso
Abstract:
The strongest threat model for voting systems considers coercion resistance: protection against coercers that force voters to modify their votes, or to abstain. Existing remote voting systems either do not provide this property; require an expensive tallying phase; or burden users with the need to store cryptographic key material and with the responsibility to deceive their coercers. We propose Vo…
▽ More
The strongest threat model for voting systems considers coercion resistance: protection against coercers that force voters to modify their votes, or to abstain. Existing remote voting systems either do not provide this property; require an expensive tallying phase; or burden users with the need to store cryptographic key material and with the responsibility to deceive their coercers. We propose VoteAgain, a scalable voting scheme that relies on the revoting paradigm to provide coercion resistance. VoteAgain uses a novel deterministic ballot padding mechanism to ensure that coercers cannot see whether a vote has been replaced. This mechanism ensures tallies take quasilinear time, making VoteAgain the first revoting scheme that can handle elections with millions of voters. We prove that VoteAgain provides ballot privacy, coercion resistance, and verifiability; and we demonstrate its scalability using a prototype implementation of all cryptographic primitives.
△ Less
Submitted 1 June, 2020; v1 submitted 22 May, 2020;
originally announced May 2020.
-
zksk: A Library for Composable Zero-Knowledge Proofs
Authors:
Wouter Lueks,
Bogdan Kulynych,
Jules Fasquelle,
Simon Le Bail-Collet,
Carmela Troncoso
Abstract:
Zero-knowledge proofs are an essential building block in many privacy-preserving systems. However, implementing these proofs is tedious and error-prone. In this paper, we present zksk, a well-documented Python library for defining and computing sigma protocols: the most popular class of zero-knowledge proofs. In zksk, proofs compose: programmers can convert smaller proofs into building blocks that…
▽ More
Zero-knowledge proofs are an essential building block in many privacy-preserving systems. However, implementing these proofs is tedious and error-prone. In this paper, we present zksk, a well-documented Python library for defining and computing sigma protocols: the most popular class of zero-knowledge proofs. In zksk, proofs compose: programmers can convert smaller proofs into building blocks that then can be combined into bigger proofs. zksk features a modern Python-based domain-specific language. This makes possible to define proofs without learning a new custom language, and to benefit from the rich Python syntax and ecosystem. The library is available at https://github.com/spring-epfl/zksk
△ Less
Submitted 10 November, 2019; v1 submitted 6 November, 2019;
originally announced November 2019.
-
Filter Design for Delay-Based Anonymous Communications
Authors:
Simon Oya,
Fernando Pérez-González,
Carmela Troncoso
Abstract:
In this work, we address the problem of designing delay-based anonymous communication systems. We consider a timed mix where an eavesdropper wants to learn the communication pattern of the users, and study how the mix must delay the messages so as to increase the adversary's estimation error. We show the connection between this problem and a MIMO system where we want to design the coloring filter…
▽ More
In this work, we address the problem of designing delay-based anonymous communication systems. We consider a timed mix where an eavesdropper wants to learn the communication pattern of the users, and study how the mix must delay the messages so as to increase the adversary's estimation error. We show the connection between this problem and a MIMO system where we want to design the coloring filter that worsens the adversary's estimation of the MIMO channel matrix. We obtain theoretical solutions for the optimal filter against short-term and long-term adversaries, evaluate them with experiments, and show how some properties of filters can be used in the implementation of timed mixes. This opens the door to the application of previously known filter design techniques to anonymous communication systems.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
Understanding the Effects of Real-World Behavior in Statistical Disclosure Attacks
Authors:
Simon Oya,
Carmela Troncoso,
Fernando Pérez-González
Abstract:
High-latency anonymous communication systems prevent passive eavesdroppers from inferring communicating partners with certainty. However, disclosure attacks allow an adversary to recover users' behavioral profiles when communications are persistent. Understanding how the system parameters affect the privacy of the users against such attacks is crucial. Earlier work in the area analyzes the perform…
▽ More
High-latency anonymous communication systems prevent passive eavesdroppers from inferring communicating partners with certainty. However, disclosure attacks allow an adversary to recover users' behavioral profiles when communications are persistent. Understanding how the system parameters affect the privacy of the users against such attacks is crucial. Earlier work in the area analyzes the performance of disclosure attacks in controlled scenarios, where a certain model about the users' behavior is assumed. In this paper, we analyze the profiling accuracy of one of the most efficient disclosure attack, the least squares disclosure attack, in realistic scenarios. We generate real traffic observations from datasets of different nature and find that the models considered in previous work do not fit this realistic behavior. We relax previous hypotheses on the behavior of the users and extend previous performance analyses, validating our results with real data and providing new insights into the parameters that affect the protection of the users in the real world.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
A Least Squares Approach to the Static Traffic Analysis of High-Latency Anonymous Communication Systems
Authors:
Fernando Pérez-González,
Carmela Troncoso,
Simon Oya
Abstract:
Mixes, relaying routers that hide the relation between incoming and outgoing messages, are the main building block of high-latency anonymous communication networks. A number of so-called disclosure attacks have been proposed to effectively de-anonymize traffic sent through these channels. Yet, the dependence of their success on the system parameters is not well-understood. We propose the Least Squ…
▽ More
Mixes, relaying routers that hide the relation between incoming and outgoing messages, are the main building block of high-latency anonymous communication networks. A number of so-called disclosure attacks have been proposed to effectively de-anonymize traffic sent through these channels. Yet, the dependence of their success on the system parameters is not well-understood. We propose the Least Squares Disclosure Attack (LSDA), in which user profiles are estimated by solving a least squares problem. We show that LSDA is not only suitable for the analysis of threshold mixes, but can be easily extended to attack pool mixes. Furthermore, contrary to previous heuristic-based attacks, our approach allows us to analytically derive expressions that characterize the profiling error of LSDA with respect to the system parameters. We empirically demonstrate that LSDA recovers users' profiles with greater accuracy than its statistical predecessors and verify that our analysis closely predicts actual performance.
△ Less
Submitted 17 October, 2019;
originally announced October 2019.
-
Meet the Family of Statistical Disclosure Attacks
Authors:
Simon Oya,
Carmela Troncoso,
Fernando Pérez-González
Abstract:
Disclosure attacks aim at revealing communication patterns in anonymous communication systems, such as conversation partners or frequency. In this paper, we propose a framework to compare between the members of the statistical disclosure attack family. We compare different variants of the Statistical Disclosure Attack (SDA) in the literature, together with two new methods; as well as show their re…
▽ More
Disclosure attacks aim at revealing communication patterns in anonymous communication systems, such as conversation partners or frequency. In this paper, we propose a framework to compare between the members of the statistical disclosure attack family. We compare different variants of the Statistical Disclosure Attack (SDA) in the literature, together with two new methods; as well as show their relation with the Least Squares Disclosure Attack (LSDA).
We empirically explore the performance of the attacks with respect to the different parameters of the system. Our experiments show that i) our proposals considerably improve the state-of-the-art SDA and ii) confirm that LSDA outperforms the SDA family when the adversary has enough observations of the system.
△ Less
Submitted 16 October, 2019;
originally announced October 2019.
-
Encrypted DNS --> Privacy? A Traffic Analysis Perspective
Authors:
Sandra Siby,
Marc Juarez,
Claudia Diaz,
Narseo Vallina-Rodriguez,
Carmela Troncoso
Abstract:
Virtually every connection to an Internet service is preceded by a DNS lookup which is performed without any traffic-level protection, thus enabling manipulation, redirection, surveillance, and censorship. To address these issues, large organizations such as Google and Cloudflare are deploying recently standardized protocols that encrypt DNS traffic between end users and recursive resolvers such a…
▽ More
Virtually every connection to an Internet service is preceded by a DNS lookup which is performed without any traffic-level protection, thus enabling manipulation, redirection, surveillance, and censorship. To address these issues, large organizations such as Google and Cloudflare are deploying recently standardized protocols that encrypt DNS traffic between end users and recursive resolvers such as DNS-over-TLS (DoT) and DNS-over-HTTPS (DoH). In this paper, we examine whether encrypting DNS traffic can protect users from traffic analysis-based monitoring and censoring. We propose a novel feature set to perform the attacks, as those used to attack HTTPS or Tor traffic are not suitable for DNS' characteristics. We show that traffic analysis enables the identification of domains with high accuracy in closed and open world settings, using 124 times less data than attacks on HTTPS flows. We find that factors such as location, resolver, platform, or client do mitigate the attacks performance but they are far from completely stopping them. Our results indicate that DNS-based censorship is still possible on encrypted DNS traffic. In fact, we demonstrate that the standardized padding schemes are not effective. Yet, Tor -- which does not effectively mitigate traffic analysis attacks on web traffic -- is a good defense against DoH traffic analysis.
△ Less
Submitted 6 October, 2019; v1 submitted 23 June, 2019;
originally announced June 2019.
-
Disparate Vulnerability to Membership Inference Attacks
Authors:
Bogdan Kulynych,
Mohammad Yaghini,
Giovanni Cherubin,
Michael Veale,
Carmela Troncoso
Abstract:
A membership inference attack (MIA) against a machine-learning model enables an attacker to determine whether a given data record was part of the model's training data or not. In this paper, we provide an in-depth study of the phenomenon of disparate vulnerability against MIAs: unequal success rate of MIAs against different population subgroups. We first establish necessary and sufficient conditio…
▽ More
A membership inference attack (MIA) against a machine-learning model enables an attacker to determine whether a given data record was part of the model's training data or not. In this paper, we provide an in-depth study of the phenomenon of disparate vulnerability against MIAs: unequal success rate of MIAs against different population subgroups. We first establish necessary and sufficient conditions for MIAs to be prevented, both on average and for population subgroups, using a notion of distributional generalization. Second, we derive connections of disparate vulnerability to algorithmic fairness and to differential privacy. We show that fairness can only prevent disparate vulnerability against limited classes of adversaries. Differential privacy bounds disparate vulnerability but can significantly reduce the accuracy of the model. We show that estimating disparate vulnerability to MIAs by naïvely applying existing attacks can lead to overestimation. We then establish which attacks are suitable for estimating disparate vulnerability, and provide a statistical framework for doing so reliably. We conduct experiments on synthetic and real-world data finding statistically significant evidence of disparate vulnerability in realistic settings. The code is available at https://github.com/spring-epfl/disparate-vulnerability
△ Less
Submitted 16 September, 2021; v1 submitted 2 June, 2019;
originally announced June 2019.
-
Measuring Membership Privacy on Aggregate Location Time-Series
Authors:
Apostolos Pyrgelis,
Carmela Troncoso,
Emiliano De Cristofaro
Abstract:
While location data is extremely valuable for various applications, disclosing it prompts serious threats to individuals' privacy. To limit such concerns, organizations often provide analysts with aggregate time-series that indicate, e.g., how many people are in a location at a time interval, rather than raw individual traces. In this paper, we perform a measurement study to understand Membership…
▽ More
While location data is extremely valuable for various applications, disclosing it prompts serious threats to individuals' privacy. To limit such concerns, organizations often provide analysts with aggregate time-series that indicate, e.g., how many people are in a location at a time interval, rather than raw individual traces. In this paper, we perform a measurement study to understand Membership Inference Attacks (MIAs) on aggregate location time-series, where an adversary tries to infer whether a specific user contributed to the aggregates.
We find that the volume of contributed data, as well as the regularity and particularity of users' mobility patterns, play a crucial role in the attack's success. We experiment with a wide range of defenses based on generalization, hiding, and perturbation, and evaluate their ability to thwart the attack vis-a-vis the utility loss they introduce for various mobility analytics tasks.
Our results show that some defenses fail across the board, while others work for specific tasks on aggregate location time-series. For instance, suppressing small counts can be used for ranking hotspots, data generalization for forecasting traffic, hotspot discovery, and map inference, while sampling is effective for location labeling and anomaly detection when the dataset is sparse. Differentially private techniques provide reasonable accuracy only in very specific settings, e.g., discovering hotspots and forecasting their traffic, and more so when using weaker privacy notions like crowd-blending privacy. Overall, our measurements show that there does not exist a unique generic defense that can preserve the utility of the analytics for arbitrary applications, and provide useful insights regarding the disclosure of sanitized aggregate location time-series.
△ Less
Submitted 27 April, 2020; v1 submitted 20 February, 2019;
originally announced February 2019.
-
On (The Lack Of) Location Privacy in Crowdsourcing Applications
Authors:
Spyros Boukoros,
Mathias Humbert,
Stefan Katzenbeisser,
Carmela Troncoso
Abstract:
Crowdsourcing enables application developers to benefit from large and diverse datasets at a low cost. Specifically, mobile crowdsourcing (MCS) leverages users' devices as sensors to perform geo-located data collection. The collection of geolocated data raises serious privacy concerns for users. Yet, despite the large research body on location privacy-preserving mechanisms (LPPMs), MCS developers…
▽ More
Crowdsourcing enables application developers to benefit from large and diverse datasets at a low cost. Specifically, mobile crowdsourcing (MCS) leverages users' devices as sensors to perform geo-located data collection. The collection of geolocated data raises serious privacy concerns for users. Yet, despite the large research body on location privacy-preserving mechanisms (LPPMs), MCS developers implement little to no protection for data collection or publication. To understand this mismatch, we study the performance of existing LPPMs on publicly available data from two mobile crowdsourcing projects. Our results show that well-established defenses are either not applicable or offer little protection in the MCS setting. Additionally, they have a much stronger impact on applications' utility than foreseen in the literature. This is because existing LPPMs, designed with location-based services (LBSs) in mind, are optimized for utility functions based on users' locations, while MCS utility functions depend on the values (e.g., measurements) associated with those locations. We finally outline possible research avenues to facilitate the development of new location privacy solutions that fit the needs of MCS so that the increasing number of such applications do not jeopardize their users' privacy.
△ Less
Submitted 5 June, 2019; v1 submitted 15 January, 2019;
originally announced January 2019.
-
Questioning the assumptions behind fairness solutions
Authors:
Rebekah Overdorf,
Bogdan Kulynych,
Ero Balsa,
Carmela Troncoso,
Seda Gürses
Abstract:
In addition to their benefits, optimization systems can have negative economic, moral, social, and political effects on populations as well as their environments. Frameworks like fairness have been proposed to aid service providers in addressing subsequent bias and discrimination during data collection and algorithm design. However, recent reports of neglect, unresponsiveness, and malevolence cast…
▽ More
In addition to their benefits, optimization systems can have negative economic, moral, social, and political effects on populations as well as their environments. Frameworks like fairness have been proposed to aid service providers in addressing subsequent bias and discrimination during data collection and algorithm design. However, recent reports of neglect, unresponsiveness, and malevolence cast doubt on whether service providers can effectively implement fairness solutions. These reports invite us to revisit assumptions made about the service providers in fairness solutions. Namely, that service providers have (i) the incentives or (ii) the means to mitigate optimization externalities. Moreover, the environmental impact of these systems suggests that we need (iii) novel frameworks that consider systems other than algorithmic decision-making and recommender systems, and (iv) solutions that go beyond removing related algorithmic biases. Going forward, we propose Protective Optimization Technologies that enable optimization subjects to defend against negative consequences of optimization systems.
△ Less
Submitted 27 November, 2018;
originally announced November 2018.
-
Evading classifiers in discrete domains with provable optimality guarantees
Authors:
Bogdan Kulynych,
Jamie Hayes,
Nikita Samarin,
Carmela Troncoso
Abstract:
Machine-learning models for security-critical applications such as bot, malware, or spam detection, operate in constrained discrete domains. These applications would benefit from having provable guarantees against adversarial examples. The existing literature on provable adversarial robustness of models, however, exclusively focuses on robustness to gradient-based attacks in domains such as images…
▽ More
Machine-learning models for security-critical applications such as bot, malware, or spam detection, operate in constrained discrete domains. These applications would benefit from having provable guarantees against adversarial examples. The existing literature on provable adversarial robustness of models, however, exclusively focuses on robustness to gradient-based attacks in domains such as images. These attacks model the adversarial cost, e.g., amount of distortion applied to an image, as a $p$-norm. We argue that this approach is not well-suited to model adversarial costs in constrained domains where not all examples are feasible.
We introduce a graphical framework that (1) generalizes existing attacks in discrete domains, (2) can accommodate complex cost functions beyond $p$-norms, including financial cost incurred when attacking a classifier, and (3) efficiently produces valid adversarial examples with guarantees of minimal adversarial cost. These guarantees directly translate into a notion of adversarial robustness that takes into account domain constraints and the adversary's capabilities. We show how our framework can be used to evaluate security by crafting adversarial examples that evade a Twitter-bot detection classifier with provably minimal number of changes; and to build privacy defenses by crafting adversarial examples that evade a privacy-invasive website-fingerprinting classifier.
△ Less
Submitted 1 July, 2019; v1 submitted 25 October, 2018;
originally announced October 2018.
-
Rethinking Location Privacy for Unknown Mobility Behaviors
Authors:
Simon Oya,
Carmela Troncoso,
Fernando Pérez-González
Abstract:
Location Privacy-Preserving Mechanisms (LPPMs) in the literature largely consider that users' data available for training wholly characterizes their mobility patterns. Thus, they hardwire this information in their designs and evaluate their privacy properties with these same data. In this paper, we aim to understand the impact of this decision on the level of privacy these LPPMs may offer in real…
▽ More
Location Privacy-Preserving Mechanisms (LPPMs) in the literature largely consider that users' data available for training wholly characterizes their mobility patterns. Thus, they hardwire this information in their designs and evaluate their privacy properties with these same data. In this paper, we aim to understand the impact of this decision on the level of privacy these LPPMs may offer in real life when the users' mobility data may be different from the data used in the design phase. Our results show that, in many cases, training data does not capture users' behavior accurately and, thus, the level of privacy provided by the LPPM is often overestimated. To address this gap between theory and practice, we propose to use blank-slate models for LPPM design. Contrary to the hardwired approach, that assumes known users' behavior, blank-slate models learn the users' behavior from the queries to the service provider. We leverage this blank-slate approach to develop a new family of LPPMs, that we call Profile Estimation-Based LPPMs. Using real data, we empirically show that our proposal outperforms optimal state-of-the-art mechanisms designed on sporadic hardwired models. On non-sporadic location privacy scenarios, our method is only better if the usage of the location privacy service is not continuous. It is our hope that eliminating the need to bootstrap the mechanisms with training data and ensuring that the mechanisms are lightweight and easy to compute help fostering the integration of location privacy protections in deployed systems.
△ Less
Submitted 23 May, 2019; v1 submitted 12 September, 2018;
originally announced September 2018.
-
Tandem: Securing Keys by Using a Central Server While Preserving Privacy
Authors:
Wouter Lueks,
Brinda Hampiholi,
Greg Alpár,
Carmela Troncoso
Abstract:
Users' devices, e.g., smartphones or laptops, are typically incapable of securely storing and processing cryptographic keys. We present Tandem, a novel set of protocols for securing cryptographic keys with support from a central server. Tandem uses one-time-use key-share tokens to preserve users' privacy with respect to a malicious central server. Additionally, Tandem enables users to block their…
▽ More
Users' devices, e.g., smartphones or laptops, are typically incapable of securely storing and processing cryptographic keys. We present Tandem, a novel set of protocols for securing cryptographic keys with support from a central server. Tandem uses one-time-use key-share tokens to preserve users' privacy with respect to a malicious central server. Additionally, Tandem enables users to block their keys if they lose their device, and it enables the server to limit how often an adversary can use an unblocked key. We prove Tandem's security and privacy properties, apply Tandem to attribute-based credentials, and implement a Tandem proof of concept to show that it causes little overhead.
△ Less
Submitted 13 July, 2020; v1 submitted 10 September, 2018;
originally announced September 2018.
-
POTs: Protective Optimization Technologies
Authors:
Bogdan Kulynych,
Rebekah Overdorf,
Carmela Troncoso,
Seda Gürses
Abstract:
Algorithmic fairness aims to address the economic, moral, social, and political impact that digital systems have on populations through solutions that can be applied by service providers. Fairness frameworks do so, in part, by mapping these problems to a narrow definition and assuming the service providers can be trusted to deploy countermeasures. Not surprisingly, these decisions limit fairness f…
▽ More
Algorithmic fairness aims to address the economic, moral, social, and political impact that digital systems have on populations through solutions that can be applied by service providers. Fairness frameworks do so, in part, by mapping these problems to a narrow definition and assuming the service providers can be trusted to deploy countermeasures. Not surprisingly, these decisions limit fairness frameworks' ability to capture a variety of harms caused by systems.
We characterize fairness limitations using concepts from requirements engineering and from social sciences. We show that the focus on algorithms' inputs and outputs misses harms that arise from systems interacting with the world; that the focus on bias and discrimination omits broader harms on populations and their environments; and that relying on service providers excludes scenarios where they are not cooperative or intentionally adversarial.
We propose Protective Optimization Technologies (POTs). POTs provide means for affected parties to address the negative impacts of systems in the environment, expanding avenues for political contestation. POTs intervene from outside the system, do not require service providers to cooperate, and can serve to correct, shift, or expose harms that systems impose on populations and their environments. We illustrate the potential and limitations of POTs in two case studies: countering road congestion caused by traffic-beating applications, and recalibrating credit scoring for loan applicants.
△ Less
Submitted 26 January, 2020; v1 submitted 7 June, 2018;
originally announced June 2018.
-
Under the Underground: Predicting Private Interactions in Underground Forums
Authors:
Rebekah Overdorf,
Carmela Troncoso,
Rachel Greenstadt,
Damon McCoy
Abstract:
Underground forums where users discuss, buy, and sell illicit services and goods facilitate a better understanding of the economy and organization of cybercriminals. Prior work has shown that in particular private interactions provide a wealth of information about the cybercriminal ecosystem. Yet, those messages are seldom available to analysts, except when there is a leak. To address this problem…
▽ More
Underground forums where users discuss, buy, and sell illicit services and goods facilitate a better understanding of the economy and organization of cybercriminals. Prior work has shown that in particular private interactions provide a wealth of information about the cybercriminal ecosystem. Yet, those messages are seldom available to analysts, except when there is a leak. To address this problem we propose a supervised machine learning based method able to predict which public \threads will generate private messages, after a partial leak of such messages has occurred. To the best of our knowledge, we are the first to develop a solution to overcome the barrier posed by limited to no information on private activity for underground forum analysis. Additionally, we propose an automate method for labeling posts, significantly reducing the cost of our approach in the presence of real unlabeled data. This method can be tuned to focus on the likelihood of users receiving private messages, or \threads triggering private interactions. We evaluate the performance of our methods using data from three real forum leaks. Our results show that public information can indeed be used to predict private activity, although prediction models do not transfer well between forums. We also find that neither the length of the leak period nor the time between the leak and the prediction have significant impact on our technique's performance, and that NLP features dominate the prediction power.
△ Less
Submitted 11 May, 2018;
originally announced May 2018.
-
TARANET: Traffic-Analysis Resistant Anonymity at the NETwork layer
Authors:
Chen Chen,
Daniele E. Asoni,
Adrian Perrig,
David Barrera,
George Danezis,
Carmela Troncoso
Abstract:
Modern low-latency anonymity systems, no matter whether constructed as an overlay or implemented at the network layer, offer limited security guarantees against traffic analysis. On the other hand, high-latency anonymity systems offer strong security guarantees at the cost of computational overhead and long delays, which are excessive for interactive applications. We propose TARANET, an anonymity…
▽ More
Modern low-latency anonymity systems, no matter whether constructed as an overlay or implemented at the network layer, offer limited security guarantees against traffic analysis. On the other hand, high-latency anonymity systems offer strong security guarantees at the cost of computational overhead and long delays, which are excessive for interactive applications. We propose TARANET, an anonymity system that implements protection against traffic analysis at the network layer, and limits the incurred latency and overhead. In TARANET's setup phase, traffic analysis is thwarted by mixing. In the data transmission phase, end hosts and ASes coordinate to shape traffic into constant-rate transmission using packet splitting. Our prototype implementation shows that TARANET can forward anonymous traffic at over 50~Gbps using commodity hardware.
△ Less
Submitted 23 February, 2018;
originally announced February 2018.
-
Feature importance scores and lossless feature pruning using Banzhaf power indices
Authors:
Bogdan Kulynych,
Carmela Troncoso
Abstract:
Understanding the influence of features in machine learning is crucial to interpreting models and selecting the best features for classification. In this work we propose the use of principles from coalitional game theory to reason about importance of features. In particular, we propose the use of the Banzhaf power index as a measure of influence of features on the outcome of a classifier. We show…
▽ More
Understanding the influence of features in machine learning is crucial to interpreting models and selecting the best features for classification. In this work we propose the use of principles from coalitional game theory to reason about importance of features. In particular, we propose the use of the Banzhaf power index as a measure of influence of features on the outcome of a classifier. We show that features having Banzhaf power index of zero can be losslessly pruned without damage to classifier accuracy. Computing the power indices does not require having access to data samples. However, if samples are available, the indices can be empirically estimated. We compute Banzhaf power indices for a neural network classifier on real-life data, and compare the results with gradient-based feature saliency, and coefficients of a logistic regression model with $L_1$ regularization.
△ Less
Submitted 3 December, 2017; v1 submitted 14 November, 2017;
originally announced November 2017.
-
Is Geo-Indistinguishability What You Are Looking for?
Authors:
Simon Oya,
Carmela Troncoso,
Fernando Pérez-González
Abstract:
Since its proposal in 2013, geo-indistinguishability has been consolidated as a formal notion of location privacy, generating a rich body of literature building on this idea. A problem with most of these follow-up works is that they blindly rely on geo-indistinguishability to provide location privacy, ignoring the numerical interpretation of this privacy guarantee. In this paper, we provide an alt…
▽ More
Since its proposal in 2013, geo-indistinguishability has been consolidated as a formal notion of location privacy, generating a rich body of literature building on this idea. A problem with most of these follow-up works is that they blindly rely on geo-indistinguishability to provide location privacy, ignoring the numerical interpretation of this privacy guarantee. In this paper, we provide an alternative formulation of geo-indistinguishability as an adversary error, and use it to show that the privacy vs.~utility trade-off that can be obtained is not as appealing as implied by the literature. We also show that although geo-indistinguishability guarantees a lower bound on the adversary's error, this comes at the cost of achieving poorer performance than other noise generation mechanisms in terms of average error, and enabling the possibility of exposing obfuscated locations that are useless from the quality of service point of view.
△ Less
Submitted 19 September, 2017;
originally announced September 2017.
-
Knock Knock, Who's There? Membership Inference on Aggregate Location Data
Authors:
Apostolos Pyrgelis,
Carmela Troncoso,
Emiliano De Cristofaro
Abstract:
Aggregate location data is often used to support smart services and applications, e.g., generating live traffic maps or predicting visits to businesses. In this paper, we present the first study on the feasibility of membership inference attacks on aggregate location time-series. We introduce a game-based definition of the adversarial task, and cast it as a classification problem where machine lea…
▽ More
Aggregate location data is often used to support smart services and applications, e.g., generating live traffic maps or predicting visits to businesses. In this paper, we present the first study on the feasibility of membership inference attacks on aggregate location time-series. We introduce a game-based definition of the adversarial task, and cast it as a classification problem where machine learning can be used to distinguish whether or not a target user is part of the aggregates.
We empirically evaluate the power of these attacks on both raw and differentially private aggregates using two mobility datasets. We find that membership inference is a serious privacy threat, and show how its effectiveness depends on the adversary's prior knowledge, the characteristics of the underlying location data, as well as the number of users and the timeframe on which aggregation is performed. Although differentially private mechanisms can indeed reduce the extent of the attacks, they also yield a significant loss in utility. Moreover, a strategic adversary mimicking the behavior of the defense mechanism can greatly limit the protection they provide. Overall, our work presents a novel methodology geared to evaluate membership inference on aggregate location data in real-world settings and can be used by providers to assess the quality of privacy protection before data release or by regulators to detect violations.
△ Less
Submitted 29 November, 2017; v1 submitted 21 August, 2017;
originally announced August 2017.
-
ClaimChain: Improving the Security and Privacy of In-band Key Distribution for Messaging
Authors:
Bogdan Kulynych,
Wouter Lueks,
Marios Isaakidis,
George Danezis,
Carmela Troncoso
Abstract:
The social demand for email end-to-end encryption is barely supported by mainstream service providers. Autocrypt is a new community-driven open specification for e-mail encryption that attempts to respond to this demand. In Autocrypt the encryption keys are attached directly to messages, and thus the encryption can be implemented by email clients without any collaboration of the providers. The dec…
▽ More
The social demand for email end-to-end encryption is barely supported by mainstream service providers. Autocrypt is a new community-driven open specification for e-mail encryption that attempts to respond to this demand. In Autocrypt the encryption keys are attached directly to messages, and thus the encryption can be implemented by email clients without any collaboration of the providers. The decentralized nature of this in-band key distribution, however, makes it prone to man-in-the-middle attacks and can leak the social graph of users. To address this problem we introduce ClaimChain, a cryptographic construction for privacy-preserving authentication of public keys. Users store claims about their identities and keys, as well as their beliefs about others, in ClaimChains. These chains form authenticated decentralized repositories that enable users to prove the authenticity of both their keys and the keys of their contacts. ClaimChains are encrypted, and therefore protect the stored information, such as keys and contact identities, from prying eyes. At the same time, ClaimChain implements mechanisms to provide strong non-equivocation properties, discouraging malicious actors from distributing conflicting or inauthentic claims. We implemented ClaimChain and we show that it offers reasonable performance, low overhead, and authenticity guarantees.
△ Less
Submitted 12 October, 2018; v1 submitted 19 July, 2017;
originally announced July 2017.