Skip to main content

Showing 1–10 of 10 results for author: Meeus, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2502.14921  [pdf, other

    cs.CL cs.CR cs.LG

    The Canary's Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text

    Authors: Matthieu Meeus, Lukas Wutschitz, Santiago Zanella-Béguelin, Shruti Tople, Reza Shokri

    Abstract: How much information about training samples can be gleaned from synthetic data generated by Large Language Models (LLMs)? Overlooking the subtleties of information flow in synthetic data generation pipelines can lead to a false sense of privacy. In this paper, we design membership inference attacks (MIAs) that target data used to fine-tune pre-trained LLMs that are then used to synthesize data, pa… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

  2. arXiv:2412.07633  [pdf, other

    cs.CL

    ChocoLlama: Lessons Learned From Teaching Llamas Dutch

    Authors: Matthieu Meeus, Anthony Rathé, François Remy, Pieter Delobelle, Jens-Joris Decorte, Thomas Demeester

    Abstract: While Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, their performance often lags in lower-resource, non-English languages due to biases in the training data. In this work, we explore strategies for adapting the primarily English LLMs (Llama-2 and Llama-3) to Dutch, a language spoken by 30 million people worldwide yet often underre… ▽ More

    Submitted 10 December, 2024; originally announced December 2024.

  3. arXiv:2406.17975  [pdf, ps, other

    cs.CL cs.CR cs.LG

    SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)

    Authors: Matthieu Meeus, Igor Shilov, Shubham Jain, Manuel Faysse, Marek Rei, Yves-Alexandre de Montjoye

    Abstract: Whether LLMs memorize their training data and what this means, from measuring privacy leakage to detecting copyright violations, has become a rapidly growing area of research. In the last few months, more than 10 new methods have been proposed to perform Membership Inference Attacks (MIAs) against LLMs. Contrary to traditional MIAs which rely on fixed-but randomized-records or models, these method… ▽ More

    Submitted 7 March, 2025; v1 submitted 25 June, 2024; originally announced June 2024.

    Comments: IEEE Conference on Secure and Trustworthy Machine Learning (SaTML 2025)

  4. arXiv:2405.15523  [pdf, other

    cs.CL cs.LG

    The Mosaic Memory of Large Language Models

    Authors: Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye

    Abstract: As Large Language Models (LLMs) become widely adopted, understanding how they learn from, and memorize, training data becomes crucial. Memorization in LLMs is widely assumed to only occur as a result of sequences being repeated in the training data. Instead, we show that LLMs memorize by assembling information from similar sequences, a phenomena we call mosaic memory. We show major LLMs to exhibit… ▽ More

    Submitted 15 May, 2025; v1 submitted 24 May, 2024; originally announced May 2024.

  5. arXiv:2405.15423  [pdf, other

    cs.LG cs.CR

    Lost in the Averages: A New Specific Setup to Evaluate Membership Inference Attacks Against Machine Learning Models

    Authors: Florent Guépin, Nataša Krčo, Matthieu Meeus, Yves-Alexandre de Montjoye

    Abstract: Membership Inference Attacks (MIAs) are widely used to evaluate the propensity of a machine learning (ML) model to memorize an individual record and the privacy risk releasing the model poses. MIAs are commonly evaluated similarly to ML models: the MIA is performed on a test set of models trained on datasets unseen during training, which are sampled from a larger pool, $D_{eval}$. The MIA is evalu… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  6. arXiv:2402.09363  [pdf, other

    cs.CL cs.CR

    Copyright Traps for Large Language Models

    Authors: Matthieu Meeus, Igor Shilov, Manuel Faysse, Yves-Alexandre de Montjoye

    Abstract: Questions of fair use of copyright-protected content to train Large Language Models (LLMs) are being actively debated. Document-level inference has been proposed as a new task: inferring from black-box access to the trained model whether a piece of content has been seen during training. SOTA methods however rely on naturally occurring memorization of (part of) the content. While very effective aga… ▽ More

    Submitted 4 June, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

    Comments: 41st International Conference on Machine Learning (ICML 2024)

  7. arXiv:2310.15007  [pdf, other

    cs.CL cs.CR cs.LG

    Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models

    Authors: Matthieu Meeus, Shubham Jain, Marek Rei, Yves-Alexandre de Montjoye

    Abstract: With large language models (LLMs) poised to become embedded in our daily lives, questions are starting to be raised about the data they learned from. These questions range from potential bias or misinformation LLMs could retain from their training data to questions of copyright and fair use of human-generated text. However, while these questions emerge, developers of the recent state-of-the-art LL… ▽ More

    Submitted 15 July, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Accepted at 33rd USENIX Security Symposium (USENIX Security 2024)

  8. arXiv:2307.01701  [pdf, other

    cs.CR cs.AI

    Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data

    Authors: Florent Guépin, Matthieu Meeus, Ana-Maria Cretu, Yves-Alexandre de Montjoye

    Abstract: Synthetic data is emerging as one of the most promising solutions to share individual-level data while safeguarding privacy. While membership inference attacks (MIAs), based on shadow modeling, have become the standard to evaluate the privacy of synthetic data, they currently assume the attacker to have access to an auxiliary dataset sampled from a similar distribution as the training dataset. Thi… ▽ More

    Submitted 21 September, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

    Journal ref: ESORICS 2023 workshop Data Privacy Management (DPM) 2023

  9. Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing

    Authors: Matthieu Meeus, Florent Guépin, Ana-Maria Cretu, Yves-Alexandre de Montjoye

    Abstract: Synthetic data is seen as the most promising solution to share individual-level data while preserving privacy. Shadow modeling-based Membership Inference Attacks (MIAs) have become the standard approach to evaluate the privacy risk of synthetic data. While very effective, they require a large number of datasets to be created and models trained to evaluate the risk posed by a single record. The pri… ▽ More

    Submitted 21 September, 2023; v1 submitted 17 June, 2023; originally announced June 2023.

    Journal ref: Computer Security ESORICS 2023

  10. arXiv:1406.7562  [pdf

    cs.SI physics.soc-ph

    When none of us perform better than all of us together: the role of analogical decision rules in groups

    Authors: Nicoleta Meslec, Petru Curseu, Marius Meeus, Oana Fodor

    Abstract: During social interactions, groups develop collective competencies that (ideally) should assist groups to outperform average standalone individual members (weak cognitive synergy) or the best performing member in the group (strong cognitive synergy). In two experimental studies we manipulate the type of decision rule used in group decision-making (identify the best vs. collaborative), and the way… ▽ More

    Submitted 29 June, 2014; originally announced June 2014.

    Report number: ci-2014/35