Skip to main content

Showing 1–2 of 2 results for author: Atanasov, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.12914  [pdf, other

    cs.LG cs.CL

    Evaluating Defences against Unsafe Feedback in RLHF

    Authors: Domenic Rosati, Giles Edkins, Harsh Raj, David Atanasov, Subhabrata Majumdar, Janarthanan Rajendran, Frank Rudzicz, Hassan Sajjad

    Abstract: While there has been progress towards aligning Large Language Models (LLMs) with human values and ensuring safe behaviour at inference time, safety guards can easily be removed when fine tuned on unsafe and harmful datasets. While this setting has been treated extensively, another popular training paradigm, learning from unsafe feedback with reinforcement learning, has previously been unexplored.… ▽ More

    Submitted 25 February, 2025; v1 submitted 19 September, 2024; originally announced September 2024.

  2. arXiv:2405.14577  [pdf, other

    cs.CL cs.LG

    Representation Noising: A Defence Mechanism Against Harmful Finetuning

    Authors: Domenic Rosati, Jan Wehner, Kai Williams, Ɓukasz Bartoszcze, David Atanasov, Robie Gonzales, Subhabrata Majumdar, Carsten Maple, Hassan Sajjad, Frank Rudzicz

    Abstract: Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes. Even without the open release of weights, weight stealing and fine-tuning APIs make closed models vulnerable to harmful fine-tuning attacks (HFAs). While safety measures like preventing jailbreaks and improving safety guardrails are important, such me… ▽ More

    Submitted 30 October, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: Published in NeurIPs 2024