Skip to main content

Showing 1–3 of 3 results for author: Kniejski, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.12484  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

    Authors: Filip Sondej, Yushi Yang, Mikołaj Kniejski, Marcel Windys

    Abstract: Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning. We introduce Di… ▽ More

    Submitted 30 June, 2025; v1 submitted 14 June, 2025; originally announced June 2025.

  2. arXiv:2506.05533  [pdf, other

    cs.CV cs.HC

    Personalized Interpretability -- Interactive Alignment of Prototypical Parts Networks

    Authors: Tomasz Michalski, Adam Wróbel, Andrea Bontempelli, Jakub Luśtyk, Mikolaj Kniejski, Stefano Teso, Andrea Passerini, Bartosz Zieliński, Dawid Rymarczyk

    Abstract: Concept-based interpretable neural networks have gained significant attention due to their intuitive and easy-to-understand explanations based on case-based reasoning, such as "this bird looks like those sparrows". However, a major limitation is that these explanations may not always be comprehensible to users due to concept inconsistency, where multiple visual features are inappropriately mixed (… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: 20 pages, 11 figures

  3. arXiv:2502.19145  [pdf, ps, other

    cs.AI cs.MA

    Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems

    Authors: Pierre Peigne-Lefebvre, Mikolaj Kniejski, Filip Sondej, Matthieu David, Jason Hoelscher-Obermaier, Christian Schroeder de Witt, Esben Kran

    Abstract: As AI agents are increasingly adopted to collaborate on complex objectives, ensuring the security of autonomous multi-agent systems becomes crucial. We develop simulations of agents collaborating on shared objectives to study these security risks and security trade-offs. We focus on scenarios where an attacker compromises one agent, using it to steer the entire system toward misaligned outcomes by… ▽ More

    Submitted 4 June, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

    Comments: Accepted to AAAI 2025 Conference