Skip to main content

Showing 1–21 of 21 results for author: Abdelnabi, S

.
  1. arXiv:2506.09956  [pdf, ps, other

    cs.CR cs.AI

    LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge

    Authors: Sahar Abdelnabi, Aideen Fay, Ahmed Salem, Egor Zverev, Kai-Chieh Liao, Chi-Huang Liu, Chun-Chih Kuo, Jannis Weigend, Danyael Manlangit, Alex Apostolov, Haris Umair, João Donato, Masayuki Kawakita, Athar Mahboob, Tran Huu Bach, Tsun-Han Chiang, Myeongjin Cho, Hajin Choi, Byeonghyeon Kim, Hyeonjin Lee, Benjamin Pannell, Conor McCauley, Mark Russinovich, Andrew Paverd, Giovanni Cherubin

    Abstract: Indirect Prompt Injection attacks exploit the inherent limitation of Large Language Models (LLMs) to distinguish between instructions and data in their inputs. Despite numerous defense proposals, the systematic evaluation against adaptive adversaries remains limited, even when successful attacks can have wide security and privacy implications, and many real-world LLM-based applications remain vuln… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Dataset at: https://huggingface.co/datasets/microsoft/llmail-inject-challenge

  2. arXiv:2506.04245  [pdf, ps, other

    cs.AI cs.CL cs.LG

    Contextual Integrity in LLMs via Reasoning and Reinforcement Learning

    Authors: Guangchen Lan, Huseyin A. Inan, Sahar Abdelnabi, Janardhan Kulkarni, Lukas Wutschitz, Reza Shokri, Christopher G. Brinton, Robert Sim

    Abstract: As the era of autonomous agents making decisions on behalf of users unfolds, ensuring contextual integrity (CI) -- what is the appropriate information to share while carrying out a certain task -- becomes a central question to the field. We posit that CI demands a form of reasoning where the agent needs to reason about the context in which it is operating. To test this, we first prompt LLMs to rea… ▽ More

    Submitted 29 May, 2025; originally announced June 2025.

    ACM Class: I.2.6; I.2.7

  3. arXiv:2505.14617  [pdf, ps, other

    cs.CL cs.CY

    Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models

    Authors: Sahar Abdelnabi, Ahmed Salem

    Abstract: Reasoning-focused large language models (LLMs) sometimes alter their behavior when they detect that they are being evaluated, an effect analogous to the Hawthorne phenomenon, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impact… ▽ More

    Submitted 26 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

  4. arXiv:2502.19649  [pdf, other

    cs.LG cs.CL

    Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

    Authors: Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz

    Abstract: Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for… ▽ More

    Submitted 12 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  5. arXiv:2502.04512  [pdf, other

    cs.AI

    Safety is Essential for Responsible Open-Ended Systems

    Authors: Ivaxi Sheth, Jan Wehner, Sahar Abdelnabi, Ruta Binkyte, Mario Fritz

    Abstract: AI advancements have been significantly driven by a combination of foundation models and curiosity-driven learning aimed at increasing capability and adaptability. A growing area of interest within this field is Open-Endedness - the ability of AI systems to continuously and autonomously generate novel and diverse artifacts or solutions. This has become relevant for accelerating scientific discover… ▽ More

    Submitted 10 February, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

    Comments: 12 pages

  6. arXiv:2502.01822  [pdf, other

    cs.CR cs.CY

    Firewalls to Secure Dynamic LLM Agentic Networks

    Authors: Sahar Abdelnabi, Amr Gomaa, Eugene Bagdasarian, Per Ola Kristensson, Reza Shokri

    Abstract: LLM agents will likely communicate on behalf of users with other entity-representing agents on tasks involving long-horizon plans with interdependent goals. Current work neglects these agentic networks and their challenges. We identify required properties for agent communication: proactivity, adaptability, privacy (sharing only task-necessary information), and security (preserving integrity and ut… ▽ More

    Submitted 26 May, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

  7. arXiv:2409.02604  [pdf, other

    cs.LG stat.ME

    Hypothesizing Missing Causal Variables with LLMs

    Authors: Ivaxi Sheth, Sahar Abdelnabi, Mario Fritz

    Abstract: Scientific discovery is a catalyst for human intellectual advances, driven by the cycle of hypothesis generation, experimental design, data evaluation, and iterative assumption refinement. This process, while crucial, is expensive and heavily dependent on the domain knowledge of scientists to generate hypotheses and navigate the scientific cycle. Central to this is causality, the ability to establ… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: Code - https://github.com/ivaxi0s/hypothesizing-causal-variable-llm

  8. arXiv:2406.07954  [pdf, other

    cs.CR cs.AI

    Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

    Authors: Edoardo Debenedetti, Javier Rando, Daniel Paleka, Silaghi Fineas Florin, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, Giovanni Cherubin, Santiago Zanella-Beguelin, Robin Schmid, Victor Klemm, Takahiro Miki, Chenhao Li, Stefan Kraft, Mario Fritz, Florian Tramèr, Sahar Abdelnabi, Lea Schönherr

    Abstract: Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt. The competition was organized in two phases. In the first phase, teams developed… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  9. arXiv:2406.00799  [pdf, other

    cs.CR cs.CL cs.CY

    Get my drift? Catching LLM Task Drift with Activation Deltas

    Authors: Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, Andrew Paverd

    Abstract: LLMs are commonly used in retrieval-augmented applications to execute user instructions based on data from external sources. For example, modern search engines use LLMs to answer queries based on relevant search results; email plugins summarize emails by processing their content through an LLM. However, the potentially untrusted provenance of these data sources can lead to prompt injection attacks… ▽ More

    Submitted 6 March, 2025; v1 submitted 2 June, 2024; originally announced June 2024.

    Comments: SaTML 2025

  10. arXiv:2403.06833  [pdf, other

    cs.LG cs.CL

    Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

    Authors: Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, Christoph H. Lampert

    Abstract: Instruction-tuned Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features that are common in other areas of computer science, particularly an explicit separation of instructions and data. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. Surprisi… ▽ More

    Submitted 31 January, 2025; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: Published as a conference paper at ICLR 2025, GitHub: https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed. 10 pages main text, 30 pages in total

  11. arXiv:2402.11005  [pdf, other

    cs.CL cs.AI

    A Theory of LLM Sampling: Part Descriptive and Part Prescriptive

    Authors: Sarath Sivaprasad, Pramod Kaushik, Sahar Abdelnabi, Mario Fritz

    Abstract: Large Language Models (LLMs) are increasingly utilized in autonomous decision-making, where they sample options from vast action spaces. However, the heuristics that guide this sampling process remain under-explored. We study this sampling behavior and show that this underlying heuristics resembles that of human decision-making: comprising a descriptive component (reflecting statistical norm) and… ▽ More

    Submitted 18 April, 2025; v1 submitted 16 February, 2024; originally announced February 2024.

  12. arXiv:2309.17234  [pdf, other

    cs.CL cs.CY cs.LG

    Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation

    Authors: Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, Mario Fritz

    Abstract: There is an growing interest in using Large Language Models (LLMs) in multi-agent systems to tackle interactive real-world tasks that require effective collaboration and assessing complex situations. Yet, we still have a limited understanding of LLMs' communication and decision-making abilities in multi-agent setups. The fundamental task of negotiation spans many key features of communication, suc… ▽ More

    Submitted 10 June, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: Updated version with major additions (new experiments, evaluation, and attacks)

  13. From Attachments to SEO: Click Here to Learn More about Clickbait PDFs!

    Authors: Giada Stivala, Sahar Abdelnabi, Andrea Mengascini, Mariano Graziano, Mario Fritz, Giancarlo Pellegrino

    Abstract: Clickbait PDFs are PDF documents that do not embed malware but trick victims into visiting malicious web pages leading to attacks like password theft or drive-by download. While recent reports indicate a surge of clickbait PDFs, prior works have largely neglected this new threat, considering PDFs only as accessories of email phishing campaigns. This paper investigates the landscape of clickbait… ▽ More

    Submitted 22 December, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

    Comments: Corrected symbols in Table 1

  14. arXiv:2306.04883  [pdf

    cs.CR

    From Bad to Worse: Using Private Data to Propagate Disinformation on Online Platforms with a Greater Efficiency

    Authors: Protik Bose Pranto, Waqar Hassan Khan, Sahar Abdelnabi, Rebecca Weil, Mario Fritz, Rakibul Hasan

    Abstract: We outline a planned experiment to investigate if personal data (e.g., demographics and behavioral patterns) can be used to selectively expose individuals to disinformation such that an adversary can spread disinformation more efficiently compared to broadcasting the same information to everyone. This mechanism, if effective, will have devastating consequences as modern technologies collect and in… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

  15. arXiv:2302.12173  [pdf, other

    cs.CR cs.AI cs.CL cs.CY

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Authors: Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, Mario Fritz

    Abstract: Large Language Models (LLMs) are increasingly being integrated into various applications. The functionalities of recent LLMs can be flexibly modulated via natural language prompts. This renders them susceptible to targeted adversarial prompting, e.g., Prompt Injection (PI) attacks enable attackers to override original instructions and employed controls. So far, it was assumed that the user is dire… ▽ More

    Submitted 5 May, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

  16. arXiv:2209.03755  [pdf, other

    cs.CR cs.CL cs.CY cs.LG

    Fact-Saboteurs: A Taxonomy of Evidence Manipulation Attacks against Fact-Verification Systems

    Authors: Sahar Abdelnabi, Mario Fritz

    Abstract: Mis- and disinformation are a substantial global threat to our security and safety. To cope with the scale of online misinformation, researchers have been working on automating fact-checking by retrieving and verifying against relevant evidence. However, despite many advances, a comprehensive evaluation of the possible attack vectors against such systems is still lacking. Particularly, the automat… ▽ More

    Submitted 16 June, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

  17. arXiv:2112.00061  [pdf, other

    cs.CV cs.CL cs.CY cs.LG

    Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources

    Authors: Sahar Abdelnabi, Rakibul Hasan, Mario Fritz

    Abstract: Misinformation is now a major problem due to its potential high risks to our core democratic and societal values and orders. Out-of-context misinformation is one of the easiest and effective ways used by adversaries to spread viral false stories. In this threat, a real image is re-purposed to support other narratives by misrepresenting its context and/or elements. The internet is being used as the… ▽ More

    Submitted 20 March, 2022; v1 submitted 30 November, 2021; originally announced December 2021.

    Comments: CVPR'22

  18. arXiv:2102.05104  [pdf, other

    cs.LG cs.CR cs.CV

    "What's in the box?!": Deflecting Adversarial Attacks by Randomly Deploying Adversarially-Disjoint Models

    Authors: Sahar Abdelnabi, Mario Fritz

    Abstract: Machine learning models are now widely deployed in real-world applications. However, the existence of adversarial examples has been long considered a real threat to such models. While numerous defenses aiming to improve the robustness have been proposed, many have been shown ineffective. As these vulnerabilities are still nowhere near being eliminated, we propose an alternative deployment-based de… ▽ More

    Submitted 9 March, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

  19. arXiv:2009.03015  [pdf, other

    cs.CR cs.CL cs.CY cs.LG

    Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding

    Authors: Sahar Abdelnabi, Mario Fritz

    Abstract: Recent advances in natural language generation have introduced powerful language models with high-quality output text. However, this raises concerns about the potential misuse of such models for malicious purposes. In this paper, we study natural language watermarking as a defense to help better mark and trace the provenance of text. We introduce the Adversarial Watermarking Transformer (AWT) with… ▽ More

    Submitted 29 March, 2021; v1 submitted 7 September, 2020; originally announced September 2020.

    ACM Class: I.2.7

  20. arXiv:2007.08457  [pdf, other

    cs.CR cs.CV cs.CY cs.GR cs.LG

    Artificial Fingerprinting for Generative Models: Rooting Deepfake Attribution in Training Data

    Authors: Ning Yu, Vladislav Skripniuk, Sahar Abdelnabi, Mario Fritz

    Abstract: Photorealistic image generation has reached a new level of quality due to the breakthroughs of generative adversarial networks (GANs). Yet, the dark side of such deepfakes, the malicious use of generated media, raises concerns about visual misinformation. While existing research work on deepfake detection demonstrates high accuracy, it is subject to advances in generation techniques and adversaria… ▽ More

    Submitted 17 March, 2022; v1 submitted 16 July, 2020; originally announced July 2020.

    Comments: Accepted to ICCV'21 as Oral

  21. arXiv:1909.00300  [pdf, other

    cs.CR cs.CV cs.LG

    VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity

    Authors: Sahar Abdelnabi, Katharina Krombholz, Mario Fritz

    Abstract: Phishing websites are still a major threat in today's Internet ecosystem. Despite numerous previous efforts, similarity-based detection methods do not offer sufficient protection for the trusted websites - in particular against unseen phishing pages. This paper contributes VisualPhishNet, a new similarity-based phishing detection framework, based on a triplet Convolutional Neural Network (CNN). Vi… ▽ More

    Submitted 5 July, 2020; v1 submitted 31 August, 2019; originally announced September 2019.