-
Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface
Authors:
Andrey Labunets,
Nishit V. Pandya,
Ashish Hooda,
Xiaohan Fu,
Earlence Fernandes
Abstract:
We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to f…
▽ More
We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.
△ Less
Submitted 9 May, 2025; v1 submitted 16 January, 2025;
originally announced January 2025.
-
An Empirical Analysis on the Use and Reporting of National Security Letters
Authors:
Alex Bellon,
Miro Haller,
Andrey Labunets,
Enze Liu,
Stefan Savage
Abstract:
Government investigatory and surveillance powers are important tools for examining crime and protecting public safety. However, since these tools must be employed in secret, it can be challenging to identify abuses or changes in use that could be of significant public interest. In this paper, we evaluate this phenomenon in the context of National Security Letters (NSLs). NSLs are a form of legal p…
▽ More
Government investigatory and surveillance powers are important tools for examining crime and protecting public safety. However, since these tools must be employed in secret, it can be challenging to identify abuses or changes in use that could be of significant public interest. In this paper, we evaluate this phenomenon in the context of National Security Letters (NSLs). NSLs are a form of legal process that empowers parts of the United States federal government to request certain pieces of information for national security purposes. After initial concerns about the lack of public oversight, Congress worked to increase transparency by mandating government agencies to publish aggregated statistics on the NSL usage and by allowing the private sector to report information on NSLs in transparency reports. The implicit goal is that these transparency mechanisms should deter large-scale abuse by making it visible. We evaluate how well these mechanisms work by carefully analyzing the full range of publicly available data related to NSL use. Our findings suggest that they may not lead to the desired public scrutiny as we find published information requires significant manual effort to collect and parse data due to the lack of structure and context. Moreover, we discovered mistakes (subsequently fixed after our reporting to the ODNI), which suggests a lack of active auditing. Taken together, our case study of NSLs provides insights and suggestions for the successful construction of transparency mechanisms that enable effective public auditing.
△ Less
Submitted 1 February, 2025; v1 submitted 5 March, 2024;
originally announced March 2024.
-
Re-purposing Perceptual Hashing based Client Side Scanning for Physical Surveillance
Authors:
Ashish Hooda,
Andrey Labunets,
Tadayoshi Kohno,
Earlence Fernandes
Abstract:
Content scanning systems employ perceptual hashing algorithms to scan user content for illegal material, such as child pornography or terrorist recruitment flyers. Perceptual hashing algorithms help determine whether two images are visually similar while preserving the privacy of the input images. Several efforts from industry and academia propose to conduct content scanning on client devices such…
▽ More
Content scanning systems employ perceptual hashing algorithms to scan user content for illegal material, such as child pornography or terrorist recruitment flyers. Perceptual hashing algorithms help determine whether two images are visually similar while preserving the privacy of the input images. Several efforts from industry and academia propose to conduct content scanning on client devices such as smartphones due to the impending roll out of end-to-end encryption that will make server-side content scanning difficult. However, these proposals have met with strong criticism because of the potential for the technology to be misused and re-purposed. Our work informs this conversation by experimentally characterizing the potential for one type of misuse -- attackers manipulating the content scanning system to perform physical surveillance on target locations. Our contributions are threefold: (1) we offer a definition of physical surveillance in the context of client-side image scanning systems; (2) we experimentally characterize this risk and create a surveillance algorithm that achieves physical surveillance rates of >40% by poisoning 5% of the perceptual hash database; (3) we experimentally study the trade-off between the robustness of client-side image scanning systems and surveillance, showing that more robust detection of illegal material leads to increased potential for physical surveillance.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021
Authors:
Maaz Amjad,
Alisa Zhila,
Grigori Sidorov,
Andrey Labunets,
Sabur Butta,
Hamza Imam Amjad,
Oxana Vitman,
Alexander Gelbukh
Abstract:
With the growth of social media platform influence, the effect of their misuse becomes more and more impactful. The importance of automatic detection of threatening and abusive language can not be overestimated. However, most of the existing studies and state-of-the-art methods focus on English as the target language, with limited work on low- and medium-resource languages. In this paper, we prese…
▽ More
With the growth of social media platform influence, the effect of their misuse becomes more and more impactful. The importance of automatic detection of threatening and abusive language can not be overestimated. However, most of the existing studies and state-of-the-art methods focus on English as the target language, with limited work on low- and medium-resource languages. In this paper, we present two shared tasks of abusive and threatening language detection for the Urdu language which has more than 170 million speakers worldwide. Both are posed as binary classification tasks where participating systems are required to classify tweets in Urdu into two classes, namely: (i) Abusive and Non-Abusive for the first task, and (ii) Threatening and Non-Threatening for the second. We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening. The abusive dataset contains 2400 annotated tweets in the train part and 1100 annotated tweets in the test part. The threatening dataset contains 6000 annotated tweets in the train part and 3950 annotated tweets in the test part. We also provide logistic regression and BERT-based baseline classifiers for both tasks. In this shared task, 21 teams from six countries registered for participation (India, Pakistan, China, Malaysia, United Arab Emirates, and Taiwan), 10 teams submitted their runs for Subtask A, which is Abusive Language Detection and 9 teams submitted their runs for Subtask B, which is Threatening Language detection, and seven teams submitted their technical reports. The best performing system achieved an F1-score value of 0.880 for Subtask A and 0.545 for Subtask B. For both subtasks, m-Bert based transformer model showed the best performance.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.