Showing 1–2 of 2 results for author: Shlomi, T

Search v0.5.6 released 2020-02-24

arXiv:2305.18456 [pdf, other]

cs.LG cs.AI cs.CR cs.CY

Baselines for Identifying Watermarked Large Language Models

Authors: Leonard Tang, Gavin Uberti, Tom Shlomi

Abstract: We consider the emerging problem of identifying the presence and use of watermarking schemes in widely used, publicly hosted, closed source large language models (LLMs). We introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce… ▽ More We consider the emerging problem of identifying the presence and use of watermarking schemes in widely used, publicly hosted, closed source large language models (LLMs). We introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce distributions that diverge qualitatively and identifiably from standard models. Furthermore, we investigate the identifiability of watermarks at varying strengths and consider the tradeoffs of each of our identification mechanisms with respect to watermarking scenario. Along the way, we formalize the specific problem of identifying watermarks in LLMs, as well as LLM watermarks and watermark detection in general, providing a framework and foundations for studying them. △ Less

Submitted 29 May, 2023; originally announced May 2023.
arXiv:2303.05593 [pdf, other]

cs.LG cs.AI cs.CR

Learning the Wrong Lessons: Inserting Trojans During Knowledge Distillation

Authors: Leonard Tang, Tom Shlomi, Alexander Cai

Abstract: In recent years, knowledge distillation has become a cornerstone of efficiently deployed machine learning, with labs and industries using knowledge distillation to train models that are inexpensive and resource-optimized. Trojan attacks have contemporaneously gained significant prominence, revealing fundamental vulnerabilities in deep learning models. Given the widespread use of knowledge distilla… ▽ More In recent years, knowledge distillation has become a cornerstone of efficiently deployed machine learning, with labs and industries using knowledge distillation to train models that are inexpensive and resource-optimized. Trojan attacks have contemporaneously gained significant prominence, revealing fundamental vulnerabilities in deep learning models. Given the widespread use of knowledge distillation, in this work we seek to exploit the unlabelled data knowledge distillation process to embed Trojans in a student model without introducing conspicuous behavior in the teacher. We ultimately devise a Trojan attack that effectively reduces student accuracy, does not alter teacher performance, and is efficiently constructible in practice. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: ICLR 2023 Workshop on Backdoor Attacks and Defenses in Machine Learning

Search v0.5.6 released 2020-02-24