In-Training Defenses against Emergent Misalignment in Language Models

Kaczér, David; Jørgenvåg, Magnus; Vetter, Clemens; Flek, Lucie; Mai, Florian

Computer Science > Machine Learning

arXiv:2508.06249 (cs)

[Submitted on 8 Aug 2025]

Title:In-Training Defenses against Emergent Misalignment in Language Models

Authors:David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Lucie Flek, Florian Mai

View PDF HTML (experimental)

Abstract:Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods' emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods' impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.

Comments:	Under review
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2508.06249 [cs.LG]
	(or arXiv:2508.06249v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2508.06249

Submission history

From: David Kaczér [view email]
[v1] Fri, 8 Aug 2025 12:10:28 UTC (959 KB)

Computer Science > Machine Learning

Title:In-Training Defenses against Emergent Misalignment in Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:In-Training Defenses against Emergent Misalignment in Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators