The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Verma, Nikhil; Bharadwaj, Manasa

Computer Science > Computation and Language

arXiv:2504.02708 (cs)

[Submitted on 3 Apr 2025]

Title:The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Authors:Nikhil Verma, Manasa Bharadwaj

View PDF HTML (experimental)

Abstract:Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.

Comments:	14 pages, 11 Figures, 2 Tables, currently under review at ACL 2025
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2504.02708 [cs.CL]
	(or arXiv:2504.02708v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2504.02708

Submission history

From: Nikhil Verma [view email]
[v1] Thu, 3 Apr 2025 15:46:46 UTC (13,945 KB)

Computer Science > Computation and Language

Title:The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators