MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

Gurioli, Andrea; Pennino, Federico; Monteiro, João; Gabbrielli, Maurizio

Computer Science > Computation and Language

arXiv:2503.03008 (cs)

[Submitted on 4 Mar 2025 (v1), last revised 19 May 2025 (this version, v2)]

Title:MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

Authors:Andrea Gurioli, Federico Pennino, João Monteiro, Maurizio Gabbrielli

View PDF HTML (experimental)

Abstract:Deploying language models often requires navigating accuracy vs. performance trade-offs to meet latency constraints while preserving utility. Traditional model distillation reduces size but incurs substantial costs through training separate models. We introduce ModularStarEncoder (MoSE), a 1-billion-parameter multi-exit encoder for code retrieval and classification that employs a novel Self-Distillation mechanism. This approach significantly enhances lower-layer representations, enabling flexible deployment of different model portions with favorable performance trade-offs. Our architecture improves text-to-code and code-to-code search by targeting specific encoder layers as exit heads, where higher layers guide earlier ones during training-improving intermediate representations at minimal additional cost. We further enhance MoSE with a repository-level contextual loss that maximizes training context window utilization. Additionally, we release a new dataset created through code translation that extends text-to-code benchmarks with cross-language code-to-code pairs. Evaluations demonstrate the effectiveness of Self-Distillation as a principled approach to trading inference cost for accuracy across various code understanding tasks.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
Cite as:	arXiv:2503.03008 [cs.CL]
	(or arXiv:2503.03008v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.03008

Submission history

From: Andrea Gurioli [view email]
[v1] Tue, 4 Mar 2025 21:08:17 UTC (1,413 KB)
[v2] Mon, 19 May 2025 13:39:47 UTC (1,591 KB)

Computer Science > Computation and Language

Title:MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators