CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling

Baral, Aditeya; Ajith, Allen George; Nayak, Roshan; Bhanja, Mrityunjay Abhijeet

Computer Science > Computation and Language

arXiv:2505.12587 (cs)

[Submitted on 19 May 2025]

Title:CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling

Authors:Aditeya Baral, Allen George Ajith, Roshan Nayak, Mrityunjay Abhijeet Bhanja

View PDF HTML (experimental)

Abstract:Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer's architecture and multi-task pre-training strategy for modeling code-mixed languages.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2505.12587 [cs.CL]
	(or arXiv:2505.12587v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2505.12587

Submission history

From: Aditeya Baral [view email]
[v1] Mon, 19 May 2025 00:50:49 UTC (12,328 KB)

Computer Science > Computation and Language

Title:CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators