PALO: A Polyglot Large Multimodal Model for 5B People

Maaz, Muhammad; Rasheed, Hanoona; Shaker, Abdelrahman; Khan, Salman; Cholakal, Hisham; Anwer, Rao M.; Baldwin, Tim; Felsberg, Michael; Khan, Fahad S.

Computer Science > Computation and Language

arXiv:2402.14818 (cs)

[Submitted on 22 Feb 2024 (v1), last revised 5 Mar 2024 (this version, v2)]

Title:PALO: A Polyglot Large Multimodal Model for 5B People

Authors:Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan

View PDF HTML (experimental)

Abstract:In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: this https URL.

Comments:	Technical Report of PALO
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2402.14818 [cs.CL]
	(or arXiv:2402.14818v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2402.14818

Submission history

From: Muhammad Maaz Mr [view email]
[v1] Thu, 22 Feb 2024 18:59:58 UTC (25,805 KB)
[v2] Tue, 5 Mar 2024 11:22:07 UTC (25,805 KB)

Computer Science > Computation and Language

Title:PALO: A Polyglot Large Multimodal Model for 5B People

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:PALO: A Polyglot Large Multimodal Model for 5B People

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators