-
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Authors:
Monojit Choudhury,
Shivam Chauhan,
Rocktim Jyoti Das,
Dhruv Sahnan,
Xudong Han,
Haonan Li,
Aaryamonvikram Singh,
Alok Anil Jadhav,
Utkarsh Agarwal,
Mukund Choudhary,
Debopriyo Banerjee,
Fajri Koto,
Junaid Bhat,
Awantika Shukla,
Samujjwal Ghosh,
Samta Kamboj,
Onkar Pandit,
Lalit Pradhan,
Rahul Pal,
Sunil Sahu,
Soundar Doraiswamy,
Parvez Mullah,
Ali El Filali,
Neha Sengupta,
Gokul Ramakrishnan
, et al. (5 additional authors not shown)
Abstract:
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorp…
▽ More
Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continuous pre-training with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Prot42: a Novel Family of Protein Language Models for Target-aware Protein Binder Generation
Authors:
Mohammad Amaan Sayeed,
Engin Tekin,
Maryam Nadeem,
Nancy A. ElNaker,
Aahan Singh,
Natalia Vassilieva,
Boulbaba Ben Amor
Abstract:
Unlocking the next generation of biotechnology and therapeutic innovation demands overcoming the inherent complexity and resource-intensity of conventional protein engineering methods. Recent GenAI-powered computational techniques often rely on the availability of the target protein's 3D structures and specific binding sites to generate high-affinity binders, constraints exhibited by models such a…
▽ More
Unlocking the next generation of biotechnology and therapeutic innovation demands overcoming the inherent complexity and resource-intensity of conventional protein engineering methods. Recent GenAI-powered computational techniques often rely on the availability of the target protein's 3D structures and specific binding sites to generate high-affinity binders, constraints exhibited by models such as AlphaProteo and RFdiffusion. In this work, we explore the use of Protein Language Models (pLMs) for high-affinity binder generation. We introduce Prot42, a novel family of Protein Language Models (pLMs) pretrained on vast amounts of unlabeled protein sequences. By capturing deep evolutionary, structural, and functional insights through an advanced auto-regressive, decoder-only architecture inspired by breakthroughs in natural language processing, Prot42 dramatically expands the capabilities of computational protein design based on language only. Remarkably, our models handle sequences up to 8,192 amino acids, significantly surpassing standard limitations and enabling precise modeling of large proteins and complex multi-domain sequences. Demonstrating powerful practical applications, Prot42 excels in generating high-affinity protein binders and sequence-specific DNA-binding proteins. Our innovative models are publicly available, offering the scientific community an efficient and precise computational toolkit for rapid protein engineering.
△ Less
Submitted 18 May, 2025; v1 submitted 6 April, 2025;
originally announced April 2025.
-
Gene42: Long-Range Genomic Foundation Model With Dense Attention
Authors:
Kirill Vishniakov,
Boulbaba Ben Amor,
Engin Tekin,
Nancy A. ElNaker,
Karthik Viswanathan,
Aleksandr Medvedev,
Aahan Singh,
Maryam Nadeem,
Mohammad Amaan Sayeed,
Praveenkumar Kanithi,
Tiago Magalhaes,
Natalia Vassilieva,
Dwarikanath Mahapatra,
Marco Pimentel,
and Shadab Khan
Abstract:
We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context…
▽ More
We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at huggingface.co/inceptionai.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Chem42: a Family of chemical Language Models for Target-aware Ligand Generation
Authors:
Aahan Singh,
Engin Tekin,
Maryam Nadeem,
Nancy A. ElNaker,
Mohammad Amaan Sayeed,
Natalia Vassilieva,
Boulbaba Ben Amor
Abstract:
Revolutionizing drug discovery demands more than just understanding molecular interactions - it requires generative models that can design novel ligands tailored to specific biological targets. While chemical Language Models (cLMs) have made strides in learning molecular properties, most fail to incorporate target-specific insights, restricting their ability to drive de-novo ligand generation. Che…
▽ More
Revolutionizing drug discovery demands more than just understanding molecular interactions - it requires generative models that can design novel ligands tailored to specific biological targets. While chemical Language Models (cLMs) have made strides in learning molecular properties, most fail to incorporate target-specific insights, restricting their ability to drive de-novo ligand generation. Chem42, a cutting-edge family of generative chemical Language Models, is designed to bridge this gap. By integrating atomic-level interactions with multimodal inputs from Prot42, a complementary protein Language Model, Chem42 achieves a sophisticated cross-modal representation of molecular structures, interactions, and binding patterns. This innovative framework enables the creation of structurally valid, synthetically accessible ligands with enhanced target specificity. Evaluations across diverse protein targets confirm that Chem42 surpasses existing approaches in chemical validity, target-aware design, and predicted binding affinity. By reducing the search space of viable drug candidates, Chem42 could accelerate the drug discovery pipeline, offering a powerful generative AI tool for precision medicine. Our Chem42 models set a new benchmark in molecule property prediction, conditional molecule generation, and target-aware ligand design. The models are publicly available at huggingface.co/inceptionai.
△ Less
Submitted 11 June, 2025; v1 submitted 20 March, 2025;
originally announced March 2025.
-
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh
Authors:
Fajri Koto,
Rituraj Joshi,
Nurdaulet Mukhituly,
Yuxia Wang,
Zhuohan Xie,
Rahul Pal,
Daniil Orel,
Parvez Mullah,
Diana Turmakhan,
Maiya Goloburda,
Mohammed Kamran,
Samujjwal Ghosh,
Bokang Jia,
Jonibek Mansurov,
Mukhammed Togmanov,
Debopriyo Banerjee,
Nurkhan Laiyk,
Akhmed Sakip,
Xudong Han,
Ekaterina Kochmar,
Alham Fikri Aji,
Aaryamonvikram Singh,
Alok Anil Jadhav,
Satheesh Katipomu,
Samta Kamboj
, et al. (10 additional authors not shown)
Abstract:
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion…
▽ More
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outperforming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. We release Sherkala-Chat (8B) as an open-weight instruction-tuned model and provide a detailed overview of its training, fine-tuning, safety alignment, and evaluation, aiming to advance research and support diverse real-world applications.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Crystal: Illuminating LLM Abilities on Language and Code
Authors:
Tianhua Tao,
Junbo Li,
Bowen Tan,
Hongyi Wang,
William Marshall,
Bhargav M Kanakiya,
Joel Hestness,
Natalia Vassilieva,
Zhiqiang Shen,
Eric P. Xing,
Zhengzhong Liu
Abstract:
Large Language Models (LLMs) specializing in code generation (which are also often referred to as code LLMs), e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for code LLMs to possess both code generation and natural language abilities for many specific applications, such as code snippet retrieval using natural language…
▽ More
Large Language Models (LLMs) specializing in code generation (which are also often referred to as code LLMs), e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for code LLMs to possess both code generation and natural language abilities for many specific applications, such as code snippet retrieval using natural language or code explanations. The intricate interaction between acquiring language and coding skills complicates the development of strong code LLMs. Furthermore, there is a lack of thorough prior studies on the LLM pretraining strategy that mixes code and natural language. In this work, we propose a pretraining strategy to enhance the integration of natural language and coding capabilities within a single LLM. Specifically, it includes two phases of training with appropriately adjusted code/language ratios. The resulting model, Crystal, demonstrates remarkable capabilities in both domains. Specifically, it has natural language and coding performance comparable to that of Llama 2 and Code Llama, respectively. Crystal exhibits better data efficiency, using 1.4 trillion tokens compared to the more than 2 trillion tokens used by Llama 2 and Code Llama. We verify our pretraining strategy by analyzing the training process and observe consistent improvements in most benchmarks. We also adopted a typical application adaptation phase with a code-centric data mixture, only to find that it did not lead to enhanced performance or training efficiency, underlining the importance of a carefully designed data recipe. To foster research within the community, we commit to open-sourcing every detail of the pretraining, including our training datasets, code, loggings and 136 checkpoints throughout the training.
△ Less
Submitted 6 November, 2024;
originally announced November 2024.
-
Bilingual Adaptation of Monolingual Foundation Models
Authors:
Gurpreet Gosal,
Yishi Xu,
Gokul Ramakrishnan,
Rituraj Joshi,
Avraham Sheinin,
Zhiming,
Chen,
Biswajit Mishra,
Natalia Vassilieva,
Joel Hestness,
Neha Sengupta,
Sunil Kumar Sahu,
Bokang Jia,
Onkar Pandit,
Satheesh Katipomu,
Samta Kamboj,
Samujjwal Ghosh,
Rahul Pal,
Parvez Mullah,
Soundar Doraiswamy,
Mohamed El Karim Chami,
Preslav Nakov
Abstract:
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpu…
▽ More
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpus. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic. Our approach results in significant improvements in Arabic and slight enhancements in English, demonstrating cost-effective cross-lingual transfer. We perform ablations on embedding initialization techniques, data mix ratios, and learning rates and release a detailed training recipe. To demonstrate generalizability of this approach we also adapted Llama 3 8B to Arabic and Llama 2 13B to Hindi.
△ Less
Submitted 25 July, 2024; v1 submitted 13 July, 2024;
originally announced July 2024.
-
Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches
Authors:
Clément Christophe,
Praveen K Kanithi,
Prateek Munjal,
Tathagata Raha,
Nasir Hayat,
Ronnie Rajan,
Ahmed Al-Mahrooqi,
Avani Gupta,
Muhammad Umar Salman,
Gurpreet Gosal,
Bhargav Kanakiya,
Charles Chen,
Natalia Vassilieva,
Boulbaba Ben Amor,
Marco AF Pimentel,
Shadab Khan
Abstract:
This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering…
▽ More
This study presents a comprehensive analysis and comparison of two predominant fine-tuning methodologies - full-parameter fine-tuning and parameter-efficient tuning - within the context of medical Large Language Models (LLMs). We developed and refined a series of LLMs, based on the Llama-2 architecture, specifically designed to enhance medical knowledge retrieval, reasoning, and question-answering capabilities. Our experiments systematically evaluate the effectiveness of these tuning strategies across various well-known medical benchmarks. Notably, our medical LLM Med42 showed an accuracy level of 72% on the US Medical Licensing Examination (USMLE) datasets, setting a new standard in performance for openly available medical LLMs. Through this comparative analysis, we aim to identify the most effective and efficient method for fine-tuning LLMs in the medical domain, thereby contributing significantly to the advancement of AI-driven healthcare applications.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model
Authors:
Nolan Dey,
Daria Soboleva,
Faisal Al-Khateeb,
Bowen Yang,
Ribhu Pathria,
Hemant Khachane,
Shaheer Muhammad,
Zhiming,
Chen,
Robert Myers,
Jacob Robert Steeves,
Natalia Vassilieva,
Marvin Tom,
Joel Hestness
Abstract:
We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter mod…
▽ More
We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter models. Additionally, BTLM-3B-8K provides excellent long context performance, outperforming MPT-7B-8K and XGen-7B-8K on tasks up to 8,192 context length. We trained the model on a cleaned and deduplicated SlimPajama dataset; aggressively tuned the \textmu P hyperparameters and schedule; used ALiBi position embeddings; and adopted the SwiGLU nonlinearity.
On Hugging Face, the most popular models have 7B parameters, indicating that users prefer the quality-size ratio of 7B models. Compacting the 7B parameter model to one with 3B parameters, with little performance impact, is an important milestone. BTLM-3B-8K needs only 3GB of memory with 4-bit precision and takes 2.5x less inference compute than 7B models, helping to open up access to a powerful language model on mobile and edge devices. BTLM-3B-8K is available under an Apache 2.0 license on Hugging Face: https://huggingface.co/cerebras/btlm-3b-8k-base.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
SlimPajama-DC: Understanding Data Combinations for LLM Training
Authors:
Zhiqiang Shen,
Tianhua Tao,
Liqun Ma,
Willie Neiswanger,
Zhengzhong Liu,
Hongyi Wang,
Bowen Tan,
Joel Hestness,
Natalia Vassilieva,
Daria Soboleva,
Eric Xing
Abstract:
This paper aims to understand the impacts of various data combinations (e.g., web text, Wikipedia, GitHub, books) on the pretraining of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together. We have termed our resear…
▽ More
This paper aims to understand the impacts of various data combinations (e.g., web text, Wikipedia, GitHub, books) on the pretraining of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together. We have termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations on SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16$\times$ CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our SlimPajama-DC models are available at: https://huggingface.co/MBZUAI-LLM/SlimPajama-DC and the separate SlimPajama-DC datasets are available at: https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC.
△ Less
Submitted 9 May, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
Authors:
Neha Sengupta,
Sunil Kumar Sahu,
Bokang Jia,
Satheesh Katipomu,
Haonan Li,
Fajri Koto,
William Marshall,
Gurpreet Gosal,
Cynthia Liu,
Zhiming Chen,
Osama Mohammed Afzal,
Samta Kamboj,
Onkar Pandit,
Rahul Pal,
Lalit Pradhan,
Zain Muhammad Mujahid,
Massa Baali,
Xudong Han,
Sondos Mahmoud Bsharat,
Alham Fikri Aji,
Zhiqiang Shen,
Zhengzhong Liu,
Natalia Vassilieva,
Joel Hestness,
Andy Hock
, et al. (7 additional authors not shown)
Abstract:
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning…
▽ More
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than any existing open Arabic and multilingual models by a sizable margin, based on extensive evaluation. Moreover, the models are competitive in English compared to English-centric open models of similar size, despite being trained on much less English data. We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models. We release two open versions of the model -- the foundation Jais model, and an instruction-tuned Jais-chat variant -- with the aim of promoting research on Arabic LLMs. Available at https://huggingface.co/inception-mbzuai/jais-13b-chat
△ Less
Submitted 29 September, 2023; v1 submitted 30 August, 2023;
originally announced August 2023.