Skip to main content

Showing 1–7 of 7 results for author: Chowdhury, N R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.02507  [pdf, other

    cs.LG cs.CL

    ZClip: Adaptive Spike Mitigation for LLM Pre-Training

    Authors: Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra

    Abstract: Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds o… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  2. arXiv:2503.22329  [pdf

    cs.CL

    A Refined Analysis of Massive Activations in LLMs

    Authors: Louis Owen, Nilabhra Roy Chowdhury, Abhay Kumar, Fabian Güra

    Abstract: Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

  3. arXiv:2503.17500  [pdf

    cs.LG cs.CL stat.ML

    Variance Control via Weight Rescaling in LLM Pre-training

    Authors: Louis Owen, Abhay Kumar, Nilabhra Roy Chowdhury, Fabian Güra

    Abstract: The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Ind… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  4. arXiv:2407.14885  [pdf, other

    cs.CL cs.CV

    Falcon2-11B Technical Report

    Authors: Quentin Malartic, Nilabhra Roy Chowdhury, Ruxandra Cojocaru, Mugariya Farooq, Giulia Campesan, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Ankit Singh, Maksim Velikanov, Basma El Amel Boussaha, Mohammed Al-Yafeai, Hamza Alobeidli, Leen Al Qadi, Mohamed El Amine Seddik, Kirill Fedyanin, Reda Alami, Hakim Hacid

    Abstract: We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and its multimodal counterpart, Falcon2-11B-vlm, which is a vision-to-text model. We report our findings during the training of the Falcon2-11B which follows a multi-stage approach where the early stages are distinguished by their context length and a final stage where we use a curated, high-quality dataset. Additio… ▽ More

    Submitted 20 July, 2024; originally announced July 2024.

  5. arXiv:2405.16646  [pdf, other

    cs.LG

    A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

    Authors: Mohammed Nowaz Rabbani Chowdhury, Meng Wang, Kaoutar El Maghraoui, Naigang Wang, Pin-Yu Chen, Christopher Carothers

    Abstract: The sparsely gated mixture of experts (MoE) architecture sends different inputs to different subnetworks, i.e., experts, through trainable routers. MoE reduces the training computation significantly for large models, but its deployment can be still memory or computation expensive for some downstream tasks. Model pruning is a popular approach to reduce inference computation, but its application in… ▽ More

    Submitted 30 May, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

    Journal ref: The 41st International Conference on Machine Learning, ICML 2024

  6. arXiv:2306.04073  [pdf, other

    cs.LG

    Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks

    Authors: Mohammed Nowaz Rabbani Chowdhury, Shuai Zhang, Meng Wang, Sijia Liu, Pin-Yu Chen

    Abstract: In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis, resulting in significant computation reduction. The recently proposed \underline{p}atch-level routing in \underline{MoE} (pMoE) divides each input into $n$ patches (or tokens) and sends $l$ patches ($l\ll n$) to each expert through prioritized routing. pMoE has demonstrated gr… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Journal ref: The 40th International Conference on Machine Learning (ICML), 2023

  7. arXiv:2208.00296  [pdf, other

    cs.LG

    ANOVA-based Automatic Attribute Selection and a Predictive Model for Heart Disease Prognosis

    Authors: Mohammed Nowshad Ruhani Chowdhury, Wandong Zhang, Thangarajah Akilan

    Abstract: Studies show that Studies that cardiovascular diseases (CVDs) are malignant for human health. Thus, it is important to have an efficient way of CVD prognosis. In response to this, the healthcare industry has adopted machine learning-based smart solutions to alleviate the manual process of CVD prognosis. Thus, this work proposes an information fusion technique that combines key attributes of a pers… ▽ More

    Submitted 30 July, 2022; originally announced August 2022.