Search | arXiv e-print repository

Scaling Laws for Fine-Grained Mixture of Experts

Authors: Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur

Abstract: Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling la… ▽ More Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity. Leveraging these laws, we derive the optimal training configuration for a given computational budget. Our findings not only show that MoE models consistently outperform dense Transformers but also highlight that the efficiency gap between dense and MoE models widens as we scale up the model size and training budget. Furthermore, we demonstrate that the common practice of setting the size of experts in MoE to mirror the feed-forward layer is not optimal at almost any computational budget. △ Less

Submitted 12 February, 2024; originally announced February 2024.

arXiv:2401.04081 [pdf, other]

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Authors: Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Michał Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Miłoś, Marek Cygan, Sebastian Jaszczur

Abstract: State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcas… ▽ More State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in $2.35\times$ fewer training steps while preserving the inference performance gains of Mamba against Transformer. △ Less

Submitted 26 February, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

arXiv:2310.15961 [pdf, other]

Mixture of Tokens: Continuous MoE through Cross-Example Aggregation

Authors: Szymon Antoniak, Michał Krutul, Maciej Pióro, Jakub Krajewski, Jan Ludziejewski, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Marek Cygan, Sebastian Jaszczur

Abstract: Mixture of Experts (MoE) models based on Transformer architecture are pushing the boundaries of language and vision tasks. The allure of these models lies in their ability to substantially increase the parameter count without a corresponding increase in FLOPs. Most widely adopted MoE models are discontinuous with respect to their parameters - often referred to as sparse. At the same time, existing… ▽ More Mixture of Experts (MoE) models based on Transformer architecture are pushing the boundaries of language and vision tasks. The allure of these models lies in their ability to substantially increase the parameter count without a corresponding increase in FLOPs. Most widely adopted MoE models are discontinuous with respect to their parameters - often referred to as sparse. At the same time, existing continuous MoE designs either lag behind their sparse counterparts or are incompatible with autoregressive decoding. Motivated by the observation that the adaptation of fully continuous methods has been an overarching trend in deep learning, we develop Mixture of Tokens (MoT), a simple, continuous architecture that is capable of scaling the number of parameters similarly to sparse MoE models. Unlike conventional methods, MoT assigns mixtures of tokens from different examples to each expert. This architecture is fully compatible with autoregressive training and generation. Our best models not only achieve a 3x increase in training speed over dense Transformer models in language pretraining but also match the performance of state-of-the-art MoE architectures. Additionally, a close connection between MoT and MoE is demonstrated through a novel technique we call transition tuning. △ Less

Submitted 24 September, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

arXiv:1811.03415 [pdf]

Credibility of Automatic Appraisal of Domain Names

Authors: Karol Król, Artur Strzelecki, Dariusz Zdonek

Abstract: Both domain names and entire websites are increasingly frequently treated as assets, the value of which can be appraised. The objective of the present thesis was to verify the credibility of domain name appraisals obtained using generally available web applications in an automated, algorithmic way. In conclusions section, it was mentioned that the terms domain name appraisal and website appraisal… ▽ More Both domain names and entire websites are increasingly frequently treated as assets, the value of which can be appraised. The objective of the present thesis was to verify the credibility of domain name appraisals obtained using generally available web applications in an automated, algorithmic way. In conclusions section, it was mentioned that the terms domain name appraisal and website appraisal are frequently equated. It was also shown that algorithms used in the tested applications consider parameters characterising websites. Thus, they cannot be used to verify the value of domain names themselves. Moreover, during the analysis of the pattern of operation of the appraisal websites it was noticed that they were not made available with domain name or website appraisals in mind. Their objective was to acquire and intercept online traffic. Such applications also left cookie files on recipients' devices, which were then used by advertising systems based on the re-marketing concept. △ Less

Submitted 29 October, 2018; originally announced November 2018.

Comments: 4 pages, 3 tables

arXiv:1501.04434 [pdf, other]

"`They brought in the horrible key ring thing!" Analysing the Usability of Two-Factor Authentication in UK Online Banking

Authors: Kat Krol, Eleni Philippou, Emiliano De Cristofaro, M. Angela Sasse

Abstract: To prevent password breaches and guessing attacks, banks increasingly turn to two-factor authentication (2FA), requiring users to present at least one more factor, such as a one-time password generated by a hardware token or received via SMS, besides a password. We can expect some solutions -- especially those adding a token -- to create extra work for users, but little research has investigated u… ▽ More To prevent password breaches and guessing attacks, banks increasingly turn to two-factor authentication (2FA), requiring users to present at least one more factor, such as a one-time password generated by a hardware token or received via SMS, besides a password. We can expect some solutions -- especially those adding a token -- to create extra work for users, but little research has investigated usability, user acceptance, and perceived security of deployed 2FA. This paper presents an in-depth study of 2FA usability with 21 UK online banking customers, 16 of whom had accounts with more than one bank. We collected a rich set of qualitative and quantitative data through two rounds of semi-structured interviews, and an authentication diary over an average of 11 days. Our participants reported a wide range of usability issues, especially with the use of hardware tokens, showing that the mental and physical workload involved shapes how they use online banking. Key targets for improvements are (i) the reduction in the number of authentication steps, and (ii) removing features that do not add any security but negatively affect the user experience. △ Less

Submitted 19 January, 2015; originally announced January 2015.

Comments: To appear in NDSS Workshop on Usable Security (USEC 2015)

Showing 1–5 of 5 results for author: Król, K