-
Decoupled Relative Learning Rate Schedules
Authors:
Jan Ludziejewski,
Jan Małaśnicki,
Maciej Pióro,
Michał Krutul,
Kamil Ciebiera,
Maciej Stefaniak,
Jakub Krajewski,
Piotr Sankowski,
Marek Cygan,
Kamil Adamczewski,
Sebastian Jaszczur
Abstract:
In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process…
▽ More
In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to $23\%$, particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to $27\times$ larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.
△ Less
Submitted 4 July, 2025;
originally announced July 2025.
-
Projected Compression: Trainable Projection for Efficient Transformer Compression
Authors:
Maciej Stefaniak,
Michał Krutul,
Jan Małaśnicki,
Maciej Pióro,
Jakub Krajewski,
Sebastian Jaszczur,
Marek Cygan,
Kamil Adamczewski,
Jan Ludziejewski
Abstract:
Large language models have steadily increased in size to achieve improved performance; however, this growth has also led to greater inference time and computational demands. Consequently, there is rising interest in model size reduction methods. To address this issue, we propose Projected Compression, a novel model compression technique, that reduces model weights by utilizing projection modules.…
▽ More
Large language models have steadily increased in size to achieve improved performance; however, this growth has also led to greater inference time and computational demands. Consequently, there is rising interest in model size reduction methods. To address this issue, we propose Projected Compression, a novel model compression technique, that reduces model weights by utilizing projection modules. Specifically, we first train additional trainable projections weights and preserve access to all the original model parameters. Subsequently, these projections are merged into a lower-dimensional product matrix, resulting in a reduced-size standard Transformer-based model. Unlike alternative approaches that require additional computational overhead, our method matches the base model's per-token computation step in FLOPs. Experimental results show that Projected Compression outperforms the comparable hard pruning and retraining approach on higher quality models. Moreover, the performance margin scales well with the number of tokens.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
A Survey on Hypothesis Generation for Scientific Discovery in the Era of Large Language Models
Authors:
Atilla Kaan Alkan,
Shashwat Sourav,
Maja Jablonska,
Simone Astarita,
Rishabh Chakrabarty,
Nikhil Garuda,
Pranav Khetarpal,
Maciej Pióro,
Dimitrios Tanoglidis,
Kartheik G. Iyer,
Mugdha S. Polimera,
Michael J. Smith,
Tirthankar Ghosal,
Marc Huertas-Company,
Sandor Kruk,
Kevin Schawinski,
Ioana Ciucă
Abstract:
Hypothesis generation is a fundamental step in scientific discovery, yet it is increasingly challenged by information overload and disciplinary fragmentation. Recent advances in Large Language Models (LLMs) have sparked growing interest in their potential to enhance and automate this process. This paper presents a comprehensive survey of hypothesis generation with LLMs by (i) reviewing existing me…
▽ More
Hypothesis generation is a fundamental step in scientific discovery, yet it is increasingly challenged by information overload and disciplinary fragmentation. Recent advances in Large Language Models (LLMs) have sparked growing interest in their potential to enhance and automate this process. This paper presents a comprehensive survey of hypothesis generation with LLMs by (i) reviewing existing methods, from simple prompting techniques to more complex frameworks, and proposing a taxonomy that categorizes these approaches; (ii) analyzing techniques for improving hypothesis quality, such as novelty boosting and structured reasoning; (iii) providing an overview of evaluation strategies; and (iv) discussing key challenges and future directions, including multimodal integration and human-AI collaboration. Our survey aims to serve as a reference for researchers exploring LLMs for hypothesis generation.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
Authors:
Jan Ludziejewski,
Maciej Pióro,
Jakub Krajewski,
Maciej Stefaniak,
Michał Krutul,
Jan Małaśnicki,
Marek Cygan,
Piotr Sankowski,
Kamil Adamczewski,
Piotr Miłoś,
Sebastian Jaszczur
Abstract:
Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of acti…
▽ More
Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. To derive and validate the theoretical predictions of our scaling laws, we conduct over 280 experiments with up to 2.7B active parameters and up to 5B total parameters. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.
△ Less
Submitted 19 February, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
State Soup: In-Context Skill Learning, Retrieval and Mixing
Authors:
Maciej Pióro,
Maciej Wołczyk,
Razvan Pascanu,
Johannes von Oswald,
João Sacramento
Abstract:
A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems. Such models naturally handle long sequences efficiently, as the cost of processing a new input is independent of sequence length. Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter inte…
▽ More
A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems. Such models naturally handle long sequences efficiently, as the cost of processing a new input is independent of sequence length. Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter interpolation. Building on parallels between fine-tuning and in-context learning, we investigate whether we can treat internal states as task vectors that can be stored, retrieved, and then linearly combined, exploiting the linearity of recurrence. We study this form of fast model merging on Mamba-2.8b, a pretrained recurrent model, and present preliminary evidence that simple linear state interpolation methods suffice to improve next-token perplexity as well as downstream in-context learning task performance.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Scaling Laws for Fine-Grained Mixture of Experts
Authors:
Jakub Krajewski,
Jan Ludziejewski,
Kamil Adamczewski,
Maciej Pióro,
Michał Krutul,
Szymon Antoniak,
Kamil Ciebiera,
Krystian Król,
Tomasz Odrzygóźdź,
Piotr Sankowski,
Marek Cygan,
Sebastian Jaszczur
Abstract:
Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling la…
▽ More
Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity. Leveraging these laws, we derive the optimal training configuration for a given computational budget. Our findings not only show that MoE models consistently outperform dense Transformers but also highlight that the efficiency gap between dense and MoE models widens as we scale up the model size and training budget. Furthermore, we demonstrate that the common practice of setting the size of experts in MoE to mirror the feed-forward layer is not optimal at almost any computational budget.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Authors:
Maciej Pióro,
Kamil Ciebiera,
Krystian Król,
Jan Ludziejewski,
Michał Krutul,
Jakub Krajewski,
Szymon Antoniak,
Piotr Miłoś,
Marek Cygan,
Sebastian Jaszczur
Abstract:
State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcas…
▽ More
State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in $2.35\times$ fewer training steps while preserving the inference performance gains of Mamba against Transformer.
△ Less
Submitted 26 February, 2024; v1 submitted 8 January, 2024;
originally announced January 2024.
-
Mixture of Tokens: Continuous MoE through Cross-Example Aggregation
Authors:
Szymon Antoniak,
Michał Krutul,
Maciej Pióro,
Jakub Krajewski,
Jan Ludziejewski,
Kamil Ciebiera,
Krystian Król,
Tomasz Odrzygóźdź,
Marek Cygan,
Sebastian Jaszczur
Abstract:
Mixture of Experts (MoE) models based on Transformer architecture are pushing the boundaries of language and vision tasks. The allure of these models lies in their ability to substantially increase the parameter count without a corresponding increase in FLOPs. Most widely adopted MoE models are discontinuous with respect to their parameters - often referred to as sparse. At the same time, existing…
▽ More
Mixture of Experts (MoE) models based on Transformer architecture are pushing the boundaries of language and vision tasks. The allure of these models lies in their ability to substantially increase the parameter count without a corresponding increase in FLOPs. Most widely adopted MoE models are discontinuous with respect to their parameters - often referred to as sparse. At the same time, existing continuous MoE designs either lag behind their sparse counterparts or are incompatible with autoregressive decoding. Motivated by the observation that the adaptation of fully continuous methods has been an overarching trend in deep learning, we develop Mixture of Tokens (MoT), a simple, continuous architecture that is capable of scaling the number of parameters similarly to sparse MoE models. Unlike conventional methods, MoT assigns mixtures of tokens from different examples to each expert. This architecture is fully compatible with autoregressive training and generation. Our best models not only achieve a 3x increase in training speed over dense Transformer models in language pretraining but also match the performance of state-of-the-art MoE architectures. Additionally, a close connection between MoT and MoE is demonstrated through a novel technique we call transition tuning.
△ Less
Submitted 24 September, 2024; v1 submitted 24 October, 2023;
originally announced October 2023.
-
Efficient Single-Image Depth Estimation on Mobile Devices, Mobile AI & AIM 2022 Challenge: Report
Authors:
Andrey Ignatov,
Grigory Malivenko,
Radu Timofte,
Lukasz Treszczotko,
Xin Chang,
Piotr Ksiazek,
Michal Lopuszynski,
Maciej Pioro,
Rafal Rudnicki,
Maciej Smyl,
Yujie Ma,
Zhenyu Li,
Zehui Chen,
Jialei Xu,
Xianming Liu,
Junjun Jiang,
XueChao Shi,
Difan Xu,
Yanan Li,
Xiaotao Wang,
Lei Lei,
Ziyu Zhang,
Yicheng Wang,
Zilong Huang,
Guozhong Luo
, et al. (14 additional authors not shown)
Abstract:
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth es…
▽ More
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
A Light Signalling Approach to Node Grouping for Massive MIMO IoT Networks
Authors:
Emma Fitzgerald,
Michał Pióro,
Harsh Tataria,
Gilles Callebaut,
Sara Gunnarsson,
Liesbet Van der Perre
Abstract:
Massive MIMO is a leading technology to connect very large numbers of energy constrained nodes, as it offers both extensive spatial multiplexing and large array gain. A challenge resides in partitioning the many nodes into groups that can communicate simultaneously such that the mutual interference is minimized. We here propose node partitioning strategies that do not require full channel state in…
▽ More
Massive MIMO is a leading technology to connect very large numbers of energy constrained nodes, as it offers both extensive spatial multiplexing and large array gain. A challenge resides in partitioning the many nodes into groups that can communicate simultaneously such that the mutual interference is minimized. We here propose node partitioning strategies that do not require full channel state information, but rather are based on nodes' respective directional channel properties. In our considered scenarios, these typically have a time constant that is far larger than the coherence time of the channel. We developed both an optimal and an approximation algorithm to partition users based on directional channel properties, and evaluated them numerically. Our results show that both algorithms, despite using only these directional channel properties, achieve similar performance in terms of the minimum signal-to-interference-plus-noise ratio for any user, compared with a reference method using full channel knowledge. In particular, we demonstrate that grouping nodes with related directional properties is to be avoided. We hence realise a simple partitioning method requiring minimal information to be collected from the nodes, and where this information typically remains stable over a long term, thus promoting their autonomy and energy efficiency.
△ Less
Submitted 16 June, 2022; v1 submitted 11 May, 2020;
originally announced May 2020.
-
Efficient Pilot Allocation for URLLC Traffic in 5G Industrial IoT Networks
Authors:
Emma Fitzgerald,
Michał Pióro
Abstract:
In this paper we address the problem of resource allocation for alarm traffic in industrial Internet of Things networks using massive MIMO. We formulate the general problem of how to allocate pilot signals to alarm traffic such that delivery is guaranteed, while also minimising the number of pilots reserved for alarms, thus maximising the channel resources available for other traffic, such as indu…
▽ More
In this paper we address the problem of resource allocation for alarm traffic in industrial Internet of Things networks using massive MIMO. We formulate the general problem of how to allocate pilot signals to alarm traffic such that delivery is guaranteed, while also minimising the number of pilots reserved for alarms, thus maximising the channel resources available for other traffic, such as industrial control traffic. We present an algorithm that fulfils these requirements, and evaluate its performance both analytically and through a simulation study. For realistic alarm traffic characteristics, on average our algorithm can deliver alarms within two time slots (of duration equal to the 5G transmission time interval) using fewer than 1.5 pilots per slot, and even in the worst case it uses around 3.5 pilots in any given slot, with delivery guaranteed in an average of approximately four slots.
△ Less
Submitted 25 September, 2019;
originally announced September 2019.
-
Network Lifetime Maximization in Wireless Mesh Networks for Machine-to-Machine Communication
Authors:
Emma Fitzgerald,
Michał Pióro,
Artur Tomaszewski
Abstract:
In this paper we present new optimization formulations for maximizing the network lifetime in wireless mesh networks performing data aggregation and dissemination for machine-to-machine communication in the Internet of Things. We focus on heterogeneous networks in which multiple applications co-exist and nodes may take on different roles for different applications. Moreover, we address network rec…
▽ More
In this paper we present new optimization formulations for maximizing the network lifetime in wireless mesh networks performing data aggregation and dissemination for machine-to-machine communication in the Internet of Things. We focus on heterogeneous networks in which multiple applications co-exist and nodes may take on different roles for different applications. Moreover, we address network reconfiguration as a means to increase the network lifetime, in keeping with the current trend towards software defined networks and network function virtualization. To test our optimization formulations, we conducted a numerical study using randomly-generated mesh networks from 10 to 30 nodes, and showed that the network lifetime can be increased using network reconfiguration by up to 75% over a single, minimal-energy configuration. Further, our solutions are feasible to implement in practical scenarios: only few configurations are needed, thus requiring little storage for a standalone network, and the synchronization and signalling needed to switch configurations is low relative to each configuration's operating time.
△ Less
Submitted 14 August, 2019;
originally announced August 2019.
-
Massive MIMO Optimization with Compatible Sets
Authors:
Emma Fitzgerald,
Michał Pióro,
Fredrik Tufvesson
Abstract:
Massive multiple-input multiple-output (MIMO) is expected to be a vital component in future 5G systems. As such, there is a need for new modeling in order to investigate the performance of massive MIMO not only at the physical layer, but also higher up the networking stack. In this paper, we present general optimization models for massive MIMO, based on mixed-integer programming and compatible set…
▽ More
Massive multiple-input multiple-output (MIMO) is expected to be a vital component in future 5G systems. As such, there is a need for new modeling in order to investigate the performance of massive MIMO not only at the physical layer, but also higher up the networking stack. In this paper, we present general optimization models for massive MIMO, based on mixed-integer programming and compatible sets, with both maximum ratio combing and zero forcing precoding schemes. We then apply our models to the case of joint device scheduling and power control for heterogeneous devices and traffic demands, in contrast to existing power control schemes that consider only homogeneous users and saturated scenarios. Our results show substantial benefits in terms of energy usage can be achieved without sacrificing throughput, and that both signalling overhead and the complexity of end devices can be reduced by abrogating the need for uplink power control through efficient scheduling.
△ Less
Submitted 26 March, 2019; v1 submitted 19 March, 2019;
originally announced March 2019.
-
Semi-Distributed Demand Response Solutions for Smart Homes
Authors:
Rim Kaddah,
Daniel Kofman,
Fabien Mathieu,
Michal Pioro
Abstract:
The Internet of Things (IoT) paradigm brings an opportunity for advanced Demand Response (DR) solutions. It enables visibility and control on the various appliances that may consume, store or generate energy within a home. It has been shown that a centralized control on the appliances of a set of households leads to efficient DR mechanisms; unfortunately, such solutions raise privacy and scalabili…
▽ More
The Internet of Things (IoT) paradigm brings an opportunity for advanced Demand Response (DR) solutions. It enables visibility and control on the various appliances that may consume, store or generate energy within a home. It has been shown that a centralized control on the appliances of a set of households leads to efficient DR mechanisms; unfortunately, such solutions raise privacy and scalability issues. In this chapter we propose an approach that deals with these issues. Specifically, we introduce a scalable two-levels control system where a centralized controller allocates power to each house on one side and, each household implements a DR local solution on the other side. A limited feedback to the centralized controller allows to enhance the performance with little impact on privacy. The solution is proposed for the general framework of capacity markets.
△ Less
Submitted 30 November, 2017;
originally announced November 2017.
-
Optimization of Free Space Optical Wireless Network for Cellular Backhauling
Authors:
Yuan Li,
Nikolaos Pappas,
Vangelis Angelakis,
Michal Pióro,
Di Yuan
Abstract:
With densification of nodes in cellular networks, free space optic (FSO) connections are becoming an appealing low cost and high rate alternative to copper and fiber as the backhaul solution for wireless communication systems. To ensure a reliable cellular backhaul, provisions for redundant, disjoint paths between the nodes must be made in the design phase. This paper aims at finding a cost-effect…
▽ More
With densification of nodes in cellular networks, free space optic (FSO) connections are becoming an appealing low cost and high rate alternative to copper and fiber as the backhaul solution for wireless communication systems. To ensure a reliable cellular backhaul, provisions for redundant, disjoint paths between the nodes must be made in the design phase. This paper aims at finding a cost-effective solution to upgrade the cellular backhaul with pre-deployed optical fibers using FSO links and mirror components. Since the quality of the FSO links depends on several factors, such as transmission distance, power, and weather conditions, we adopt an elaborate formulation to calculate link reliability. We present a novel integer linear programming model to approach optimal FSO backhaul design, guaranteeing $K$-disjoint paths connecting each node pair. Next, we derive a column generation method to a path-oriented mathematical formulation. Applying the method in a sequential manner enables high computational scalability. We use realistic scenarios to demonstrate our approaches efficiently provide optimal or near-optimal solutions, and thereby allow for accurately dealing with the trade-off between cost and reliability.
△ Less
Submitted 10 June, 2014;
originally announced June 2014.