Search | arXiv e-print repository

Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

Authors: Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Therien, Supriyo Chakraborty, Tom Goldstein

Abstract: Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update w… ▽ More Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: https://github.com/vatsal0/default-moe. △ Less

Submitted 17 April, 2025; v1 submitted 16 April, 2025; originally announced April 2025.

arXiv:2503.05029 [pdf, other]

Continual Pre-training of MoEs: How robust is your router?

Authors: Benjamin Thérien, Charles-Étienne Joseph, Zain Sarwar, Ashwinee Panda, Anirban Das, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Eugene Belilovsky, Irina Rish

Abstract: Sparsely-activated Mixture of Experts (MoE) transformers are promising architectures for foundation models. Compared to dense transformers that require the same amount of floating point operations (FLOPs) per forward pass, MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have thus adopte… ▽ More Sparsely-activated Mixture of Experts (MoE) transformers are promising architectures for foundation models. Compared to dense transformers that require the same amount of floating point operations (FLOPs) per forward pass, MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have thus adopted an MoE architecture. Naturally, practitioners will want to extend the capabilities of these models with large amounts of newly collected data without completely re-training them. Prior work has shown that a simple combination of replay and learning rate re-warming and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) do the MoE transformer's routers exacerbate forgetting relative to a dense model?; 2) do the routers maintain a balanced load on previous distributions after CPT?; 3) are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs? In what follows, we conduct a large-scale (>2B parameter switch and DeepSeek MoE LLMs trained for 600B tokens) empirical study across four MoE transformers to answer these questions. Our results establish a surprising robustness to distribution shifts for both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost. △ Less

Submitted 6 March, 2025; originally announced March 2025.

arXiv:2410.08432 [pdf, other]

MYCROFT: Towards Effective and Efficient External Data Augmentation

Authors: Zain Sarwar, Van Tran, Arjun Nitin Bhagoji, Nick Feamster, Ben Y. Zhao, Supriyo Chakraborty

Abstract: Machine learning (ML) models often require large amounts of data to perform well. When the available data is limited, model trainers may need to acquire more data from external sources. Often, useful data is held by private entities who are hesitant to share their data due to propriety and privacy concerns. This makes it challenging and expensive for model trainers to acquire the data they need to… ▽ More Machine learning (ML) models often require large amounts of data to perform well. When the available data is limited, model trainers may need to acquire more data from external sources. Often, useful data is held by private entities who are hesitant to share their data due to propriety and privacy concerns. This makes it challenging and expensive for model trainers to acquire the data they need to improve model performance. To address this challenge, we propose Mycroft, a data-efficient method that enables model trainers to evaluate the relative utility of different data sources while working with a constrained data-sharing budget. By leveraging feature space distances and gradient matching, Mycroft identifies small but informative data subsets from each owner, allowing model trainers to maximize performance with minimal data exposure. Experimental results across four tasks in two domains show that Mycroft converges rapidly to the performance of the full-information baseline, where all data is shared. Moreover, Mycroft is robust to noise and can effectively rank data owners by utility. Mycroft can pave the way for democratized training of high performance ML models. △ Less

Submitted 10 October, 2024; originally announced October 2024.

Comments: 10 pages, 3 figures, 3 tables

arXiv:2310.16191 [pdf, other]

Can Virtual Reality Protect Users from Keystroke Inference Attacks?

Authors: Zhuolin Yang, Zain Sarwar, Iris Hwang, Ronik Bhaskar, Ben Y. Zhao, Haitao Zheng

Abstract: Virtual Reality (VR) has gained popularity by providing immersive and interactive experiences without geographical limitations. It also provides a sense of personal privacy through physical separation. In this paper, we show that despite assumptions of enhanced privacy, VR is unable to shield its users from side-channel attacks that steal private information. Ironically, this vulnerability arises… ▽ More Virtual Reality (VR) has gained popularity by providing immersive and interactive experiences without geographical limitations. It also provides a sense of personal privacy through physical separation. In this paper, we show that despite assumptions of enhanced privacy, VR is unable to shield its users from side-channel attacks that steal private information. Ironically, this vulnerability arises from VR's greatest strength, its immersive and interactive nature. We demonstrate this by designing and implementing a new set of keystroke inference attacks in shared virtual environments, where an attacker (VR user) can recover the content typed by another VR user by observing their avatar. While the avatar displays noisy telemetry of the user's hand motion, an intelligent attacker can use that data to recognize typed keys and reconstruct typed content, without knowing the keyboard layout or gathering labeled data. We evaluate the proposed attacks using IRB-approved user studies across multiple VR scenarios. For 13 out of 15 tested users, our attacks accurately recognize 86%-98% of typed keys, and the recovered content retains up to 98% of the meaning of the original typed content. We also discuss potential defenses. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: Accepted by USENIX 2024

arXiv:2210.09421 [pdf, other]

Deepfake Text Detection: Limitations and Opportunities

Authors: Jiameng Pu, Zain Sarwar, Sifat Muhammad Abdullah, Abdullah Rehman, Yoonjin Kim, Parantapa Bhattacharya, Mobin Javed, Bimal Viswanath

Abstract: Recent advances in generative models for language have enabled the creation of convincing synthetic text or deepfake text. Prior work has demonstrated the potential for misuse of deepfake text to mislead content consumers. Therefore, deepfake text detection, the task of discriminating between human and machine-generated text, is becoming increasingly critical. Several defenses have been proposed f… ▽ More Recent advances in generative models for language have enabled the creation of convincing synthetic text or deepfake text. Prior work has demonstrated the potential for misuse of deepfake text to mislead content consumers. Therefore, deepfake text detection, the task of discriminating between human and machine-generated text, is becoming increasingly critical. Several defenses have been proposed for deepfake text detection. However, we lack a thorough understanding of their real-world applicability. In this paper, we collect deepfake text from 4 online services powered by Transformer-based tools to evaluate the generalization ability of the defenses on content in the wild. We develop several low-cost adversarial attacks, and investigate the robustness of existing defenses against an adaptive attacker. We find that many defenses show significant degradation in performance under our evaluation scenarios compared to their original claimed performance. Our evaluation shows that tapping into the semantic information in the text content is a promising approach for improving the robustness and generalization performance of deepfake text detection schemes. △ Less

Submitted 17 October, 2022; originally announced October 2022.

Comments: Accepted to IEEE S&P 2023; First two authors contributed equally to this work; 18 pages, 7 figures

arXiv:2110.11011 [pdf, other]

doi 10.1142/S0218271821501285

Thermodynamics of Bardeen regular black hole with generalized uncertainty principle

Authors: Areeba Merriam, M. Zain Sarwar

Abstract: This study explores the emission of massive charged spin-1 particles from the background of Bardeen regular spacetime by the semi-classical method used to study the Hawking radiation spectrum. We employed the Hamilton-Jacobi method and WKB approximation technique with the suitable form of the wave function to solve the Proca field equation. We calculated the tunneling probability of outgoing spin-… ▽ More This study explores the emission of massive charged spin-1 particles from the background of Bardeen regular spacetime by the semi-classical method used to study the Hawking radiation spectrum. We employed the Hamilton-Jacobi method and WKB approximation technique with the suitable form of the wave function to solve the Proca field equation. We calculated the tunneling probability of outgoing spin-1 particles and the corresponding thermodynamic temperature. Furthermore, we obtained the modified thermodynamic quantities like temperature, entropy as well as heat capacity by utilizing the quadratic form of generalized uncertainty principle (GUP) and minimal length. In the end, we investigated the local stability as well as phase transitions of the Bardeen black hole in the context of GUP-modified heat capacity. △ Less

Submitted 21 October, 2021; originally announced October 2021.

Comments: 13 pages, 4 figures

arXiv:1910.07718 [pdf]

Multimetric Event-driven System for Long-Term Wireless Sensor Operation in SHM Application

Authors: Muhammad Zohaib Sarwar, Muhammad Rakeh Saleem, Jong-Woong Park, Do-Soo Moon, Dong Joo Kim

Abstract: Wireless sensor networks (WSNs) are promising solutions for large infrastructure monitoring because of their ease of installation, computing and communication capability, and cost-effectiveness. Long-term structural health monitoring (SHM), however, is still a challenge because it requires continuous data acquisition for the detection of random events such as earthquakes and structural collapse. T… ▽ More Wireless sensor networks (WSNs) are promising solutions for large infrastructure monitoring because of their ease of installation, computing and communication capability, and cost-effectiveness. Long-term structural health monitoring (SHM), however, is still a challenge because it requires continuous data acquisition for the detection of random events such as earthquakes and structural collapse. To achieve long-term operation, it is necessary to reduce the power consumption of sensor nodes designed to capture random events and, thus, enhance structural safety. In this paper, we present an event-based sensing system design based on an ultra-low-power microcontroller with a programmable event-detection mechanism to allow continuous monitoring; the device is triggered by vibration, strain, or a timer and has a programmed threshold, resulting in ultra-low-power consumption of the sensor node. Furthermore, the proposed system can be easily reconfigured to any existing wireless sensor platform to enable ultra-low power operation. For validation, the proposed system was integrated with a commercial wireless platform to allow strain, acceleration, and time-based triggering with programmed thresholds and current consumptions of 7.43 and 0.85 mA in active and inactive modes, respectively. △ Less

Submitted 17 October, 2019; originally announced October 2019.

Comments: 10 pages, 9 figures, 3 Tables, Journal paper

Showing 1–7 of 7 results for author: Sarwar, Z