-
Spark Transformer: Reactivating Sparsity in FFN and Attention
Authors:
Chong You,
Kan Wu,
Zhipeng Jia,
Lin Chen,
Srinadh Bhojanapalli,
Jiaxian Guo,
Utku Evci,
Jan Wassenberg,
Praneeth Netrapalli,
Jeremiah J. Willcock,
Suvinay Subramanian,
Felix Chern,
Alek Andreev,
Shreya Pathak,
Felix Yu,
Prateek Jain,
David E. Culler,
Henry M. Levy,
Sanjiv Kumar
Abstract:
The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the Re…
▽ More
The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity often degrade model quality, increase parameter count, complicate or slow down training. Sparse attention, the application of sparse activation to the attention mechanism, often faces similar challenges.
This paper introduces the Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism while maintaining model quality, parameter count, and standard training procedures. Our method realizes sparsity via top-k masking for explicit control over sparsity level. Crucially, we introduce statistical top-k, a hardware-accelerator-friendly, linear-time approximate algorithm that avoids costly sorting and mitigates significant training slowdown from standard top-$k$ operators. Furthermore, Spark Transformer reallocates existing FFN parameters and attention key embeddings to form a low-cost predictor for identifying activated entries. This design not only mitigates quality loss from enforced sparsity, but also enhances wall-time benefit. Pretrained with the Gemma-2 recipe, Spark Transformer demonstrates competitive performance on standard benchmarks while exhibiting significant sparsity: only 8% of FFN neurons are activated, and each token attends to a maximum of 256 tokens. This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion
Authors:
Arash Nasr-Esfahany,
Mohammad Alizadeh,
Victor Lee,
Hanna Alam,
Brett W. Coon,
David Culler,
Vidushi Dadu,
Martin Dixon,
Henry M. Levy,
Santosh Pandey,
Parthasarathy Ranganathan,
Amir Yazdanbakhsh
Abstract:
Cycle-level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures. Unlike existing simulators and learning approaches that emulate each instruction, Concorde predicts the behavior of a program based…
▽ More
Cycle-level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures. Unlike existing simulators and learning approaches that emulate each instruction, Concorde predicts the behavior of a program based on compact performance distributions that capture the impact of different microarchitectural components. It derives these performance distributions using simple analytical models that estimate bounds on performance induced by each microarchitectural component, providing a simple yet rich representation of a program's performance characteristics across a large space of microarchitectural parameters. Experiments show that Concorde is more than five orders of magnitude faster than a reference cycle-level simulator, with about 2% average Cycles-Per-Instruction (CPI) prediction error across a range of SPEC, open-source, and proprietary benchmarks. This enables rapid design-space exploration and performance sensitivity analyses that are currently infeasible, e.g., in about an hour, we conducted a first-of-its-kind fine-grained performance attribution to different microarchitectural components across a diverse set of programs, requiring nearly 150 million CPI evaluations.
△ Less
Submitted 29 March, 2025;
originally announced March 2025.
-
EchoLM: Accelerating LLM Serving with Real-time Knowledge Distillation
Authors:
Yifan Yu,
Yu Gan,
Lillian Tsai,
Nikhil Sarda,
Jiaming Shen,
Yanqi Zhou,
Arvind Krishnamurthy,
Fan Lai,
Henry M. Levy,
David Culler
Abstract:
Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 60% of user requests to LLMs have semantically similar counterparts, suggesting the potential for knowledge sharing among requests. However, naively caching and reusing past responses leads t…
▽ More
Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 60% of user requests to LLMs have semantically similar counterparts, suggesting the potential for knowledge sharing among requests. However, naively caching and reusing past responses leads to large quality degradation. In this paper, we introduce EchoLM, an in-context caching system that leverages historical requests as examples to guide response generation, enabling selective offloading of requests to more efficient LLMs. However, enabling this real-time knowledge transfer leads to intricate tradeoffs between response quality, latency, and system throughput at scale. For a new request, EchoLM identifies similar, high-utility examples and efficiently prepends them to the input for better response. At scale, EchoLM adaptively routes requests to LLMs of varying capabilities, accounting for response quality and serving loads. EchoLM employs a cost-aware cache replay mechanism to improve example quality and coverage offline, maximizing cache utility and runtime efficiency. Evaluations on millions of open-source requests demonstrate that EchoLM has a throughput improvement of 1.4-5.9x while reducing latency by 28-71% without hurting response quality on average.
△ Less
Submitted 24 January, 2025; v1 submitted 22 January, 2025;
originally announced January 2025.
-
Tide: A Split OS Architecture for Control Plane Offloading
Authors:
Jack Tigar Humphries,
Neel Natu,
Kostis Kaffes,
Stanko Novaković,
Paul Turner,
Hank Levy,
David Culler,
Christos Kozyrakis
Abstract:
The end of Moore's Law is driving cloud providers to offload virtualization and the network data plane to SmartNICs to improve compute efficiency. Even though individual OS control plane tasks consume up to 5% of cycles across the fleet, they remain on the host CPU because they are tightly intertwined with OS mechanisms. Moreover, offloading puts the slow PCIe interconnect in the critical path of…
▽ More
The end of Moore's Law is driving cloud providers to offload virtualization and the network data plane to SmartNICs to improve compute efficiency. Even though individual OS control plane tasks consume up to 5% of cycles across the fleet, they remain on the host CPU because they are tightly intertwined with OS mechanisms. Moreover, offloading puts the slow PCIe interconnect in the critical path of OS decisions.
We propose Tide, a new split OS architecture that separates OS control plane policies from mechanisms and offloads the control plane policies onto a SmartNIC. Tide has a new host-SmartNIC communication API, state synchronization mechanism, and communication mechanisms that overcome the PCIe bottleneck, even for $μ$s-scale workloads. Tide frees up host compute for applications and unlocks new optimization opportunities, including machine learning-driven policies, scheduling on the network I/O path, and reducing on-host interference. We demonstrate that Tide enables OS control planes that are competitive with on-host performance for the most difficult $μ$s-scale workloads. Tide outperforms on-host control planes for memory management (saving 16 host cores), Stubby network RPCs (saving 8 cores), and GCE virtual machine management (11.2% performance improvement).
△ Less
Submitted 20 October, 2024; v1 submitted 30 August, 2024;
originally announced August 2024.
-
Lovelock: Towards Smart NIC-hosted Clusters
Authors:
Seo Jin Park,
Ramesh Govindan,
Kai Shen,
David Culler,
Fatma Özcan,
Geon-Woo Kim,
Hank Levy
Abstract:
Traditional cluster designs were originally server-centric, and have evolved recently to support hardware acceleration and storage disaggregation. In applications that leverage acceleration, the server CPU performs the role of orchestrating computation and data movement and data-intensive applications stress the memory bandwidth. Applications that leverage disaggregation can be adversely affected…
▽ More
Traditional cluster designs were originally server-centric, and have evolved recently to support hardware acceleration and storage disaggregation. In applications that leverage acceleration, the server CPU performs the role of orchestrating computation and data movement and data-intensive applications stress the memory bandwidth. Applications that leverage disaggregation can be adversely affected by the increased PCIe and network bandwidth resulting from disaggregation. In this paper, we advocate for a specialized cluster design for important data intensive applications, such as analytics, query processing and ML training. This design, Lovelock, replaces each server in a cluster with one or more headless smart NICs. Because smart NICs are significantly cheaper than servers on bandwidth, the resulting cluster can run these applications without adversely impacting performance, while obtaining cost and energy savings.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
MAGE: Nearly Zero-Cost Virtual Memory for Secure Computation
Authors:
Sam Kumar,
David E. Culler,
Raluca Ada Popa
Abstract:
Secure Computation (SC) is a family of cryptographic primitives for computing on encrypted data in single-party and multi-party settings. SC is being increasingly adopted by industry for a variety of applications. A significant obstacle to using SC for practical applications is the memory overhead of the underlying cryptography. We develop MAGE, an execution engine for SC that efficiently runs SC…
▽ More
Secure Computation (SC) is a family of cryptographic primitives for computing on encrypted data in single-party and multi-party settings. SC is being increasingly adopted by industry for a variety of applications. A significant obstacle to using SC for practical applications is the memory overhead of the underlying cryptography. We develop MAGE, an execution engine for SC that efficiently runs SC computations that do not fit in memory. We observe that, due to their intended security guarantees, SC schemes are inherently oblivious -- their memory access patterns are independent of the input data. Using this property, MAGE calculates the memory access pattern ahead of time and uses it to produce a memory management plan. This formulation of memory management, which we call memory programming, is a generalization of paging that allows MAGE to provide a highly efficient virtual memory abstraction for SC. MAGE outperforms the OS virtual memory system by up to an order of magnitude, and in many cases, runs SC computations that do not fit in memory at nearly the same speed as if the underlying machines had unbounded physical memory to fit the entire computation.
△ Less
Submitted 27 October, 2022; v1 submitted 23 June, 2021;
originally announced June 2021.
-
Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems
Authors:
Sam Kumar,
Michael P Andersen,
David E. Culler
Abstract:
As the rate of data collection continues to grow rapidly, developing visualization tools that scale to immense data sets is a serious and ever-increasing challenge. Existing approaches generally seek to decouple storage and visualization systems, performing just-in-time data reduction to transparently avoid overloading the visualizer. We present a new architecture in which the visualizer and data…
▽ More
As the rate of data collection continues to grow rapidly, developing visualization tools that scale to immense data sets is a serious and ever-increasing challenge. Existing approaches generally seek to decouple storage and visualization systems, performing just-in-time data reduction to transparently avoid overloading the visualizer. We present a new architecture in which the visualizer and data store are tightly coupled. Unlike systems that read raw data from storage, the performance of our system scales linearly with the size of the final visualization, essentially independent of the size of the data. Thus, it scales to massive data sets while supporting interactive performance (sub-100 ms query latency). This enables a new class of visualization clients that automatically manage data, quickly and transparently requesting data from the underlying database without requiring the user to explicitly initiate queries. It lays a groundwork for supporting truly interactive exploration of big data and opens new directions for research on scalable information visualization systems.
△ Less
Submitted 23 June, 2021;
originally announced June 2021.
-
10 Years Later: Cloud Computing is Closing the Performance Gap
Authors:
Giulia Guidi,
Marquita Ellis,
Aydin Buluc,
Katherine Yelick,
David Culler
Abstract:
Can cloud computing infrastructures provide HPC-competitive performance for scientific applications broadly? Despite prolific related literature, this question remains open. Answers are crucial for designing future systems and democratizing high-performance computing. We present a multi-level approach to investigate the performance gap between HPC and cloud computing, isolating different variables…
▽ More
Can cloud computing infrastructures provide HPC-competitive performance for scientific applications broadly? Despite prolific related literature, this question remains open. Answers are crucial for designing future systems and democratizing high-performance computing. We present a multi-level approach to investigate the performance gap between HPC and cloud computing, isolating different variables that contribute to this gap. Our experiments are divided into (i) hardware and system microbenchmarks and (ii) user application proxies. The results show that today's high-end cloud computing can deliver HPC-competitive performance not only for computationally intensive applications but also for memory- and communication-intensive applications - at least at modest scales - thanks to the high-speed memory systems and interconnects and dedicated batch scheduling now available on some cloud platforms.
△ Less
Submitted 5 March, 2021; v1 submitted 1 November, 2020;
originally announced November 2020.
-
CoVista: A Unified View on Privacy Sensitive Mobile Contact Tracing Effort
Authors:
David Culler,
Prabal Dutta,
Gabe Fierro,
Joseph E. Gonzalez,
Nathan Pemberton,
Johann Schleier-Smith,
K. Shankari,
Alvin Wan,
Thomas Zachariah
Abstract:
Governments around the world have become increasingly frustrated with tech giants dictating public health policy. The software created by Apple and Google enables individuals to track their own potential exposure through collated exposure notifications. However, the same software prohibits location tracking, denying key information needed by public health officials for robust contract tracing. Thi…
▽ More
Governments around the world have become increasingly frustrated with tech giants dictating public health policy. The software created by Apple and Google enables individuals to track their own potential exposure through collated exposure notifications. However, the same software prohibits location tracking, denying key information needed by public health officials for robust contract tracing. This information is needed to treat and isolate COVID-19 positive people, identify transmission hotspots, and protect against continued spread of infection. In this article, we present two simple ideas: the lighthouse and the covid-commons that address the needs of public health authorities while preserving the privacy-sensitive goals of the Apple and google exposure notification protocols.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
JEDI: Many-to-Many End-to-End Encryption and Key Delegation for IoT
Authors:
Sam Kumar,
Yuncong Hu,
Michael P Andersen,
Raluca Ada Popa,
David E. Culler
Abstract:
As the Internet of Things (IoT) emerges over the next decade, developing secure communication for IoT devices is of paramount importance. Achieving end-to-end encryption for large-scale IoT systems, like smart buildings or smart cities, is challenging because multiple principals typically interact indirectly via intermediaries, meaning that the recipient of a message is not known in advance. This…
▽ More
As the Internet of Things (IoT) emerges over the next decade, developing secure communication for IoT devices is of paramount importance. Achieving end-to-end encryption for large-scale IoT systems, like smart buildings or smart cities, is challenging because multiple principals typically interact indirectly via intermediaries, meaning that the recipient of a message is not known in advance. This paper proposes JEDI (Joining Encryption and Delegation for IoT), a many-to-many end-to-end encryption protocol for IoT. JEDI encrypts and signs messages end-to-end, while conforming to the decoupled communication model typical of IoT systems. JEDI's keys support expiry and fine-grained access to data, common in IoT. Furthermore, JEDI allows principals to delegate their keys, restricted in expiry or scope, to other principals, thereby granting access to data and managing access control in a scalable, distributed way. Through careful protocol design and implementation, JEDI can run across the spectrum of IoT devices, including ultra low-power deeply embedded sensors severely constrained in CPU, memory, and energy consumption. We apply JEDI to an existing IoT messaging system and demonstrate that its overhead is modest.
△ Less
Submitted 3 March, 2020; v1 submitted 30 May, 2019;
originally announced May 2019.
-
Performant TCP for Low-Power Wireless Networks
Authors:
Sam Kumar,
Michael P Andersen,
Hyung-Sin Kim,
David E. Culler
Abstract:
Low-power and lossy networks (LLNs) enable diverse applications integrating many resource-constrained embedded devices, often requiring interconnectivity with existing TCP/IP networks as part of the Internet of Things. But TCP has received little attention in LLNs due to concerns about its overhead and performance, leading to LLN-specific protocols that require specialized gateways for interoperab…
▽ More
Low-power and lossy networks (LLNs) enable diverse applications integrating many resource-constrained embedded devices, often requiring interconnectivity with existing TCP/IP networks as part of the Internet of Things. But TCP has received little attention in LLNs due to concerns about its overhead and performance, leading to LLN-specific protocols that require specialized gateways for interoperability. We present a systematic study of a well-designed TCP stack in IEEE 802.15.4-based LLNs, based on the TCP protocol logic in FreeBSD. Through careful implementation and extensive experiments, we show that modern low-power sensor platforms are capable of running full-scale TCP and that TCP, counter to common belief, performs well despite the lossy nature of LLNs. By carefully studying the interaction between the transport and link layers, we identify subtle but important modifications to both, achieving TCP goodput within 25% of an upper bound (5-40x higher than prior results) and low-power operation commensurate to CoAP in a practical LLN application scenario. This suggests that a TCP-based transport layer, seamlessly interoperable with existing TCP/IP networks, is viable and performant in LLNs.
△ Less
Submitted 28 February, 2020; v1 submitted 6 November, 2018;
originally announced November 2018.
-
A Berkeley View of Systems Challenges for AI
Authors:
Ion Stoica,
Dawn Song,
Raluca Ada Popa,
David Patterson,
Michael W. Mahoney,
Randy Katz,
Anthony D. Joseph,
Michael Jordan,
Joseph M. Hellerstein,
Joseph E. Gonzalez,
Ken Goldberg,
Ali Ghodsi,
David Culler,
Pieter Abbeel
Abstract:
With the increasing commoditization of computer vision, speech recognition and machine translation systems and the widespread deployment of learning-based back-end technologies such as digital advertising and intelligent infrastructures, AI (Artificial Intelligence) has moved from research labs to production. These changes have been made possible by unprecedented levels of data and computation, by…
▽ More
With the increasing commoditization of computer vision, speech recognition and machine translation systems and the widespread deployment of learning-based back-end technologies such as digital advertising and intelligent infrastructures, AI (Artificial Intelligence) has moved from research labs to production. These changes have been made possible by unprecedented levels of data and computation, by methodological advances in machine learning, by innovations in systems software and architectures, and by the broad accessibility of these technologies.
The next generation of AI systems promises to accelerate these developments and increasingly impact our lives via frequent interactions and making (often mission-critical) decisions on our behalf, often in highly personalized contexts. Realizing this promise, however, raises daunting challenges. In particular, we need AI systems that make timely and safe decisions in unpredictable environments, that are robust against sophisticated adversaries, and that can process ever increasing amounts of data across organizations and individuals without compromising confidentiality. These challenges will be exacerbated by the end of the Moore's Law, which will constrain the amount of data these technologies can store and process. In this paper, we propose several open research directions in systems, architectures, and security that can address these challenges and help unlock AI's potential to improve lives and society.
△ Less
Submitted 15 December, 2017;
originally announced December 2017.
-
Energy-Efficient Building HVAC Control Using Hybrid System LBMPC
Authors:
Anil Aswani,
Neal Master,
Jay Taneja,
Andrew Krioukov,
David Culler,
Claire Tomlin
Abstract:
Improving the energy-efficiency of heating, ventilation, and air-conditioning (HVAC) systems has the potential to realize large economic and societal benefits. This paper concerns the system identification of a hybrid system model of a building-wide HVAC system and its subsequent control using a hybrid system formulation of learning-based model predictive control (LBMPC). Here, the learning refers…
▽ More
Improving the energy-efficiency of heating, ventilation, and air-conditioning (HVAC) systems has the potential to realize large economic and societal benefits. This paper concerns the system identification of a hybrid system model of a building-wide HVAC system and its subsequent control using a hybrid system formulation of learning-based model predictive control (LBMPC). Here, the learning refers to model updates to the hybrid system model that incorporate the heating effects due to occupancy, solar effects, outside air temperature (OAT), and equipment, in addition to integrator dynamics inherently present in low-level control. Though we make significant modeling simplifications, our corresponding controller that uses this model is able to experimentally achieve a large reduction in energy usage without any degradations in occupant comfort. It is in this way that we justify the modeling simplifications that we have made. We conclude by presenting results from experiments on our building HVAC testbed, which show an average of 1.5MWh of energy savings per day (p = 0.002) with a 95% confidence interval of 1.0MWh to 2.1MWh of energy savings.
△ Less
Submitted 20 April, 2012;
originally announced April 2012.