Search | arXiv e-print repository

TensorQC: Towards Scalable Distributed Quantum Computing via Tensor Networks

Abstract: A quantum processing unit (QPU) must contain a large number of high quality qubits to produce accurate results for problems at useful scales. In contrast, most scientific and industry classical computation workloads happen in parallel on distributed systems, which rely on copying data across multiple cores. Unfortunately, copying quantum data is theoretically prohibited due to the quantum non-clon… ▽ More A quantum processing unit (QPU) must contain a large number of high quality qubits to produce accurate results for problems at useful scales. In contrast, most scientific and industry classical computation workloads happen in parallel on distributed systems, which rely on copying data across multiple cores. Unfortunately, copying quantum data is theoretically prohibited due to the quantum non-cloning theory. Instead, quantum circuit cutting techniques cut a large quantum circuit into multiple smaller subcircuits, distribute the subcircuits on parallel QPUs and reconstruct the results with classical computing. Such techniques make distributed hybrid quantum computing (DHQC) a possibility but also introduce an exponential classical co-processing cost in the number of cuts and easily become intractable. This paper presents TensorQC, which leverages classical tensor networks to bring an exponential runtime advantage over state-of-the-art parallelization post-processing techniques. As a result, this paper demonstrates running benchmarks that are otherwise intractable for a standalone QPU and prior circuit cutting techniques. Specifically, this paper runs six realistic benchmarks using QPUs available nowadays and a single GPU, and reduces the QPU size and quality requirements by more than $10\times$ over purely quantum platforms. △ Less

Submitted 5 February, 2025; originally announced February 2025.

Comments: 14 pages, 14 figures

arXiv:2312.10244 [pdf, other]

Muchisim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems

Authors: Marcelo Orenes-Vera, Esin Tureci, Margaret Martonosi, David Wentzlaff

Abstract: The design space exploration of scaled-out manycores for communication-intensive applications (e.g., graph analytics and sparse linear algebra) is hampered due to either lack of scalability or accuracy of existing frameworks at simulating data-dependent execution patterns. This paper presents MuchiSim, a novel parallel simulator designed to address these challenges when exploring the design space… ▽ More The design space exploration of scaled-out manycores for communication-intensive applications (e.g., graph analytics and sparse linear algebra) is hampered due to either lack of scalability or accuracy of existing frameworks at simulating data-dependent execution patterns. This paper presents MuchiSim, a novel parallel simulator designed to address these challenges when exploring the design space of distributed multi-chiplet manycore architectures. We evaluate MuchiSim at simulating systems with up to a million interconnected processing units (PUs) while modeling data movement and communication cycle by cycle. In addition to performance, MuchiSim reports the energy, area, and cost of the simulated system. It also comes with a benchmark application suite and two data visualization tools. MuchiSim supports various parallelization strategies and communication primitives such as task-based parallelization and message passing, making it highly relevant for architectures with software-managed coherence and distributed memory. Via a case study, we show that MuchiSim helps users explore the balance between memory and computation units and the constraints related to chiplet integration and inter-chip communication. MuchiSim enables evaluating new techniques or design parameters for systems at scales that are more realistic for modern parallel systems, opening the gate for further research in this area. △ Less

Submitted 22 April, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: In Proceedings of the 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

arXiv:2312.09733 [pdf, other]

doi 10.1016/j.future.2024.04.060

Quantum-centric Supercomputing for Materials Science: A Perspective on Challenges and Future Directions

Authors: Yuri Alexeev, Maximilian Amsler, Paul Baity, Marco Antonio Barroca, Sanzio Bassini, Torey Battelle, Daan Camps, David Casanova, Young Jai Choi, Frederic T. Chong, Charles Chung, Chris Codella, Antonio D. Corcoles, James Cruise, Alberto Di Meglio, Jonathan Dubois, Ivan Duran, Thomas Eckl, Sophia Economou, Stephan Eidenbenz, Bruce Elmegreen, Clyde Fare, Ismael Faro, Cristina Sanz Fernández, Rodrigo Neumann Barros Ferreira , et al. (102 additional authors not shown)

Abstract: Computational models are an essential tool for the design, characterization, and discovery of novel materials. Hard computational tasks in materials science stretch the limits of existing high-performance supercomputing centers, consuming much of their simulation, analysis, and data resources. Quantum computing, on the other hand, is an emerging technology with the potential to accelerate many of… ▽ More Computational models are an essential tool for the design, characterization, and discovery of novel materials. Hard computational tasks in materials science stretch the limits of existing high-performance supercomputing centers, consuming much of their simulation, analysis, and data resources. Quantum computing, on the other hand, is an emerging technology with the potential to accelerate many of the computational tasks needed for materials science. In order to do that, the quantum technology must interact with conventional high-performance computing in several ways: approximate results validation, identification of hard problems, and synergies in quantum-centric supercomputing. In this paper, we provide a perspective on how quantum-centric supercomputing can help address critical computational problems in materials science, the challenges to face in order to solve representative use cases, and new suggested directions. △ Less

Submitted 19 September, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: 65 pages, 15 figures; comments welcome

Journal ref: Future Generation Computer Systems, Volume 160, November 2024, Pages 666-710

arXiv:2311.15810 [pdf, other]

Tascade: Hardware Support for Atomic-free, Asynchronous and Efficient Reduction Trees

Authors: Marcelo Orenes-Vera, Esin Tureci, David Wentzlaff, Margaret Martonosi

Abstract: Graph search and sparse data-structure traversal workloads contain challenging irregular memory patterns on global data structures that need to be modified atomically. Distributed processing of these workloads has relied on server threads operating on their own data copies that are merged upon global synchronization. As parallelism increases within each server, the communication challenges that ar… ▽ More Graph search and sparse data-structure traversal workloads contain challenging irregular memory patterns on global data structures that need to be modified atomically. Distributed processing of these workloads has relied on server threads operating on their own data copies that are merged upon global synchronization. As parallelism increases within each server, the communication challenges that arose in distributed systems a decade ago are now being encountered within large manycore servers. Prior work has achieved scalability for sparse applications up to thousands of PUs on-chip, but does not scale further due to increasing communication distances and load-imbalance across PUs. To address these challenges we propose Tascade, a hardware-software co-design that offers support for storage-efficient data-private reductions as well as asynchronous and opportunistic reduction trees. Tascade introduces an execution model along with supporting hardware design that allows coalescing of data updates regionally and merges the data from these regions through cascaded updates. Together, Tascade innovations minimize communication and increase work balance in task-based parallelization schemes and scales up to a million PUs. We evaluate six applications and four datasets to provide a detailed analysis of Tascade's performance, power, and traffic-reduction gains over prior work. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs -- the largest of the literature -- reaches over 7600 GTEPS. △ Less

Submitted 22 April, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.15443 [pdf, other]

DCRA: A Distributed Chiplet-based Reconfigurable Architecture for Irregular Applications

Authors: Marcelo Orenes-Vera, Esin Tureci, Margaret Martonosi, David Wentzlaff

Abstract: In recent years, the growing demand to process large graphs and sparse datasets has led to increased research efforts to develop hardware- and software-based architectural solutions to accelerate them. While some of these approaches achieve scalable parallelization with up to thousands of cores, adaptation of these proposals by the industry remained slow. To help solve this dissonance, we identifi… ▽ More In recent years, the growing demand to process large graphs and sparse datasets has led to increased research efforts to develop hardware- and software-based architectural solutions to accelerate them. While some of these approaches achieve scalable parallelization with up to thousands of cores, adaptation of these proposals by the industry remained slow. To help solve this dissonance, we identified a set of questions and considerations that current research has not considered deeply. Starting from a tile-based architecture, we put forward a Distributed Chiplet-based Reconfigurable Architecture (DCRA) for irregular applications that carefully consider fabrication constraints that made prior work either hard or costly to implement or too rigid to be applied. We identify and study pre-silicon, package-time and compile-time configurations that help optimize DCRA for different deployments and target metrics. To enable that, we propose a practical path for manufacturing chip packages by composing variable numbers of DCRA and memory dies, with a software-configurable Torus network to connect them. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of DCRA as a compute node for scale-out sparse data processing. Finally, we present our findings and discuss how DCRA's framework for design exploration can help guide architects to build scalable and cost-efficient systems for irregular applications. △ Less

Submitted 25 February, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

arXiv:2309.09437 [pdf, other]

Using LLMs to Facilitate Formal Verification of RTL

Authors: Marcelo Orenes-Vera, Margaret Martonosi, David Wentzlaff

Abstract: Formal property verification (FPV) has existed for decades and has been shown to be effective at finding intricate RTL bugs. However, formal properties, such as those written as SystemVerilog Assertions (SVA), are time-consuming and error-prone to write, even for experienced users. Prior work has attempted to lighten this burden by raising the abstraction level so that SVA is generated from high-l… ▽ More Formal property verification (FPV) has existed for decades and has been shown to be effective at finding intricate RTL bugs. However, formal properties, such as those written as SystemVerilog Assertions (SVA), are time-consuming and error-prone to write, even for experienced users. Prior work has attempted to lighten this burden by raising the abstraction level so that SVA is generated from high-level specifications. However, this does not eliminate the manual effort of reasoning and writing about the detailed hardware behavior. Motivated by the increased need for FPV in the era of heterogeneous hardware and the advances in large language models (LLMs), we set out to explore whether LLMs can capture RTL behavior and generate correct SVA properties. First, we design an FPV-based evaluation framework that measures the correctness and completeness of SVA. Then, we evaluate GPT4 iteratively to craft the set of syntax and semantic rules needed to prompt it toward creating better SVA. We extend the open-source AutoSVA framework by integrating our improved GPT4-based flow to generate safety properties, in addition to facilitating their existing flow for liveness properties. Lastly, our use cases evaluate (1) the FPV coverage of GPT4-generated SVA on complex open-source RTL and (2) using generated SVA to prompt GPT4 to create RTL from scratch. Through these experiments, we find that GPT4 can generate correct SVA even for flawed RTL, without mirroring design errors. Particularly, it generated SVA that exposed a bug in the RISC-V CVA6 core that eluded the prior work's evaluation. △ Less

Submitted 28 September, 2023; v1 submitted 17 September, 2023; originally announced September 2023.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2305.03243 [pdf, other]

Microarchitectures for Heterogeneous Superconducting Quantum Computers

Authors: Samuel Stein, Sara Sussman, Teague Tomesh, Charles Guinn, Esin Tureci, Sophia Fuhui Lin, Wei Tang, James Ang, Srivatsan Chakram, Ang Li, Margaret Martonosi, Fred T. Chong, Andrew A. Houck, Isaac L. Chuang, Michael Austin DeMarco

Abstract: Noisy Intermediate-Scale Quantum Computing (NISQ) has dominated headlines in recent years, with the longer-term vision of Fault-Tolerant Quantum Computation (FTQC) offering significant potential albeit at currently intractable resource costs and quantum error correction (QEC) overheads. For problems of interest, FTQC will require millions of physical qubits with long coherence times, high-fidelity… ▽ More Noisy Intermediate-Scale Quantum Computing (NISQ) has dominated headlines in recent years, with the longer-term vision of Fault-Tolerant Quantum Computation (FTQC) offering significant potential albeit at currently intractable resource costs and quantum error correction (QEC) overheads. For problems of interest, FTQC will require millions of physical qubits with long coherence times, high-fidelity gates, and compact sizes to surpass classical systems. Just as heterogeneous specialization has offered scaling benefits in classical computing, it is likewise gaining interest in FTQC. However, systematic use of heterogeneity in either hardware or software elements of FTQC systems remains a serious challenge due to the vast design space and variable physical constraints. This paper meets the challenge of making heterogeneous FTQC design practical by introducing HetArch, a toolbox for designing heterogeneous quantum systems, and using it to explore heterogeneous design scenarios. Using a hierarchical approach, we successively break quantum algorithms into smaller operations (akin to classical application kernels), thus greatly simplifying the design space and resulting tradeoffs. Specializing to superconducting systems, we then design optimized heterogeneous hardware composed of varied superconducting devices, abstracting physical constraints into design rules that enable devices to be assembled into standard cells optimized for specific operations. Finally, we provide a heterogeneous design space exploration framework which reduces the simulation burden by a factor of 10^4 or more and allows us to characterize optimal design points. We use these techniques to design superconducting quantum modules for entanglement distillation, error correction, and code teleportation, reducing error rates by 2.6x, 10.7x, and 3.0x compared to homogeneous systems. △ Less

Submitted 4 May, 2023; originally announced May 2023.

arXiv:2304.09389 [pdf, other]

Massive Data-Centric Parallelism in the Chiplet Era

Authors: Marcelo Orenes-Vera, Esin Tureci, David Wentzlaff, Margaret Martonosi

Abstract: Recent works have introduced task-based parallelization schemes to accelerate graph search and sparse data-structure traversal, where some solutions scale up to thousands of processing units (PUs) on a single chip. However parallelizing these memory-intensive workloads across millions of cores requires a scalable communication scheme as well as designing a cost-efficient computing node that makes… ▽ More Recent works have introduced task-based parallelization schemes to accelerate graph search and sparse data-structure traversal, where some solutions scale up to thousands of processing units (PUs) on a single chip. However parallelizing these memory-intensive workloads across millions of cores requires a scalable communication scheme as well as designing a cost-efficient computing node that makes multi-node systems practical, which have not been addressed in previous research. To address these challenges, we propose a task-oriented scalable chiplet architecture for distributed execution (Tascade), a multi-node system design that we evaluate with up to 256 distributed chips -- over a million PUs. We introduce an execution model that scales to this level via proxy regions and selective cascading, which reduce overall communication and improve load balancing. In addition, package-time reconfiguration of our chiplet-based design enables creating chip products that optimized post-silicon for different target metrics, such as time-to-solution, energy, or cost. We evaluate six applications and four datasets, with several configurations and memory technologies to provide a detailed analysis of the performance, power, and cost of data-centric execution at a massive scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs -- the largest of the literature -- reaches 3021 GTEPS. △ Less

Submitted 11 August, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

arXiv:2212.06167 [pdf, other]

Architectures for Multinode Superconducting Quantum Computers

Authors: James Ang, Gabriella Carini, Yanzhu Chen, Isaac Chuang, Michael Austin DeMarco, Sophia E. Economou, Alec Eickbusch, Andrei Faraon, Kai-Mei Fu, Steven M. Girvin, Michael Hatridge, Andrew Houck, Paul Hilaire, Kevin Krsulich, Ang Li, Chenxu Liu, Yuan Liu, Margaret Martonosi, David C. McKay, James Misewich, Mark Ritter, Robert J. Schoelkopf, Samuel A. Stein, Sara Sussman, Hong X. Tang , et al. (8 additional authors not shown)

Abstract: Many proposals to scale quantum technology rely on modular or distributed designs where individual quantum processors, called nodes, are linked together to form one large multinode quantum computer (MNQC). One scalable method to construct an MNQC is using superconducting quantum systems with optical interconnects. However, a limiting factor of these machines will be internode gates, which may be t… ▽ More Many proposals to scale quantum technology rely on modular or distributed designs where individual quantum processors, called nodes, are linked together to form one large multinode quantum computer (MNQC). One scalable method to construct an MNQC is using superconducting quantum systems with optical interconnects. However, a limiting factor of these machines will be internode gates, which may be two to three orders of magnitude noisier and slower than local operations. Surmounting the limitations of internode gates will require a range of techniques, including improvements in entanglement generation, the use of entanglement distillation, and optimized software and compilers, and it remains unclear how improvements to these components interact to affect overall system performance, what performance from each is required, or even how to quantify the performance of each. In this paper, we employ a `co-design' inspired approach to quantify overall MNQC performance in terms of hardware models of internode links, entanglement distillation, and local architecture. In the case of superconducting MNQCs with microwave-to-optical links, we uncover a tradeoff between entanglement generation and distillation that threatens to degrade performance. We show how to navigate this tradeoff, lay out how compilers should optimize between local and internode gates, and discuss when noisy quantum links have an advantage over purely classical links. Using these results, we introduce a roadmap for the realization of early MNQCs which illustrates potential improvements to the hardware and software of MNQCs and outlines criteria for evaluating the landscape, from progress in entanglement generation and quantum memory to dedicated algorithms such as distributed quantum phase estimation. While we focus on superconducting devices with optical interconnects, our approach is general across MNQC implementations. △ Less

Submitted 12 December, 2022; originally announced December 2022.

Comments: 23 pages, white paper

arXiv:2207.13219 [pdf, other]

doi 10.1109/HPCA56546.2023.10071089

Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications

Authors: Marcelo Orenes-Vera, Esin Tureci, David Wentzlaff, Margaret Martonosi

Abstract: Applications with low data reuse and frequent irregular memory accesses, such as graph or sparse linear algebra workloads, fail to scale well due to memory bottlenecks and poor core utilization. While prior work with prefetching, decoupling, or pipelining can mitigate memory latency and improve core utilization, memory bottlenecks persist due to limited off-chip bandwidth. Approaches doing process… ▽ More Applications with low data reuse and frequent irregular memory accesses, such as graph or sparse linear algebra workloads, fail to scale well due to memory bottlenecks and poor core utilization. While prior work with prefetching, decoupling, or pipelining can mitigate memory latency and improve core utilization, memory bottlenecks persist due to limited off-chip bandwidth. Approaches doing processing in-memory (PIM) with Hybrid Memory Cube (HMC) overcome bandwidth limitations but fail to achieve high core utilization due to poor task scheduling and synchronization overheads. Moreover, the high memory-per-core ratio available with HMC limits strong scaling. We introduce Dalorex, a hardware-software co-design that achieves high parallelism and energy efficiency, demonstrating strong scaling with >16,000 cores when processing graph and sparse linear algebra workloads. Over the prior work in PIM, both using 256 cores, Dalorex improves performance and energy consumption by two orders of magnitude through (1) a tile-based distributed-memory architecture where each processing tile holds an equal amount of data, and all memory operations are local; (2) a task-based parallel programming model where tasks are executed by the processing unit that is co-located with the target data; (3) a network design optimized for irregular traffic, where all communication is one-way, and messages do not contain routing metadata; (4) novel traffic-aware task scheduling hardware that maintains high core utilization; and (5) a data placement strategy that improves work balance. This work proposes architectural and software innovations to provide the greatest scalability to date for running graph algorithms while still being programmable for other domains. △ Less

Submitted 4 May, 2023; v1 submitted 26 July, 2022; originally announced July 2022.

Comments: In Proceedings of the 29th IEEE Symposium on High-Performance Computer Architecture (HPCA-29)

Journal ref: 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

arXiv:2207.00933 [pdf, other]

ScaleQC: A Scalable Framework for Hybrid Computation on Quantum and Classical Processors

Authors: Wei Tang, Margaret Martonosi

Abstract: Quantum processing unit (QPU) has to satisfy highly demanding quantity and quality requirements on its qubits to produce accurate results for problems at useful scales. Furthermore, classical simulations of quantum circuits generally do not scale. Instead, quantum circuit cutting techniques cut and distribute a large quantum circuit into multiple smaller subcircuits feasible for less powerful QPUs… ▽ More Quantum processing unit (QPU) has to satisfy highly demanding quantity and quality requirements on its qubits to produce accurate results for problems at useful scales. Furthermore, classical simulations of quantum circuits generally do not scale. Instead, quantum circuit cutting techniques cut and distribute a large quantum circuit into multiple smaller subcircuits feasible for less powerful QPUs. However, the classical post-processing incurred from the cutting introduces runtime and memory bottlenecks. Our tool, called ScaleQC, addresses the bottlenecks by developing novel algorithmic techniques including (1) a quantum states merging framework that quickly locates the solution states of large quantum circuits; (2) an automatic solver that cuts complex quantum circuits to fit on less powerful QPUs; and (3) a tensor network based post-processing that minimizes the classical overhead. Our experiments demonstrate both QPU requirement advantages over the purely quantum platforms, and runtime advantages over the purely classical platforms for benchmarks up to 1000 qubits. △ Less

Submitted 2 July, 2022; originally announced July 2022.

Comments: 12 pages, 13 figures

arXiv:2205.05836 [pdf, other]

Cutting Quantum Circuits to Run on Quantum and Classical Platforms

Authors: Wei Tang, Margaret Martonosi

Abstract: Quantum computing (QC) offers a new computing paradigm that has the potential to provide significant speedups over classical computing. Each additional qubit doubles the size of the computational state space available to a quantum algorithm. Such exponentially expanding reach underlies QC's power, but at the same time puts demanding requirements on the quantum processing units (QPU) hardware. On t… ▽ More Quantum computing (QC) offers a new computing paradigm that has the potential to provide significant speedups over classical computing. Each additional qubit doubles the size of the computational state space available to a quantum algorithm. Such exponentially expanding reach underlies QC's power, but at the same time puts demanding requirements on the quantum processing units (QPU) hardware. On the other hand, purely classical simulations of quantum circuits on either central processing unit (CPU) or graphics processing unit (GPU) scale poorly as they quickly become bottlenecked by runtime and memory. This paper introduces CutQC, a scalable hybrid computing approach that distributes a large quantum circuit onto quantum (QPU) and classical platforms (CPU or GPU) for co-processing. CutQC demonstrates evaluation of quantum circuits that are larger than the limit of QPU or classical simulation, and achieves much higher quantum circuit evaluation fidelity than the large NISQ devices achieve in real-system runs. △ Less

Submitted 11 May, 2022; originally announced May 2022.

Comments: 10 pages, 5 figures

arXiv:2203.12713 [pdf, other]

Optimized Quantum Program Execution Ordering to Mitigate Errors in Simulations of Quantum Systems

Authors: Teague Tomesh, Kaiwen Gui, Pranav Gokhale, Yunong Shi, Frederic T. Chong, Margaret Martonosi, Martin Suchara

Abstract: Simulating the time evolution of a physical system at quantum mechanical levels of detail -- known as Hamiltonian Simulation (HS) -- is an important and interesting problem across physics and chemistry. For this task, algorithms that run on quantum computers are known to be exponentially faster than classical algorithms; in fact, this application motivated Feynman to propose the construction of qu… ▽ More Simulating the time evolution of a physical system at quantum mechanical levels of detail -- known as Hamiltonian Simulation (HS) -- is an important and interesting problem across physics and chemistry. For this task, algorithms that run on quantum computers are known to be exponentially faster than classical algorithms; in fact, this application motivated Feynman to propose the construction of quantum computers. Nonetheless, there are challenges in reaching this performance potential. Prior work has focused on compiling circuits (quantum programs) for HS with the goal of maximizing either accuracy or gate cancellation. Our work proposes a compilation strategy that simultaneously advances both goals. At a high level, we use classical optimizations such as graph coloring and travelling salesperson to order the execution of quantum programs. Specifically, we group together mutually commuting terms in the Hamiltonian (a matrix characterizing the quantum mechanical system) to improve the accuracy of the simulation. We then rearrange the terms within each group to maximize gate cancellation in the final quantum circuit. These optimizations work together to improve HS performance and result in an average 40% reduction in circuit depth. This work advances the frontier of HS which in turn can advance physical and chemical modeling in both basic and applied sciences. △ Less

Submitted 23 March, 2022; originally announced March 2022.

Comments: 13 pages, 7 figures, Awarded Best Paper during the IEEE International Conference on Rebooting Computing (ICRC) 2021

arXiv:2202.11045 [pdf, other]

SupermarQ: A Scalable Quantum Benchmark Suite

Authors: Teague Tomesh, Pranav Gokhale, Victory Omole, Gokul Subramanian Ravi, Kaitlin N. Smith, Joshua Viszlai, Xin-Chuan Wu, Nikos Hardavellas, Margaret R. Martonosi, Frederic T. Chong

Abstract: The emergence of quantum computers as a new computational paradigm has been accompanied by speculation concerning the scope and timeline of their anticipated revolutionary changes. While quantum computing is still in its infancy, the variety of different architectures used to implement quantum computations make it difficult to reliably measure and compare performance. This problem motivates our in… ▽ More The emergence of quantum computers as a new computational paradigm has been accompanied by speculation concerning the scope and timeline of their anticipated revolutionary changes. While quantum computing is still in its infancy, the variety of different architectures used to implement quantum computations make it difficult to reliably measure and compare performance. This problem motivates our introduction of SupermarQ, a scalable, hardware-agnostic quantum benchmark suite which uses application-level metrics to measure performance. SupermarQ is the first attempt to systematically apply techniques from classical benchmarking methodology to the quantum domain. We define a set of feature vectors to quantify coverage, select applications from a variety of domains to ensure the suite is representative of real workloads, and collect benchmark results from the IBM, IonQ, and AQT@LBNL platforms. Looking forward, we envision that quantum benchmarking will encompass a large cross-community effort built on open source, constantly evolving benchmark suites. We introduce SupermarQ as an important step in this direction. △ Less

Submitted 27 April, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

Comments: 17 pages, 4 figures, Awarded Best Paper during the 28th IEEE International Symposium on High-Performance Computer Architecture (HPCA-28), Seoul, South Korea

arXiv:2109.06132 [pdf, other]

doi 10.1145/3485508

Specifying and Testing GPU Workgroup Progress Models

Authors: Tyler Sorensen, Lucas F. Salvador, Harmit Raval, Hugues Evrard, John Wickerson, Margaret Martonosi, Alastair F. Donaldson

Abstract: As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications say almost nothing… ▽ More As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model. This work is a collection of tools experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a constraint for concurrent programs that require relative forward progress to terminate. Using this constraint, we synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, this allows us to determine the expected status of each litmus test -- i.e. whether it is guaranteed eventual termination -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, an intuitive progress model defined by prior work and hypothesized to describe the workgroup schedulers of existing GPUs. △ Less

Submitted 13 September, 2021; originally announced September 2021.

Comments: OOPSLA 2021

arXiv:2107.07532 [pdf, other]

Divide and Conquer for Combinatorial Optimization and Distributed Quantum Computation

Authors: Teague Tomesh, Zain H. Saleem, Michael A. Perlin, Pranav Gokhale, Martin Suchara, Margaret Martonosi

Abstract: Scaling the size of monolithic quantum computer systems is a difficult task. As the number of qubits within a device increases, a number of factors contribute to decreases in yield and performance. To meet this challenge, distributed architectures composed of many networked quantum computers have been proposed as a viable path to scalability. Such systems will need algorithms and compilers that ar… ▽ More Scaling the size of monolithic quantum computer systems is a difficult task. As the number of qubits within a device increases, a number of factors contribute to decreases in yield and performance. To meet this challenge, distributed architectures composed of many networked quantum computers have been proposed as a viable path to scalability. Such systems will need algorithms and compilers that are tailored to their distributed architectures. In this work we introduce the Quantum Divide and Conquer Algorithm (QDCA), a hybrid variational approach to mapping large combinatorial optimization problems onto distributed quantum architectures. This is achieved through the combined use of graph partitioning and quantum circuit cutting. The QDCA, an example of application-compiler co-design, alters the structure of the variational ansatz to tame the exponential compilation overhead incurred by quantum circuit cutting. The result of this cross-layer co-design is a highly flexible algorithm which can be tuned to the amount of classical or quantum computational resources that are available, and can be applied to both near- and long-term distributed quantum architectures. We simulate the QDCA on instances of the Maximum Independent Set problem and find that it is able to outperform similar classical algorithms. We also evaluate an 8-qubit QDCA ansatz on a superconducting quantum computer and show that circuit cutting can help to mitigate the effects of noise. Our work demonstrates how many small-scale quantum computers can work together to solve problems $85\%$ larger than their own qubit count, motivating the development and potential of large-scale distributed quantum computing. △ Less

Submitted 12 October, 2023; v1 submitted 15 July, 2021; originally announced July 2021.

Comments: 12 pages, 10 figures, 1 table. This paper was published in the 2023 International Conference on Quantum Computing and Engineering (QCE) where it was awarded a Best Paper award

arXiv:2106.15490 [pdf, other]

doi 10.1109/ISCA52012.2021.00071

Designing calibration and expressivity-efficient instruction sets for quantum computing

Authors: Prakash Murali, Lingling Lao, Margaret Martonosi, Dan Browne

Abstract: Near-term quantum computing (QC) systems have limited qubit counts, high gate (instruction) error rates, and typically support a minimal instruction set having one type of two-qubit gate (2Q). To reduce program instruction counts and improve application expressivity, vendors have proposed, and shown proof-of-concept demonstrations of richer instruction sets such as XY gates (Rigetti) and fSim gate… ▽ More Near-term quantum computing (QC) systems have limited qubit counts, high gate (instruction) error rates, and typically support a minimal instruction set having one type of two-qubit gate (2Q). To reduce program instruction counts and improve application expressivity, vendors have proposed, and shown proof-of-concept demonstrations of richer instruction sets such as XY gates (Rigetti) and fSim gates (Google). These instruction sets comprise of families of 2Q gate types parameterized by continuous qubit rotation angles. However, having such a large number of gate types is problematic because each gate type has to be calibrated periodically, across the full system, to obtain high fidelity implementations. This results in substantial recurring calibration overheads even on current systems which use only a few gate types. Our work aims to navigate this tradeoff between application expressivity and calibration overhead, and identify what instructions vendors should implement to get the best expressivity with acceptable calibration time. We develop NuOp, a flexible compilation pass based on numerical optimization, to efficiently decompose application operations into arbitrary hardware gate types. Using NuOp and four important quantum applications, we study the instruction set proposals of Rigetti and Google, with realistic noise simulations and a calibration model. Our experiments show that implementing 4-8 types of 2Q gates is sufficient to attain nearly the same expressivity as a full continuous gate family, while reducing the calibration overhead by two orders of magnitude. With several vendors proposing rich gate families as means to higher fidelity, our work has potential to provide valuable instruction set design guidance for near-term QC systems. △ Less

Submitted 29 June, 2021; originally announced June 2021.

Comments: 14 pages, 11 figures

Journal ref: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA):846-859

arXiv:2104.04003 [pdf, other]

AutoSVA: Democratizing Formal Verification of RTL Module Interactions

Authors: Marcelo Orenes-Vera, Aninda Manocha, David Wentzlaff, Margaret Martonosi

Abstract: Modern SoC design relies on the ability to separately verify IP blocks relative to their own specifications. Formal verification (FV) using SystemVerilog Assertions (SVA) is an effective method to exhaustively verify blocks at unit-level. Unfortunately, FV has a steep learning curve and requires engineering effort that discourages hardware designers from using it during RTL module development. We… ▽ More Modern SoC design relies on the ability to separately verify IP blocks relative to their own specifications. Formal verification (FV) using SystemVerilog Assertions (SVA) is an effective method to exhaustively verify blocks at unit-level. Unfortunately, FV has a steep learning curve and requires engineering effort that discourages hardware designers from using it during RTL module development. We propose AutoSVA, a framework to automatically generate FV testbenches that verify liveness and safety of control logic involved in module interactions. We demonstrate AutoSVA's effectiveness and efficiency on deadlock-critical modules of widely-used open-source hardware projects. △ Less

Submitted 8 April, 2021; originally announced April 2021.

arXiv:2103.17226 [pdf, other]

Logical Abstractions for Noisy Variational Quantum Algorithm Simulation

Authors: Yipeng Huang, Steven Holtzen, Todd Millstein, Guy Van den Broeck, Margaret Martonosi

Abstract: Due to the unreliability and limited capacity of existing quantum computer prototypes, quantum circuit simulation continues to be a vital tool for validating next generation quantum computers and for studying variational quantum algorithms, which are among the leading candidates for useful quantum computation. Existing quantum circuit simulators do not address the common traits of variational algo… ▽ More Due to the unreliability and limited capacity of existing quantum computer prototypes, quantum circuit simulation continues to be a vital tool for validating next generation quantum computers and for studying variational quantum algorithms, which are among the leading candidates for useful quantum computation. Existing quantum circuit simulators do not address the common traits of variational algorithms, namely: 1) their ability to work with noisy qubits and operations, 2) their repeated execution of the same circuits but with different parameters, and 3) the fact that they sample from circuit final wavefunctions to drive a classical optimization routine. We present a quantum circuit simulation toolchain based on logical abstractions targeted for simulating variational algorithms. Our proposed toolchain encodes quantum amplitudes and noise probabilities in a probabilistic graphical model, and it compiles the circuits to logical formulas that support efficient repeated simulation of and sampling from quantum circuits for different parameters. Compared to state-of-the-art state vector and density matrix quantum circuit simulators, our simulation approach offers greater performance when sampling from noisy circuits with at least eight to 20 qubits and with around 12 operations on each qubit, making the approach ideal for simulating near-term variational quantum algorithms. And for simulating noise-free shallow quantum circuits with 32 qubits, our simulation approach offers a $66\times$ reduction in sampling cost versus quantum circuit simulation techniques based on tensor network contraction. △ Less

Submitted 31 March, 2021; originally announced March 2021.

Comments: ASPLOS '21, April 19-23, 2021, Virtual, USA

arXiv:2012.14968 [pdf, other]

Optimizing IoT and Web Traffic Using Selective Edge Compression

Authors: Themis Melissaris, Kelly Shaw, Margaret Martonosi

Abstract: Internet of Things (IoT) devices and applications are generating and communicating vast quantities of data, and the rate of data collection is increasing rapidly. These high communication volumes are challenging for energy-constrained, data-capped, wireless mobile devices and networked sensors. Compression is commonly used to reduce web traffic, to save energy, and to make network transfers faster… ▽ More Internet of Things (IoT) devices and applications are generating and communicating vast quantities of data, and the rate of data collection is increasing rapidly. These high communication volumes are challenging for energy-constrained, data-capped, wireless mobile devices and networked sensors. Compression is commonly used to reduce web traffic, to save energy, and to make network transfers faster. If not used judiciously, however, compression can hurt performance. This work proposes and evaluates mechanisms that employ selective compression at the network's edge, based on data characteristics and network conditions. This approach (i) improves the performance of network transfers in IoT environments, while (ii) providing significant data savings. We demonstrate that our library speeds up web transfers by an average of 2.18x and 2.03x under fixed and dynamically changing network conditions respectively. Furthermore, it also provides consistent data savings, compacting data down to 19% of the original data size. △ Less

Submitted 29 December, 2020; originally announced December 2020.

Comments: 10 pages

arXiv:2012.02333 [pdf]

doi 10.1145/3445814.3446758

CutQC: Using Small Quantum Computers for Large Quantum Circuit Evaluations

Authors: Wei Tang, Teague Tomesh, Martin Suchara, Jeffrey Larson, Margaret Martonosi

Abstract: Quantum computing (QC) is a new paradigm offering the potential of exponential speedups over classical computing for certain computational problems. Each additional qubit doubles the size of the computational state space available to a QC algorithm. This exponential scaling underlies QC's power, but today's Noisy Intermediate-Scale Quantum (NISQ) devices face significant engineering challenges in… ▽ More Quantum computing (QC) is a new paradigm offering the potential of exponential speedups over classical computing for certain computational problems. Each additional qubit doubles the size of the computational state space available to a QC algorithm. This exponential scaling underlies QC's power, but today's Noisy Intermediate-Scale Quantum (NISQ) devices face significant engineering challenges in scalability. The set of quantum circuits that can be reliably run on NISQ devices is limited by their noisy operations and low qubit counts. This paper introduces CutQC, a scalable hybrid computing approach that combines classical computers and quantum computers to enable evaluation of quantum circuits that cannot be run on classical or quantum computers alone. CutQC cuts large quantum circuits into smaller subcircuits, allowing them to be executed on smaller quantum devices. Classical postprocessing can then reconstruct the output of the original circuit. This approach offers significant runtime speedup compared with the only viable current alternative--purely classical simulations--and demonstrates evaluation of quantum circuits that are larger than the limit of QC or classical simulation. Furthermore, in real-system runs, CutQC achieves much higher quantum circuit evaluation fidelity using small prototype quantum computers than the state-of-the-art large NISQ devices achieve. Overall, this hybrid approach allows users to leverage classical and quantum computing resources to evaluate quantum programs far beyond the reach of either one alone. △ Less

Submitted 18 March, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

Comments: 14 pages, 12 figures, In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '21), April 19-23, 2021, Virtual, USA. ACM, New York, NY, USA

arXiv:2011.00028 [pdf, other]

doi 10.1109/JPROC.2020.2994765

Resource-Efficient Quantum Computing by Breaking Abstractions

Authors: Yunong Shi, Pranav Gokhale, Prakash Murali, Jonathan M. Baker, Casey Duckering, Yongshan Ding, Natalie C. Brown, Christopher Chamberland, Ali Javadi Abhari, Andrew W. Cross, David I. Schuster, Kenneth R. Brown, Margaret Martonosi, Frederic T. Chong

Abstract: Building a quantum computer that surpasses the computational power of its classical counterpart is a great engineering challenge. Quantum software optimizations can provide an accelerated pathway to the first generation of quantum computing applications that might save years of engineering effort. Current quantum software stacks follow a layered approach similar to the stack of classical computers… ▽ More Building a quantum computer that surpasses the computational power of its classical counterpart is a great engineering challenge. Quantum software optimizations can provide an accelerated pathway to the first generation of quantum computing applications that might save years of engineering effort. Current quantum software stacks follow a layered approach similar to the stack of classical computers, which was designed to manage the complexity. In this review, we point out that greater efficiency of quantum computing systems can be achieved by breaking the abstractions between these layers. We review several works along this line, including two hardware-aware compilation optimizations that break the quantum Instruction Set Architecture (ISA) abstraction and two error-correction/information-processing schemes that break the qubit abstraction. Last, we discuss several possible future directions. △ Less

Submitted 30 October, 2020; originally announced November 2020.

Comments: Invited paper by Proceedings of IEEE special issue

Journal ref: in Proceedings of the IEEE, vol. 108, no. 8, pp. 1353-1370, Aug. 2020

arXiv:2008.03578 [pdf, other]

doi 10.1109/ISCA45697.2020.00076

TransForm: Formally Specifying Transistency Models and Synthesizing Enhanced Litmus Tests

Authors: Naorin Hossain, Caroline Trippel, Margaret Martonosi

Abstract: Memory consistency models (MCMs) specify the legal ordering and visibility of shared memory accesses in a parallel program. Traditionally, instruction set architecture (ISA) MCMs assume that relevant program-visible memory ordering behaviors only result from shared memory interactions that take place between user-level program instructions. This assumption fails to account for virtual memory (VM)… ▽ More Memory consistency models (MCMs) specify the legal ordering and visibility of shared memory accesses in a parallel program. Traditionally, instruction set architecture (ISA) MCMs assume that relevant program-visible memory ordering behaviors only result from shared memory interactions that take place between user-level program instructions. This assumption fails to account for virtual memory (VM) implementations that may result in additional shared memory interactions between user-level program instructions and both 1) system-level operations (e.g., address remappings and translation lookaside buffer invalidations initiated by system calls) and 2) hardware-level operations (e.g., hardware page table walks and dirty bit updates) during a user-level program's execution. These additional shared memory interactions can impact the observable memory ordering behaviors of user-level programs. Thus, memory transistency models (MTMs) have been coined as a superset of MCMs to additionally articulate VM-aware consistency rules. However, no prior work has enabled formal MTM specifications, nor methods to support their automated analysis. To fill the above gap, this paper presents the TransForm framework. First, TransForm features an axiomatic vocabulary for formally specifying MTMs. Second, TransForm includes a synthesis engine to support the automated generation of litmus tests enhanced with MTM features (i.e., enhanced litmus tests, or ELTs) when supplied with a TransForm MTM specification. As a case study, we formally define an estimated MTM for Intel x86 processors, called x86t_elt, that is based on observations made by an ELT-based evaluation of an Intel x86 MTM implementation from prior work and available public documentation. Given x86t_elt and a synthesis bound as input, TransForm's synthesis engine successfully produces a set of ELTs including relevant ELTs from prior work. △ Less

Submitted 11 August, 2020; v1 submitted 8 August, 2020; originally announced August 2020.

Comments: *This is an updated version of the TransForm paper that features updated results reflecting performance optimizations and software bug fixes. 14 pages, 11 figures, Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA)

arXiv:2004.08539 [pdf, other]

doi 10.1109/ISCA45697.2020.00054

SQUARE: Strategic Quantum Ancilla Reuse for Modular Quantum Programs via Cost-Effective Uncomputation

Authors: Yongshan Ding, Xin-Chuan Wu, Adam Holmes, Ash Wiseth, Diana Franklin, Margaret Martonosi, Frederic T. Chong

Abstract: Compiling high-level quantum programs to machines that are size constrained (i.e. limited number of quantum bits) and time constrained (i.e. limited number of quantum operations) is challenging. In this paper, we present SQUARE (Strategic QUantum Ancilla REuse), a compilation infrastructure that tackles allocation and reclamation of scratch qubits (called ancilla) in modular quantum programs. At i… ▽ More Compiling high-level quantum programs to machines that are size constrained (i.e. limited number of quantum bits) and time constrained (i.e. limited number of quantum operations) is challenging. In this paper, we present SQUARE (Strategic QUantum Ancilla REuse), a compilation infrastructure that tackles allocation and reclamation of scratch qubits (called ancilla) in modular quantum programs. At its core, SQUARE strategically performs uncomputation to create opportunities for qubit reuse. Current Noisy Intermediate-Scale Quantum (NISQ) computers and forward-looking Fault-Tolerant (FT) quantum computers have fundamentally different constraints such as data locality, instruction parallelism, and communication overhead. Our heuristic-based ancilla-reuse algorithm balances these considerations and fits computations into resource-constrained NISQ or FT quantum machines, throttling parallelism when necessary. To precisely capture the workload of a program, we propose an improved metric, the "active quantum volume," and use this metric to evaluate the effectiveness of our algorithm. Our results show that SQUARE improves the average success rate of NISQ applications by 1.47X. Surprisingly, the additional gates for uncomputation create ancilla with better locality, and result in substantially fewer swap gates and less gate noise overall. SQUARE also achieves an average reduction of 1.5X (and up to 9.6X) in active quantum volume for FT machines. △ Less

Submitted 25 June, 2020; v1 submitted 18 April, 2020; originally announced April 2020.

Comments: 14 pages, 10 figures

arXiv:2004.07415 [pdf, other]

The MosaicSim Simulator (Full Technical Report)

Authors: Opeoluwa Matthews, Aninda Manocha, Davide Giri, Marcelo Orenes-Vera, Esin Tureci, Tyler Sorensen, Tae Jun Ham, Juan L. Aragón, Luca P. Carloni, Margaret Martonosi

Abstract: As Moore's Law has slowed and Dennard Scaling has ended, architects are increasingly turning to heterogeneous parallelism and domain-specific hardware-software co-designs. These trends present new challenges for simulation-based performance assessments that are central to early-stage architectural exploration. Simulators must be lightweight to support rich heterogeneous combinations of general pur… ▽ More As Moore's Law has slowed and Dennard Scaling has ended, architects are increasingly turning to heterogeneous parallelism and domain-specific hardware-software co-designs. These trends present new challenges for simulation-based performance assessments that are central to early-stage architectural exploration. Simulators must be lightweight to support rich heterogeneous combinations of general purpose cores and specialized processing units. They must also support agile exploration of hardware-software co-design, i.e. changes in the programming model, compiler, ISA, and specialized hardware. To meet these challenges, we introduce MosaicSim, a lightweight, modular simulator for heterogeneous systems, offering accuracy and agility designed specifically for hardware-software co-design explorations. By integrating the LLVM toolchain, MosaicSim enables efficient modeling of instruction dependencies and flexible additions across the stack. Its modularity also allows the composition and integration of different hardware components. We first demonstrate that MosaicSim captures architectural bottlenecks in applications, and accurately models both scaling trends in a multicore setting and accelerator behavior. We then present two case-studies where MosaicSim enables straightforward design space explorations for emerging systems, i.e. data science application acceleration and heterogeneous parallel architectures. △ Less

Submitted 15 April, 2020; originally announced April 2020.

Comments: This is a full technical report on the MosaicSim simulator. This version is a variation of the original ISPASS publication with additions describing the accuracy of MosaicSim's memory hierarchy performance modeling and additional hardware features, e.g. branch predictors. This technical report will be maintained as the MosaicSim developers continue to augment the simulator with more features

arXiv:2004.04706 [pdf, other]

Architecting Noisy Intermediate-Scale Trapped Ion Quantum Computers

Authors: Prakash Murali, Dripto M. Debroy, Kenneth R. Brown, Margaret Martonosi

Abstract: Trapped ions (TI) are a leading candidate for building Noisy Intermediate-Scale Quantum (NISQ) hardware. TI qubits have fundamental advantages over other technologies such as superconducting qubits, including high qubit quality, coherence and connectivity. However, current TI systems are small in size, with 5-20 qubits and typically use a single trap architecture which has fundamental scalability… ▽ More Trapped ions (TI) are a leading candidate for building Noisy Intermediate-Scale Quantum (NISQ) hardware. TI qubits have fundamental advantages over other technologies such as superconducting qubits, including high qubit quality, coherence and connectivity. However, current TI systems are small in size, with 5-20 qubits and typically use a single trap architecture which has fundamental scalability limitations. To progress towards the next major milestone of 50-100 qubits, a modular architecture termed the Quantum Charge Coupled Device (QCCD) has been proposed. In a QCCD-based TI device, small traps are connected through ion shuttling. While the basic hardware components for such devices have been demonstrated, building a 50-100 qubit system is challenging because of a wide range of design possibilities for trap sizing, communication topology and gate implementations and the need to match diverse application resource requirements. Towards realizing QCCD systems with 50-100 qubits, we perform an extensive architectural study evaluating the key design choices of trap sizing, communication topology and operation implementation methods. We built a design toolflow which takes a QCCD architecture's parameters as input, along with a set of applications and realistic hardware performance models. Our toolflow maps the applications onto the target device and simulates their execution to compute metrics such as application run time, reliability and device noise rates. Using six applications and several hardware design points, we show that trap sizing and communication topology choices can impact application reliability by up to three orders of magnitude. Microarchitectural gate implementation choices influence reliability by another order of magnitude. From these studies, we provide concrete recommendations to tune these choices to achieve highly reliable and performant application executions. △ Less

Submitted 9 April, 2020; originally announced April 2020.

Comments: Published in ISCA 2020 https://www.iscaconf.org/isca2020/program/ (please cite the ISCA version)

arXiv:2003.04892 [pdf, other]

RealityCheck: Bringing Modularity, Hierarchy, and Abstraction to Automated Microarchitectural Memory Consistency Verification

Authors: Yatin A. Manerkar, Daniel Lustig, Margaret Martonosi

Abstract: Modern SoCs are heterogeneous parallel systems comprised of components developed by distinct teams and possibly even different vendors. The memory consistency model (MCM) of processors in such SoCs specifies the ordering rules which constrain the values that can be read by load instructions in parallel programs running on such systems. The implementation of required MCM orderings can span componen… ▽ More Modern SoCs are heterogeneous parallel systems comprised of components developed by distinct teams and possibly even different vendors. The memory consistency model (MCM) of processors in such SoCs specifies the ordering rules which constrain the values that can be read by load instructions in parallel programs running on such systems. The implementation of required MCM orderings can span components which may be designed and implemented by many different teams. Ideally, each team would be able to specify the orderings enforced by their components independently and then connect them together when conducting MCM verification. However, no prior automated approach for formal hardware MCM verification provided this. To bring automated hardware MCM verification in line with the realities of the design process, we present RealityCheck, a methodology and tool for automated formal MCM verification of modular microarchitectural ordering specifications. RealityCheck allows users to specify their designs as a hierarchy of distinct modules connected to each other rather than a single flat specification. It can then automatically verify litmus test programs against these modular specifications. RealityCheck also provides support for abstraction, which enables scalable verification by breaking up the verification of the entire design into smaller verification problems. We present results for verifying litmus tests on 7 different designs using RealityCheck. These include in-order and out-of-order pipelines, a non-blocking cache, and a heterogeneous processor. Our case studies cover the TSO and RISC-V (RVWMO) weak memory models. RealityCheck is capable of verifying 98 RVWMO litmus tests in under 4 minutes each, and its capability for abstraction enables up to a 32.1% reduction in litmus test verification time for RVWMO. △ Less

Submitted 9 March, 2020; originally announced March 2020.

arXiv:2001.05983 [pdf, other]

Term Grouping and Travelling Salesperson for Digital Quantum Simulation

Authors: Kaiwen Gui, Teague Tomesh, Pranav Gokhale, Yunong Shi, Frederic T. Chong, Margaret Martonosi, Martin Suchara

Abstract: Digital simulation of quantum dynamics by evaluating the time evolution of a Hamiltonian is the initially proposed application of quantum computing. The large number of quantum gates required for emulating the complete second quantization form of the Hamiltonian, however, makes such an approach unsuitable for near-term devices with limited gate fidelities that cause high physical errors. In additi… ▽ More Digital simulation of quantum dynamics by evaluating the time evolution of a Hamiltonian is the initially proposed application of quantum computing. The large number of quantum gates required for emulating the complete second quantization form of the Hamiltonian, however, makes such an approach unsuitable for near-term devices with limited gate fidelities that cause high physical errors. In addition, Trotter error caused by noncommuting terms can accumulate and harm the overall circuit fidelity, thus causing algorithmic errors. In this paper, we propose a new term ordering strategy, max-commute-tsp (MCTSP), that simultaneously mitigates both algorithmic and physical errors. First, we improve the Trotter fidelity compared with previously proposed optimization by reordering Pauli terms and partitioning them into commuting families. We demonstrate the practicality of this method by constructing and evaluating quantum circuits that simulate different molecular Hamiltonians, together with theoretical explanations for the fidelity improvements from our term grouping method. Second, we describe a new gate cancellation technique that reduces the high gate counts by formulating the gate cancellation problem as a travelling salesperson problem, together with benchmarking experiments. Finally, we also provide benchmarking results that demonstrate the combined advantage of max-commute-tsp to mitigate both physical and algorithmic errors via quantum circuit simulation under realistic noise models. △ Less

Submitted 12 March, 2021; v1 submitted 16 January, 2020; originally announced January 2020.

arXiv:2001.02826 [pdf, other]

doi 10.1145/3373376.3378477

Software Mitigation of Crosstalk on Noisy Intermediate-Scale Quantum Computers

Authors: Prakash Murali, David C. McKay, Margaret Martonosi, Ali Javadi-Abhari

Abstract: Crosstalk is a major source of noise in Noisy Intermediate-Scale Quantum (NISQ) systems and is a fundamental challenge for hardware design. When multiple instructions are executed in parallel, crosstalk between the instructions can corrupt the quantum state and lead to incorrect program execution. Our goal is to mitigate the application impact of crosstalk noise through software techniques. This r… ▽ More Crosstalk is a major source of noise in Noisy Intermediate-Scale Quantum (NISQ) systems and is a fundamental challenge for hardware design. When multiple instructions are executed in parallel, crosstalk between the instructions can corrupt the quantum state and lead to incorrect program execution. Our goal is to mitigate the application impact of crosstalk noise through software techniques. This requires (i) accurate characterization of hardware crosstalk, and (ii) intelligent instruction scheduling to serialize the affected operations. Since crosstalk characterization is computationally expensive, we develop optimizations which reduce the characterization overhead. On three 20-qubit IBMQ systems, we demonstrate two orders of magnitude reduction in characterization time (compute time on the QC device) compared to all-pairs crosstalk measurements. Informed by these characterization, we develop a scheduler that judiciously serializes high crosstalk instructions balancing the need to mitigate crosstalk and exponential decoherence errors from serialization. On real-system runs on three IBMQ systems, our scheduler improves the error rate of application circuits by up to 5.6x, compared to the IBM instruction scheduler and offers near-optimal crosstalk mitigation in practice. In a broader picture, the difficulty of mitigating crosstalk has recently driven QC vendors to move towards sparser qubit connectivity or disabling nearby operations entirely in hardware, which can be detrimental to performance. Our work makes the case for software mitigation of crosstalk errors. △ Less

Submitted 8 January, 2020; originally announced January 2020.

Comments: To appear in ASPLOS 2020

arXiv:1907.13623 [pdf, other]

Minimizing State Preparations in Variational Quantum Eigensolver by Partitioning into Commuting Families

Authors: Pranav Gokhale, Olivia Angiuli, Yongshan Ding, Kaiwen Gui, Teague Tomesh, Martin Suchara, Margaret Martonosi, Frederic T. Chong

Abstract: Variational quantum eigensolver (VQE) is a promising algorithm suitable for near-term quantum machines. VQE aims to approximate the lowest eigenvalue of an exponentially sized matrix in polynomial time. It minimizes quantum resource requirements both by co-processing with a classical processor and by structuring computation into many subproblems. Each quantum subproblem involves a separate state p… ▽ More Variational quantum eigensolver (VQE) is a promising algorithm suitable for near-term quantum machines. VQE aims to approximate the lowest eigenvalue of an exponentially sized matrix in polynomial time. It minimizes quantum resource requirements both by co-processing with a classical processor and by structuring computation into many subproblems. Each quantum subproblem involves a separate state preparation terminated by the measurement of one Pauli string. However, the number of such Pauli strings scales as $N^4$ for typical problems of interest--a daunting growth rate that poses a serious limitation for emerging applications such as quantum computational chemistry. We introduce a systematic technique for minimizing requisite state preparations by exploiting the simultaneous measurability of partitions of commuting Pauli strings. Our work encompasses algorithms for efficiently approximating a MIN-COMMUTING-PARTITION, as well as a synthesis tool for compiling simultaneous measurement circuits. For representative problems, we achieve 8-30x reductions in state preparations, with minimal overhead in measurement circuit cost. We demonstrate experimental validation of our techniques by estimating the ground state energy of deuteron on an IBM Q 20-qubit machine. We also investigate the underlying statistics of simultaneous measurement and devise an adaptive strategy for mitigating harmful covariance terms. △ Less

Submitted 31 July, 2019; originally announced July 2019.

arXiv:1905.11349 [pdf, other]

Full-Stack, Real-System Quantum Computer Studies: Architectural Comparisons and Design Insights

Authors: Prakash Murali, Norbert Matthias Linke, Margaret Martonosi, Ali Javadi Abhari, Nhung Hong Nguyen, Cinthia Huerta Alderete

Abstract: In recent years, Quantum Computing (QC) has progressed to the point where small working prototypes are available for use. Termed Noisy Intermediate-Scale Quantum (NISQ) computers, these prototypes are too small for large benchmarks or even for Quantum Error Correction, but they do have sufficient resources to run small benchmarks, particularly if compiled with optimizations to make use of scarce q… ▽ More In recent years, Quantum Computing (QC) has progressed to the point where small working prototypes are available for use. Termed Noisy Intermediate-Scale Quantum (NISQ) computers, these prototypes are too small for large benchmarks or even for Quantum Error Correction, but they do have sufficient resources to run small benchmarks, particularly if compiled with optimizations to make use of scarce qubits and limited operation counts and coherence times. QC has not yet, however, settled on a particular preferred device implementation technology, and indeed different NISQ prototypes implement qubits with very different physical approaches and therefore widely-varying device and machine characteristics. Our work performs a full-stack, benchmark-driven hardware-software analysis of QC systems. We evaluate QC architectural possibilities, software-visible gates, and software optimizations to tackle fundamental design questions about gate set choices, communication topology, the factors affecting benchmark performance and compiler optimizations. In order to answer key cross-technology and cross-platform design questions, our work has built the first top-to-bottom toolflow to target different qubit device technologies, including superconducting and trapped ion qubits which are the current QC front-runners. We use our toolflow, TriQ, to conduct {\em real-system} measurements on 7 running QC prototypes from 3 different groups, IBM, Rigetti, and University of Maryland. From these real-system experiences at QC's hardware-software interface, we make observations about native and software-visible gates for different QC technologies, communication topologies, and the value of noise-aware compilation even on lower-noise platforms. This is the largest cross-platform real-system QC study performed thus far; its results have the potential to inform both QC device and compiler design going forward. △ Less

Submitted 11 June, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: Preprint of a publication in ISCA 2019

arXiv:1905.09721 [pdf, other]

doi 10.1145/3307650.3322213

Statistical Assertions for Validating Patterns and Finding Bugs in Quantum Programs

Authors: Yipeng Huang, Margaret Martonosi

Abstract: In support of the growing interest in quantum computing experimentation, programmers need new tools to write quantum algorithms as program code. Compared to debugging classical programs, debugging quantum programs is difficult because programmers have limited ability to probe the internal states of quantum programs; those states are difficult to interpret even when observations exist; and programm… ▽ More In support of the growing interest in quantum computing experimentation, programmers need new tools to write quantum algorithms as program code. Compared to debugging classical programs, debugging quantum programs is difficult because programmers have limited ability to probe the internal states of quantum programs; those states are difficult to interpret even when observations exist; and programmers do not yet have guidelines for what to check for when building quantum programs. In this work, we present quantum program assertions based on statistical tests on classical observations. These allow programmers to decide if a quantum program state matches its expected value in one of classical, superposition, or entangled types of states. We extend an existing quantum programming language with the ability to specify quantum assertions, which our tool then checks in a quantum program simulator. We use these assertions to debug three benchmark quantum programs in factoring, search, and chemistry. We share what types of bugs are possible, and lay out a strategy for using quantum programming patterns to place assertions and prevent bugs. △ Less

Submitted 20 May, 2019; originally announced May 2019.

Comments: In The 46th Annual International Symposium on Computer Architecture (ISCA '19). arXiv admin note: text overlap with arXiv:1811.05447

arXiv:1904.11528 [pdf, other]

doi 10.1016/j.micpro.2019.02.007

Resource Optimized Quantum Architectures for Surface Code Implementations of Magic-State Distillation

Authors: Adam Holmes, Yongshan Ding, Ali Javadi-Abhari, Diana Franklin, Margaret Martonosi, Frederic T. Chong

Abstract: Quantum computers capable of solving classically intractable problems are under construction, and intermediate-scale devices are approaching completion. Current efforts to design large-scale devices require allocating immense resources to error correction, with the majority dedicated to the production of high-fidelity ancillary states known as magic-states. Leading techniques focus on dedicating a… ▽ More Quantum computers capable of solving classically intractable problems are under construction, and intermediate-scale devices are approaching completion. Current efforts to design large-scale devices require allocating immense resources to error correction, with the majority dedicated to the production of high-fidelity ancillary states known as magic-states. Leading techniques focus on dedicating a large, contiguous region of the processor as a single "magic-state distillation factory" responsible for meeting the magic-state demands of applications. In this work we design and analyze a set of optimized factory architectural layouts that divide a single factory into spatially distributed factories located throughout the processor. We find that distributed factory architectures minimize the space-time volume overhead imposed by distillation. Additionally, we find that the number of distributed components in each optimal configuration is sensitive to application characteristics and underlying physical device error rates. More specifically, we find that the rate at which T-gates are demanded by an application has a significant impact on the optimal distillation architecture. We develop an optimization procedure that discovers the optimal number of factory distillation rounds and number of output magic states per factory, as well as an overall system architecture that interacts with the factories. This yields between a 10x and 20x resource reduction compared to commonly accepted single factory designs. Performance is analyzed across representative application classes such as quantum simulation and quantum chemistry. △ Less

Submitted 25 April, 2019; originally announced April 2019.

Comments: 16 pages, 14 figures

arXiv:1903.10541 [pdf]

Next Steps in Quantum Computing: Computer Science's Role

Authors: Margaret Martonosi, Martin Roetteler

Abstract: The computing ecosystem has always had deep impacts on society and technology and profoundly changed our lives in myriads of ways. Despite decades of impressive Moore's Law performance scaling and other growth in the computing ecosystem there are nonetheless still important potential applications of computing that remain out of reach of current or foreseeable conventional computer systems. Specifi… ▽ More The computing ecosystem has always had deep impacts on society and technology and profoundly changed our lives in myriads of ways. Despite decades of impressive Moore's Law performance scaling and other growth in the computing ecosystem there are nonetheless still important potential applications of computing that remain out of reach of current or foreseeable conventional computer systems. Specifically, there are computational applications whose complexity scales super-linearly, even exponentially, with the size of their input data such that the computation time or memory requirements for these problems become intractably large to solve for useful data input sizes. Such problems can have memory requirements that exceed what can be built on the most powerful supercomputers, and/or runtimes on the order of tens of years or more. Quantum computing (QC) is viewed by many as a possible future option for tackling these high-complexity or seemingly-intractable problems by complementing classical computing with a fundamentally different compute paradigm. There is a huge gap between the problems for which a quantum computer might be useful and what we can currently build, program, and run. The goal of the QC research community is to close the gap such that useful algorithms can be run in practical amounts of time on reliable real-world QC hardware. In particular, the goal of this Computing Community Consortium (CCC) workshop was to articulate the central role that the computer science (CS) research communities plays in closing this gap. CS researchers bring invaluable expertise in the design of programming languages, in techniques for systems building, scalability and verification, and in architectural approaches that can bring practical QC from the future to the present. This report introduces issues and recommendations across a range of technical and non-technical topic areas. △ Less

Submitted 25 March, 2019; originally announced March 2019.

Comments: A Computing Community Consortium (CCC) workshop report, 24 pages

Report number: ccc2018report_3

arXiv:1903.03276 [pdf, ps, other]

doi 10.1016/j.micpro.2019.02.005

Formal Constraint-based Compilation for Noisy Intermediate-Scale Quantum Systems

Authors: Prakash Murali, Ali Javadi-Abhari, Frederic T. Chong, Margaret Martonosi

Abstract: Noisy, intermediate-scale quantum (NISQ) systems are expected to have a few hundred qubits, minimal or no error correction, limited connectivity and limits on the number of gates that can be performed within the short coherence window of the machine. The past decade's research on quantum programming languages and compilers is directed towards large systems with thousands of qubits. For near term q… ▽ More Noisy, intermediate-scale quantum (NISQ) systems are expected to have a few hundred qubits, minimal or no error correction, limited connectivity and limits on the number of gates that can be performed within the short coherence window of the machine. The past decade's research on quantum programming languages and compilers is directed towards large systems with thousands of qubits. For near term quantum systems, it is crucial to design tool flows which make efficient use of the hardware resources without sacrificing the ease and portability of a high-level programming environment. In this paper, we present a compiler for the Scaffold quantum programming language in which aggressive optimization specifically targets NISQ machines with hundreds of qubits. Our compiler extracts gates from a Scaffold program, and formulates a constrained optimization problem which considers both program characteristics and machine constraints. Using the Z3 SMT solver, the compiler maps program qubits to hardware qubits, schedules gates, and inserts CNOT routing operations while optimizing the overall execution time. The output of the optimization is used to produce target code in the OpenQASM language, which can be executed on existing quantum hardware such as the 16-qubit IBM machine. Using real and synthetic benchmarks, we show that it is feasible to synthesize near-optimal compiled code for current and small NISQ systems. For large programs and machine sizes, the SMT optimization approach can be used to synthesize compiled code that is guaranteed to finish within the coherence window of the machine. △ Less

Submitted 7 March, 2019; originally announced March 2019.

Comments: Invited paper in Special Issue on Quantum Computer Architecture: a full-stack overview, Microprocessors and Microsystems

Journal ref: Microprocessors and Microsystems 2019

arXiv:1901.11054 [pdf, other]

Noise-Adaptive Compiler Mappings for Noisy Intermediate-Scale Quantum Computers

Authors: Prakash Murali, Jonathan M. Baker, Ali Javadi Abhari, Frederic T. Chong, Margaret Martonosi

Abstract: A massive gap exists between current quantum computing (QC) prototypes, and the size and scale required for many proposed QC algorithms. Current QC implementations are prone to noise and variability which affect their reliability, and yet with less than 80 quantum bits (qubits) total, they are too resource-constrained to implement error correction. The term Noisy Intermediate-Scale Quantum (NISQ)… ▽ More A massive gap exists between current quantum computing (QC) prototypes, and the size and scale required for many proposed QC algorithms. Current QC implementations are prone to noise and variability which affect their reliability, and yet with less than 80 quantum bits (qubits) total, they are too resource-constrained to implement error correction. The term Noisy Intermediate-Scale Quantum (NISQ) refers to these current and near-term systems of 1000 qubits or less. Given NISQ's severe resource constraints, low reliability, and high variability in physical characteristics such as coherence time or error rates, it is of pressing importance to map computations onto them in ways that use resources efficiently and maximize the likelihood of successful runs. This paper proposes and evaluates backend compiler approaches to map and optimize high-level QC programs to execute with high reliability on NISQ systems with diverse hardware characteristics. Our techniques all start from an LLVM intermediate representation of the quantum program (such as would be generated from high-level QC languages like Scaffold) and generate QC executables runnable on the IBM Q public QC machine. We then use this framework to implement and evaluate several optimal and heuristic mapping methods. These methods vary in how they account for the availability of dynamic machine calibration data, the relative importance of various noise parameters, the different possible routing strategies, and the relative importance of compile-time scalability versus runtime success. Using real-system measurements, we show that fine grained spatial and temporal variations in hardware parameters can be exploited to obtain an average $2.9$x (and up to $18$x) improvement in program success rate over the industry standard IBM Qiskit compiler. △ Less

Submitted 30 January, 2019; originally announced January 2019.

Comments: To appear in ASPLOS'19

arXiv:1811.05447 [pdf, other]

doi 10.4230/OASIcs.PLATEAU.2018.4

QDB: From Quantum Algorithms Towards Correct Quantum Programs

Authors: Yipeng Huang, Margaret Martonosi

Abstract: With the advent of small-scale prototype quantum computers, researchers can now code and run quantum algorithms that were previously proposed but not fully implemented. In support of this growing interest in quantum computing experimentation, programmers need new tools and techniques to write and debug QC code. In this work, we implement a range of QC algorithms and programs in order to discover w… ▽ More With the advent of small-scale prototype quantum computers, researchers can now code and run quantum algorithms that were previously proposed but not fully implemented. In support of this growing interest in quantum computing experimentation, programmers need new tools and techniques to write and debug QC code. In this work, we implement a range of QC algorithms and programs in order to discover what types of bugs occur and what defenses against those bugs are possible in QC programs. We conduct our study by running small-sized QC programs in QC simulators in order to replicate published results in QC implementations. Where possible, we cross-validate results from programs written in different QC languages for the same problems and inputs. Drawing on this experience, we provide a taxonomy for QC bugs, and we propose QC language features that would aid in writing correct code. △ Less

Submitted 15 November, 2018; v1 submitted 13 November, 2018; originally announced November 2018.

Journal ref: http://drops.dagstuhl.de/opus/volltexte/2019/10196/

arXiv:1809.01302 [pdf, other]

doi 10.1109/MICRO.2018.00072

Magic-State Functional Units: Mapping and Scheduling Multi-Level Distillation Circuits for Fault-Tolerant Quantum Architectures

Authors: Yongshan Ding, Adam Holmes, Ali Javadi-Abhari, Diana Franklin, Margaret Martonosi, Frederic T. Chong

Abstract: Quantum computers have recently made great strides and are on a long-term path towards useful fault-tolerant computation. A dominant overhead in fault-tolerant quantum computation is the production of high-fidelity encoded qubits, called magic states, which enable reliable error-corrected computation. We present the first detailed designs of hardware functional units that implement space-time opti… ▽ More Quantum computers have recently made great strides and are on a long-term path towards useful fault-tolerant computation. A dominant overhead in fault-tolerant quantum computation is the production of high-fidelity encoded qubits, called magic states, which enable reliable error-corrected computation. We present the first detailed designs of hardware functional units that implement space-time optimized magic-state factories for surface code error-corrected machines. Interactions among distant qubits require surface code braids (physical pathways on chip) which must be routed. Magic-state factories are circuits comprised of a complex set of braids that is more difficult to route than quantum circuits considered in previous work [1]. This paper explores the impact of scheduling techniques, such as gate reordering and qubit renaming, and we propose two novel mapping techniques: braid repulsion and dipole moment braid rotation. We combine these techniques with graph partitioning and community detection algorithms, and further introduce a stitching algorithm for mapping subgraphs onto a physical machine. Our results show a factor of 5.64 reduction in space-time volume compared to the best-known previous designs for magic-state factories. △ Less

Submitted 4 September, 2018; originally announced September 2018.

Comments: 13 pages, 10 figures

arXiv:1802.03802 [pdf, other]

MeltdownPrime and SpectrePrime: Automatically-Synthesized Attacks Exploiting Invalidation-Based Coherence Protocols

Authors: Caroline Trippel, Daniel Lustig, Margaret Martonosi

Abstract: The recent Meltdown and Spectre attacks highlight the importance of automated verification techniques for identifying hardware security vulnerabilities. We have developed a tool for synthesizing microarchitecture-specific programs capable of producing any user-specified hardware execution pattern of interest. Our tool takes two inputs: a formal description of (i) a microarchitecture in a domain-sp… ▽ More The recent Meltdown and Spectre attacks highlight the importance of automated verification techniques for identifying hardware security vulnerabilities. We have developed a tool for synthesizing microarchitecture-specific programs capable of producing any user-specified hardware execution pattern of interest. Our tool takes two inputs: a formal description of (i) a microarchitecture in a domain-specific language, and (ii) a microarchitectural execution pattern of interest, e.g. a threat pattern. All programs synthesized by our tool are capable of producing the specified execution pattern on the supplied microarchitecture. We used our tool to specify a hardware execution pattern common to Flush+Reload attacks and automatically synthesized security litmus tests representative of those that have been publicly disclosed for conducting Meltdown and Spectre attacks. We also formulated a Prime+Probe threat pattern, enabling our tool to synthesize a new variant of each---MeltdownPrime and SpectrePrime. Both of these new exploits use Prime+Probe approaches to conduct the timing attack. They are both also novel in that they are 2-core attacks which leverage the cache line invalidation mechanism in modern cache coherence protocols. These are the first proposed Prime+Probe variants of Meltdown and Spectre. But more importantly, both Prime attacks exploit invalidation-based coherence protocols to achieve the same level of precision as a Flush+Reload attack. While mitigation techniques in software (e.g., barriers that prevent speculation) will likely be the same for our Prime variants as for original Spectre and Meltdown, we believe that hardware protection against them will be distinct. As a proof of concept, we implemented SpectrePrime as a C program and ran it on an Intel x86 processor, averaging about the same accuracy as Spectre over 100 runs---97.9% for Spectre and 99.95% for SpectrePrime. △ Less

Submitted 11 February, 2018; originally announced February 2018.

arXiv:1708.09283 [pdf, other]

doi 10.1145/3123939.3123949

Optimized Surface Code Communication in Superconducting Quantum Computers

Authors: Ali Javadi-Abhari, Pranav Gokhale, Adam Holmes, Diana Franklin, Kenneth R. Brown, Margaret Martonosi, Frederic T. Chong

Abstract: Quantum computing (QC) is at the cusp of a revolution. Machines with 100 quantum bits (qubits) are anticipated to be operational by 2020 [googlemachine,gambetta2015building], and several-hundred-qubit machines are around the corner. Machines of this scale have the capacity to demonstrate quantum supremacy, the tipping point where QC is faster than the fastest classical alternative for a particular… ▽ More Quantum computing (QC) is at the cusp of a revolution. Machines with 100 quantum bits (qubits) are anticipated to be operational by 2020 [googlemachine,gambetta2015building], and several-hundred-qubit machines are around the corner. Machines of this scale have the capacity to demonstrate quantum supremacy, the tipping point where QC is faster than the fastest classical alternative for a particular problem. Because error correction techniques will be central to QC and will be the most expensive component of quantum computation, choosing the lowest-overhead error correction scheme is critical to overall QC success. This paper evaluates two established quantum error correction codes---planar and double-defect surface codes---using a set of compilation, scheduling and network simulation tools. In considering scalable methods for optimizing both codes, we do so in the context of a full microarchitectural and compiler analysis. Contrary to previous predictions, we find that the simpler planar codes are sometimes more favorable for implementation on superconducting quantum computers, especially under conditions of high communication congestion. △ Less

Submitted 30 August, 2017; originally announced August 2017.

Comments: 14 pages, 9 figures, The 50th Annual IEEE/ACM International Symposium on Microarchitecture

arXiv:1611.01507 [pdf, other]

Counterexamples and Proof Loophole for the C/C++ to POWER and ARMv7 Trailing-Sync Compiler Mappings

Authors: Yatin A. Manerkar, Caroline Trippel, Daniel Lustig, Michael Pellauer, Margaret Martonosi

Abstract: The C and C++ high-level languages provide programmers with atomic operations for writing high-performance concurrent code. At the assembly language level, C and C++ atomics get mapped down to individual instructions or combinations of instructions by compilers, depending on the ordering guarantees and synchronization instructions provided by the underlying architecture. These compiler mappings mu… ▽ More The C and C++ high-level languages provide programmers with atomic operations for writing high-performance concurrent code. At the assembly language level, C and C++ atomics get mapped down to individual instructions or combinations of instructions by compilers, depending on the ordering guarantees and synchronization instructions provided by the underlying architecture. These compiler mappings must uphold the ordering guarantees provided by C/C++ atomics or the compiled program will not behave according to the C/C++ memory model. In this paper we discuss two counterexamples to the well-known trailing-sync compiler mappings for the Power and ARMv7 architectures that were previously thought to be proven correct. In addition to the counterexamples, we discuss the loophole in the proof of the mappings that allowed the incorrect mappings to be proven correct. We also discuss the current state of compilers and architectures in relation to the bug. △ Less

Submitted 16 November, 2016; v1 submitted 4 November, 2016; originally announced November 2016.

arXiv:1609.06756 [pdf]

21st Century Computer Architecture

Authors: Mark D. Hill, Sarita Adve, Luis Ceze, Mary Jane Irwin, David Kaeli, Margaret Martonosi, Josep Torrellas, Thomas F. Wenisch, David Wood, Katherine Yelick

Abstract: Because most technology and computer architecture innovations were (intentionally) invisible to higher layers, application and other software developers could reap the benefits of this progress without engaging in it. Higher performance has both made more computationally demanding applications feasible (e.g., virtual assistants, computer vision) and made less demanding applications easier to devel… ▽ More Because most technology and computer architecture innovations were (intentionally) invisible to higher layers, application and other software developers could reap the benefits of this progress without engaging in it. Higher performance has both made more computationally demanding applications feasible (e.g., virtual assistants, computer vision) and made less demanding applications easier to develop by enabling higher-level programming abstractions (e.g., scripting languages and reusable components). Improvements in computer system cost-effectiveness enabled value creation that could never have been imagined by the field's founders (e.g., distributed web search sufficiently inexpensive so as to be covered by advertising links). The wide benefits of computer performance growth are clear. Recently, Danowitz et al. apportioned computer performance growth roughly equally between technology and architecture, with architecture credited with ~80x improvement since 1985. As semiconductor technology approaches its "end-of-the-road" (see below), computer architecture will need to play an increasing role in enabling future ICT innovation. But instead of asking, "How can I make my chip run faster?," architects must now ask, "How can I enable the 21st century infrastructure, from sensors to clouds, adding value from performance to privacy, but without the benefit of near-perfect technology scaling?". The challenges are many, but with appropriate investment, opportunities abound. Underlying these opportunities is a common theme that future architecture innovations will require the engagement of and investments from innovators in other ICT layers. △ Less

Submitted 21 September, 2016; originally announced September 2016.

Comments: A Computing Community Consortium (CCC) white paper, 16 pages

arXiv:1608.07547 [pdf, other]

doi 10.1145/3037697.3037719

TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA

Authors: Caroline Trippel, Yatin A. Manerkar, Daniel Lustig, Michael Pellauer, Margaret Martonosi

Abstract: Memory consistency models (MCMs) which govern inter-module interactions in a shared memory system, are a significant, yet often under-appreciated, aspect of system design. MCMs are defined at the various layers of the hardware-software stack, requiring thoroughly verified specifications, compilers, and implementations at the interfaces between layers. Current verification techniques evaluate segme… ▽ More Memory consistency models (MCMs) which govern inter-module interactions in a shared memory system, are a significant, yet often under-appreciated, aspect of system design. MCMs are defined at the various layers of the hardware-software stack, requiring thoroughly verified specifications, compilers, and implementations at the interfaces between layers. Current verification techniques evaluate segments of the system stack in isolation, such as proving compiler mappings from a high-level language (HLL) to an ISA or proving validity of a microarchitectural implementation of an ISA. This paper makes a case for full-stack MCM verification and provides a toolflow, TriCheck, capable of verifying that the HLL, compiler, ISA, and implementation collectively uphold MCM requirements. The work showcases TriCheck's ability to evaluate a proposed ISA MCM in order to ensure that each layer and each mapping is correct and complete. Specifically, we apply TriCheck to the open source RISC-V ISA, seeking to verify accurate, efficient, and legal compilations from C11. We uncover under-specifications and potential inefficiencies in the current RISC-V ISA documentation and identify possible solutions for each. As an example, we find that a RISC-V-compliant microarchitecture allows 144 outcomes forbidden by C11 to be observed out of 1,701 litmus tests examined. Overall, this paper demonstrates the necessity of full-stack verification for detecting MCM-related bugs in the hardware-software stack. △ Less

Submitted 8 February, 2017; v1 submitted 26 August, 2016; originally announced August 2016.

Comments: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

arXiv:1507.01902 [pdf, other]

doi 10.1016/j.parco.2014.12.001

ScaffCC: Scalable Compilation and Analysis of Quantum Programs

Authors: Ali JavadiAbhari, Shruti Patil, Daniel Kudrow, Jeff Heckey, Alexey Lvov, Frederic T. Chong, Margaret Martonosi

Abstract: We present ScaffCC, a scalable compilation and analysis framework based on LLVM, which can be used for compiling quantum computing applications at the logical level. Drawing upon mature compiler technologies, we discuss similarities and differences between compilation of classical and quantum programs, and adapt our methods to optimizing the compilation time and output for the quantum case. Our wo… ▽ More We present ScaffCC, a scalable compilation and analysis framework based on LLVM, which can be used for compiling quantum computing applications at the logical level. Drawing upon mature compiler technologies, we discuss similarities and differences between compilation of classical and quantum programs, and adapt our methods to optimizing the compilation time and output for the quantum case. Our work also integrates a reversible-logic synthesis tool in the compiler to facilitate coding of quantum circuits. Lastly, we present some useful quantum program analysis scenarios and discuss their implications, specifically with an elaborate discussion of timing analysis for critical path estimation. Our work focuses on bridging the gap between high-level quantum algorithm specifi- cations and low-level physical implementations, while providing good scalability to larger and more interesting problems △ Less

Submitted 7 July, 2015; originally announced July 2015.

Comments: Journal of Parallel Computing (PARCO)

Journal ref: Parallel Comput. 45, C (June 2015), 2-17

arXiv:0801.0094 [pdf, other]

doi 10.1103/PhysRevA.77.052315

An analytical error model for quantum computer simulation

Authors: Eric Chi, Stephen A. Lyon, Margaret Martonosi

Abstract: Quantum computers (QCs) must implement quantum error correcting codes (QECCs) to protect their logical qubits from errors, and modeling the effectiveness of QECCs on QCs is an important problem for evaluating the QC architecture. The previously developed Monte Carlo (MC) error models may take days or weeks of execution to produce an accurate result due to their random sampling approach. We prese… ▽ More Quantum computers (QCs) must implement quantum error correcting codes (QECCs) to protect their logical qubits from errors, and modeling the effectiveness of QECCs on QCs is an important problem for evaluating the QC architecture. The previously developed Monte Carlo (MC) error models may take days or weeks of execution to produce an accurate result due to their random sampling approach. We present an alternative analytical error model that generates, over the course of executing the quantum program, a probability tree of the QC's error states. By calculating the fidelity of the quantum program directly, this error model has the potential for enormous speedups over the MC model when applied to small yet useful problem sizes. We observe a speedup on the order of 1,000X when accuracy is required, and we evaluate the scaling properties of this new analytical error model. △ Less

Submitted 2 January, 2008; v1 submitted 29 December, 2007; originally announced January 2008.

Showing 1–45 of 45 results for author: Martonosi, M