-
Pushing the Boundary of Quantum Advantage in Hard Combinatorial Optimization with Probabilistic Computers
Authors:
Shuvro Chowdhury,
Navid Anjum Aadit,
Andrea Grimaldi,
Eleonora Raimondo,
Atharva Raut,
P. Aaron Lott,
Johan H. Mentink,
Marek M. Rams,
Federico Ricci-Tersenghi,
Massimo Chiappini,
Luke S. Theogarajan,
Tathagata Srimani,
Giovanni Finocchio,
Masoud Mohseni,
Kerem Y. Camsari
Abstract:
Recent demonstrations on specialized benchmarks have reignited excitement for quantum computers, yet whether they can deliver an advantage for practical real-world problems remains an open question. Here, we show that probabilistic computers (p-computers) when co-designed with hardware to implement powerful Monte Carlo algorithms can surpass state-of-the-art quantum annealers <a href="https://www.…
▽ More
Recent demonstrations on specialized benchmarks have reignited excitement for quantum computers, yet whether they can deliver an advantage for practical real-world problems remains an open question. Here, we show that probabilistic computers (p-computers) when co-designed with hardware to implement powerful Monte Carlo algorithms can surpass state-of-the-art quantum annealers <a href="https://www.nature.com/articles/s41586-023-05867-2" target="_blank">[King et al., Nature (2023)]</a> in solving certain hard optimization problems. We focus on two key algorithms: discrete-time simulated quantum annealing (DT-SQA) and adaptive parallel tempering (APT), both applied to 3D spin glasses. For DT-SQA, we find that increasing the number of replicas improves residual energy scaling, while parallelizing fewer replicas across independent runs also achieves comparable scaling. Both strategies align with the theoretical expectations from extreme value theory. In addition, APT outperforms DT-SQA when supported by non-local isoenergetic cluster moves. Finite-size scaling analysis suggests a universal behavior that explains the superior performance of APT over both DT-SQA and quantum annealing. We show that these algorithms are readily implementable in modern hardware thanks to the mature semiconductor technology. Unlike software simulations, replicas can be monolithically housed on a single chip and a large number of spins can be updated in parallel and asynchronously, similar to a quantum annealer. We project that custom Field Programmable Gate Arrays (FPGA) or specialized chips leveraging massive parallelism can further accelerate these algorithms by orders of magnitude, while drastically improving energy efficiency. Our results raise the bar for a practical quantum advantage in optimization and present p-computers as scalable, energy-efficient hardware for real-world optimization problems.
△ Less
Submitted 7 April, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Scalable Connectivity for Ising Machines: Dense to Sparse
Authors:
M Mahmudul Hasan Sajeeb,
Navid Anjum Aadit,
Shuvro Chowdhury,
Tong Wu,
Cesely Smith,
Dhruv Chinmay,
Atharva Raut,
Kerem Y. Camsari,
Corentin Delacour,
Tathagata Srimani
Abstract:
In recent years, hardware implementations of Ising machines have emerged as a viable alternative to quantum computing for solving hard optimization problems among other applications. Unlike quantum hardware, dense connectivity can be achieved in classical systems. However, we show that dense connectivity leads to severe frequency slowdowns and interconnect congestion scaling unfavorably with syste…
▽ More
In recent years, hardware implementations of Ising machines have emerged as a viable alternative to quantum computing for solving hard optimization problems among other applications. Unlike quantum hardware, dense connectivity can be achieved in classical systems. However, we show that dense connectivity leads to severe frequency slowdowns and interconnect congestion scaling unfavorably with system sizes. As a scalable solution, we propose a systematic sparsification method for dense graphs by introducing copy nodes to limit the number of neighbors per graph node. In addition to solving interconnect congestion, this approach enables constant frequency scaling where all spins in a network can be updated in constant time. On the other hand, sparsification introduces new difficulties, such as constraint-breaking between copied spins and increased convergence times to solve optimization problems, especially if exact ground states are sought. Relaxing the exact solution requirements, we find that the overheads in convergence times are milder. We demonstrate these ideas by designing probabilistic bit Ising machines using ASAP7 (a predictive 7nm FinFET technology model) process design kits as well as Field Programmable Gate Array (FPGA)-based implementations. Finally, we show how formulating problems in naturally sparse networks (e.g., by invertible logic) sidesteps challenges introduced by sparsification methods. Our results are applicable to a broad family of Ising machines using different hardware implementations.
△ Less
Submitted 2 June, 2025; v1 submitted 2 March, 2025;
originally announced March 2025.
-
How to Build a Quantum Supercomputer: Scaling from Hundreds to Millions of Qubits
Authors:
Masoud Mohseni,
Artur Scherer,
K. Grace Johnson,
Oded Wertheim,
Matthew Otten,
Navid Anjum Aadit,
Yuri Alexeev,
Kirk M. Bresniker,
Kerem Y. Camsari,
Barbara Chapman,
Soumitra Chatterjee,
Gebremedhin A. Dagnew,
Aniello Esposito,
Farah Fahim,
Marco Fiorentino,
Archit Gajjar,
Abdullah Khalid,
Xiangzhou Kong,
Bohdan Kulchytskyy,
Elica Kyoseva,
Ruoyu Li,
P. Aaron Lott,
Igor L. Markov,
Robert F. McDermott,
Giacomo Pedretti
, et al. (16 additional authors not shown)
Abstract:
In the span of four decades, quantum computation has evolved from an intellectual curiosity to a potentially realizable technology. Today, small-scale demonstrations have become possible for quantum algorithmic primitives on hundreds of physical qubits and proof-of-principle error-correction on a single logical qubit. Nevertheless, despite significant progress and excitement, the path toward a ful…
▽ More
In the span of four decades, quantum computation has evolved from an intellectual curiosity to a potentially realizable technology. Today, small-scale demonstrations have become possible for quantum algorithmic primitives on hundreds of physical qubits and proof-of-principle error-correction on a single logical qubit. Nevertheless, despite significant progress and excitement, the path toward a full-stack scalable technology is largely unknown. There are significant outstanding quantum hardware, fabrication, software architecture, and algorithmic challenges that are either unresolved or overlooked. These issues could seriously undermine the arrival of utility-scale quantum computers for the foreseeable future. Here, we provide a comprehensive review of these scaling challenges. We show how the road to scaling could be paved by adopting existing semiconductor technology to build much higher-quality qubits, employing system engineering approaches, and performing distributed quantum computation within heterogeneous high-performance computing infrastructures. These opportunities for research and development could unlock certain promising applications, in particular, efficient quantum simulation/learning of quantum data generated by natural or engineered quantum systems. To estimate the true cost of such promises, we provide a detailed resource and sensitivity analysis for classically hard quantum chemistry calculations on surface-code error-corrected quantum computers given current, target, and desired hardware specifications based on superconducting qubits, accounting for a realistic distribution of errors. Furthermore, we argue that, to tackle industry-scale classical optimization and machine learning problems in a cost-effective manner, heterogeneous quantum-probabilistic computing with custom-designed accelerators should be considered as a complementary path toward scalability.
△ Less
Submitted 31 January, 2025; v1 submitted 15 November, 2024;
originally announced November 2024.
-
All-to-all reconfigurability with sparse and higher-order Ising machines
Authors:
Srijan Nikhar,
Sidharth Kannan,
Navid Anjum Aadit,
Shuvro Chowdhury,
Kerem Y. Camsari
Abstract:
Domain-specific hardware to solve computationally hard optimization problems has generated tremendous excitement. Here, we evaluate probabilistic bit (p-bit) based Ising Machines (IM) on the 3-regular 3-Exclusive OR Satisfiability (3R3X), as a representative hard optimization problem. We first introduce a multiplexed architecture that emulates all-to-all network functionality while maintaining hig…
▽ More
Domain-specific hardware to solve computationally hard optimization problems has generated tremendous excitement. Here, we evaluate probabilistic bit (p-bit) based Ising Machines (IM) on the 3-regular 3-Exclusive OR Satisfiability (3R3X), as a representative hard optimization problem. We first introduce a multiplexed architecture that emulates all-to-all network functionality while maintaining highly parallelized chromatic Gibbs sampling. We implement this architecture in single Field-Programmable Gate Arrays (FPGA) and show that running the adaptive parallel tempering algorithm demonstrates competitive algorithmic and prefactor advantages over alternative IMs by D-Wave, Toshiba, and Fujitsu. We also implement higher-order interactions that lead to better prefactors without changing algorithmic scaling for the XORSAT problem. Even though FPGA implementations of p-bits are still not quite as fast as the best possible greedy algorithms accelerated on Graphics Processing Units (GPU), scaled magnetic versions of p-bit IMs could lead to orders of magnitude improvements over the state of the art for generic optimization.
△ Less
Submitted 26 September, 2024; v1 submitted 21 November, 2023;
originally announced December 2023.
-
CMOS + stochastic nanomagnets: heterogeneous computers for probabilistic inference and learning
Authors:
Nihal Sanjay Singh,
Keito Kobayashi,
Qixuan Cao,
Kemal Selcuk,
Tianrui Hu,
Shaila Niazi,
Navid Anjum Aadit,
Shun Kanai,
Hideo Ohno,
Shunsuke Fukami,
Kerem Y. Camsari
Abstract:
Extending Moore's law by augmenting complementary-metal-oxide semiconductor (CMOS) transistors with emerging nanotechnologies (X) has become increasingly important. One important class of problems involve sampling-based Monte Carlo algorithms used in probabilistic machine learning, optimization, and quantum simulation. Here, we combine stochastic magnetic tunnel junction (sMTJ)-based probabilistic…
▽ More
Extending Moore's law by augmenting complementary-metal-oxide semiconductor (CMOS) transistors with emerging nanotechnologies (X) has become increasingly important. One important class of problems involve sampling-based Monte Carlo algorithms used in probabilistic machine learning, optimization, and quantum simulation. Here, we combine stochastic magnetic tunnel junction (sMTJ)-based probabilistic bits (p-bits) with Field Programmable Gate Arrays (FPGA) to create an energy-efficient CMOS + X (X = sMTJ) prototype. This setup shows how asynchronously driven CMOS circuits controlled by sMTJs can perform probabilistic inference and learning by leveraging the algorithmic update-order-invariance of Gibbs sampling. We show how the stochasticity of sMTJs can augment low-quality random number generators (RNG). Detailed transistor-level comparisons reveal that sMTJ-based p-bits can replace up to 10,000 CMOS transistors while dissipating two orders of magnitude less energy. Integrated versions of our approach can advance probabilistic computing involving deep Boltzmann machines and other energy-based learning algorithms with extremely high throughput and energy efficiency.
△ Less
Submitted 23 February, 2024; v1 submitted 12 April, 2023;
originally announced April 2023.
-
Training Deep Boltzmann Networks with Sparse Ising Machines
Authors:
Shaila Niazi,
Navid Anjum Aadit,
Masoud Mohseni,
Shuvro Chowdhury,
Yao Qin,
Kerem Y. Camsari
Abstract:
The slowing down of Moore's law has driven the development of unconventional computing paradigms, such as specialized Ising machines tailored to solve combinatorial optimization problems. In this paper, we show a new application domain for probabilistic bit (p-bit) based Ising machines by training deep generative AI models with them. Using sparse, asynchronous, and massively parallel Ising machine…
▽ More
The slowing down of Moore's law has driven the development of unconventional computing paradigms, such as specialized Ising machines tailored to solve combinatorial optimization problems. In this paper, we show a new application domain for probabilistic bit (p-bit) based Ising machines by training deep generative AI models with them. Using sparse, asynchronous, and massively parallel Ising machines we train deep Boltzmann networks in a hybrid probabilistic-classical computing setup. We use the full MNIST and Fashion MNIST (FMNIST) dataset without any downsampling and a reduced version of CIFAR-10 dataset in hardware-aware network topologies implemented in moderately sized Field Programmable Gate Arrays (FPGA). For MNIST, our machine using only 4,264 nodes (p-bits) and about 30,000 parameters achieves the same classification accuracy (90%) as an optimized software-based restricted Boltzmann Machine (RBM) with approximately 3.25 million parameters. Similar results follow for FMNIST and CIFAR-10. Additionally, the sparse deep Boltzmann network can generate new handwritten digits and fashion products, a task the 3.25 million parameter RBM fails at despite achieving the same accuracy. Our hybrid computer takes a measured 50 to 64 billion probabilistic flips per second, which is at least an order of magnitude faster than superficially similar Graphics and Tensor Processing Unit (GPU/TPU) based implementations. The massively parallel architecture can comfortably perform the contrastive divergence algorithm (CD-n) with up to n = 10 million sweeps per update, beyond the capabilities of existing software implementations. These results demonstrate the potential of using Ising machines for traditionally hard-to-train deep generative Boltzmann networks, with further possible improvement in nanodevice-based realizations.
△ Less
Submitted 23 January, 2024; v1 submitted 19 March, 2023;
originally announced March 2023.
-
A full-stack view of probabilistic computing with p-bits: devices, architectures and algorithms
Authors:
Shuvro Chowdhury,
Andrea Grimaldi,
Navid Anjum Aadit,
Shaila Niazi,
Masoud Mohseni,
Shun Kanai,
Hideo Ohno,
Shunsuke Fukami,
Luke Theogarajan,
Giovanni Finocchio,
Supriyo Datta,
Kerem Y. Camsari
Abstract:
The transistor celebrated its 75${}^\text{th}$ birthday in 2022. The continued scaling of the transistor defined by Moore's Law continues, albeit at a slower pace. Meanwhile, computing demands and energy consumption required by modern artificial intelligence (AI) algorithms have skyrocketed. As an alternative to scaling transistors for general-purpose computing, the integration of transistors with…
▽ More
The transistor celebrated its 75${}^\text{th}$ birthday in 2022. The continued scaling of the transistor defined by Moore's Law continues, albeit at a slower pace. Meanwhile, computing demands and energy consumption required by modern artificial intelligence (AI) algorithms have skyrocketed. As an alternative to scaling transistors for general-purpose computing, the integration of transistors with unconventional technologies has emerged as a promising path for domain-specific computing. In this article, we provide a full-stack review of probabilistic computing with p-bits as a representative example of the energy-efficient and domain-specific computing movement. We argue that p-bits could be used to build energy-efficient probabilistic systems, tailored for probabilistic algorithms and applications. From hardware, architecture, and algorithmic perspectives, we outline the main applications of probabilistic computers ranging from probabilistic machine learning and AI to combinatorial optimization and quantum simulation. Combining emerging nanodevices with the existing CMOS ecosystem will lead to probabilistic computers with orders of magnitude improvements in energy efficiency and probabilistic sampling, potentially unlocking previously unexplored regimes for powerful probabilistic algorithms.
△ Less
Submitted 16 March, 2023; v1 submitted 13 February, 2023;
originally announced February 2023.
-
Physics-inspired Ising Computing with Ring Oscillator Activated p-bits
Authors:
Navid Anjum Aadit,
Andrea Grimaldi,
Giovanni Finocchio,
Kerem Y. Camsari
Abstract:
The nearing end of Moore's Law has been driving the development of domain-specific hardware tailored to solve a special set of problems. Along these lines, probabilistic computing with inherently stochastic building blocks (p-bits) have shown significant promise, particularly in the context of hard optimization and statistical sampling problems. p-bits have been proposed and demonstrated in differ…
▽ More
The nearing end of Moore's Law has been driving the development of domain-specific hardware tailored to solve a special set of problems. Along these lines, probabilistic computing with inherently stochastic building blocks (p-bits) have shown significant promise, particularly in the context of hard optimization and statistical sampling problems. p-bits have been proposed and demonstrated in different hardware substrates ranging from small-scale stochastic magnetic tunnel junctions (sMTJs) in asynchronous architectures to large-scale CMOS in synchronous architectures. Here, we design and implement a truly asynchronous and medium-scale p-computer (with $\approx$ 800 p-bits) that closely emulates the asynchronous dynamics of sMTJs in Field Programmable Gate Arrays (FPGAs). Using hard instances of the planted Ising glass problem on the Chimera lattice, we evaluate the performance of the asynchronous architecture against an ideal, synchronous design that performs parallelized (chromatic) exact Gibbs sampling. We find that despite the lack of any careful synchronization, the asynchronous design achieves parallelism with comparable algorithmic scaling in the ideal, carefully tuned and parallelized synchronous design. Our results highlight the promise of massively scaled p-computers with millions of free-running p-bits made out of nanoscale building blocks such as stochastic magnetic tunnel junctions.
△ Less
Submitted 15 May, 2022;
originally announced May 2022.
-
Massively Parallel Probabilistic Computing with Sparse Ising Machines
Authors:
Navid Anjum Aadit,
Andrea Grimaldi,
Mario Carpentieri,
Luke Theogarajan,
John M. Martinis,
Giovanni Finocchio,
Kerem Y. Camsari
Abstract:
Inspired by the developments in quantum computing, building domain-specific classical hardware to solve computationally hard problems has received increasing attention. Here, by introducing systematic sparsification techniques, we demonstrate a massively parallel architecture: the sparse Ising Machine (sIM). Exploiting sparsity, sIM achieves ideal parallelism: its key figure of merit - flips per s…
▽ More
Inspired by the developments in quantum computing, building domain-specific classical hardware to solve computationally hard problems has received increasing attention. Here, by introducing systematic sparsification techniques, we demonstrate a massively parallel architecture: the sparse Ising Machine (sIM). Exploiting sparsity, sIM achieves ideal parallelism: its key figure of merit - flips per second - scales linearly with the number of probabilistic bits (p-bit) in the system. This makes sIM up to 6 orders of magnitude faster than a CPU implementing standard Gibbs sampling. Compared to optimized implementations in TPUs and GPUs, sIM delivers 5-18x speedup in sampling. In benchmark problems such as integer factorization, sIM can reliably factor semiprimes up to 32-bits, far larger than previous attempts from D-Wave and other probabilistic solvers. Strikingly, sIM beats competition-winning SAT solvers (by 4-700x in runtime to reach 95% accuracy) in solving 3SAT problems. Even when sampling is made inexact using faster clocks, sIM can find the correct ground state with further speedup. The problem encoding and sparsification techniques we introduce can be applied to other Ising Machines (classical and quantum) and the architecture we present can be used for scaling the demonstrated 5,000-10,000 p-bits to 1,000,000 or more through analog CMOS or nanodevices.
△ Less
Submitted 21 February, 2022; v1 submitted 5 October, 2021;
originally announced October 2021.
-
Dense Pruning of Pointwise Convolutions in the Frequency Domain
Authors:
Mark Buckler,
Neil Adit,
Yuwei Hu,
Zhiru Zhang,
Adrian Sampson
Abstract:
Depthwise separable convolutions and frequency-domain convolutions are two recent ideas for building efficient convolutional neural networks. They are seemingly incompatible: the vast majority of operations in depthwise separable CNNs are in pointwise convolutional layers, but pointwise layers use 1x1 kernels, which do not benefit from frequency transformation. This paper unifies these two ideas b…
▽ More
Depthwise separable convolutions and frequency-domain convolutions are two recent ideas for building efficient convolutional neural networks. They are seemingly incompatible: the vast majority of operations in depthwise separable CNNs are in pointwise convolutional layers, but pointwise layers use 1x1 kernels, which do not benefit from frequency transformation. This paper unifies these two ideas by transforming the activations, not the kernels. Our key insights are that 1) pointwise convolutions commute with frequency transformation and thus can be computed in the frequency domain without modification, 2) each channel within a given layer has a different level of sensitivity to frequency domain pruning, and 3) each channel's sensitivity to frequency pruning is approximately monotonic with respect to frequency. We leverage this knowledge by proposing a new technique which wraps each pointwise layer in a discrete cosine transform (DCT) which is truncated to selectively prune coefficients above a given threshold as per the needs of each channel. To learn which frequencies should be pruned from which channels, we introduce a novel learned parameter which specifies each channel's pruning threshold. We add a new regularization term which incentivizes the model to decrease the number of retained frequencies while still maintaining task accuracy. Unlike weight pruning techniques which rely on sparse operators, our contiguous frequency band pruning results in fully dense computation. We apply our technique to MobileNetV2 and in the process reduce computation time by 22% and incur <1% accuracy degradation.
△ Less
Submitted 16 September, 2021;
originally announced September 2021.
-
Dagger: Accelerating RPCs in Cloud Microservices Through Tightly-Coupled Reconfigurable NICs
Authors:
Nikita Lazarev,
Shaojie Xiang,
Neil Adit,
Zhiru Zhang,
Christina Delimitrou
Abstract:
The ongoing shift of cloud services from monolithic designs to microservices creates high demand for efficient and high performance datacenter networking stacks, optimized for fine-grained workloads. Commodity networking systems based on software stacks and peripheral NICs introduce high overheads when it comes to delivering small messages.
We present Dagger, a hardware acceleration fabric for c…
▽ More
The ongoing shift of cloud services from monolithic designs to microservices creates high demand for efficient and high performance datacenter networking stacks, optimized for fine-grained workloads. Commodity networking systems based on software stacks and peripheral NICs introduce high overheads when it comes to delivering small messages.
We present Dagger, a hardware acceleration fabric for cloud RPCs based on FPGAs, where the accelerator is closely-coupled with the host processor over a configurable memory interconnect. The three key design principle of Dagger are: (1) offloading the entire RPC stack to an FPGA-based NIC, (2) leveraging memory interconnects instead of PCIe buses as the interface with the host CPU, and (3) making the acceleration fabric reconfigurable, so it can accommodate the diverse needs of microservices. We show that the combination of these principles significantly improves the efficiency and performance of cloud RPC systems while preserving their generality. Dagger achieves 1.3-3.8x higher per-core RPC throughput compared to both highly-optimized software stacks, and systems using specialized RDMA adapters. It also scales up to 84 Mrps with 8 threads on 4 CPU cores, while maintaining state-of-the-art us-scale tail latency. We also demonstrate that large third-party applications, like memcached and MICA KVS, can be easily ported on Dagger with minimal changes to their codebase, bringing their median and tail KVS access latency down to 2.8 - 3.5us and 5.4 - 7.8us, respectively. Finally, we show that Dagger is beneficial for multi-tier end-to-end microservices with different threading models by evaluating it using an 8-tier application implementing a flight check-in service.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
Dagger: Towards Efficient RPCs in Cloud Microservices with Near-Memory Reconfigurable NICs
Authors:
Nikita Lazarev,
Neil Adit,
Shaojie Xiang,
Zhiru Zhang,
Christina Delimitrou
Abstract:
Cloud applications are increasingly relying on hundreds of loosely-coupled microservices to complete user requests that meet an applications end-to-end QoS requirements. Communication time between services accounts for a large fraction of the end-to-end latency and can introduce performance unpredictability and QoS violations. This work presents our early work on Dagger, a hardware acceleration pl…
▽ More
Cloud applications are increasingly relying on hundreds of loosely-coupled microservices to complete user requests that meet an applications end-to-end QoS requirements. Communication time between services accounts for a large fraction of the end-to-end latency and can introduce performance unpredictability and QoS violations. This work presents our early work on Dagger, a hardware acceleration platform for networking, designed specifically with the unique qualities of microservices in mind. The Dagger architecture relies on an FPGA-based NIC, closely coupled with the processor over a configurable memory interconnect, designed to offload and accelerate RPC stacks. Unlike the traditional cloud systems that use PCIe links as the NIC I/O interface, we leverage memory-interconnected FPGAs as networking devices to provide the efficiency, transparency, and programmability needed for fine-grained microservices. We show that this considerably improves CPU utilization and performance for cloud RPCs.
△ Less
Submitted 11 September, 2020; v1 submitted 16 July, 2020;
originally announced July 2020.