-
Recurrent CircuitSAT Sampling for Sequential Circuits
Authors:
Arash Ardakani,
Kevin He,
John Wawrzynek
Abstract:
In this work, we introduce a novel GPU-accelerated circuit satisfiability (CircuitSAT) sampling technique for sequential circuits. This work is motivated by the requirement in constrained random verification (CRV) to generate input stimuli to validate the functionality of digital hardware circuits. A major challenge in CRV is generating inputs for sequential circuits, along with the appropriate nu…
▽ More
In this work, we introduce a novel GPU-accelerated circuit satisfiability (CircuitSAT) sampling technique for sequential circuits. This work is motivated by the requirement in constrained random verification (CRV) to generate input stimuli to validate the functionality of digital hardware circuits. A major challenge in CRV is generating inputs for sequential circuits, along with the appropriate number of clock cycles required to meet design constraints. Traditional approaches often use Boolean satisfiability (SAT) samplers to generate inputs by unrolling state transitions over a fixed number of clock cycles. However, these methods do not guarantee that a solution exists for the given number of cycles. Consequently, producing input stimuli together with the required clock cycles is essential for thorough testing and verification. Our approach converts the logical constraints and temporal behavior of sequential circuits into a recurrent CircuitSAT problem, optimized via gradient descent to efficiently explore a diverse set of valid solutions, including their associated number of clock cycles. By operating directly on the circuit structure, our method reinterprets the sampling process as a supervised multi-output regression task. This differentiable framework enables independent element-wise operations on each tensor element, facilitating parallel execution during learning. As a result, we achieve GPU-accelerated sampling with substantial runtime improvements (up to 105.1x) over state-of-the-art heuristic samplers. We demonstrate the effectiveness of our method through extensive evaluations on circuit problems from the ISCAS-89 and ITC'99 benchmark suites.
△ Less
Submitted 3 March, 2025; v1 submitted 28 February, 2025;
originally announced February 2025.
-
High-Throughput SAT Sampling
Authors:
Arash Ardakani,
Minwoo Kang,
Kevin He,
Qijing Huang,
John Wawrzynek
Abstract:
In this work, we present a novel technique for GPU-accelerated Boolean satisfiability (SAT) sampling. Unlike conventional sampling algorithms that directly operate on conjunctive normal form (CNF), our method transforms the logical constraints of SAT problems by factoring their CNF representations into simplified multi-level, multi-output Boolean functions. It then leverages gradient-based optimiz…
▽ More
In this work, we present a novel technique for GPU-accelerated Boolean satisfiability (SAT) sampling. Unlike conventional sampling algorithms that directly operate on conjunctive normal form (CNF), our method transforms the logical constraints of SAT problems by factoring their CNF representations into simplified multi-level, multi-output Boolean functions. It then leverages gradient-based optimization to guide the search for a diverse set of valid solutions. Our method operates directly on the circuit structure of refactored SAT instances, reinterpreting the SAT problem as a supervised multi-output regression task. This differentiable technique enables independent bit-wise operations on each tensor element, allowing parallel execution of learning processes. As a result, we achieve GPU-accelerated sampling with significant runtime improvements ranging from $33.6\times$ to $523.6\times$ over state-of-the-art heuristic samplers. We demonstrate the superior performance of our sampling method through an extensive evaluation on $60$ instances from a public domain benchmark suite utilized in previous studies.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
DEMOTIC: A Differentiable Sampler for Multi-Level Digital Circuits
Authors:
Arash Ardakani,
Minwoo Kang,
Kevin He,
Qijing Huang,
Vighnesh Iyer,
Suhong Moon,
John Wawrzynek
Abstract:
Efficient sampling of satisfying formulas for circuit satisfiability (CircuitSAT), a well-known NP-complete problem, is essential in modern front-end applications for thorough testing and verification of digital circuits. Generating such samples is a hard computational problem due to the inherent complexity of digital circuits, size of the search space, and resource constraints involved in the pro…
▽ More
Efficient sampling of satisfying formulas for circuit satisfiability (CircuitSAT), a well-known NP-complete problem, is essential in modern front-end applications for thorough testing and verification of digital circuits. Generating such samples is a hard computational problem due to the inherent complexity of digital circuits, size of the search space, and resource constraints involved in the process. Addressing these challenges has prompted the development of specialized algorithms that heavily rely on heuristics. However, these heuristic-based approaches frequently encounter scalability issues when tasked with sampling from a larger number of solutions, primarily due to their sequential nature. Different from such heuristic algorithms, we propose a novel differentiable sampler for multi-level digital circuits, called {\sc Demotic}, that utilizes gradient descent (GD) to solve the CircuitSAT problem and obtain a wide range of valid and distinct solutions. {\sc Demotic} leverages the circuit structure of the problem instance to learn valid solutions using GD by re-framing the CircuitSAT problem as a supervised multi-output regression task. This differentiable approach allows bit-wise operations to be performed independently on each element of a tensor, enabling parallel execution of learning operations, and accordingly, GPU-accelerated sampling with significant runtime improvements compared to state-of-the-art heuristic samplers. We demonstrate the superior runtime performance of {\sc Demotic} in the sampling task across various CircuitSAT instances from the ISCAS-85 benchmark suite. Specifically, {\sc Demotic} outperforms the state-of-the-art sampler by more than two orders of magnitude in most cases.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Chip Placement with Diffusion Models
Authors:
Vint Lee,
Minh Nguyen,
Leena Elzeiny,
Chun Deng,
Pieter Abbeel,
John Wawrzynek
Abstract:
Macro placement is a vital step in digital circuit design that defines the physical location of large collections of components, known as macros, on a 2D chip. Because key performance metrics of the chip are determined by the placement, optimizing it is crucial. Existing learning-based methods typically fall short because of their reliance on reinforcement learning (RL), which is slow and struggle…
▽ More
Macro placement is a vital step in digital circuit design that defines the physical location of large collections of components, known as macros, on a 2D chip. Because key performance metrics of the chip are determined by the placement, optimizing it is crucial. Existing learning-based methods typically fall short because of their reliance on reinforcement learning (RL), which is slow and struggles to generalize, requiring online training on each new circuit. Instead, we train a diffusion model capable of placing new circuits zero-shot, using guided sampling in lieu of RL to optimize placement quality. To enable such models to train at scale, we designed a capable yet efficient architecture for the denoising model, and propose a novel algorithm to generate large synthetic datasets for pre-training. To allow zero-shot transfer to real circuits, we empirically study the design decisions of our dataset generation algorithm, and identify several key factors enabling generalization. When trained on our synthetic data, our models generate high-quality placements on unseen, realistic circuits, achieving competitive performance on placement benchmarks compared to state-of-the-art methods.
△ Less
Submitted 10 June, 2025; v1 submitted 16 July, 2024;
originally announced July 2024.
-
CoSA: Scheduling by Constrained Optimization for Spatial Accelerators
Authors:
Qijing Huang,
Minwoo Kang,
Grace Dinh,
Thomas Norell,
Aravind Kalaiah,
James Demmel,
John Wawrzynek,
Yakun Sophia Shao
Abstract:
Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and flexible interconnect. While DNN accelerators can take advantage of data reuse and achieve high peak throughput, they also expose a large number of runtime para…
▽ More
Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and flexible interconnect. While DNN accelerators can take advantage of data reuse and achieve high peak throughput, they also expose a large number of runtime parameters to the programmers who need to explicitly manage how computation is scheduled both spatially and temporally. In fact, different scheduling choices can lead to wide variations in performance and efficiency, motivating the need for a fast and efficient search strategy to navigate the vast scheduling space.
To address this challenge, we present CoSA, a constrained-optimization-based approach for scheduling DNN accelerators. As opposed to existing approaches that either rely on designers' heuristics or iterative methods to navigate the search space, CoSA expresses scheduling decisions as a constrained-optimization problem that can be deterministically solved using mathematical optimization techniques. Specifically, CoSA leverages the regularities in DNN operators and hardware to formulate the DNN scheduling space into a mixed-integer programming (MIP) problem with algorithmic and architectural constraints, which can be solved to automatically generate a highly efficient schedule in one shot. We demonstrate that CoSA-generated schedules significantly outperform state-of-the-art approaches by a geometric mean of up to 2.5x across a wide range of DNN networks while improving the time-to-solution by 90x.
△ Less
Submitted 5 May, 2021;
originally announced May 2021.
-
HAO: Hardware-aware neural Architecture Optimization for Efficient Inference
Authors:
Zhen Dong,
Yizhao Gao,
Qijing Huang,
John Wawrzynek,
Hayden K. H. So,
Kurt Keutzer
Abstract:
Automatic algorithm-hardware co-design for DNN has shown great success in improving the performance of DNNs on FPGAs. However, this process remains challenging due to the intractable search space of neural network architectures and hardware accelerator implementation. Differing from existing hardware-aware neural architecture search (NAS) algorithms that rely solely on the expensive learning-based…
▽ More
Automatic algorithm-hardware co-design for DNN has shown great success in improving the performance of DNNs on FPGAs. However, this process remains challenging due to the intractable search space of neural network architectures and hardware accelerator implementation. Differing from existing hardware-aware neural architecture search (NAS) algorithms that rely solely on the expensive learning-based approaches, our work incorporates integer programming into the search algorithm to prune the design space. Given a set of hardware resource constraints, our integer programming formulation directly outputs the optimal accelerator configuration for mapping a DNN subgraph that minimizes latency. We use an accuracy predictor for different DNN subgraphs with different quantization schemes and generate accuracy-latency pareto frontiers. With low computational cost, our algorithm can generate quantized networks that achieve state-of-the-art accuracy and hardware performance on Xilinx Zynq (ZU3EG) FPGA for image classification on ImageNet dataset. The solution searched by our algorithm achieves 72.5% top-1 accuracy on ImageNet at framerate 50, which is 60% faster than MnasNet and 135% faster than FBNet with comparable accuracy.
△ Less
Submitted 26 April, 2021;
originally announced April 2021.
-
CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs
Authors:
Zhen Dong,
Dequan Wang,
Qijing Huang,
Yizhao Gao,
Yaohui Cai,
Tian Li,
Bichen Wu,
Kurt Keutzer,
John Wawrzynek
Abstract:
Deploying deep learning models on embedded systems has been challenging due to limited computing resources. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, such as object detection, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and…
▽ More
Deploying deep learning models on embedded systems has been challenging due to limited computing resources. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, such as object detection, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and therefore, require specialized convolutions to aggregate spatial information. To address this need, recent work introduces dynamic deformable convolution to augment regular convolutions. However, this will lead to inefficient memory accesses of inputs with existing hardware. In this work, we harness the flexibility of FPGAs to develop a novel object detection pipeline with deformable convolutions. We show the speed-accuracy tradeoffs for a set of algorithm modifications including irregular-access versus limited-range and fixed-shape. We then Co-Design a Network CoDeNet with the modified deformable convolution and quantize it to 4-bit weights and 8-bit activations. With our high-efficiency implementation, our solution reaches 26.9 frames per second with a tiny model size of 0.76 MB while achieving 61.7 AP50 on the standard object detection dataset, Pascal VOC. With our higher accuracy implementation, our model gets to 67.1 AP50 on Pascal VOC with only 2.9 MB of parameters-20.9x smaller but 10% more accurate than Tiny-YOLO.
△ Less
Submitted 25 January, 2021; v1 submitted 12 June, 2020;
originally announced June 2020.
-
ProTuner: Tuning Programs with Monte Carlo Tree Search
Authors:
Ameer Haj-Ali,
Hasan Genc,
Qijing Huang,
William Moses,
John Wawrzynek,
Krste Asanović,
Ion Stoica
Abstract:
We explore applying the Monte Carlo Tree Search (MCTS) algorithm in a notoriously difficult task: tuning programs for high-performance deep learning and image processing. We build our framework on top of Halide and show that MCTS can outperform the state-of-the-art beam-search algorithm. Unlike beam search, which is guided by greedy intermediate performance comparisons between partial and less mea…
▽ More
We explore applying the Monte Carlo Tree Search (MCTS) algorithm in a notoriously difficult task: tuning programs for high-performance deep learning and image processing. We build our framework on top of Halide and show that MCTS can outperform the state-of-the-art beam-search algorithm. Unlike beam search, which is guided by greedy intermediate performance comparisons between partial and less meaningful schedules, MCTS compares complete schedules and looks ahead before making any intermediate scheduling decision. We further explore modifications to the standard MCTS algorithm as well as combining real execution time measurements with the cost model. Our results show that MCTS can outperform beam search on a suite of 16 real benchmarks.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning
Authors:
Qijing Huang,
Ameer Haj-Ali,
William Moses,
John Xiang,
Ion Stoica,
Krste Asanovic,
John Wawrzynek
Abstract:
The performance of the code a compiler generates depends on the order in which it applies the optimization passes. Choosing a good order--often referred to as the phase-ordering problem, is an NP-hard problem. As a result, existing solutions rely on a variety of heuristics. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. To this end, w…
▽ More
The performance of the code a compiler generates depends on the order in which it applies the optimization passes. Choosing a good order--often referred to as the phase-ordering problem, is an NP-hard problem. As a result, existing solutions rely on a variety of heuristics. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. To this end, we implement AutoPhase: a framework that takes a program and uses deep reinforcement learning to find a sequence of compilation passes that minimizes its execution time. Without loss of generality, we construct this framework in the context of the LLVM compiler toolchain and target high-level synthesis programs. We use random forests to quantify the correlation between the effectiveness of a given pass and the program's features. This helps us reduce the search space by avoiding phase orderings that are unlikely to improve the performance of a given program. We compare the performance of AutoPhase to state-of-the-art algorithms that address the phase-ordering problem. In our evaluation, we show that AutoPhase improves circuit performance by 28% when compared to using the -O3 compiler flag, and achieves competitive results compared to the state-of-the-art solutions, while requiring fewer samples. Furthermore, unlike existing state-of-the-art solutions, our deep reinforcement learning solution shows promising result in generalizing to real benchmarks and 12,874 different randomly generated programs, after training on a hundred randomly generated programs.
△ Less
Submitted 4 March, 2020; v1 submitted 2 March, 2020;
originally announced March 2020.
-
Algorithm-hardware Co-design for Deformable Convolution
Authors:
Qijing Huang,
Dequan Wang,
Yizhao Gao,
Yaohui Cai,
Zhen Dong,
Bichen Wu,
Kurt Keutzer,
John Wawrzynek
Abstract:
FPGAs provide a flexible and efficient platform to accelerate rapidly-changing algorithms for computer vision. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, including object detection and instance segmentation, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the s…
▽ More
FPGAs provide a flexible and efficient platform to accelerate rapidly-changing algorithms for computer vision. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, including object detection and instance segmentation, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and therefore, require specialized convolutions to aggregate spatial information. To address this, recent work proposes dynamic deformable convolution to augment regular convolutions. Regular convolutions process a fixed grid of pixels across all the spatial locations in an image, while dynamic deformable convolutions may access arbitrary pixels in the image and the access pattern is input-dependent and varies per spatial location. These properties lead to inefficient memory accesses of inputs with existing hardware. In this work, we first investigate the overhead of the deformable convolution on embedded FPGA SoCs, and then show the accuracy-latency tradeoffs for a set of algorithm modifications including full versus depthwise, fixed-shape, and limited-range. These modifications benefit the energy efficiency for embedded devices in general as they reduce the compute complexity. We then build an efficient object detection network with modified deformable convolutions and quantize the network using state-of-the-art quantization methods. We implement a unified hardware engine on FPGA to support all the operations in the network. Preliminary experiments show that little accuracy is compromised and speedup can be achieved with our co-design optimization for the deformable convolution.
△ Less
Submitted 18 February, 2020;
originally announced February 2020.
-
AutoPhase: Compiler Phase-Ordering for High Level Synthesis with Deep Reinforcement Learning
Authors:
Ameer Haj-Ali,
Qijing Huang,
William Moses,
John Xiang,
Ion Stoica,
Krste Asanovic,
John Wawrzynek
Abstract:
The performance of the code generated by a compiler depends on the order in which the optimization passes are applied. In high-level synthesis, the quality of the generated circuit relates directly to the code generated by the front-end compiler. Choosing a good order--often referred to as the phase-ordering problem--is an NP-hard problem. In this paper, we evaluate a new technique to address the…
▽ More
The performance of the code generated by a compiler depends on the order in which the optimization passes are applied. In high-level synthesis, the quality of the generated circuit relates directly to the code generated by the front-end compiler. Choosing a good order--often referred to as the phase-ordering problem--is an NP-hard problem. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. We implement a framework in the context of the LLVM compiler to optimize the ordering for HLS programs and compare the performance of deep reinforcement learning to state-of-the-art algorithms that address the phase-ordering problem. Overall, our framework runs one to two orders of magnitude faster than these algorithms, and achieves a 16% improvement in circuit performance over the -O3 compiler flag.
△ Less
Submitted 3 April, 2019; v1 submitted 14 January, 2019;
originally announced January 2019.
-
Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs
Authors:
Yifan Yang,
Qijing Huang,
Bichen Wu,
Tianjun Zhang,
Liang Ma,
Giulio Gambardella,
Michaela Blott,
Luciano Lavagno,
Kees Vissers,
John Wawrzynek,
Kurt Keutzer
Abstract:
Using FPGAs to accelerate ConvNets has attracted significant attention in recent years. However, FPGA accelerator design has not leveraged the latest progress of ConvNets. As a result, the key application characteristics such as frames-per-second (FPS) are ignored in favor of simply counting GOPs, and results on accuracy, which is critical to application success, are often not even reported. In th…
▽ More
Using FPGAs to accelerate ConvNets has attracted significant attention in recent years. However, FPGA accelerator design has not leveraged the latest progress of ConvNets. As a result, the key application characteristics such as frames-per-second (FPS) are ignored in favor of simply counting GOPs, and results on accuracy, which is critical to application success, are often not even reported. In this work, we adopt an algorithm-hardware co-design approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet$^{\dagger}$. Both the accelerator and ConvNet are tailored to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with only $1\times 1$ convolutions while spatial convolutions are replaced by more efficient shift operations. DiracDeltaNet achieves competitive accuracy on ImageNet (88.7\% top-5), but with 42$\times$ fewer parameters and 48$\times$ fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and activations to 4-bits, with less than 1\% accuracy loss. These quantizations exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model size, low computational OP count, low precision and simplified operators allow us to co-design a highly customized computing unit for an FPGA. We implement the computing units for DiracDeltaNet on an Ultra96 SoC system through high-level synthesis. Our accelerator's final top-5 accuracy of 88.1\% on ImageNet, is higher than all the previously reported embedded FPGA accelerators. In addition, the accelerator reaches an inference speed of 66.3 FPS on the ImageNet classification task, surpassing prior works with similar accuracy by at least 11.6$\times$.
△ Less
Submitted 10 May, 2020; v1 submitted 21 November, 2018;
originally announced November 2018.
-
Proceedings of the 3rd International Workshop on Overlay Architectures for FPGAs (OLAF 2017)
Authors:
Hayden Kwok-Hay So,
John Wawrzynek
Abstract:
The 3rd International Workshop on Overlay Architectures for FPGAs (OLAF 2017) was held on 22 Feb, 2017 as a co-located workshop at the 25th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2017). This year, the program committee selected 3 papers and 3 extended abstracts to be presented at the workshop, which are subsequently collected in this online volume.
The 3rd International Workshop on Overlay Architectures for FPGAs (OLAF 2017) was held on 22 Feb, 2017 as a co-located workshop at the 25th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2017). This year, the program committee selected 3 papers and 3 extended abstracts to be presented at the workshop, which are subsequently collected in this online volume.
△ Less
Submitted 5 March, 2019; v1 submitted 27 April, 2017;
originally announced April 2017.
-
High Level Synthesis with a Dataflow Architectural Template
Authors:
Shaoyi Cheng,
John Wawrzynek
Abstract:
In this work, we present a new approach to high level synthesis (HLS), where high level functions are first mapped to an architectural template, before hardware synthesis is performed. As FPGA platforms are especially suitable for implementing streaming processing pipelines, we perform transformations on conventional high level programs where they are turned into multi-stage dataflow engines [1].…
▽ More
In this work, we present a new approach to high level synthesis (HLS), where high level functions are first mapped to an architectural template, before hardware synthesis is performed. As FPGA platforms are especially suitable for implementing streaming processing pipelines, we perform transformations on conventional high level programs where they are turned into multi-stage dataflow engines [1]. This target template naturally overlaps slow memory data accesses with computations and therefore has much better tolerance towards memory subsystem latency. Using a state-of-the-art HLS tool for the actual circuit generation, we observe up to 9x improvement in overall performance when the dataflow architectural template is used as an intermediate compilation target.
△ Less
Submitted 21 June, 2016;
originally announced June 2016.
-
Proceedings of the 2nd International Workshop on Overlay Architectures for FPGAs (OLAF 2016)
Authors:
Hayden Kwok-Hay So,
John Wawrzynek
Abstract:
The 2nd International Workshop on Overlay Architectures for FPGAs (OLAF 2016) was held on 21 Mar, 2016 as a co-located workshop at the 24th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2016). This year, the program committee selected 6 papers and 3 extended abstracts to be presented at the workshop, which are subsequently collected in this online volume.
The 2nd International Workshop on Overlay Architectures for FPGAs (OLAF 2016) was held on 21 Mar, 2016 as a co-located workshop at the 24th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2016). This year, the program committee selected 6 papers and 3 extended abstracts to be presented at the workshop, which are subsequently collected in this online volume.
△ Less
Submitted 26 May, 2016;
originally announced May 2016.