-
BRIDGES: Bridging Graph Modality and Large Language Models within EDA Tasks
Authors:
Wei Li,
Yang Zou,
Christopher Ellis,
Ruben Purdy,
Shawn Blanton,
José M. F. Moura
Abstract:
While many EDA tasks already involve graph-based data, existing LLMs in EDA primarily either represent graphs as sequential text, or simply ignore graph-structured data that might be beneficial like dataflow graphs of RTL code. Recent studies have found that LLM performance suffers when graphs are represented as sequential text, and using additional graph information significantly boosts performan…
▽ More
While many EDA tasks already involve graph-based data, existing LLMs in EDA primarily either represent graphs as sequential text, or simply ignore graph-structured data that might be beneficial like dataflow graphs of RTL code. Recent studies have found that LLM performance suffers when graphs are represented as sequential text, and using additional graph information significantly boosts performance. To address these challenges, we introduce BRIDGES, a framework designed to incorporate graph modality into LLMs for EDA tasks. BRIDGES integrates an automated data generation workflow, a solution that combines graph modality with LLM, and a comprehensive evaluation suite. First, we establish an LLM-driven workflow to generate RTL and netlist-level data, converting them into dataflow and netlist graphs with function descriptions. This workflow yields a large-scale dataset comprising over 500,000 graph instances and more than 1.5 billion tokens. Second, we propose a lightweight cross-modal projector that encodes graph representations into text-compatible prompts, enabling LLMs to effectively utilize graph data without architectural modifications. Experimental results demonstrate 2x to 10x improvements across multiple tasks compared to text-only baselines, including accuracy in design retrieval, type prediction and perplexity in function description, with negligible computational overhead (<1% model weights increase and <30% additional runtime overhead). Even without additional LLM finetuning, our results outperform text-only by a large margin. We plan to release BRIDGES, including the dataset, models, and training flow.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs
Authors:
Prabhu Vellaisamy,
Harideep Nair,
Thomas Kang,
Yichen Ni,
Haoyang Fan,
Bin Qi,
Jeff Chen,
Shawn Blanton,
John Paul Shen
Abstract:
The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep l…
▽ More
The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA's open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm CMOS, Tempus Core's PE cell unit (PCU) yields 59.3% and 15.3% reductions in area and power consumption, respectively, over NVDLA's CMAC unit. Considering a 16x16 PE array in Tempus Core, area and power improves by 75% and 62%, respectively, while delivering 5x and 4x iso-area throughput improvements for INT8 and INT4 precisions. Post-place and route analysis of Tempus Core's PCU shows that the 16x4 PE array for INT4 precision in 45nm CMOS requires only 0.017 mm^2 die area and consumes only 6.2mW of total power. We demonstrate that area-power efficient unary-based hardware can be seamlessly integrated into conventional DLAs, paving the path for efficient unary hardware for edge AI inference.
△ Less
Submitted 25 December, 2024;
originally announced December 2024.
-
tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit
Authors:
Prabhu Vellaisamy,
Harideep Nair,
Joseph Finn,
Manav Trivedi,
Albert Chen,
Anna Li,
Tsung-Han Lin,
Perry Wang,
Shawn Blanton,
John Paul Shen
Abstract:
General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents t…
▽ More
General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89\%, 87\%, and 50\%, respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just 0.22 mm^2 die area, 417.72 mW power, and 8.86 uJ energy, assuming no sparsity. Typical sparsity in DL workloads (MobileNetv2, ResNet-50) reduces energy by more than 3x, and lowering precision to 4 and 2 bits further reduces it by 24x and 104x respectively.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
Commercial Evaluation of Zero-Skipping MAC Design for Bit Sparsity Exploitation in DL Inference
Authors:
Harideep Nair,
Prabhu Vellaisamy,
Tsung-Han Lin,
Perry Wang,
Shawn Blanton,
John Paul Shen
Abstract:
General Matrix Multiply (GEMM) units, consisting of multiply-accumulate (MAC) arrays, perform bulk of the computation in deep learning (DL). Recent work has proposed a novel MAC design, Bit-Pragmatic (PRA), capable of dynamically exploiting bit sparsity. This work presents OzMAC (Omit-zero-MAC), a modified re-implementation of PRA, but extends beyond earlier works by performing rigorous post-synth…
▽ More
General Matrix Multiply (GEMM) units, consisting of multiply-accumulate (MAC) arrays, perform bulk of the computation in deep learning (DL). Recent work has proposed a novel MAC design, Bit-Pragmatic (PRA), capable of dynamically exploiting bit sparsity. This work presents OzMAC (Omit-zero-MAC), a modified re-implementation of PRA, but extends beyond earlier works by performing rigorous post-synthesis evaluation against binary MAC design across multiple bitwidths and clock frequencies using TSMC N5 process node to assess commercial implementation potential. We demonstrate the existence of high bit sparsity in eight pretrained INT8 DL workloads and show that 8-bit OzMAC improves all three metrics of area, power, and energy significantly by 21%, 70%, and 28%, respectively. Similar improvements are achieved when scaling data precisions (4, 8, 16 bits) and clock frequencies (0.5 GHz, 1 GHz, 1.5 GHz). For the 8-bit OzMAC, scaling its frequency to normalize the throughput, it still achieves 30% improvement on both power and energy.
△ Less
Submitted 2 January, 2025; v1 submitted 29 February, 2024;
originally announced February 2024.