Skip to main content

Showing 1–27 of 27 results for author: Khailany, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2509.25149  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Pretraining Large Language Models with NVFP4

    Authors: NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, Abhinav Goel, Oleg Goncharov , et al. (64 additional authors not shown)

    Abstract: Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute… ▽ More

    Submitted 29 September, 2025; originally announced September 2025.

  2. arXiv:2506.14074  [pdf, ps, other

    cs.LG cs.AR

    Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

    Authors: Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, Haoxing Ren

    Abstract: We present the Comprehensive Verilog Design Problems (CVDP) benchmark, a new dataset and infrastructure to advance LLM and agent research in hardware design and verification. CVDP includes 783 problems across 13 task categories, covering RTL generation, verification, debugging, specification alignment, and technical Q&A authored by experienced hardware engineers. Problems are offered in both non-a… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 16 pages with appendix

  3. arXiv:2504.14152  [pdf, ps, other

    cs.AR cs.LG

    FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

    Authors: Coleman Hooper, Charbel Sakr, Ben Keller, Rangharajan Venkatesan, Kurt Keutzer, Sophia Shao, Brucek Khailany

    Abstract: Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations to low precision is challenging without degrading model accuracy. We propose fine-grained mixed precision (FGMP) quantization, a post-training mixed-precision q… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  4. arXiv:2504.01962  [pdf

    cs.AR

    Marco: Configurable Graph-Based Task Solving and Multi-AI Agents Framework for Hardware Design

    Authors: Chia-Tung Ho, Jing Gong, Yunsheng Bai, Chenhui Deng, Haoxing Ren, Brucek Khailany

    Abstract: Hardware design presents numerous challenges stemming from its complexity and advancing technologies. These challenges result in longer turn-around-time (TAT) for optimizing performance, power, area, and cost (PPAC) during synthesis, verification, physical design, and reliability loops. Large Language Models (LLMs) have shown remarkable capacity to comprehend and generate natural language at a mas… ▽ More

    Submitted 25 February, 2025; originally announced April 2025.

    Comments: 3 pages, 5 figures, 2 tables

  5. arXiv:2503.16681  [pdf, other

    cs.GR cs.AI cs.AR

    GauRast: Enhancing GPU Triangle Rasterizers to Accelerate 3D Gaussian Splatting

    Authors: Sixu Li, Ben Keller, Yingyan Celine Lin, Brucek Khailany

    Abstract: 3D intelligence leverages rich 3D features and stands as a promising frontier in AI, with 3D rendering fundamental to many downstream applications. 3D Gaussian Splatting (3DGS), an emerging high-quality 3D rendering method, requires significant computation, making real-time execution on existing GPU-equipped edge devices infeasible. Previous efforts to accelerate 3DGS rely on dedicated accelerator… ▽ More

    Submitted 10 April, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: DAC 2025

  6. arXiv:2502.05376  [pdf, other

    cs.LG

    BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference

    Authors: Reena Elangovan, Charbel Sakr, Anand Raghunathan, Brucek Khailany

    Abstract: Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-8-bits while maintaining activations at 8-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quan… ▽ More

    Submitted 7 February, 2025; originally announced February 2025.

  7. arXiv:2501.15448  [pdf, other

    cs.CV cs.AI cs.AR cs.LG

    SQ-DM: Accelerating Diffusion Models with Aggressive Quantization and Temporal Sparsity

    Authors: Zichen Fan, Steve Dai, Rangharajan Venkatesan, Dennis Sylvester, Brucek Khailany

    Abstract: Diffusion models have gained significant popularity in image generation tasks. However, generating high-quality content remains notably slow because it requires running model inference over many time steps. To accelerate these models, we propose to aggressively quantize both weights and activations, while simultaneously promoting significant activation sparsity. We further observe that the stated… ▽ More

    Submitted 26 January, 2025; originally announced January 2025.

    Comments: 7 pages, 12 figures, 2 tables

  8. arXiv:2410.05437  [pdf, other

    cs.LG

    ESPACE: Dimensionality Reduction of Activations for Model Compression

    Authors: Charbel Sakr, Brucek Khailany

    Abstract: We propose ESPACE, an LLM compression technique based on dimensionality reduction of activations. Unlike prior works on weight-centric tensor decomposition, ESPACE projects activations onto a pre-calibrated set of principal components. The activation-centrality of the approach enables retraining LLMs with no loss of expressivity; while at inference, weight decomposition is obtained as a byproduct… ▽ More

    Submitted 7 October, 2024; originally announced October 2024.

    Comments: Published as a paper at NeurIPS 2024

  9. arXiv:2408.11053  [pdf, other

    cs.AR cs.AI

    Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation

    Authors: Nathaniel Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, Brucek Khailany

    Abstract: The application of large-language models (LLMs) to digital hardware code generation is an emerging field, with most LLMs primarily trained on natural language and software code. Hardware code like Verilog constitutes a small portion of training data, and few hardware benchmarks exist. The open-source VerilogEval benchmark, released in November 2023, provided a consistent evaluation framework for L… ▽ More

    Submitted 3 February, 2025; v1 submitted 20 August, 2024; originally announced August 2024.

    Comments: This paper revisits and improves the benchmark first presented in arXiv:2309.07544. Twenty-one pages, five figures

  10. arXiv:2408.08927  [pdf, other

    cs.AI cs.CL

    VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool

    Authors: Chia-Tung Ho, Haoxing Ren, Brucek Khailany

    Abstract: Due to the growing complexity of modern Integrated Circuits (ICs), automating hardware design can prevent a significant amount of human error from the engineering process and result in less errors. Verilog is a popular hardware description language for designing and modeling digital systems; thus, Verilog generation is one of the emerging areas of research to facilitate the design process. In this… ▽ More

    Submitted 5 March, 2025; v1 submitted 15 August, 2024; originally announced August 2024.

    Comments: main paper 7 pages, reference 1 page, it is the version that accepted by AAAI 2025

  11. arXiv:2311.00176  [pdf, other

    cs.CL

    ChipNeMo: Domain-Adapted LLMs for Chip Design

    Authors: Mingjie Liu, Teodor-Dumitru Ene, Robert Kirby, Chris Cheng, Nathaniel Pinckney, Rongjian Liang, Jonah Alben, Himyanshu Anand, Sanmitra Banerjee, Ismet Bayraktaroglu, Bonita Bhaskaran, Bryan Catanzaro, Arjun Chaudhuri, Sharon Clay, Bill Dally, Laura Dang, Parikshit Deshpande, Siddhanth Dhodhi, Sameer Halepete, Eric Hill, Jiashang Hu, Sumit Jain, Ankit Jindal, Brucek Khailany, George Kokai , et al. (17 additional authors not shown)

    Abstract: ChipNeMo aims to explore the applications of large language models (LLMs) for industrial chip design. Instead of directly deploying off-the-shelf commercial or open-source LLMs, we instead adopt the following domain adaptation techniques: domain-adaptive tokenization, domain-adaptive continued pretraining, model alignment with domain-specific instructions, and domain-adapted retrieval models. We e… ▽ More

    Submitted 4 April, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

    Comments: Updated results for ChipNeMo-70B model

  12. arXiv:2309.07544  [pdf, other

    cs.LG cs.SE

    VerilogEval: Evaluating Large Language Models for Verilog Code Generation

    Authors: Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, Haoxing Ren

    Abstract: The increasing popularity of large language models (LLMs) has paved the way for their application in diverse domains. This paper proposes a benchmarking framework tailored specifically for evaluating LLM performance in the context of Verilog code generation for hardware design and verification. We present a comprehensive evaluation dataset consisting of 156 problems from the Verilog instructional… ▽ More

    Submitted 9 December, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: ICCAD 2023 Invited Paper. Prior version contained errors in the numbers reported for gpt-4 in Table II

  13. arXiv:2211.16749  [pdf, other

    cs.LG cs.AI cs.AR

    HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression

    Authors: Jiaqi Gu, Ben Keller, Jean Kossaifi, Anima Anandkumar, Brucek Khailany, David Z. Pan

    Abstract: Transformers have attained superior performance in natural language processing and computer vision. Their self-attention and feedforward layers are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by leveraging tensor algebraic properties to express the parameters in a factorized form. Prior efforts used… ▽ More

    Submitted 30 November, 2022; originally announced November 2022.

    Comments: 9 pages. Accepted to NeurIPS ML for System Workshop 2022 (Spotlight)

  14. arXiv:2210.15765  [pdf, other

    cs.LG

    An Adversarial Active Sampling-based Data Augmentation Framework for Manufacturable Chip Design

    Authors: Mingjie Liu, Haoyu Yang, Zongyi Li, Kumara Sastry, Saumyadip Mukhopadhyay, Selim Dogru, Anima Anandkumar, David Z. Pan, Brucek Khailany, Haoxing Ren

    Abstract: Lithography modeling is a crucial problem in chip design to ensure a chip design mask is manufacturable. It requires rigorous simulations of optical and chemical models that are computationally expensive. Recent developments in machine learning have provided alternative solutions in replacing the time-consuming lithography simulations with deep neural networks. However, the considerable accuracy d… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

  15. arXiv:2207.04056  [pdf, other

    cs.LG cs.AI

    Large Scale Mask Optimization Via Convolutional Fourier Neural Operator and Litho-Guided Self Training

    Authors: Haoyu Yang, Zongyi Li, Kumara Sastry, Saumyadip Mukhopadhyay, Anima Anandkumar, Brucek Khailany, Vivek Singh, Haoxing Ren

    Abstract: Machine learning techniques have been extensively studied for mask optimization problems, aiming at better mask printability, shorter turnaround time, better mask manufacturability, and so on. However, most of these researches are focusing on the initial solution generation of small design regions. To further realize the potential of machine learning techniques on mask optimization tasks, we prese… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

    Comments: 9 pages, 10 figures, in preparation for journal submission

    ACM Class: J.6; B.7.2

  16. arXiv:2206.06501  [pdf, other

    cs.LG

    Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training

    Authors: Charbel Sakr, Steve Dai, Rangharajan Venkatesan, Brian Zimmer, William J. Dally, Brucek Khailany

    Abstract: Data clipping is crucial in reducing noise in quantization operations and improving the achievable accuracy of quantization-aware training (QAT). Current practices rely on heuristics to set clipping threshold scalars and cannot be shown to be optimal. We propose Optimally Clipped Tensors And Vectors (OCTAV), a recursive algorithm to determine MSE-optimal clipping scalars. Derived from the fast New… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: Published as a spotlight paper at ICML 2022. Paper contains 16 pages, 5 figures, and 6 tables

  17. arXiv:2203.08616  [pdf, other

    cs.OH cs.LG

    Generic Lithography Modeling with Dual-band Optics-Inspired Neural Networks

    Authors: Haoyu Yang, Zongyi Li, Kumara Sastry, Saumyadip Mukhopadhyay, Mark Kilgard, Anima Anandkumar, Brucek Khailany, Vivek Singh, Haoxing Ren

    Abstract: Lithography simulation is a critical step in VLSI design and optimization for manufacturability. Existing solutions for highly accurate lithography simulation with rigorous models are computationally expensive and slow, even when equipped with various approximation techniques. Recently, machine learning has provided alternative solutions for lithography simulation tasks such as coarse-grained edge… ▽ More

    Submitted 12 March, 2022; originally announced March 2022.

    Comments: 9 pages, 9 figures; accepted at 59th Design Automation Conference

  18. arXiv:2203.06117  [pdf, other

    cs.LG cs.DC

    GATSPI: GPU Accelerated Gate-Level Simulation for Power Improvement

    Authors: Yanqing Zhang, Haoxing Ren, Akshay Sridharan, Brucek Khailany

    Abstract: In this paper, we present GATSPI, a novel GPU accelerated logic gate simulator that enables ultra-fast power estimation for industry sized ASIC designs with millions of gates. GATSPI is written in PyTorch with custom CUDA kernels for ease of coding and maintainability. It achieves simulation kernel speedup of up to 1668X on a single-GPU system and up to 7412X on a multiple-GPU system when compared… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

  19. arXiv:2107.07044  [pdf, other

    cs.LG

    NVCell: Standard Cell Layout in Advanced Technology Nodes with Reinforcement Learning

    Authors: Haoxing Ren, Matthew Fojtik, Brucek Khailany

    Abstract: High quality standard cell layout automation in advanced technology nodes is still challenging in the industry today because of complex design rules. In this paper we introduce an automatic standard cell layout generator called NVCell that can generate layouts with equal or smaller area for over 90% of single row cells in an industry standard cell library on an advanced technology node. NVCell lev… ▽ More

    Submitted 9 July, 2021; originally announced July 2021.

  20. arXiv:2106.13914  [pdf, other

    cs.LG cs.AR

    LNS-Madam: Low-Precision Training in Logarithmic Number System using Multiplicative Weight Update

    Authors: Jiawei Zhao, Steve Dai, Rangharajan Venkatesan, Brian Zimmer, Mustafa Ali, Ming-Yu Liu, Brucek Khailany, Bill Dally, Anima Anandkumar

    Abstract: Representing deep neural networks (DNNs) in low-precision is a promising approach to enable efficient acceleration and memory reduction. Previous methods that train DNNs in low-precision typically keep a copy of weights in high-precision during the weight updates. Directly training with low-precision weights leads to accuracy degradation due to complex interactions between the low-precision number… ▽ More

    Submitted 23 August, 2022; v1 submitted 25 June, 2021; originally announced June 2021.

  21. arXiv:2103.09301  [pdf, other

    cs.AR

    Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

    Authors: Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai, Brucek Khailany, Anand Raghunathan

    Abstract: Transformers have transformed the field of natural language processing. This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. As a result, unlike other neural networks, the softmax operation accounts for a significant fraction of the total run-time of Transformers. To address this, we propose S… ▽ More

    Submitted 16 March, 2021; originally announced March 2021.

    Comments: To appear in Proceedings of the 58th Design Automation Conference (DAC '21)

  22. arXiv:2102.06326  [pdf, other

    cs.LO cs.AR cs.FL

    Verifying High-Level Latency-Insensitive Designs with Formal Model Checking

    Authors: Steve Dai, Alicia Klinefelter, Haoxing Ren, Rangharajan Venkatesan, Ben Keller, Nathaniel Pinckney, Brucek Khailany

    Abstract: Latency-insensitive design mitigates increasing interconnect delay and enables productive component reuse in complex digital systems. This design style has been adopted in high-level design flows because untimed functional blocks connected through latency-insensitive interfaces provide a natural communication abstraction. However, latency-insensitive design with high-level languages also introduce… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

  23. arXiv:2102.04503  [pdf, other

    cs.LG cs.AR

    VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

    Authors: Steve Dai, Rangharajan Venkatesan, Haoxing Ren, Brian Zimmer, William J. Dally, Brucek Khailany

    Abstract: Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting low-cost integer math hardware units. Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are… ▽ More

    Submitted 8 February, 2021; originally announced February 2021.

  24. arXiv:2012.10597  [pdf, other

    cs.AR

    MAVIREC: ML-Aided Vectored IR-DropEstimation and Classification

    Authors: Vidya A. Chhabria, Yanqing Zhang, Haoxing Ren, Ben Keller, Brucek Khailany, Sachin S. Sapatnekar

    Abstract: Vectored IR drop analysis is a critical step in chip signoff that checks the power integrity of an on-chip power delivery network. Due to the prohibitive runtimes of dynamic IR drop analysis, the large number of test patterns must be whittled down to a small subset of worst-case IR vectors. Unlike the traditional slow heuristic method that select a few vectors with incomplete coverage, MAVIREC use… ▽ More

    Submitted 18 December, 2020; originally announced December 2020.

    Comments: 6 pages paper. This has been reviewed at Design Automation and Test Conference 2021 and has been accepted as a four page paper. This is a longer version of that

  25. PowerNet: Transferable Dynamic IR Drop Estimation via Maximum Convolutional Neural Network

    Authors: Zhiyao Xie, Haoxing Ren, Brucek Khailany, Ye Sheng, Santosh Santosh, Jiang Hu, Yiran Chen

    Abstract: IR drop is a fundamental constraint required by almost all chip designs. However, its evaluation usually takes a long time that hinders mitigation techniques for fixing its violations. In this work, we develop a fast dynamic IR drop estimation technique, named PowerNet, based on a convolutional neural network (CNN). It can handle both vector-based and vectorless IR analyses. Moreover, the proposed… ▽ More

    Submitted 26 November, 2020; originally announced November 2020.

    Journal ref: 2020 Asia and South Pacific Design Automation Conference (ASP-DAC 2020)

  26. FIST: A Feature-Importance Sampling and Tree-Based Method for Automatic Design Flow Parameter Tuning

    Authors: Zhiyao Xie, Guan-Qi Fang, Yu-Hung Huang, Haoxing Ren, Yanqing Zhang, Brucek Khailany, Shao-Yun Fang, Jiang Hu, Yiran Chen, Erick Carvajal Barboza

    Abstract: Design flow parameters are of utmost importance to chip design quality and require a painfully long time to evaluate their effects. In reality, flow parameter tuning is usually performed manually based on designers' experience in an ad hoc manner. In this work, we introduce a machine learning-based automatic parameter tuning methodology that aims to find the best design quality with a limited numb… ▽ More

    Submitted 26 November, 2020; originally announced November 2020.

    Journal ref: 2020 Asia and South Pacific Design Automation Conference (ASP-DAC 2020)

  27. arXiv:1708.04485  [pdf, other

    cs.NE cs.AR cs.LG

    SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

    Authors: Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, William J. Dally

    Abstract: Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs in a wide range of situations, especially mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improve… ▽ More

    Submitted 23 May, 2017; originally announced August 2017.