Hardware Architecture

New submissions
Cross-lists
Replacements

See recent articles

Showing new listings for Friday, 16 May 2025

Total of 9 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2505.10060 [pdf, html, other]: Title: Basilisk: A 34 mm2 End-to-End Open-Source 64-bit Linux-Capable RISC-V SoC in 130nm BiCMOS

Philippe Sauter, Thomas Benz, Paul Scheffler, Martin Povišer, Frank K. Gürkaynak, Luca Benini

Comments: 2 pages, 5 figures, accepted at Hot Chips 2025 as a poster submission

Subjects: Hardware Architecture (cs.AR)

End-to-end open-source electronic design automation (OSEDA) enables a collaborative approach to chip design conducive to supply chain diversification and zero-trust step-by-step design verification. However, existing end-to-end OSEDA flows have mostly been demonstrated on small designs and have not yet enabled large, industry-grade chips such as Linux-capable systems-on-chip (SoCs). This work presents Basilisk, the largest end-to-end open-source SoC to date. Basilisk's 34 mm2, 2.7 MGE design features a 64-bit Linux-capable RISC-V core, a lightweight 124 MB/s DRAM controller, and extensive IO, including a USB 1.1 host, a video output, and a fully digital 62 Mb/s chip-to-chip (C2C) link. We implement Basilisk in IHP's open 130 nm BiCMOS technology, significantly improving on the state-of-the-art (SoA) OSEDA flow. Our enhancements of the Yosys-based synthesis flow improve design timing and area by 2.3x and 1.6x, respectively, while consuming significantly less system resources. By tuning OpenROAD place and route (P&R) to our design and technology, we decrease the die size by 12%. The fabricated Basilisk chip reaches 62 MHz at its nominal 1.2 V core voltage and up to 102 MHz at 1.64 V. It achieves a peak energy efficiency of 18.9 DP MFLOP/s/W at 0.88 V.
[2] arXiv:2505.10145 [pdf, html, other]: Title: An Integrated UVM-TLM Co-Simulation Framework for RISC-V Functional Verification and Performance Evaluation

Ruizhi Qiu, Yang Liu

Comments: 7 pages, 3 figures, This work is under consideration for conference publication

Subjects: Hardware Architecture (cs.AR); Performance (cs.PF)

The burgeoning RISC-V ecosystem necessitates efficient verification methodologies for complex processors. Traditional approaches often struggle to concurrently evaluate functional correctness and performance, or balance simulation speed with modeling accuracy. This paper introduces an integrated co-simulation framework leveraging Universal Verification Methodology (UVM) and Transaction-Level Modeling (TLM) for RISC-V processor validation. We present a configurable UVM-TLM model (vmodel) of a superscalar, out-of-order RISC-V core, featuring key microarchitectural modeling techniques such as credit-based pipeline flow control. This environment facilitates unified functional verification via co-simulation against the Spike ISA simulator and enables early-stage performance assessment using benchmarks like CoreMark, orchestrated within UVM. The methodology prioritizes integration, simulation efficiency, and acceptable fidelity for architectural exploration over cycle-level precision. Experimental results validate functional correctness and significant simulation speedup over RTL approaches, accelerating design iterations and enhancing verification coverage.

[3] arXiv:2505.10217 (cross-list from cs.OS) [pdf, html, other]: Title: Enabling Syscall Intercept for RISC-V

Petar Andrić, Aaron Call, Ramon Nou

Comments: RISC-V summit 2025 accepted

Subjects: Operating Systems (cs.OS); Hardware Architecture (cs.AR)

The European Union technological sovereignty strategy centers around the RISC-V Instruction Set Architecture, with the European Processor Initiative leading efforts to build production-ready processors. Focusing on realizing a functional RISC-V ecosystem, the BZL initiative (this http URL) is making an effort to create a software stack along with the hardware. In this work, we detail the efforts made in porting a widely used syscall interception library, mainly used on AdHocFS (i.e., DAOS, GekkoFS), to RISC-V and how we overcame some of the limitations encountered.
[4] arXiv:2505.10248 (cross-list from cs.ET) [pdf, html, other]: Title: Scalable 28nm IC implementation of coupled oscillator network featuring tunable topology and complexity

S. Y. Neyaz, A. Ashok, M. Schiek, C. Grewing, A. Zambanini, S. van Waasen

Subjects: Emerging Technologies (cs.ET); Hardware Architecture (cs.AR)

Integrated circuit implementations of coupled oscillator networks have recently gained increased attention. The focus is usually on using these networks for analogue computing, for example for solving computational optimization tasks. For use within analog computing, these networks are run close to critical dynamics. On the other hand, such networks are also used as an analogy of transport networks such as electrical power grids to answer the question of how exactly such critical dynamic states can be avoided. However, simulating large network of coupled oscillators is computationally intensive, with specifc regards to electronic ones. We have developed an integrated circuit using integrated Phase-Locked Loop (PLL) with modifications, that allows to flexibly vary the topology as well as a complexity parameter of the network during operation. The proposed architecture, inspired by the brain, employs a clustered architecture, with each cluster containing 7 PLLs featuring programmable coupling mechanisms. Additionally, the inclusion of a RISC-V processor enables future algorithmic implementations. Thus, we provide a practical alternative for large-scale network simulations both in the field of analog computing and transport network stability research.

[5] arXiv:2406.01698 (replaced) [pdf, html, other]: Title: Demystifying AI Platform Design for Distributed Inference of Next-Generation LLM models

Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Suvinay Subramanian, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna

Comments: 19 Pages, this https URL, this https URL

Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

Large language models (LLMs) have shown remarkable performance across a wide range of applications, often outperforming human experts. However, deploying these gigantic models efficiently for diverse inference use cases requires carefully designed hardware platforms with ample computing, memory, and network resources. With constant innovation in LLM serving optimizations and model architecture evolving at breakneck speed, the hardware requirements to meet Service Level Objectives (SLOs) remain an open research question.
To answer the question, we present an analytical tool, GenZ, to efficiently navigate the relationship between diverse LLM model architectures(Dense, GQA, MoE, Mamba), LLM serving optimizations(Chunking, Speculative decoding, quanitization), and AI platform design parameters. Our tool estimates LLM inference performance metrics for the given scenario. We have validated against real hardware platforms running various different LLM models, achieving a max geomean error of this http URL use GenZ to identify compute, memory capacity, memory bandwidth, network latency, and network bandwidth requirements across diverse LLM inference use cases. We also study diverse architectural choices in use today (inspired by LLM serving platforms from several vendors) to help inform computer architects designing next-generation AI hardware accelerators and platforms. The trends and insights derived from GenZ can guide AI engineers deploying LLMs as well as computer architects designing next-generation hardware accelerators and platforms. Ultimately, this work sheds light on the platform design considerations for unlocking the full potential of large language models across a spectrum of applications. The source code is available at this https URL . Users can also be tried it on at this https URL without any setup on your web browser.
[6] arXiv:2505.03780 (replaced) [pdf, html, other]: Title: GPU Performance Portability needs Autotuning

Burkhard Ringlein, Thomas Parnell, Radu Stoica

Comments: typos, fix grammatical mistakes

Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on flash attention -- a widespread performance critical LLM kernel -- we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.
[7] arXiv:2505.07995 (replaced) [pdf, html, other]: Title: Spec2Assertion: Automatic Pre-RTL Assertion Generation using Large Language Models with Progressive Regularization

Fenghua Wu, Evan Pan, Rahul Kande, Michael Quinn, Aakash Tyagi, David Kebo Houngninou, Jeyavijayan Rajendran, Jiang Hu

Comments: 8 pages, 7 figures

Subjects: Hardware Architecture (cs.AR)

SystemVerilog Assertions (SVAs) play a critical role in detecting and debugging functional bugs in digital chip design. However, generating SVAs has traditionally been a manual, labor-intensive, and error-prone process. Recent advances in automatic assertion generation, particularly those using machine learning and large language models (LLMs), have shown promising potential, though most approaches remain in the early stages of development. In this work, we introduce Spec2Assertion, a new technique for automatically generating assertions from design specifications prior to RTL implementation. It leverages LLMs with progressive regularization and incorporates Chain-of-Thought (CoT) prompting to guide assertion synthesis. Additionally, we propose a new evaluation methodology that assesses assertion quality across a broad range of scenarios. Experiments on multiple benchmark designs show that Spec2Assertion generates 70% more syntax-correct assertions with 2X quality improvement on average compared to a recent state-of-the-art approach.
[8] arXiv:2505.06085 (replaced) [pdf, html, other]: Title: Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

Hiari Pizzini Cavagna, Daniele Cesarini, Andrea Bartolini

Comments: Accepted to the Computational Aspects of Deep Learning Workshop at ISC High Performance 2025. To appear in the ISC High Performance 2025 Workshop Proceedings

Subjects: Performance (cs.PF); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)

The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware architectures that optimize computational efficiency and energy consumption. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced numerical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull's execution model, gridsize, matrix dimensions, data formats, and numerical precision impact computational efficiency. Furthermore, we compare Grayskull's performance against state-of-the-art architectures with tensor acceleration, including Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-off between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.
[9] arXiv:2505.06470 (replaced) [pdf, html, other]: Title: "vcd2df" -- Leveraging Data Science Insights for Hardware Security Research

Calvin Deutschbein, Jimmy Ostler, Hriday Raj

Comments: 6 pages, no figures, under submission at ACDSA 2025. Added co-author Hriday Raj during v2 as Hriday joined us for some Spark characterization as a domain specialist between initial drafts and reaching camera-readiness

Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR)

In this work, we hope to expand the universe of security practitioners of open-source hardware by creating a bridge from hardware design languages (HDLs) to data science languages like Python and R through libraries that convert VCD (value change dump) files into data frames, the expected input type of the modern data science tools. We show how insights can be derived in high-level languages from register transfer level (RTL) trace data. Additional, we show a promising future direction in hardware security research leveraging the parallelism of the Spark DataFrame.

Total of 9 entries

Showing up to 2000 entries per page: fewer | more | all

Hardware Architecture

Showing new listings for Friday, 16 May 2025

New submissions (showing 2 of 2 entries)

Cross submissions (showing 2 of 2 entries)

Replacement submissions (showing 5 of 5 entries)