The Open-Source BlackParrot-BedRock Cache Coherence System
Authors:
Mark Unruh Wyse
Abstract:
This dissertation revisits the topic of programmable cache coherence engines in the context of modern shared-memory multicore processors. First, the open-source BedRock cache coherence protocol is described. BedRock employs the canonical MOESIF coherence states and reduces implementation burden by eliminating transient coherence states from the protocol. The protocol's design complexity, concurren…
▽ More
This dissertation revisits the topic of programmable cache coherence engines in the context of modern shared-memory multicore processors. First, the open-source BedRock cache coherence protocol is described. BedRock employs the canonical MOESIF coherence states and reduces implementation burden by eliminating transient coherence states from the protocol. The protocol's design complexity, concurrency, and verification effort are analyzed and compared to a canonical directory-based invalidate coherence protocol. Second, the architecture and microarchitecture of three separate cache coherence directories implementing the BedRock protocol within the BlackParrot 64-bit RISC-V multicore processor, collectively called BlackParrot-BedRock (BP-BedRock), are described. A fixed-function coherence directory engine implementation provides a baseline design for performance and area comparisons. A microcode-programmable coherence directory implementation demonstrates the feasibility of implementing a programmable coherence engine capable of maintaining sufficient protocol processing performance. A hybrid fixed-function and programmable coherence directory blends the protocol processing performance of the fixed-function design with the programmable flexibility of the microcode-programmable design. Collectively, the BedRock coherence protocol and its three BP-BedRock implementations demonstrate the feasibility and challenges of including programmable logic within the coherence system of modern shared-memory multicore processors, paving the way for future research into the application- and system-level benefits of programmable coherence engines.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
The BlackParrot BedRock Cache Coherence System
Authors:
Mark Wyse,
Daniel Petrisko,
Farzam Gilani,
Yuan-Mao Chueh,
Paul Gao,
Dai Cheol Jung,
Sripathi Muralitharan,
Shashank Vijaya Ranga,
Mark Oskin,
Michael Taylor
Abstract:
This paper presents BP-BedRock, the open-source cache coherence protocol and system implemented within the BlackParrot 64-bit RISC-V multicore processor. BP-BedRock implements the BedRock directory-based MOESIF cache coherence protocol and includes two different open-source coherence protocol engines, one FSM-based and the other microcode programmable. Both coherence engines support coherent uncac…
▽ More
This paper presents BP-BedRock, the open-source cache coherence protocol and system implemented within the BlackParrot 64-bit RISC-V multicore processor. BP-BedRock implements the BedRock directory-based MOESIF cache coherence protocol and includes two different open-source coherence protocol engines, one FSM-based and the other microcode programmable. Both coherence engines support coherent uncacheable access to cacheable memory and L1-based atomic read-modify-write operations.
Fitted within the BlackParrot multicore, BP-BedRock has been silicon validated in a GlobalFoundries 12nm FinFET process and FPGA validated with both coherence engines in 8-core configurations, booting Linux and running off the shelf benchmarks. After describing BP-BedRock and the design of the two coherence engines, we study their performance by analyzing processing occupancy and running the Splash-3 benchmarks on the 8-core FPGA implementations. Careful design and coherence-specific ISA extensions enable the programmable controller to achieve performance within 1% of the fixed-function FSM controller on average (2.3% worst-case) as demonstrated on our FPGA test system. Analysis shows that the programmable coherence engine increases die area by only 4% in an ASIC process and increases logic utilization by only 6.3% on FPGA with one additional block RAM added per core.
△ Less
Submitted 11 November, 2022;
originally announced November 2022.