-
Rigidity of surfaces with nonpositive Euler characteristic by the second eigenvalue of the Jacobi operator
Authors:
Márcio Batista,
Marcos P. Cavalcante,
Abraão Mendes,
Ivaldo Nunes
Abstract:
In this paper, we investigate the spectral properties of the Jacobi operator for immersed surfaces with nonpositive Euler characteristic, extending previous results in the field. We first prove a sharp upper bound for the second eigenvalue of the Jacobi operator for compact surfaces with nonpositive Euler characteristic that are fully immersed in the Euclidean sphere, and then we classify all such…
▽ More
In this paper, we investigate the spectral properties of the Jacobi operator for immersed surfaces with nonpositive Euler characteristic, extending previous results in the field. We first prove a sharp upper bound for the second eigenvalue of the Jacobi operator for compact surfaces with nonpositive Euler characteristic that are fully immersed in the Euclidean sphere, and then we classify all such surfaces attaining this upper bound. Furthermore, we demonstrate that totally geodesic tori maximize the second eigenvalue among all compact orientable surfaces with positive genus in the product space $\mathbb{S}^1(r) \times \mathbb{S}^2(s)$.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Correcting noisy quantum gates with shortcuts to adiabaticity
Authors:
Moallison F. Cavalcante,
Bariş Çakmak,
Marcus V. S. Bonança,
Sebastian Deffner
Abstract:
Unitary quantum gates constitute the building blocks of Quantum Computing in the circuit paradigm. In this work, we engineer a locally driven two-qubit Hamiltonian whose instantaneous ground-state dynamics generates the controlled-NOT (CNOT) quantum gate. In practice, quantum gates have to be implemented in finite-time, hence non-adiabatic and external noise effects debilitate gate fidelities. Her…
▽ More
Unitary quantum gates constitute the building blocks of Quantum Computing in the circuit paradigm. In this work, we engineer a locally driven two-qubit Hamiltonian whose instantaneous ground-state dynamics generates the controlled-NOT (CNOT) quantum gate. In practice, quantum gates have to be implemented in finite-time, hence non-adiabatic and external noise effects debilitate gate fidelities. Here, we show that counterdiabatic control can restore gate performance with near perfect fidelities even in open quantum systems subject to decoherence.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Roadmap on Quantum Thermodynamics
Authors:
Steve Campbell,
Irene D'Amico,
Mario A. Ciampini,
Janet Anders,
Natalia Ares,
Simone Artini,
Alexia Auffèves,
Lindsay Bassman Oftelie,
Laetitia P. Bettmann,
Marcus V. S. Bonança,
Thomas Busch,
Michele Campisi,
Moallison F. Cavalcante,
Luis A. Correa,
Eloisa Cuestas,
Ceren B. Dag,
Salambô Dago,
Sebastian Deffner,
Adolfo Del Campo,
Andreas Deutschmann-Olek,
Sandro Donadi,
Emery Doucet,
Cyril Elouard,
Klaus Ensslin,
Paul Erker
, et al. (44 additional authors not shown)
Abstract:
The last two decades has seen quantum thermodynamics become a well established field of research in its own right. In that time, it has demonstrated a remarkably broad applicability, ranging from providing foundational advances in the understanding of how thermodynamic principles apply at the nano-scale and in the presence of quantum coherence, to providing a guiding framework for the development…
▽ More
The last two decades has seen quantum thermodynamics become a well established field of research in its own right. In that time, it has demonstrated a remarkably broad applicability, ranging from providing foundational advances in the understanding of how thermodynamic principles apply at the nano-scale and in the presence of quantum coherence, to providing a guiding framework for the development of efficient quantum devices. Exquisite levels of control have allowed state-of-the-art experimental platforms to explore energetics and thermodynamics at the smallest scales which has in turn helped to drive theoretical advances. This Roadmap provides an overview of the recent developments across many of the field's sub-disciplines, assessing the key challenges and future prospects, providing a guide for its near term progress.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
A Reliable, Time-Predictable Heterogeneous SoC for AI-Enhanced Mixed-Criticality Edge Applications
Authors:
Angelo Garofalo,
Alessandro Ottaviano,
Matteo Perotti,
Thomas Benz,
Yvan Tortorella,
Robert Balas,
Michael Rogenmoser,
Chi Zhang,
Luca Bertaccini,
Nils Wistoff,
Maicol Ciani,
Cyril Koenig,
Mattia Sinigaglia,
Luca Valente,
Paul Scheffler,
Manuel Eggimann,
Matheus Cavalcante,
Francesco Restuccia,
Alessandro Biondi,
Francesco Conti,
Frank K. Gurkaynak,
Davide Rossi,
Luca Benini
Abstract:
Next-generation mixed-criticality Systems-on-chip (SoCs) for robotics, automotive, and space must execute mixed-criticality AI-enhanced sensor processing and control workloads, ensuring reliable and time-predictable execution of critical tasks sharing resources with non-critical tasks, while also fitting within a sub-2W power envelope. To tackle these multi-dimensional challenges, in this brief, w…
▽ More
Next-generation mixed-criticality Systems-on-chip (SoCs) for robotics, automotive, and space must execute mixed-criticality AI-enhanced sensor processing and control workloads, ensuring reliable and time-predictable execution of critical tasks sharing resources with non-critical tasks, while also fitting within a sub-2W power envelope. To tackle these multi-dimensional challenges, in this brief, we present a 16nm, reliable, time-predictable heterogeneous SoC with multiple programmable accelerators. Within a 1.2W power envelope, the SoC integrates software-configurable hardware IPs to ensure predictable access to shared resources, such as the on-chip interconnect and memory system, leading to tight upper bounds on execution times of critical applications. To accelerate mixed-precision mission-critical AI, the SoC integrates a reliable multi-core accelerator achieving 304.9 GOPS peak performance at 1.6 TOPS/W energy efficiency. Non-critical, compute-intensive, floating-point workloads are accelerated by a dual-core vector cluster, achieving 121.8 GFLOPS at 1.1 TFLOPS/W and 106.8 GFLOPS/mm2.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
The Korteweg-de Vries Equation on general star graphs
Authors:
Márcio Cavalcante,
José Marques Neto
Abstract:
In this paper, we establish local well-posedness for the Cauchy problem associated with the Korteweg-de Vries (KdV) equation on a general metric star graph. The graph comprises m + k semi-infinite edges: k negative half-lines and m positive half-lines, all joined at a common vertex. The choice of boundary conditions is compatible with the conditions determined by the semigroup theory. The crucial…
▽ More
In this paper, we establish local well-posedness for the Cauchy problem associated with the Korteweg-de Vries (KdV) equation on a general metric star graph. The graph comprises m + k semi-infinite edges: k negative half-lines and m positive half-lines, all joined at a common vertex. The choice of boundary conditions is compatible with the conditions determined by the semigroup theory. The crucial point in this work is to obtain the integral formula using the forcing operator method and the Fourier restriction method of Bourgain. This work extends the results obtained by Cavalcante for the specific case of the Y junction to a more general class of star graphs.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Emergence of $X$ states in a quantum impurity model
Authors:
Moallison F. Cavalcante,
Marcus V. S. Bonança,
Eduardo Miranda,
Sebastian Deffner
Abstract:
In the present work, we demonstrate the emergence of $X$ states in the long-time response of a locally perturbed many-body quantum impurity model. The emergence of the double-qubit state is heralded by the lack of decay of the response function as well as the out-of-time order correlator, signifying the trapping of excitations and hence information in edge modes. Surprisingly, after carrying out a…
▽ More
In the present work, we demonstrate the emergence of $X$ states in the long-time response of a locally perturbed many-body quantum impurity model. The emergence of the double-qubit state is heralded by the lack of decay of the response function as well as the out-of-time order correlator, signifying the trapping of excitations and hence information in edge modes. Surprisingly, after carrying out a quantum information theory characterization, we show that such states exhibit genuine quantum correlations.
△ Less
Submitted 3 May, 2025; v1 submitted 23 January, 2025;
originally announced January 2025.
-
Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12nm FinFET
Authors:
Paul Scheffler,
Thomas Benz,
Viviane Potocnik,
Tim Fischer,
Luca Colagrande,
Nils Wistoff,
Yichao Zhang,
Luca Bertaccini,
Gianmarco Ottavi,
Manuel Eggimann,
Matheus Cavalcante,
Gianna Paulin,
Frank K. Gürkaynak,
Davide Rossi,
Luca Benini
Abstract:
ML and HPC applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing CPUs and GPUs struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-Core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-cor…
▽ More
ML and HPC applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing CPUs and GPUs struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-Core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-core streaming units (SUs) designed to accelerate dense and sparse FP8-to-FP64 ML and HPC workloads. We implement Occamy's compute chiplets in 12 nm FinFET, and its passive interposer, Hedwig, in a 65 nm node. On dense linear algebra (LA), Occamy achieves a competitive FPU utilization of 89%. On stencil codes, Occamy reaches an FPU utilization of 83% and a technology-node-normalized compute density of 11.1 DP-GFLOP/s/mm2,leading state-of-the-art (SoA) processors by 1.7x and 1.2x, respectively. On sparse-dense linear algebra (LA), it achieves 42% FPU utilization and a normalized compute density of 5.95 DP-GFLOP/s/mm2, surpassing the SoA by 5.2x and 11x, respectively. On, sparse-sparse LA, Occamy reaches a throughput of up to 187 GCOMP/s at 17.4 GCOMP/s/W and a compute density of 3.63 GCOMP/s/mm2. Finally, we reach up to 75% and 54% FPU utilization on and dense (LLM) and graph-sparse (GCN) ML inference workloads. Occamy's RTL is freely available under a permissive open-source license.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
New method of image processing via statistical analysis for application in intelligent systems
Authors:
Monalisa Cavalcante,
José Araújo,
José Holanda
Abstract:
Image processing has always been a topic of significant importance to society. Recently, this field has gained considerable prominence due to the development of intelligent systems. In this work, we present a new method of image processing that utilizes statistical analysis, specifically designed for applications in intelligent systems. We tested our method on a large collection of images to asses…
▽ More
Image processing has always been a topic of significant importance to society. Recently, this field has gained considerable prominence due to the development of intelligent systems. In this work, we present a new method of image processing that utilizes statistical analysis, specifically designed for applications in intelligent systems. We tested our method on a large collection of images to assess its effectiveness.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
Spatzformer: An Efficient Reconfigurable Dual-Core RISC-V V Cluster for Mixed Scalar-Vector Workloads
Authors:
Matteo Perotti,
Michele Raeber,
Mattia Sinigaglia,
Matheus Cavalcante,
Davide Rossi,
Luca Benini
Abstract:
Multi-core vector processor architectures excel in handling computationally intensive vectorizable tasks but struggle to achieve optimal resource utilization when facing sequential and control tasks that cannot be vectorized. This work presents Spatzformer, the first reconfigurable RISC-V V (RVV) architecture developed from a baseline open-source dual-core cluster based on Snitch scalar cores augm…
▽ More
Multi-core vector processor architectures excel in handling computationally intensive vectorizable tasks but struggle to achieve optimal resource utilization when facing sequential and control tasks that cannot be vectorized. This work presents Spatzformer, the first reconfigurable RISC-V V (RVV) architecture developed from a baseline open-source dual-core cluster based on Snitch scalar cores augmented with compact Spatz vector units. Spatzformer operates in two distinct modes: split mode, working as a dual-core vector architecture to handle vectorizable tasks concurrently, and merge mode, where two vector units are driven by a single scalar core, allowing the remaining scalar core to handle non-vectorizable control tasks. We implement Spatzformer in a 12-nm technology node and characterize the cost of the added architectural reconfigurability. We show that merge mode accelerates mixed scalar-vector kernels by up to 1.8x compared to split mode. Moreover, it accelerates the vector kernels that require fine-grained synchronization (such as FFT) by up to 20% with respect to the baseline. The reconfigurability features do not degrade the architecture's maximum frequency (1.2GHz, TT, 0.8V, 25C) and have a negligible area impact (+1.4%), with a worst-case energy efficiency drop of only 7% with respect to the non-reconfigurable baseline.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET
Authors:
Gianna Paulin,
Paul Scheffler,
Thomas Benz,
Matheus Cavalcante,
Tim Fischer,
Manuel Eggimann,
Yichao Zhang,
Nils Wistoff,
Luca Bertaccini,
Luca Colagrande,
Gianmarco Ottavi,
Frank K. Gürkaynak,
Davide Rossi,
Luca Benini
Abstract:
We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stenc…
▽ More
We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Stability of extremal domains for the first eigenvalue of the Laplacian operator
Authors:
Marcos P. Cavalcante,
Ivaldo Nunes
Abstract:
In this paper, we compute the second variation of the first Dirichlet eigenvalue on extremal domains in general Riemannian manifolds and establish a criterion for stability. We classify the stable extremal domains in the 2-sphere and higher-dimensional spheres when the boundary is minimal. Additionally, we establish topological bounds for stable domains in a general compact Riemannian surface, ass…
▽ More
In this paper, we compute the second variation of the first Dirichlet eigenvalue on extremal domains in general Riemannian manifolds and establish a criterion for stability. We classify the stable extremal domains in the 2-sphere and higher-dimensional spheres when the boundary is minimal. Additionally, we establish topological bounds for stable domains in a general compact Riemannian surface, assuming either nonnegative total Gaussian curvature or small volume.
△ Less
Submitted 28 July, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
Index estimates for harmonic Gauss maps
Authors:
Alcides de Carvalho,
Marcos P. Cavalcante,
Wagner Costa-Filho,
Darlan de Oliveira
Abstract:
Let $Σ$ denote a closed surface with constant mean curvature in $\mathbb{G}^3$, a 3-dimensional Lie group equipped with a bi-invariant metric. For such surfaces, there is a harmonic Gauss map which maps values to the unit sphere within the Lie algebra of $\mathbb{G}$. We prove that the energy index of the Gauss map of $Σ$ is bounded below by its topological genus. We also obtain index estimates in…
▽ More
Let $Σ$ denote a closed surface with constant mean curvature in $\mathbb{G}^3$, a 3-dimensional Lie group equipped with a bi-invariant metric. For such surfaces, there is a harmonic Gauss map which maps values to the unit sphere within the Lie algebra of $\mathbb{G}$. We prove that the energy index of the Gauss map of $Σ$ is bounded below by its topological genus. We also obtain index estimates in the case of complete non compact surfaces.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
First Eigenvalue of Jacobi operator and Rigidity Results for Constant Mean Curvature Hypersurfaces
Authors:
Marcio Batista,
Marcos P. Cavalcante,
Luiz R. Melo
Abstract:
In this paper, we obtain geometric upper bounds for the first eigenvalue $λ_1^J$ of the Jacobi operator for both closed and compact with boundary hypersurfaces having constant mean curvature (CMC). As an application, we derive new rigidity results for the area of CMC hypersurfaces under suitable conditions on $λ_1^J$ and the curvature of the ambient space. We also address the Jacobi-Steklov proble…
▽ More
In this paper, we obtain geometric upper bounds for the first eigenvalue $λ_1^J$ of the Jacobi operator for both closed and compact with boundary hypersurfaces having constant mean curvature (CMC). As an application, we derive new rigidity results for the area of CMC hypersurfaces under suitable conditions on $λ_1^J$ and the curvature of the ambient space. We also address the Jacobi-Steklov problem, proving geometric upper bounds for its first eigenvalue $σ_1^J$ and deriving rigidity results related to the length of the boundary. Additionally, we present some results in higher dimensions related to the Yamabe invariants.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
TeraPool-SDR: An 1.89TOPS 1024 RV-Cores 4MiB Shared-L1 Cluster for Next-Generation Open-Source Software-Defined Radios
Authors:
Yichao Zhang,
Marco Bertuletti,
Samuel Riedel,
Matheus Cavalcante,
Alessandro Vanelli-Coralli,
Luca Benini
Abstract:
Radio Access Networks (RAN) workloads are rapidly scaling up in data processing intensity and throughput as the 5G (and beyond) standards grow in number of antennas and sub-carriers. Offering flexible Processing Elements (PEs), efficient memory access, and a productive parallel programming model, many-core clusters are a well-matched architecture for next-generation software-defined RANs, but stag…
▽ More
Radio Access Networks (RAN) workloads are rapidly scaling up in data processing intensity and throughput as the 5G (and beyond) standards grow in number of antennas and sub-carriers. Offering flexible Processing Elements (PEs), efficient memory access, and a productive parallel programming model, many-core clusters are a well-matched architecture for next-generation software-defined RANs, but staggering performance requirements demand a high number of PEs coupled with extreme Power, Performance and Area (PPA) efficiency. We present the architecture, design, and full physical implementation of Terapool-SDR, a cluster for Software Defined Radio (SDR) with 1024 latency-tolerant, compact RV32 PEs, sharing a global view of a 4MiB, 4096-banked, L1 memory. We report various feasible configurations of TeraPool-SDR featuring an ultra-high bandwidth PE-to-L1-memory interconnect, clocked at 730MHz, 880MHz, and 924MHz (TT/0.80 V/25 °C) in 12nm FinFET technology. The TeraPool-SDR cluster achieves high energy efficiency on all SDR key kernels for 5G RANs: Fast Fourier Transform (93GOPS/W), Matrix-Multiplication (125GOPS/W), Channel Estimation (96GOPS/W), and Linear System Inversion (61GOPS/W). For all the kernels, it consumes less than 10W, in compliance with industry standards.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Nano-welding of quantum spin-$1/2$ chains at minimal dissipation
Authors:
Moallison F. Cavalcante,
Marcus V. S. Bonança,
Eduardo Miranda,
Sebastian Deffner
Abstract:
We consider the optimal control of switching on a coupling term between two quantum many-body systems. Specifically, we (i) quantify the energetic cost of establishing a weak junction between two quantum spin-$1/2$ chains in finite time $τ$ and (ii) identify the energetically optimal protocol to realize it. For linear driving protocols, we find that for long times the excess (irreversible) work sc…
▽ More
We consider the optimal control of switching on a coupling term between two quantum many-body systems. Specifically, we (i) quantify the energetic cost of establishing a weak junction between two quantum spin-$1/2$ chains in finite time $τ$ and (ii) identify the energetically optimal protocol to realize it. For linear driving protocols, we find that for long times the excess (irreversible) work scales as $τ^{-η}$, where $η=1, 2$ or a nonuniversal number depending on the phase of the chains. Interestingly, increasing a $J_z$ anisotropy in the chains suppresses the excess work thus promoting quasi-adiabaticity. The general optimal control problem is solved, employing a Chebyshev ansatz. We find that the optimal control protocol is intimately sensitive to the chain phases.
△ Less
Submitted 23 January, 2025; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Dynamics of the Korteweg-de Vries equation on a balanced metric graph
Authors:
Jaime Angulo Pava,
Márcio Cavalcante
Abstract:
In this work, we establish local well-posedness for the Korteweg-de Vries model on a balanced star graph with a structure represented by semi-infinite edges, by considering a boundary condition of $δ$-type at the {unique} graph-vertex. Also, we extend the linear instability result of Angulo and Cavalcante (2021) to one of nonlinear instability. For the proof of local well posedness theory the prin…
▽ More
In this work, we establish local well-posedness for the Korteweg-de Vries model on a balanced star graph with a structure represented by semi-infinite edges, by considering a boundary condition of $δ$-type at the {unique} graph-vertex. Also, we extend the linear instability result of Angulo and Cavalcante (2021) to one of nonlinear instability. For the proof of local well posedness theory the principal new ingredient is the utilization of the special solutions by Faminskii in the context of half-lines. As far as we are aware, this approach is being used for the first time in the context of star graphs and can potentially be applied to other boundary classes. In the case of the nonlinear instability result, the principal ingredients are the linearized instability known result and the fact that data-to-solution map determined by the local theory is at least of class $C^2$.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication
Authors:
Matteo Perotti,
Yichao Zhang,
Matheus Cavalcante,
Enis Mustafa,
Luca Benini
Abstract:
Dense Matrix Multiplication (MatMul) is arguably one of the most ubiquitous compute-intensive kernels, spanning linear algebra, DSP, graphics, and machine learning applications. Thus, MatMul optimization is crucial not only in high-performance processors but also in embedded low-power platforms. Several Instruction Set Architectures (ISAs) have recently included matrix extensions to improve MatMul…
▽ More
Dense Matrix Multiplication (MatMul) is arguably one of the most ubiquitous compute-intensive kernels, spanning linear algebra, DSP, graphics, and machine learning applications. Thus, MatMul optimization is crucial not only in high-performance processors but also in embedded low-power platforms. Several Instruction Set Architectures (ISAs) have recently included matrix extensions to improve MatMul performance and efficiency at the cost of added matrix register files and units. In this paper, we propose Matrix eXtension (MX), a lightweight approach that builds upon the open-source RISC-V Vector (RVV) ISA to boost MatMul energy efficiency. Instead of adding expensive dedicated hardware, MX uses the pre-existing vector register file and functional units to create a hybrid vector/matrix engine at a negligible area cost (< 3%), which comes from a compact near-FPU tile buffer for higher data reuse, and no clock frequency overhead. We implement MX on a compact and highly energy-optimized RVV processor and evaluate it in both a Dual- and 64-Core cluster in a 12-nm technology node. MX boosts the Dual-Core's energy efficiency by 10% for a double-precision 64x64x64 matrix multiplication with the same FPU utilization (~97%) and by 25% on the 64-Core cluster for the same benchmark on 32-bit data, with a 56% performance gain.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Ara2: Exploring Single- and Multi-Core Vector Processing with an Efficient RVV 1.0 Compliant Open-Source Processor
Authors:
Matteo Perotti,
Matheus Cavalcante,
Renzo Andri,
Lukas Cavigelli,
Luca Benini
Abstract:
Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit…
▽ More
Vector processing is highly effective in boosting processor performance and efficiency for data-parallel workloads. In this paper, we present Ara2, the first fully open-source vector processor to support the RISC-V V 1.0 frozen ISA. We evaluate Ara2's performance on a diverse set of data-parallel kernels for various problem sizes and vector-unit configurations, achieving an average functional-unit utilization of 95% on the most computationally intensive kernels. We pinpoint performance boosters and bottlenecks, including the scalar core, memories, and vector architecture, providing insights into the main vector architecture's performance drivers. Leveraging the openness of the design, we implement Ara2 in a 22nm technology, characterize its PPA metrics on various configurations (2-16 lanes), and analyze its microarchitecture and implementation bottlenecks. Ara2 achieves a state-of-the-art energy efficiency of 37.8 DP-GFLOPS/W (0.8V) and 1.35GHz of clock frequency (critical path: ~40 FO4 gates). Finally, we explore the performance and energy-efficiency trade-offs of multi-core vector processors: we find that multiple vector cores help overcome the scalar core issue-rate bound that limits short-vector performance. For example, a cluster of eight 2-lane Ara2 (16 FPUs) achieves more than 3x better performance than a 16-lane single-core Ara2 (16 FPUs) when executing a 32x32x32 matrix multiplication, with 1.5x improved energy efficiency.
△ Less
Submitted 17 June, 2024; v1 submitted 13 November, 2023;
originally announced November 2023.
-
Geometric properties of extremal domains for the $p$-Laplacian operator
Authors:
Francisco G. Carvalho,
Marcos P. Cavalcante
Abstract:
In this paper, we explore the geometric properties of unbounded extremal domains for the $p$-Laplacian operator in both Euclidean and hyperbolic spaces. Assuming that the nonlinearity grows at least as the nonlinearity of the eigenvalue problem, we prove that these domains exhibit remarkable geometric properties and cannot be arbitrarily wide. In two dimensions, we prove that such domains with con…
▽ More
In this paper, we explore the geometric properties of unbounded extremal domains for the $p$-Laplacian operator in both Euclidean and hyperbolic spaces. Assuming that the nonlinearity grows at least as the nonlinearity of the eigenvalue problem, we prove that these domains exhibit remarkable geometric properties and cannot be arbitrarily wide. In two dimensions, we prove that such domains with connected complements must necessarily be balls. In the hyperbolic space, we highlight the constraints on extremal domains and the geometry of their asymptotic boundaries.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency
Authors:
Matteo Perotti,
Samuel Riedel,
Matheus Cavalcante,
Luca Benini
Abstract:
The ever-increasing computational and storage requirements of modern applications and the slowdown of technology scaling pose major challenges to designing and implementing efficient computer architectures. To mitigate the bottlenecks of typical processor-based architectures on both the instruction and data sides of the memory, we present Spatz, a compact 64-bit floating-point-capable vector proce…
▽ More
The ever-increasing computational and storage requirements of modern applications and the slowdown of technology scaling pose major challenges to designing and implementing efficient computer architectures. To mitigate the bottlenecks of typical processor-based architectures on both the instruction and data sides of the memory, we present Spatz, a compact 64-bit floating-point-capable vector processor based on RISC-V's Vector Extension Zve64d. Using Spatz as the main Processing Element (PE), we design an open-source dual-core vector processor architecture based on a modular and scalable cluster sharing a Scratchpad Memory (SCM). Unlike typical vector processors, whose Vector Register Files (VRFs) are hundreds of KiB large, we prove that Spatz can achieve peak energy efficiency with a latch-based VRF of only 2 KiB. An implementation of the Spatz-based cluster in GlobalFoundries' 12LPP process with eight double-precision Floating Point Units (FPUs) achieves an FPU utilization just 3.4% lower than the ideal upper bound on a double-precision, floating-point matrix multiplication. The cluster reaches 7.7 FMA/cycle, corresponding to 15.7 DP-GFLOPS and 95.7 DP-GFLOPS/W at 1 GHz and nominal operating conditions (TT, 0.80V, 25C), with more than 55% of the power spent on the FPUs. Furthermore, the optimally-balanced Spatz-based cluster reaches a 95.0% FPU utilization (7.6 FMA/cycle), 15.2 DP-GFLOPS, and 99.3 DP-GFLOPS/W (61% of the power spent in the FPU) on a 2D workload with a 7x7 kernel, resulting in an outstanding area/energy efficiency of 171 DP-GFLOPS/W/mm2. At equi-area, the computing cluster built upon compact vector processors reaches a 30% higher energy efficiency than a cluster with the same FPU count built upon scalar cores specialized for stream-based floating-point computation.
△ Less
Submitted 9 January, 2025; v1 submitted 18 September, 2023;
originally announced September 2023.
-
PATRONoC: Parallel AXI Transport Reducing Overhead for Networks-on-Chip targeting Multi-Accelerator DNN Platforms at the Edge
Authors:
Vikram Jain,
Matheus Cavalcante,
Nazareno Bruschi,
Michael Rogenmoser,
Thomas Benz,
Andreas Kurth,
Davide Rossi,
Luca Benini,
Marian Verhelst
Abstract:
Emerging deep neural network (DNN) applications require high-performance multi-core hardware acceleration with large data bursts. Classical network-on-chips (NoCs) use serial packet-based protocols suffering from significant protocol translation overheads towards the endpoints. This paper proposes PATRONoC, an open-source fully AXI-compliant NoC fabric to better address the specific needs of multi…
▽ More
Emerging deep neural network (DNN) applications require high-performance multi-core hardware acceleration with large data bursts. Classical network-on-chips (NoCs) use serial packet-based protocols suffering from significant protocol translation overheads towards the endpoints. This paper proposes PATRONoC, an open-source fully AXI-compliant NoC fabric to better address the specific needs of multi-core DNN computing platforms. Evaluation of PATRONoC in a 2D-mesh topology shows 34% higher area efficiency compared to a state-of-the-art classical NoC at 1 GHz. PATRONoC's throughput outperforms a baseline NoC by 2-8X on uniform random traffic and provides a high aggregated throughput of up to 350 GiB/s on synthetic and DNN workload traffic.
△ Less
Submitted 31 July, 2023;
originally announced August 2023.
-
Raman Response of the Charge Density Wave in Cuprate Superconductors
Authors:
Moallison F. Cavalcante,
S. Bag,
I. Paul,
A. Sacuto,
M. C. O. Aguiar,
M. Civelli
Abstract:
We study the Raman response, for $B_{1g}$ and $B_{2g}$ light-polarization symmetries, of the charge density wave phase appearing in the underdoped region of cuprate superconductors. We show that the $B_{2g}$ response provides a distinctive signature of the charge order, independently of the details of the electronic structure and from the concomitant presence of a pseudogap, in sharp contrast with…
▽ More
We study the Raman response, for $B_{1g}$ and $B_{2g}$ light-polarization symmetries, of the charge density wave phase appearing in the underdoped region of cuprate superconductors. We show that the $B_{2g}$ response provides a distinctive signature of the charge order, independently of the details of the electronic structure and from the concomitant presence of a pseudogap, in sharp contrast with the behavior of the $B_{1g}$ response. This well accounts for the Raman experimental results. We then clearly identify a charge density wave energy scale, and show that its doping dependence is eventually driven by the monotonic behavior of the pesudogap. This has also been pointed out in Raman experiments, and it is suggestive of a pseudogap ruling the multiple energy scales of the exotic phases appearing in the cuprate phase diagram.
△ Less
Submitted 10 October, 2023; v1 submitted 19 May, 2023;
originally announced May 2023.
-
FlooNoC: A Multi-Tbps Wide NoC for Heterogeneous AXI4 Traffic
Authors:
Tim Fischer,
Michael Rogenmoser,
Matheus Cavalcante,
Frank K. Gürkaynak,
Luca Benini
Abstract:
Meeting the staggering bandwidth requirements of today's applications challenges the traditional narrow and serialized NoCs, which hit hard bounds on the maximum operating frequency. This paper proposes FlooNoC, an open-source, low-latency, fully AXI4-compatible NoC with wide physical channels for latency-tolerant high-bandwidth non-blocking transactions and decoupled latency-critical short messag…
▽ More
Meeting the staggering bandwidth requirements of today's applications challenges the traditional narrow and serialized NoCs, which hit hard bounds on the maximum operating frequency. This paper proposes FlooNoC, an open-source, low-latency, fully AXI4-compatible NoC with wide physical channels for latency-tolerant high-bandwidth non-blocking transactions and decoupled latency-critical short messages. We demonstrate the feasibility of wide channels by integrating a 5x5 router and links within a 9-core compute cluster in 12 nm FinFet technology. Our NoC achieves a bandwidth of 629Gbps per link while running at only 1.23 GHz (at 0.19 pJ/B/hop), with just 10% area overhead post layout.
△ Less
Submitted 6 August, 2023; v1 submitted 15 May, 2023;
originally announced May 2023.
-
MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory
Authors:
Samuel Riedel,
Matheus Cavalcante,
Renzo Andri,
Luca Benini
Abstract:
Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and ma…
▽ More
Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80 V/25 °C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 180 GOPS/W with less than 2% of execution stalls.
△ Less
Submitted 28 November, 2023; v1 submitted 30 March, 2023;
originally announced March 2023.
-
Quark: An Integer RISC-V Vector Processor for Sub-Byte Quantized DNN Inference
Authors:
MohammadHossein AskariHemmat,
Theo Dupuis,
Yoan Fournier,
Nizar El Zarif,
Matheus Cavalcante,
Matteo Perotti,
Frank Gurkaynak,
Luca Benini,
Francois Leduc-Primeau,
Yvon Savaria,
Jean-Pierre David
Abstract:
In this paper, we present Quark, an integer RISC-V vector processor specifically tailored for sub-byte DNN inference. Quark is implemented in GlobalFoundries' 22FDX FD-SOI technology. It is designed on top of Ara, an open-source 64-bit RISC-V vector processor. To accommodate sub-byte DNN inference, Quark extends Ara by adding specialized vector instructions to perform sub-byte quantized operations…
▽ More
In this paper, we present Quark, an integer RISC-V vector processor specifically tailored for sub-byte DNN inference. Quark is implemented in GlobalFoundries' 22FDX FD-SOI technology. It is designed on top of Ara, an open-source 64-bit RISC-V vector processor. To accommodate sub-byte DNN inference, Quark extends Ara by adding specialized vector instructions to perform sub-byte quantized operations. We also remove the floating-point unit from Quarks' lanes and use the CVA6 RISC-V scalar core for the re-scaling operations that are required in quantized neural network inference. This makes each lane of Quark 2 times smaller and 1.9 times more power efficient compared to the ones of Ara. In this paper we show that Quark can run quantized models at sub-byte precision. Notably we show that for 1-bit and 2-bit quantized models, Quark can accelerate computation of Conv2d over various ranges of inputs and kernel sizes.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
On the fundamental tone of the $p$-Laplacian on Riemannian manifolds and applications
Authors:
Francisco G. de S. Carvalho,
Marcos Petrucio Cavalcante
Abstract:
We present a general lower bound for the fundamental tone for the $p$-Laplacian on Riemannian manifolds carrying a special kind of function. We then apply our result to the cases of negatively curved simply connected manifolds, a class of warped product manifolds and for a class of Riemannian submersions.
We present a general lower bound for the fundamental tone for the $p$-Laplacian on Riemannian manifolds carrying a special kind of function. We then apply our result to the cases of negatively curved simply connected manifolds, a class of warped product manifolds and for a class of Riemannian submersions.
△ Less
Submitted 28 January, 2023;
originally announced January 2023.
-
Index bounds for closed minimal surfaces in 3-manifolds with the Killing property
Authors:
Marcos P. Cavalcante,
Darlan F. de Oliveira,
Robson dos S. Silva
Abstract:
Let $Σ$ be a closed minimal surface immersed in a Riemannian 3-manifold carrying an orthonormal Killing frame. This class of ambient spaces includes Lie groups with a bi-invariant metric. In this paper, we prove that the sum of the Morse index and the nullity of $Σ$ is bounded from below by a constant times its genus.
Let $Σ$ be a closed minimal surface immersed in a Riemannian 3-manifold carrying an orthonormal Killing frame. This class of ambient spaces includes Lie groups with a bi-invariant metric. In this paper, we prove that the sum of the Morse index and the nullity of $Σ$ is bounded from below by a constant times its genus.
△ Less
Submitted 28 January, 2023;
originally announced January 2023.
-
HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement
Authors:
Patrick Iff,
Maciej Besta,
Matheus Cavalcante,
Tim Fischer,
Luca Benini,
Torsten Hoefler
Abstract:
2.5D integration is an important technique to tackle the growing cost of manufacturing chips in advanced technology nodes. This poses the challenge of providing high-performance inter-chiplet interconnects (ICIs). As the number of chiplets grows to tens or hundreds, it becomes infeasible to hand-optimize their arrangement in a way that maximizes the ICI performance. In this paper, we propose HexaM…
▽ More
2.5D integration is an important technique to tackle the growing cost of manufacturing chips in advanced technology nodes. This poses the challenge of providing high-performance inter-chiplet interconnects (ICIs). As the number of chiplets grows to tens or hundreds, it becomes infeasible to hand-optimize their arrangement in a way that maximizes the ICI performance. In this paper, we propose HexaMesh, an arrangement of chiplets that outperforms a grid arrangement both in theory (network diameter reduced by 42%; bisection bandwidth improved by 130%) and in practice (latency reduced by 19%; throughput improved by 34%). MexaMesh enables large-scale chiplet designs with high-performance ICIs.
△ Less
Submitted 8 October, 2023; v1 submitted 25 November, 2022;
originally announced November 2022.
-
Sparse Hamming Graph: A Customizable Network-on-Chip Topology
Authors:
Patrick Iff,
Maciej Besta,
Matheus Cavalcante,
Tim Fischer,
Luca Benini,
Torsten Hoefler
Abstract:
Chips with hundreds to thousands of cores require scalable networks-on-chip (NoCs). Customization of the NoC topology is necessary to reach the diverse design goals of different chips. We introduce sparse Hamming graph, a novel NoC topology with an adjustable costperformance trade-off that is based on four NoC topology design principles we identified. To efficiently customize this topology, we dev…
▽ More
Chips with hundreds to thousands of cores require scalable networks-on-chip (NoCs). Customization of the NoC topology is necessary to reach the diverse design goals of different chips. We introduce sparse Hamming graph, a novel NoC topology with an adjustable costperformance trade-off that is based on four NoC topology design principles we identified. To efficiently customize this topology, we develop a toolchain that leverages approximate floorplanning and link routing to deliver fast and accurate cost and performance predictions. We demonstrate how to use our methodology to achieve desired cost-performance trade-offs while outperforming established topologies in cost, performance, or both.
△ Less
Submitted 28 June, 2023; v1 submitted 25 November, 2022;
originally announced November 2022.
-
Quench dynamics of the Kondo effect: transport across an impurity coupled to interacting wires
Authors:
Moallison F. Cavalcante,
Rodrigo G. Pereira,
Maria C. O. Aguiar
Abstract:
We study the real-time dynamics of the Kondo effect after a quantum quench in which a magnetic impurity is coupled to two metallic Hubbard chains. Using an effective field theory approach, we find that for noninteracting electrons the charge current across the impurity is given by a scaling function that involves the Kondo time. In the interacting case, we show that the Kondo time decreases with t…
▽ More
We study the real-time dynamics of the Kondo effect after a quantum quench in which a magnetic impurity is coupled to two metallic Hubbard chains. Using an effective field theory approach, we find that for noninteracting electrons the charge current across the impurity is given by a scaling function that involves the Kondo time. In the interacting case, we show that the Kondo time decreases with the strength of the repulsive interaction and the time dependence of the current reveals signatures of the Kondo effect in a Luttinger liquid. In addition, we verify that the relaxation of the impurity magnetization does not exhibit universal scaling behavior in the perturbative regime below the Kondo time. Our results highlight the role of nonequilibrium dynamics as a valuable tool in the study of quantum impurities in interacting systems.
△ Less
Submitted 7 February, 2023; v1 submitted 4 November, 2022;
originally announced November 2022.
-
A "New Ara" for Vector Computing: An Open Source Highly Efficient RISC-V V 1.0 Vector Processor Design
Authors:
Matteo Perotti,
Matheus Cavalcante,
Nils Wistoff,
Renzo Andri,
Lukas Cavigelli,
Luca Benini
Abstract:
Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification…
▽ More
Vector architectures are gaining traction for highly efficient processing of data-parallel workloads, driven by all major ISAs (RISC-V, Arm, Intel), and boosted by landmark chips, like the Arm SVE-based Fujitsu A64FX, powering the TOP500 leader Fugaku. The RISC-V V extension has recently reached 1.0-Frozen status. Here, we present its first open-source implementation, discuss the new specification's impact on the micro-architecture of a lane-based design, and provide insights on performance-oriented design of coupled scalar-vector processors. Our system achieves comparable/better PPA than state-of-the-art vector engines that implement older RVV versions: 15% better area, 6% improved throughput, and FPU utilization >98.5% on crucial kernels.
△ Less
Submitted 9 January, 2025; v1 submitted 17 October, 2022;
originally announced October 2022.
-
Soft Tiles: Capturing Physical Implementation Flexibility for Tightly-Coupled Parallel Processing Clusters
Authors:
Gianna Paulin,
Matheus Cavalcante,
Paul Scheffler,
Luca Bertaccini,
Yichao Zhang,
Frank Gürkaynak,
Luca Benini
Abstract:
Modern high-performance computing architectures (Multicore, GPU, Manycore) are based on tightly-coupled clusters of processing elements, physically implemented as rectangular tiles. Their size and aspect ratio strongly impact the achievable operating frequency and energy efficiency, but they should be as flexible as possible to achieve a high utilization for the top-level die floorplan. In this pa…
▽ More
Modern high-performance computing architectures (Multicore, GPU, Manycore) are based on tightly-coupled clusters of processing elements, physically implemented as rectangular tiles. Their size and aspect ratio strongly impact the achievable operating frequency and energy efficiency, but they should be as flexible as possible to achieve a high utilization for the top-level die floorplan. In this paper, we explore the flexibility range for a high-performance cluster of RISC-V cores with shared L1 memory used to build scalable accelerators, with the goal of establishing a hierarchical implementation methodology where clusters can be modeled as soft tiles to achieve optimal die utilization.
△ Less
Submitted 2 September, 2022;
originally announced September 2022.
-
Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters
Authors:
Matheus Cavalcante,
Domenic Wüthrich,
Matteo Perotti,
Samuel Riedel,
Luca Benini
Abstract:
While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector…
▽ More
While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include micro-architectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256x256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.
△ Less
Submitted 16 July, 2022;
originally announced July 2022.
-
Stability of mKdV breathers on the half-line
Authors:
Miguel A. Alejo,
Márcio Cavalcante,
Adán J. Corcho
Abstract:
In this paper we study the stability problem for mKdV breathers on the left half-line. We are able to show that leftwards moving breathers, initially located far away from the origin, are strongly stable for the problem posed on the left half-line, when assuming homogeneous boundary conditions. The proof involves a Lyapunov functional which is almost conserved by the mKdV flow once we control some…
▽ More
In this paper we study the stability problem for mKdV breathers on the left half-line. We are able to show that leftwards moving breathers, initially located far away from the origin, are strongly stable for the problem posed on the left half-line, when assuming homogeneous boundary conditions. The proof involves a Lyapunov functional which is almost conserved by the mKdV flow once we control some boundary terms which naturally arise.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration
Authors:
Matheus Cavalcante,
Anthony Agnesina,
Samuel Riedel,
Moritz Brunion,
Alberto Garcia-Ortiz,
Dragomir Milojevic,
Francky Catthoor,
Sung Kyu Lim,
Luca Benini
Abstract:
Three-dimensional integrated circuits promise power, performance, and footprint gains compared to their 2D counterparts, thanks to drastic reductions in the interconnects' length through their smaller form factor. We can leverage the potential of 3D integration by enhancing MemPool, an open-source many-core design with 256 cores and a shared pool of L1 scratchpad memory connected with a low-latenc…
▽ More
Three-dimensional integrated circuits promise power, performance, and footprint gains compared to their 2D counterparts, thanks to drastic reductions in the interconnects' length through their smaller form factor. We can leverage the potential of 3D integration by enhancing MemPool, an open-source many-core design with 256 cores and a shared pool of L1 scratchpad memory connected with a low-latency interconnect. MemPool's baseline 2D design is severely limited by routing congestion and wire propagation delay, making the design ideal for 3D integration. In architectural terms, we increase MemPool's scratchpad memory capacity beyond the sweet spot for 2D designs, improving performance in a common digital signal processing kernel. We propose a 3D MemPool design that leverages a smart partitioning of the memory resources across two layers to balance the size and utilization of the stacked dies. In this paper, we explore the architectural and the technology parameter spaces by analyzing the power, performance, area, and energy efficiency of MemPool instances in 2D and 3D with 1 MiB, 2 MiB, 4 MiB, and 8 MiB of scratchpad memory in a commercial 28 nm technology node. We observe a performance gain of 9.1% when running a matrix multiplication on the MemPool-3D design with 4 MiB of scratchpad memory compared to the MemPool 2D counterpart. In terms of energy efficiency, we can implement the MemPool-3D instance with 4 MiB of L1 memory on an energy budget 15% smaller than its 2D counterpart, and even 3.7% smaller than the MemPool-2D instance with one-fourth of the L1 scratchpad memory capacity.
△ Less
Submitted 2 December, 2021;
originally announced December 2021.
-
Controllability for Schrödinger type system with mixed dispersion on compact star graphs
Authors:
Roberto de A. Capistrano-Filho,
Márcio Cavalcante,
Fernando Gallego
Abstract:
In this work we are concerned with solutions to the linear Schrödinger type system with mixed dispersion, the so-called biharmonic Schrödinger equation. Precisely, we are able to prove an exact control property for these solutions with the control in the energy space posed on an oriented star graph structure $\mathcal{G}$ for $T>T_{min}$, with…
▽ More
In this work we are concerned with solutions to the linear Schrödinger type system with mixed dispersion, the so-called biharmonic Schrödinger equation. Precisely, we are able to prove an exact control property for these solutions with the control in the energy space posed on an oriented star graph structure $\mathcal{G}$ for $T>T_{min}$, with $$T_{min}=\sqrt{ \frac{ \overline{L} (L^2+π^2)}{π^2\varepsilon(1- \overline{L} \varepsilon)}},$$ when the couplings and the controls appear only on the Neumann boundary conditions.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
The nonlinear Quadratic Interactions of the Schrödinger type on the half-line
Authors:
Isnaldo Isaac Barbosa,
Márcio Cavalcante
Abstract:
In this work we study the initial boundary value problem associated with the coupled Schrödinger equations {with quadratic nonlinearities, that appears in nonlinear optics}, on the half-line. We obtain local well-posedness for data {in Sobolev spaces} with low regularity, by using a forcing problem on the full line with a presence of a forcing term in order to apply the Fourier restriction method…
▽ More
In this work we study the initial boundary value problem associated with the coupled Schrödinger equations {with quadratic nonlinearities, that appears in nonlinear optics}, on the half-line. We obtain local well-posedness for data {in Sobolev spaces} with low regularity, by using a forcing problem on the full line with a presence of a forcing term in order to apply the Fourier restriction method of Bourgain. The crucial point in this work is the new bilinear estimates on the classical Bourgain spaces $X^{s,b}$ with $b<\frac12$, jointly with bilinear estimates in adapted Bourgain spaces that will used to treat the traces of nonlinear part of the solution. Here the understanding of the dispersion relation is the key point in these estimates, where the set of regularity depends strongly of the constant $a$ measures the scaling-diffraction magnitude indices.
△ Less
Submitted 11 April, 2021;
originally announced April 2021.
-
Quench dynamics and relaxation of a spin coupled to interacting leads
Authors:
Helena Bragança,
Moallison F. Cavalcante,
R. G. Pereira,
Maria C. O. Aguiar
Abstract:
We study a quantum quench in which a magnetic impurity is suddenly coupled to Hubbard chains, whose low-energy physics is described by Tomonaga-Luttinger liquid theory. Using the time-dependent density-matrix renormalization-group (tDMRG) technique, we analyze the propagation of charge, spin and entanglement in the chains after the quench and relate the light-cone velocities to the dispersion of h…
▽ More
We study a quantum quench in which a magnetic impurity is suddenly coupled to Hubbard chains, whose low-energy physics is described by Tomonaga-Luttinger liquid theory. Using the time-dependent density-matrix renormalization-group (tDMRG) technique, we analyze the propagation of charge, spin and entanglement in the chains after the quench and relate the light-cone velocities to the dispersion of holons and spinons. We find that the local magnetization at the impurity site decays faster if we increase the interaction in the chains, even though the spin velocity decreases. We derive an analytical expression for the relaxation of the impurity magnetization which is in good agreement with the tDMRG results at intermediate timescales, providing valuable insight into the time evolution of the Kondo screening cloud in interacting systems.
△ Less
Submitted 25 March, 2021; v1 submitted 22 January, 2021;
originally announced January 2021.
-
MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect
Authors:
Matheus Cavalcante,
Samuel Riedel,
Antonio Pullini,
Luca Benini
Abstract:
A key challenge in scaling shared-L1 multi-core clusters towards many-core (more than 16 cores) configurations is to ensure low-latency and efficient access to the L1 memory. In this work we demonstrate that it is possible to scale up the shared-L1 architecture: We present MemPool, a 32 bit many-core system with 256 fast RV32IMA "Snitch" cores featuring application-tunable execution units, running…
▽ More
A key challenge in scaling shared-L1 multi-core clusters towards many-core (more than 16 cores) configurations is to ensure low-latency and efficient access to the L1 memory. In this work we demonstrate that it is possible to scale up the shared-L1 architecture: We present MemPool, a 32 bit many-core system with 256 fast RV32IMA "Snitch" cores featuring application-tunable execution units, running at 700 MHz in typical conditions (TT/0.80 V/25°C). MemPool is easy to program, with all the cores sharing a global view of a large L1 scratchpad memory pool, accessible within at most 5 cycles. In MemPool's physical-aware design, we emphasized the exploration, design, and optimization of the low-latency processor-to-L1-memory interconnect. We compare three candidate topologies, analyzing them in terms of latency, throughput, and back-end feasibility. The chosen topology keeps the average latency at fewer than 6 cycles, even for a heavy injected load of 0.33 request/core/cycle. We also propose a lightweight addressing scheme that maps each core private data to a memory bank accessible within one cycle, which leads to performance gains of up to 20% in real-world signal processing benchmarks. The addressing scheme is also highly efficient in terms of energy consumption since requests to local banks consume only half of the energy required to access remote banks. Our design achieves competitive performance with respect to an ideal, non-implementable full-crossbar baseline.
△ Less
Submitted 5 December, 2020;
originally announced December 2020.
-
An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication
Authors:
Andreas Kurth,
Wolfgang Rönninger,
Thomas Benz,
Matheus Cavalcante,
Fabian Schuiki,
Florian Zaruba,
Luca Benini
Abstract:
On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the needs of heteroge…
▽ More
On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the needs of heterogeneous many-cores and accelerator-rich SoCs, which are not, or only partially, coherent, are a much less mature research area.
In this work, we present a modular, topology-agnostic, high-performance on-chip communication platform. The platform includes components to build and link subnetworks with customizable bandwidth and concurrency properties and adheres to a state-of-the-art, industry-standard protocol. We discuss microarchitectural trade-offs and timing/area characteristics of our modules and show that they can be composed to build high-bandwidth (e.g., 2.5 GHz and 1024 bit data width) end-to-end on-chip communication fabrics (not only network switches but also DMA engines and memory controllers) with high degrees of concurrency. We design and implement a state-of-the-art ML training accelerator, where our communication fabric scales to 1024 cores on a die, providing 32 TB/s cross-sectional bandwidth at only 24 ns round-trip latency between any two cores.
△ Less
Submitted 11 November, 2021; v1 submitted 11 September, 2020;
originally announced September 2020.
-
Linear instability of stationary solutions for the Korteweg-de Vries equation on a star graph
Authors:
Jaime Angulo Pava,
Márcio Cavalcante
Abstract:
The aim of this work is to establish a linear instability criterium of stationary solutions for the Korteweg-de Vries model on a star graph with a structure represented by a finite collections of semi-infinite edges. By considering a boundary condition of $δ$-type interaction at the graph-vertex, we show that the continuous tail and bump profiles are linearly unstable in a balanced star graph. The…
▽ More
The aim of this work is to establish a linear instability criterium of stationary solutions for the Korteweg-de Vries model on a star graph with a structure represented by a finite collections of semi-infinite edges. By considering a boundary condition of $δ$-type interaction at the graph-vertex, we show that the continuous tail and bump profiles are linearly unstable in a balanced star graph. The use of the analytic perturbation theory of operators and the extension theory of symmetric operators is a piece fundamental in our stability analysis.
The arguments presented in this investigation has prospects for the study of the instability of stationary waves solutions of other nonlinear evolution equations on star graphs.
△ Less
Submitted 22 June, 2020;
originally announced June 2020.
-
The cubic nonlinear fractional Schrödinger equation on the half-line
Authors:
Márcio Cavalcante,
Gerardo Huaroto
Abstract:
We study the cubic nonlinear fractional Schrödinger equation with Lévy indices $\frac{4}{3}<α< 2$ posed on the half-line. More precisely, we define the notion of a solution for this model and we obtain a result of local-well-posedness almost sharp with respect for known results on the full real line $\mathbb R$. Also, we prove for the same model that the solution of the nonlinear part is smoother…
▽ More
We study the cubic nonlinear fractional Schrödinger equation with Lévy indices $\frac{4}{3}<α< 2$ posed on the half-line. More precisely, we define the notion of a solution for this model and we obtain a result of local-well-posedness almost sharp with respect for known results on the full real line $\mathbb R$. Also, we prove for the same model that the solution of the nonlinear part is smoother than the initial data. To get our results we use the Colliander and Kenig approach based in the Riemann--Liouville fractional operator combined with Fourier restriction method of Bourgain \cite{Bourgain3} and some ideas of the recent work of Erdogan, Gurel and Tzirakis \cite{tzirakis2}. The method applies to both focusing and defocusing nonlinearities. As the consequence of our analysis we prove a smothing effect for the cubic nonlinear fractional Schrödinger equation posed in full line $\mathbb R$ for the case of the low regularity assumption, which was point out at the recent work \cite{tzirakis2}.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
Forcing operators on star graphs applied for the cubic fourth order Schrödinger equation
Authors:
Roberto de A. Capistrano Filho,
Márcio Cavalcante,
Fernando A. Gallego
Abstract:
In a recent article \textit{"Lower regularity solutions of the biharmonic Schrödinger equation in a quarter plane", to appear on Pacific Journal of Mathematics [15]}, the authors gave a starting point of the study on a series of problems concerning the initial boundary value problem and control theory of Biharmonic NLS in some non-standard domains. In this direction, this article deals to present…
▽ More
In a recent article \textit{"Lower regularity solutions of the biharmonic Schrödinger equation in a quarter plane", to appear on Pacific Journal of Mathematics [15]}, the authors gave a starting point of the study on a series of problems concerning the initial boundary value problem and control theory of Biharmonic NLS in some non-standard domains. In this direction, this article deals to present answers for some questions left in [15] concerning the study of the cubic fourth order Schrödinger equation in a star graph structure $\mathcal{G}$. Precisely, consider $\mathcal{G}$ composed by $N$ edges parameterized by half-lines $(0,+\infty)$ attached with a common vertex $ν$. With this structure the manuscript proposes to study the well-posedness of a dispersive model on star graphs with three appropriated vertex conditions by using the \textit{boundary forcing operator approach}. More precisely, we give positive answer for the Cauchy problem in low regularity Sobolev spaces. We have noted that this approach seems very efficient, since this allows to use the tools of Harmonic Analysis, for instance, the Fourier restriction method, introduced by Bourgain, while for the other known standard methods to solve partial differential partial equations on star graphs are more complicated to capture the dispersive smoothing effect in low regularity. The arguments presented in this work have prospects to be applied for other nonlinear dispersive equations in the context of star graphs with unbounded edges.
△ Less
Submitted 10 August, 2020; v1 submitted 15 September, 2019;
originally announced September 2019.
-
Gap phenomena for constant mean curvature surfaces
Authors:
Ezequiel Barbosa,
Marcos P. Cavalcante,
Edno Pereira
Abstract:
In this paper, we prove gap results for constant mean curvature (CMC) surfaces. Firstly, we find a natural inequality for CMC surfaces which imply convexity for distance function. We then show that if $Σ$ is a complete, properly embedded CMC surface in the Euclidean space satisfying this inequality, then $Σ$ is either a sphere or a right circular cylinder. Next, we show that if $Σ$ is a free bound…
▽ More
In this paper, we prove gap results for constant mean curvature (CMC) surfaces. Firstly, we find a natural inequality for CMC surfaces which imply convexity for distance function. We then show that if $Σ$ is a complete, properly embedded CMC surface in the Euclidean space satisfying this inequality, then $Σ$ is either a sphere or a right circular cylinder. Next, we show that if $Σ$ is a free boundary CMC surface in the Euclidean 3-ball satisfying the same inequality, then either $Σ$ is a totally umbilical disk or an annulus of revolution. These results complete the picture about gap theorems for CMC surfaces in the Euclidean 3-space. We also prove similar results in the hyperbolic space and in the upper hemisphere, and in higher dimensions.
△ Less
Submitted 28 January, 2023; v1 submitted 26 August, 2019;
originally announced August 2019.
-
Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI
Authors:
Matheus Cavalcante,
Fabian Schuiki,
Florian Zaruba,
Michael Schaffner,
Luca Benini
Abstract:
In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units. It achieves up to 97% FPU utilization when running a 256 x…
▽ More
In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units. It achieves up to 97% FPU utilization when running a 256 x 256 double precision matrix multiplication on sixteen lanes. Ara runs at more than 1 GHz in the typical corner (TT/0.80V/25 oC) achieving a performance up to 33 DP-GFLOPS. In terms of energy efficiency, Ara achieves up to 41 DP-GFLOPS/W under the same conditions, which is slightly superior to similar vector processors found in literature. An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors and outlines directions to maintain high energy efficiency even for small matrix sizes where the vector architecture achieves suboptimal utilization of the available FPUs.
△ Less
Submitted 27 October, 2019; v1 submitted 2 June, 2019;
originally announced June 2019.
-
Lower regularity solutions of the biharmonic Schrödinger equation in a quarter plane
Authors:
Roberto A. Capistrano-Filho,
Márcio Cavalcante,
Fernando A. Gallego
Abstract:
This paper deals with the initial-boundary value problem of the biharmonic cubic nonlinear Schrödinger equation in a quarter plane with inhomogeneous Dirichlet-Neumann boundary data. We prove local well-posedness in the low regularity Sobolev spaces introducing Duhamel boundary forcing operator associated to the linear equation to construct solutions on the whole line. With this in hands, the ener…
▽ More
This paper deals with the initial-boundary value problem of the biharmonic cubic nonlinear Schrödinger equation in a quarter plane with inhomogeneous Dirichlet-Neumann boundary data. We prove local well-posedness in the low regularity Sobolev spaces introducing Duhamel boundary forcing operator associated to the linear equation to construct solutions on the whole line. With this in hands, the energy and nonlinear estimates allow us to apply Fourier restriction method, introduced by J. Bourgain, to get the main result of the article. Additionally, adaptations of this approach for the biharmonic cubic nonlinear Schrödinger equation on star graphs are also discussed.
△ Less
Submitted 10 August, 2020; v1 submitted 23 December, 2018;
originally announced December 2018.
-
The halfspace theorem for minimal hypersurfaces in regions bounded by minimal cones
Authors:
Marcos Petrúcio Cavalcante,
Wagner Oliveira Costa-Filho
Abstract:
We prove that there are no minimal hypersurfaces properly immersed in any region of the Euclidean space bounded by unstable minimal cones. We also prove the analogous result for $r$-minimal hypersurfaces.
We prove that there are no minimal hypersurfaces properly immersed in any region of the Euclidean space bounded by unstable minimal cones. We also prove the analogous result for $r$-minimal hypersurfaces.
△ Less
Submitted 8 October, 2018;
originally announced October 2018.
-
Well-posedness and long time behavior for the Schrödinger-Korteweg-de Vries interactions on the half-Line
Authors:
Márcio Cavalcante,
Adán Corcho
Abstract:
The initial-boundary value problem for the Schrödinger-Korteweg-de Vries system is considered on the left and right half-line for a wide class of initial-boundary data, including the energy regularity $H^1(\R^{\pm})\times H^1(\R^{\pm})$ for initial data. Assuming homogeneous boundary conditions it is shown for positive coupling interactions that local solutions can be extended globally in time for…
▽ More
The initial-boundary value problem for the Schrödinger-Korteweg-de Vries system is considered on the left and right half-line for a wide class of initial-boundary data, including the energy regularity $H^1(\R^{\pm})\times H^1(\R^{\pm})$ for initial data. Assuming homogeneous boundary conditions it is shown for positive coupling interactions that local solutions can be extended globally in time for initial data in the energy space; furthermore, for negative coupling interactions it was proved, for a certain class of regular initial data, the following result: if the respective solution does not exhibits finite time blow-up in $H^1(\R^-)\times H^1(\R^-)$, then the norm of the weighted space $L^2\big(\R^-,\, |x|dx\big)\times L^2\big(\R^-,\, |x|dx\big)$ blows-up at infinity time with \textit{super-linear rate}, this is obtained by using a satisfactory algebraic manipulation of a new global virial type identity associated to the system .
△ Less
Submitted 3 October, 2018;
originally announced October 2018.
-
Local well-posedness of the fifth-order KdV-type equations on the half-line
Authors:
Márcio Cavalcante,
Chulkwang Kwak
Abstract:
This paper is a continuation of authors' previous work \cite{CK2018-1}. We extend the argument \cite{CK2018-1} to fifth-order KdV-type equations with different nonlinearities, in specific, where the scaling argument does not hold. We establish the $X^{s,b}$ nonlinear estimates for $b < \frac12$, which is almost optimal compared to the standard $X^{s,b}$ nonlinear estimates for $b > \frac12$ \cite{…
▽ More
This paper is a continuation of authors' previous work \cite{CK2018-1}. We extend the argument \cite{CK2018-1} to fifth-order KdV-type equations with different nonlinearities, in specific, where the scaling argument does not hold. We establish the $X^{s,b}$ nonlinear estimates for $b < \frac12$, which is almost optimal compared to the standard $X^{s,b}$ nonlinear estimates for $b > \frac12$ \cite{CGL2010, JH2009}. As an immediate conclusion, we prove the local well-posedness of the initial-boundary value problem (IBVP) for fifth-order KdV-type equations on the right half-line and the left half-line.
△ Less
Submitted 2 January, 2019; v1 submitted 16 August, 2018;
originally announced August 2018.
-
Vanishing theorems for the cohomology groups of free boundary hypersurfaces
Authors:
Marcos P. Cavalcante,
Abraão Mendes,
Feliciano Vitório
Abstract:
In this paper, we prove that there exists a universal constant $C$, depending only on positive integers $n\geq 3$ and $p\leq n-1$, such that if $M^n$ is a compact free boundary submanifold of dimension $n$ immersed in the Euclidean unit ball $\mathbb{B}^{n+k}$ whose size of the traceless second fundamental form is less than $C$, then the $p$th cohomology group of $M^n$ vanishes. Also, employing a…
▽ More
In this paper, we prove that there exists a universal constant $C$, depending only on positive integers $n\geq 3$ and $p\leq n-1$, such that if $M^n$ is a compact free boundary submanifold of dimension $n$ immersed in the Euclidean unit ball $\mathbb{B}^{n+k}$ whose size of the traceless second fundamental form is less than $C$, then the $p$th cohomology group of $M^n$ vanishes. Also, employing a different technique, we obtain a rigidity result for compact free boundary surfaces minimally immersed in the unit ball $\mathbb{B}^{2+k}$.
△ Less
Submitted 22 November, 2018; v1 submitted 18 July, 2018;
originally announced July 2018.