Search | arXiv e-print repository

All-in-Memory Stochastic Computing using ReRAM

Authors: João Paulo C. de Lima, Mehran Shoushtari Moghadam, Sercan Aygun, Jeronimo Castrillon, M. Hassan Najafi, Asif Ali Khan

Abstract: As the demand for efficient, low-power computing in embedded and edge devices grows, traditional computing methods are becoming less effective for handling complex tasks. Stochastic computing (SC) offers a promising alternative by approximating complex arithmetic operations, such as addition and multiplication, using simple bitwise operations, like majority or AND, on random bit-streams. While SC… ▽ More As the demand for efficient, low-power computing in embedded and edge devices grows, traditional computing methods are becoming less effective for handling complex tasks. Stochastic computing (SC) offers a promising alternative by approximating complex arithmetic operations, such as addition and multiplication, using simple bitwise operations, like majority or AND, on random bit-streams. While SC operations are inherently fault-tolerant, their accuracy largely depends on the length and quality of the stochastic bit-streams (SBS). These bit-streams are typically generated by CMOS-based stochastic bit-stream generators that consume over 80% of the SC system's power and area. Current SC solutions focus on optimizing the logic gates but often neglect the high cost of moving the bit-streams between memory and processor. This work leverages the physics of emerging ReRAM devices to implement the entire SC flow in place: (1) generating low-cost true random numbers and SBSs, (2) conducting SC operations, and (3) converting SBSs back to binary. Considering the low reliability of ReRAM cells, we demonstrate how SC's robustness to errors copes with ReRAM's variability. Our evaluation shows significant improvements in throughput (1.39x, 2.16x) and energy consumption (1.15x, 2.8x) over state-of-the-art (CMOS- and ReRAM-based) solutions, respectively, with an average image quality drop of 5% across multiple SBS lengths and image processing tasks. △ Less

Submitted 11 April, 2025; originally announced April 2025.

Comments: 7 pages, 5 figures, To appear in DAC 2025

arXiv:2502.10167 [pdf, other]

Modeling and Simulating Emerging Memory Technologies: A Tutorial

Authors: Yun-Chih Chen, Tristan Seidl, Nils Hölscher, Christian Hakert, Minh Duy Truong, Jian-Jia Chen, João Paulo C. de Lima, Asif Ali Khan, Jeronimo Castrillon, Ali Nezhadi, Lokesh Siddhu, Hassan Nassar, Mahta Mayahinia, Mehdi Baradaran Tahoori, Jörg Henkel, Nils Wilbert, Stefan Wildermann, Jürgen Teich

Abstract: Non-volatile Memory (NVM) technologies present a promising alternative to traditional volatile memories such as SRAM and DRAM. Due to the limited availability of real NVM devices, simulators play a crucial role in architectural exploration and hardware-software co-design. This tutorial presents a simulation toolchain through four detailed case studies, showcasing its applicability to various domai… ▽ More Non-volatile Memory (NVM) technologies present a promising alternative to traditional volatile memories such as SRAM and DRAM. Due to the limited availability of real NVM devices, simulators play a crucial role in architectural exploration and hardware-software co-design. This tutorial presents a simulation toolchain through four detailed case studies, showcasing its applicability to various domains of system design, including hybrid main-memory and cache, compute-in-memory, and wear-leveling design. These case studies provide the reader with practical insights on customizing the toolchain for their specific research needs. The source code is open-sourced. △ Less

Submitted 10 March, 2025; v1 submitted 14 February, 2025; originally announced February 2025.

Comments: DFG Priority Program 2377 - Disruptive Memory Technologies

arXiv:2409.10136 [pdf, other]

Count2Multiply: Reliable In-Memory High-Radix Counting

Authors: João Paulo Cardoso de Lima, Benjamin Franklin Morris III, Asif Ali Khan, Jeronimo Castrillon, Alex K. Jones

Abstract: Computing-in-memory (CIM) has been demonstrated across various memory technologies, ranging from memristive crossbars performing analog dot-product computations to large-scale digital bitwise operations in commodity DRAM and other proposed non-volative memory technologies. However, current CIM solutions face latency and reliability challenges. CIM fidelity lags considerably behind access fidelity.… ▽ More Computing-in-memory (CIM) has been demonstrated across various memory technologies, ranging from memristive crossbars performing analog dot-product computations to large-scale digital bitwise operations in commodity DRAM and other proposed non-volative memory technologies. However, current CIM solutions face latency and reliability challenges. CIM fidelity lags considerably behind access fidelity. Furthermore, bulk-bitwise CIM, although highly parallelized, requires long latency for operations like multiplication and addition, due to their bit-serial computation. This paper presents Count2Multiply, a technology-agnostic digital CIM approach to perform multiplication, addition and other operations using high-radix, massively parallel counting enabled by CIM bulk-bitwise logic operations. Designed to meet fault tolerance requirements, Count2Multiply integrates traditional row-wise error correction codes, such as Hamming and BCH, to address the high error rates in existing CIM designs. We demonstrate Count2Multiply with a detailed application to CIM in conventional DRAM due to its ubiquity and high endurance. However, we note that the Count2Multiply architecture is compatible with other functionally complete CIM proposals. Compared to the state-of-the-art in-DRAM CIM method, Count2Multiply achieves up to 10x speedup, 8x higher GOPS/Watt, and 9.5x higher GOPS/area, while outperforming GPU for vector-matrix multiplications. △ Less

Submitted 24 April, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

Comments: 13 pages

arXiv:2401.14428 [pdf, other]

The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview

Authors: Asif Ali Khan, João Paulo C. De Lima, Hamid Farzaneh, Jeronimo Castrillon

Abstract: In today's data-centric world, where data fuels numerous application domains, with machine learning at the forefront, handling the enormous volume of data efficiently in terms of time and energy presents a formidable challenge. Conventional computing systems and accelerators are continually being pushed to their limits to stay competitive. In this context, computing near-memory (CNM) and computing… ▽ More In today's data-centric world, where data fuels numerous application domains, with machine learning at the forefront, handling the enormous volume of data efficiently in terms of time and energy presents a formidable challenge. Conventional computing systems and accelerators are continually being pushed to their limits to stay competitive. In this context, computing near-memory (CNM) and computing-in-memory (CIM) have emerged as potentially game-changing paradigms. This survey introduces the basics of CNM and CIM architectures, including their underlying technologies and working principles. We focus particularly on CIM and CNM architectures that have either been prototyped or commercialized. While surveying the evolving CIM and CNM landscape in academia and industry, we discuss the potential benefits in terms of performance, energy, and cost, along with the challenges associated with these cutting-edge computing paradigms. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2401.12630 [pdf, other]

Full-Stack Optimization for CAM-Only DNN Inference

Authors: João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon

Abstract: The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models rem… ▽ More The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. Additionally, for some CIM designs, the activation movement still requires considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing their arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: To be presented at DATE24

arXiv:2309.06418 [pdf, other]

C4CAM: A Compiler for CAM-based In-memory Accelerators

Authors: Hamid Farzaneh, João Paulo Cardoso de Lima, Mengyuan Li, Asif Ali Khan, Xiaobo Sharon Hu, Jeronimo Castrillon

Abstract: Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to remove this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-bas… ▽ More Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to remove this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and to seamlessly generate code from high-level TorchScript code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application. △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: 10 pages, 9 figures

arXiv:2308.04515 [pdf, other]

Toward unlabeled multi-view 3D pedestrian detection by generalizable AI: techniques and performance analysis

Authors: João Paulo Lima, Diego Thomas, Hideaki Uchiyama, Veronica Teichrieb

Abstract: We unveil how generalizable AI can be used to improve multi-view 3D pedestrian detection in unlabeled target scenes. One way to increase generalization to new scenes is to automatically label target data, which can then be used for training a detector model. In this context, we investigate two approaches for automatically labeling target data: pseudo-labeling using a supervised detector and automa… ▽ More We unveil how generalizable AI can be used to improve multi-view 3D pedestrian detection in unlabeled target scenes. One way to increase generalization to new scenes is to automatically label target data, which can then be used for training a detector model. In this context, we investigate two approaches for automatically labeling target data: pseudo-labeling using a supervised detector and automatic labeling using an untrained detector (that can be applied out of the box without any training). We adopt a training framework for optimizing detector models using automatic labeling procedures. This framework encompasses different training sets/modes and multi-round automatic labeling strategies. We conduct our analyses on the publicly-available WILDTRACK and MultiviewX datasets. We show that, by using the automatic labeling approach based on an untrained detector, we can obtain superior results than directly using the untrained detector or a detector trained with an existing labeled source dataset. It achieved a MODA about 4% and 1% better than the best existing unlabeled method when using WILDTRACK and MultiviewX as target datasets, respectively. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: Accepted to SIBGRAPI 2023

arXiv:2112.06735 [pdf, other]

doi 10.1140/epjb/s10051-022-00453-3

Unsupervised machine learning approaches to the $q$-state Potts model

Authors: Andrea Tirelli, Danyella O. Carvalho, Lucas A. Oliveira, J. P. Lima, Natanael C. Costa, Raimundo R. dos Santos

Abstract: In this paper with study phase transitions of the $q$-state Potts model, through a number of unsupervised machine learning techniques, namely Principal Component Analysis (PCA), $k$-means clustering, Uniform Manifold Approximation and Projection (UMAP), and Topological Data Analysis (TDA). Even though in all cases we are able to retrieve the correct critical temperatures $T_c(q)$, for $q = 3, 4$ a… ▽ More In this paper with study phase transitions of the $q$-state Potts model, through a number of unsupervised machine learning techniques, namely Principal Component Analysis (PCA), $k$-means clustering, Uniform Manifold Approximation and Projection (UMAP), and Topological Data Analysis (TDA). Even though in all cases we are able to retrieve the correct critical temperatures $T_c(q)$, for $q = 3, 4$ and $5$, results show that non-linear methods as UMAP and TDA are less dependent on finite size effects, while still being able to distinguish between first and second order phase transitions. This study may be considered as a benchmark for the use of different unsupervised machine learning algorithms in the investigation of phase transitions. △ Less

Submitted 18 March, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: Added computation of critical exponents; exposition improved

arXiv:2104.05813 [pdf, other]

Generalizable Multi-Camera 3D Pedestrian Detection

Authors: João Paulo Lima, Rafael Roberto, Lucas Figueiredo, Francisco Simões, Veronica Teichrieb

Abstract: We present a multi-camera 3D pedestrian detection method that does not need to train using data from the target scene. We estimate pedestrian location on the ground plane using a novel heuristic based on human body poses and person's bounding boxes from an off-the-shelf monocular detector. We then project these locations onto the world ground plane and fuse them with a new formulation of a clique… ▽ More We present a multi-camera 3D pedestrian detection method that does not need to train using data from the target scene. We estimate pedestrian location on the ground plane using a novel heuristic based on human body poses and person's bounding boxes from an off-the-shelf monocular detector. We then project these locations onto the world ground plane and fuse them with a new formulation of a clique cover problem. We also propose an optional step for exploiting pedestrian appearance during fusion by using a domain-generalizable person re-identification model. We evaluated the proposed approach on the challenging WILDTRACK dataset. It obtained a MODA of 0.569 and an F-score of 0.78, superior to state-of-the-art generalizable detection techniques. △ Less

Submitted 12 April, 2021; originally announced April 2021.

Comments: Accepted to CVPRW 2021, LatinX in Computer Vision (LXCV) Workshop

arXiv:0902.2187 [pdf]

A Standalone Markerless 3D Tracker for Handheld Augmented Reality

Authors: Joao Paulo Lima, Veronica Teichrieb, Judith Kelner

Abstract: This paper presents an implementation of a markerless tracking technique targeted to the Windows Mobile Pocket PC platform. The primary aim of this work is to allow the development of standalone augmented reality applications for handheld devices based on natural feature tracking. In order to achieve this goal, a subset of two computer vision libraries was ported to the Pocket PC platform. They… ▽ More This paper presents an implementation of a markerless tracking technique targeted to the Windows Mobile Pocket PC platform. The primary aim of this work is to allow the development of standalone augmented reality applications for handheld devices based on natural feature tracking. In order to achieve this goal, a subset of two computer vision libraries was ported to the Pocket PC platform. They were also adapted to use fixed point math, with the purpose of improving the overall performance of the routines. The port of these libraries opens up the possibility of having other computer vision tasks being executed on mobile platforms. A model based tracking approach that relies on edge information was adopted. Since it does not require a high processing power, it is suitable for constrained devices such as handhelds. The OpenGL ES graphics library was used to perform computer vision tasks, taking advantage of existing graphics hardware acceleration. An augmented reality application was created using the implemented technique and evaluations were done regarding tracking performance and accuracy △ Less

Submitted 12 February, 2009; originally announced February 2009.

Showing 1–10 of 10 results for author: Lima, J P