-
DX100: A Programmable Data Access Accelerator for Indirection
Authors:
Alireza Khadem,
Kamalavasan Kamalakkannan,
Zhenyan Zhu,
Akash Poptani,
Yufeng Gu,
Jered Benjamin Dominguez-Trujillo,
Nishil Talati,
Daichi Fujiki,
Scott Mahlke,
Galen Shipman,
Reetuparna Das
Abstract:
Indirect memory accesses frequently appear in applications where memory bandwidth is a critical bottleneck. Prior indirect memory access proposals, such as indirect prefetchers, runahead execution, fetchers, and decoupled access/execute architectures, primarily focus on improving memory access latency by loading data ahead of computation but still rely on the DRAM controllers to reorder memory req…
▽ More
Indirect memory accesses frequently appear in applications where memory bandwidth is a critical bottleneck. Prior indirect memory access proposals, such as indirect prefetchers, runahead execution, fetchers, and decoupled access/execute architectures, primarily focus on improving memory access latency by loading data ahead of computation but still rely on the DRAM controllers to reorder memory requests and enhance memory bandwidth utilization. DRAM controllers have limited visibility to future memory accesses due to the small capacity of request buffers and the restricted memory-level parallelism of conventional core and memory systems. We introduce DX100, a programmable data access accelerator for indirect memory accesses. DX100 is shared across cores to offload bulk indirect memory accesses and associated address calculation operations. DX100 reorders, interleaves, and coalesces memory requests to improve DRAM row-buffer hit rate and memory bandwidth utilization. DX100 provides a general-purpose ISA to support diverse access types, loop patterns, conditional accesses, and address calculations. To support this accelerator without significant programming efforts, we discuss a set of MLIR compiler passes that automatically transform legacy code to utilize DX100. Experimental evaluations on 12 benchmarks spanning scientific computing, database, and graph applications show that DX100 achieves performance improvements of 2.6x over a multicore baseline and 2.0x over the state-of-the-art indirect prefetcher.
△ Less
Submitted 2 June, 2025; v1 submitted 29 May, 2025;
originally announced May 2025.
-
PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference
Authors:
Yufeng Gu,
Alireza Khadem,
Sumanth Umesh,
Ning Liang,
Xavier Servot,
Onur Mutlu,
Ravi Iyer,
Reetuparna Das
Abstract:
Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and Convolutional Neural Networks. At the same time, LLMs possess large parameter sizes and use key-value caches to store context information. Modern LLMs support con…
▽ More
Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and Convolutional Neural Networks. At the same time, LLMs possess large parameter sizes and use key-value caches to store context information. Modern LLMs support context windows with up to 1 million tokens to generate versatile text, audio, and video content. A large key-value cache unique to each prompt requires a large memory capacity, limiting the inference batch size. Both low operational intensity and limited batch size necessitate a high memory bandwidth. However, contemporary hardware systems for ML model deployment, such as GPUs and TPUs, are primarily optimized for compute throughput. This mismatch challenges the efficient deployment of advanced LLMs and makes users pay for expensive compute resources that are poorly utilized for the memory-bound LLM inference tasks.
We propose CENT, a CXL-ENabled GPU-Free sysTem for LLM inference, which harnesses CXL memory expansion capabilities to accommodate substantial LLM sizes, and utilizes near-bank processing units to deliver high memory bandwidth, eliminating the need for expensive GPUs. CENT exploits a scalable CXL network to support peer-to-peer and collective communication primitives across CXL devices. We implement various parallelism strategies to distribute LLMs across these devices. Compared to GPU baselines with maximum supported batch sizes and similar average power, CENT achieves 2.3$\times$ higher throughput and consumes 2.9$\times$ less energy. CENT enhances the Total Cost of Ownership (TCO), generating 5.2$\times$ more tokens per dollar than GPUs.
△ Less
Submitted 3 May, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing
Authors:
Alireza Khadem,
Daichi Fujiki,
Hilbert Chen,
Yufeng Gu,
Nishil Talati,
Scott Mahlke,
Reetuparna Das
Abstract:
In-cache computing technology transforms existing caches into long-vector compute units and offers low-cost alternatives to building expensive vector engines for mobile CPUs. Unfortunately, existing long-vector Instruction Set Architecture (ISA) extensions, such as RISC-V Vector Extension (RVV) and Arm Scalable Vector Extension (SVE), provide only one-dimensional strided and random memory accesses…
▽ More
In-cache computing technology transforms existing caches into long-vector compute units and offers low-cost alternatives to building expensive vector engines for mobile CPUs. Unfortunately, existing long-vector Instruction Set Architecture (ISA) extensions, such as RISC-V Vector Extension (RVV) and Arm Scalable Vector Extension (SVE), provide only one-dimensional strided and random memory accesses. While this is sufficient for typical vector engines, it fails to effectively utilize the large Single Instruction, Multiple Data (SIMD) widths of in-cache vector engines. This is because mobile data-parallel kernels expose limited parallelism across a single dimension.
Based on our analysis of mobile vector kernels, we introduce a long-vector Multi-dimensional Vector ISA Extension (MVE) for mobile in-cache computing. MVE achieves high SIMD resource utilization and enables flexible programming by abstracting cache geometry and data layout. The proposed ISA features multi-dimensional strided and random memory accesses and efficient dimension-level masked execution to encode parallelism across multiple dimensions. Using a wide range of data-parallel mobile workloads, we demonstrate that MVE offers significant performance and energy reduction benefits of 2.9x and 8.8x, on average, compared to the SIMD units of a commercial mobile processor, at an area overhead of 3.6%.
△ Less
Submitted 16 January, 2025;
originally announced January 2025.
-
Diagnosing Bipolar Disorder from 3-D Structural Magnetic Resonance Images Using a Hybrid GAN-CNN Method
Authors:
Masood Hamed Saghayan,
Mohammad Hossein Zolfagharnasab,
Ali Khadem,
Farzam Matinfar,
Hassan Rashidi
Abstract:
Bipolar Disorder (BD) is a psychiatric condition diagnosed by repetitive cycles of hypomania and depression. Since diagnosing BD relies on subjective behavioral assessments over a long period, a solid diagnosis based on objective criteria is not straightforward. The current study responded to the described obstacle by proposing a hybrid GAN-CNN model to diagnose BD from 3-D structural MRI Images (…
▽ More
Bipolar Disorder (BD) is a psychiatric condition diagnosed by repetitive cycles of hypomania and depression. Since diagnosing BD relies on subjective behavioral assessments over a long period, a solid diagnosis based on objective criteria is not straightforward. The current study responded to the described obstacle by proposing a hybrid GAN-CNN model to diagnose BD from 3-D structural MRI Images (sMRI). The novelty of this study stems from diagnosing BD from sMRI samples rather than conventional datasets such as functional MRI (fMRI), electroencephalography (EEG), and behavioral symptoms while removing the data insufficiency usually encountered when dealing with sMRI samples. The impact of various augmentation ratios is also tested using 5-fold cross-validation. Based on the results, this study obtains an accuracy rate of 75.8%, a sensitivity of 60.3%, and a specificity of 82.5%, which are 3-5% higher than prior work while utilizing less than 6% sample counts. Next, it is demonstrated that a 2- D layer-based GAN generator can effectively reproduce complex 3D brain samples, a more straightforward technique than manual image processing. Lastly, the optimum augmentation threshold for the current study using 172 sMRI samples is 50%, showing the applicability of the described method for larger sMRI datasets. In conclusion, it is established that data augmentation using GAN improves the accuracy of the CNN classifier using sMRI samples, thus developing more reliable decision support systems to assist practitioners in identifying BD patients more reliably and in a shorter period
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
Vector-Processing for Mobile Devices: Benchmark and Analysis
Authors:
Alireza Khadem,
Daichi Fujiki,
Nishil Talati,
Scott Mahlke,
Reetuparna Das
Abstract:
Vector processing has become commonplace in today's CPU microarchitectures. Vector instructions improve performance and energy which is crucial for resource-constraint mobile devices. The research community currently lacks a comprehensive benchmark suite to study the benefits of vector processing for mobile devices. This paper presents Swan-an extensive vector processing benchmark suite for mobile…
▽ More
Vector processing has become commonplace in today's CPU microarchitectures. Vector instructions improve performance and energy which is crucial for resource-constraint mobile devices. The research community currently lacks a comprehensive benchmark suite to study the benefits of vector processing for mobile devices. This paper presents Swan-an extensive vector processing benchmark suite for mobile applications. Swan consists of a diverse set of data-parallel workloads from four commonly used mobile applications: operating system, web browser, audio/video messaging application, and PDF rendering engine. Using Swan benchmark suite, we conduct a detailed analysis of the performance, power, and energy consumption of vectorized workloads, and show that: (a) Vectorized kernels increase the pressure on cache hierarchy due to the higher rate of memory requests. (b) Vector processing is more beneficial for workloads with lower precision operations and higher cache hit rates. (c) Limited Instruction-Level Parallelism and strided memory accesses to multi-dimensional data structures prevent vector processing benefits from scaling with more SIMD functional units and wider registers. (d) Despite lower computation throughput than domain-specific accelerators, such as GPU, vector processing outperforms these accelerators for kernels with lower operation counts. Finally, we show five common computation patterns in mobile data-parallel workloads that dominate the execution time.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Automatic diagnosis of schizophrenia and attention deficit hyperactivity disorder in rs-fMRI modality using convolutional autoencoder model and interval type-2 fuzzy regression
Authors:
Afshin Shoeibi,
Navid Ghassemi,
Marjane Khodatars,
Parisa Moridian,
Abbas Khosravi,
Assef Zare,
Juan M. Gorriz,
Amir Hossein Chale-Chale,
Ali Khadem,
U. Rajendra Acharya
Abstract:
Nowadays, many people worldwide suffer from brain disorders, and their health is in danger. So far, numerous methods have been proposed for the diagnosis of Schizophrenia (SZ) and attention deficit hyperactivity disorder (ADHD), among which functional magnetic resonance imaging (fMRI) modalities are known as a popular method among physicians. This paper presents an SZ and ADHD intelligent detectio…
▽ More
Nowadays, many people worldwide suffer from brain disorders, and their health is in danger. So far, numerous methods have been proposed for the diagnosis of Schizophrenia (SZ) and attention deficit hyperactivity disorder (ADHD), among which functional magnetic resonance imaging (fMRI) modalities are known as a popular method among physicians. This paper presents an SZ and ADHD intelligent detection method of resting-state fMRI (rs-fMRI) modality using a new deep learning method. The University of California Los Angeles dataset, which contains the rs-fMRI modalities of SZ and ADHD patients, has been used for experiments. The FMRIB software library toolbox first performed preprocessing on rs-fMRI data. Then, a convolutional Autoencoder model with the proposed number of layers is used to extract features from rs-fMRI data. In the classification step, a new fuzzy method called interval type-2 fuzzy regression (IT2FR) is introduced and then optimized by genetic algorithm, particle swarm optimization, and gray wolf optimization (GWO) techniques. Also, the results of IT2FR methods are compared with multilayer perceptron, k-nearest neighbors, support vector machine, random forest, and decision tree, and adaptive neuro-fuzzy inference system methods. The experiment results show that the IT2FR method with the GWO optimization algorithm has achieved satisfactory results compared to other classifier methods. Finally, the proposed classification technique was able to provide 72.71% accuracy.
△ Less
Submitted 14 November, 2022; v1 submitted 31 May, 2022;
originally announced May 2022.
-
Automatic Diagnosis of Schizophrenia in EEG Signals Using CNN-LSTM Models
Authors:
Afshin Shoeibi,
Delaram Sadeghi,
Parisa Moridian,
Navid Ghassemi,
Jonathan Heras,
Roohallah Alizadehsani,
Ali Khadem,
Yinan Kong,
Saeid Nahavandi,
Yu-Dong Zhang,
Juan M. Gorriz
Abstract:
Schizophrenia (SZ) is a mental disorder whereby due to the secretion of specific chemicals in the brain, the function of some brain regions is out of balance, leading to the lack of coordination between thoughts, actions, and emotions. This study provides various intelligent deep learning (DL)-based methods for automated SZ diagnosis via electroencephalography (EEG) signals. The obtained results a…
▽ More
Schizophrenia (SZ) is a mental disorder whereby due to the secretion of specific chemicals in the brain, the function of some brain regions is out of balance, leading to the lack of coordination between thoughts, actions, and emotions. This study provides various intelligent deep learning (DL)-based methods for automated SZ diagnosis via electroencephalography (EEG) signals. The obtained results are compared with those of conventional intelligent methods. To implement the proposed methods, the dataset of the Institute of Psychiatry and Neurology in Warsaw, Poland, has been used. First, EEG signals were divided into 25 s time frames and then were normalized by z-score or norm L2. In the classification step, two different approaches were considered for SZ diagnosis via EEG signals. In this step, the classification of EEG signals was first carried out by conventional machine learning methods, e.g., support vector machine, k-nearest neighbors, decision tree, naïve Bayes, random forest, extremely randomized trees, and bagging. Various proposed DL models, namely, long short-term memories (LSTMs), one-dimensional convolutional networks (1D-CNNs), and 1D-CNN-LSTMs, were used in the following. In this step, the DL models were implemented and compared with different activation functions. Among the proposed DL models, the CNN-LSTM architecture has had the best performance. In this architecture, the ReLU activation function with the z-score and L2-combined normalization was used. The proposed CNN-LSTM model has achieved an accuracy percentage of 99.25%, better than the results of most former studies in this field. It is worth mentioning that to perform all simulations, the k-fold cross-validation method with k = 5 has been used.
△ Less
Submitted 1 December, 2021; v1 submitted 2 September, 2021;
originally announced September 2021.
-
CoDR: Computation and Data Reuse Aware CNN Accelerator
Authors:
Alireza Khadem,
Haojie Ye,
Trevor Mudge
Abstract:
Computation and Data Reuse is critical for the resource-limited Convolutional Neural Network (CNN) accelerators. This paper presents Universal Computation Reuse to exploit weight sparsity, repetition, and similarity simultaneously in a convolutional layer. Moreover, CoDR decreases the cost of weight memory access by proposing a customized Run-Length Encoding scheme and the number of memory accesse…
▽ More
Computation and Data Reuse is critical for the resource-limited Convolutional Neural Network (CNN) accelerators. This paper presents Universal Computation Reuse to exploit weight sparsity, repetition, and similarity simultaneously in a convolutional layer. Moreover, CoDR decreases the cost of weight memory access by proposing a customized Run-Length Encoding scheme and the number of memory accesses to the intermediate results by introducing an input and output stationary dataflow. Compared to two recent compressed CNN accelerators with the same area of 2.85 mm^2, CoDR decreases SRAM access by 5.08x and 7.99x, and consumes 3.76x and 6.84x less energy.
△ Less
Submitted 20 April, 2021;
originally announced April 2021.
-
An overview of artificial intelligence techniques for diagnosis of Schizophrenia based on magnetic resonance imaging modalities: Methods, challenges, and future works
Authors:
Delaram Sadeghi,
Afshin Shoeibi,
Navid Ghassemi,
Parisa Moridian,
Ali Khadem,
Roohallah Alizadehsani,
Mohammad Teshnehlab,
Juan M. Gorriz,
Fahime Khozeimeh,
Yu-Dong Zhang,
Saeid Nahavandi,
U Rajendra Acharya
Abstract:
Schizophrenia (SZ) is a mental disorder that typically emerges in late adolescence or early adulthood. It reduces the life expectancy of patients by 15 years. Abnormal behavior, perception of emotions, social relationships, and reality perception are among its most significant symptoms. Past studies have revealed that SZ affects the temporal and anterior lobes of hippocampus regions of the brain.…
▽ More
Schizophrenia (SZ) is a mental disorder that typically emerges in late adolescence or early adulthood. It reduces the life expectancy of patients by 15 years. Abnormal behavior, perception of emotions, social relationships, and reality perception are among its most significant symptoms. Past studies have revealed that SZ affects the temporal and anterior lobes of hippocampus regions of the brain. Also, increased volume of cerebrospinal fluid (CSF) and decreased volume of white and gray matter can be observed due to this disease. Magnetic resonance imaging (MRI) is the popular neuroimaging technique used to explore structural/functional brain abnormalities in SZ disorder, owing to its high spatial resolution. Various artificial intelligence (AI) techniques have been employed with advanced image/signal processing methods to accurately diagnose SZ. This paper presents a comprehensive overview of studies conducted on the automated diagnosis of SZ using MRI modalities. First, an AI-based computer aided-diagnosis system (CADS) for SZ diagnosis and its relevant sections are presented. Then, this section introduces the most important conventional machine learning (ML) and deep learning (DL) techniques in the diagnosis of diagnosing SZ. A comprehensive comparison is also made between ML and DL studies in the discussion section. In the following, the most important challenges in diagnosing SZ are addressed. Future works in diagnosing SZ using AI techniques and MRI modalities are recommended in another section. Results, conclusion, and research findings are also presented at the end.
△ Less
Submitted 10 May, 2022; v1 submitted 24 February, 2021;
originally announced March 2021.
-
Automated Detection and Forecasting of COVID-19 using Deep Learning Techniques: A Review
Authors:
Afshin Shoeibi,
Marjane Khodatars,
Mahboobeh Jafari,
Navid Ghassemi,
Delaram Sadeghi,
Parisa Moridian,
Ali Khadem,
Roohallah Alizadehsani,
Sadiq Hussain,
Assef Zare,
Zahra Alizadeh Sani,
Fahime Khozeimeh,
Saeid Nahavandi,
U. Rajendra Acharya,
Juan M. Gorriz
Abstract:
Coronavirus, or COVID-19, is a hazardous disease that has endangered the health of many people around the world by directly affecting the lungs. COVID-19 is a medium-sized, coated virus with a single-stranded RNA, and also has one of the largest RNA genomes and is approximately 120 nm. The X-Ray and computed tomography (CT) imaging modalities are widely used to obtain a fast and accurate medical d…
▽ More
Coronavirus, or COVID-19, is a hazardous disease that has endangered the health of many people around the world by directly affecting the lungs. COVID-19 is a medium-sized, coated virus with a single-stranded RNA, and also has one of the largest RNA genomes and is approximately 120 nm. The X-Ray and computed tomography (CT) imaging modalities are widely used to obtain a fast and accurate medical diagnosis. Identifying COVID-19 from these medical images is extremely challenging as it is time-consuming and prone to human errors. Hence, artificial intelligence (AI) methodologies can be used to obtain consistent high performance. Among the AI methods, deep learning (DL) networks have gained popularity recently compared to conventional machine learning (ML). Unlike ML, all stages of feature extraction, feature selection, and classification are accomplished automatically in DL models. In this paper, a complete survey of studies on the application of DL techniques for COVID-19 diagnostic and segmentation of lungs is discussed, concentrating on works that used X-Ray and CT images. Additionally, a review of papers on the forecasting of coronavirus prevalence in different parts of the world with DL is presented. Lastly, the challenges faced in the detection of COVID-19 using DL techniques and directions for future research are discussed.
△ Less
Submitted 10 February, 2024; v1 submitted 16 July, 2020;
originally announced July 2020.
-
Deep Learning for Neuroimaging-based Diagnosis and Rehabilitation of Autism Spectrum Disorder: A Review
Authors:
Marjane Khodatars,
Afshin Shoeibi,
Delaram Sadeghi,
Navid Ghassemi,
Mahboobeh Jafari,
Parisa Moridian,
Ali Khadem,
Roohallah Alizadehsani,
Assef Zare,
Yinan Kong,
Abbas Khosravi,
Saeid Nahavandi,
Sadiq Hussain,
U. Rajendra Acharya,
Michael Berk
Abstract:
Accurate diagnosis of Autism Spectrum Disorder (ASD) followed by effective rehabilitation is essential for the management of this disorder. Artificial intelligence (AI) techniques can aid physicians to apply automatic diagnosis and rehabilitation procedures. AI techniques comprise traditional machine learning (ML) approaches and deep learning (DL) techniques. Conventional ML methods employ various…
▽ More
Accurate diagnosis of Autism Spectrum Disorder (ASD) followed by effective rehabilitation is essential for the management of this disorder. Artificial intelligence (AI) techniques can aid physicians to apply automatic diagnosis and rehabilitation procedures. AI techniques comprise traditional machine learning (ML) approaches and deep learning (DL) techniques. Conventional ML methods employ various feature extraction and classification techniques, but in DL, the process of feature extraction and classification is accomplished intelligently and integrally. DL methods for diagnosis of ASD have been focused on neuroimaging-based approaches. Neuroimaging techniques are non-invasive disease markers potentially useful for ASD diagnosis. Structural and functional neuroimaging techniques provide physicians substantial information about the structure (anatomy and structural connectivity) and function (activity and functional connectivity) of the brain. Due to the intricate structure and function of the brain, proposing optimum procedures for ASD diagnosis with neuroimaging data without exploiting powerful AI techniques like DL may be challenging. In this paper, studies conducted with the aid of DL networks to distinguish ASD are investigated. Rehabilitation tools provided for supporting ASD patients utilizing DL networks are also assessed. Finally, we will present important challenges in the automated detection and rehabilitation of ASD and propose some future works.
△ Less
Submitted 1 November, 2021; v1 submitted 2 July, 2020;
originally announced July 2020.
-
Design Challenges of Neural Network Acceleration Using Stochastic Computing
Authors:
Alireza Khadem
Abstract:
The enormous and ever-increasing complexity of state-of-the-art neural networks (NNs) has impeded the deployment of deep learning on resource-limited devices such as the Internet of Things (IoTs). Stochastic computing exploits the inherent amenability to approximation characteristic of NNs to reduce their energy and area footprint, two critical requirements of small embedded devices suitable for t…
▽ More
The enormous and ever-increasing complexity of state-of-the-art neural networks (NNs) has impeded the deployment of deep learning on resource-limited devices such as the Internet of Things (IoTs). Stochastic computing exploits the inherent amenability to approximation characteristic of NNs to reduce their energy and area footprint, two critical requirements of small embedded devices suitable for the IoTs. This report evaluates and compares two recently proposed stochastic-based NN designs, referred to as BISC (Binary Interfaced Stochastic Computing) by Sim and Lee, 2017, and ESL (Extended Stochastic Logic) by Canals et al., 2016. Using analysis and simulation, we compare three distinct implementations of these designs in terms of performance, power consumption, area, and accuracy. We also discuss the overall challenges faced in adopting stochastic computing for building NNs. We find that BISC outperforms the other architectures when executing the LeNet-5 NN model applied to the MNIST digit recognition dataset. Our analysis and simulation experiments indicate that this architecture is around 50X faster, occupies 5.7X and 2.9X less area, and consumes 7.8X and 1.8X less power than the two ESL architectures.
△ Less
Submitted 8 June, 2020;
originally announced June 2020.