Search | arXiv e-print repository

doi 10.1145/3695053.3731109

Hybrid SLC-MLC RRAM Mixed-Signal Processing-in-Memory Architecture for Transformer Acceleration via Gradient Redistribution

Authors: Chang Eun Song, Priyansh Bhatnagar, Zihan Xia, Nam Sung Kim, Tajana Rosing, Mingu Kang

Abstract: Transformers, while revolutionary, face challenges due to their demanding computational cost and large data movement. To address this, we propose HyFlexPIM, a novel mixed-signal processing-in-memory (PIM) accelerator for inference that flexibly utilizes both single-level cell (SLC) and multi-level cell (MLC) RRAM technologies to trade-off accuracy and efficiency. HyFlexPIM achieves efficient dual-… ▽ More Transformers, while revolutionary, face challenges due to their demanding computational cost and large data movement. To address this, we propose HyFlexPIM, a novel mixed-signal processing-in-memory (PIM) accelerator for inference that flexibly utilizes both single-level cell (SLC) and multi-level cell (MLC) RRAM technologies to trade-off accuracy and efficiency. HyFlexPIM achieves efficient dual-mode operation by utilizing digital PIM for high-precision and write-intensive operations while analog PIM for high parallel and low-precision computations. The analog PIM further distributes tasks between SLC and MLC PIM operations, where a single analog PIM module can be reconfigured to switch between two operations (SLC/MLC) with minimal overhead (<1% for area & energy). Critical weights are allocated to SLC RRAM for high accuracy, while less critical weights are assigned to MLC RRAM to maximize capacity, power, and latency efficiency. However, despite employing such a hybrid mechanism, brute-force mapping on hardware fails to deliver significant benefits due to the limited proportion of weights accelerated by the MLC and the noticeable degradation in accuracy. To maximize the potential of our hybrid hardware architecture, we propose an algorithm co-optimization technique, called gradient redistribution, which uses Singular Value Decomposition (SVD) to decompose and truncate matrices based on their importance, then fine-tune them to concentrate significance into a small subset of weights. By doing so, only 5-10% of the weights have dominantly large gradients, making it favorable for HyFlexPIM by minimizing the use of expensive SLC RRAM while maximizing the efficient MLC RRAM. Our evaluation shows that HyFlexPIM significantly enhances computational throughput and energy efficiency, achieving maximum 1.86X and 1.45X higher than state-of-the-art methods. △ Less

Submitted 20 May, 2025; originally announced June 2025.

Comments: Accepted by ISCA'25

arXiv:2505.15146 [pdf, ps, other]

lmgame-Bench: How Good are LLMs at Playing Games?

Authors: Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, Hao Zhang

Abstract: Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential… ▽ More Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench. △ Less

Submitted 3 June, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

arXiv:2505.05413 [pdf, other]

DPQ-HD: Post-Training Compression for Ultra-Low Power Hyperdimensional Computing

Authors: Nilesh Prasad Pandey, Shriniwas Kulkarni, David Wang, Onat Gungor, Flavio Ponzina, Tajana Rosing

Abstract: Hyperdimensional Computing (HDC) is emerging as a promising approach for edge AI, offering a balance between accuracy and efficiency. However, current HDC-based applications often rely on high-precision models and/or encoding matrices to achieve competitive performance, which imposes significant computational and memory demands, especially for ultra-low power devices. While recent efforts use tech… ▽ More Hyperdimensional Computing (HDC) is emerging as a promising approach for edge AI, offering a balance between accuracy and efficiency. However, current HDC-based applications often rely on high-precision models and/or encoding matrices to achieve competitive performance, which imposes significant computational and memory demands, especially for ultra-low power devices. While recent efforts use techniques like precision reduction and pruning to increase the efficiency, most require retraining to maintain performance, making them expensive and impractical. To address this issue, we propose a novel Post Training Compression algorithm, Decomposition-Pruning-Quantization (DPQ-HD), which aims at compressing the end-to-end HDC system, achieving near floating point performance without the need of retraining. DPQ-HD reduces computational and memory overhead by uniquely combining the above three compression techniques and efficiently adapts to hardware constraints. Additionally, we introduce an energy-efficient inference approach that progressively evaluates similarity scores such as cosine similarity and performs early exit to reduce the computation, accelerating prediction inference while maintaining accuracy. We demonstrate that DPQ-HD achieves up to 20-100x reduction in memory for image and graph classification tasks with only a 1-2% drop in accuracy compared to uncompressed workloads. Lastly, we show that DPQ-HD outperforms the existing post-training compression methods and performs better or at par with retraining-based state-of-the-art techniques, requiring significantly less overall optimization time (up to 100x) and faster inference (up to 56x) on a microcontroller △ Less

Submitted 8 May, 2025; originally announced May 2025.

arXiv:2504.13301 [pdf, other]

DYNAMITE: Dynamic Defense Selection for Enhancing Machine Learning-based Intrusion Detection Against Adversarial Attacks

Authors: Jing Chen, Onat Gungor, Zhengli Shang, Elvin Li, Tajana Rosing

Abstract: The rapid proliferation of the Internet of Things (IoT) has introduced substantial security vulnerabilities, highlighting the need for robust Intrusion Detection Systems (IDS). Machine learning-based intrusion detection systems (ML-IDS) have significantly improved threat detection capabilities; however, they remain highly susceptible to adversarial attacks. While numerous defense mechanisms have b… ▽ More The rapid proliferation of the Internet of Things (IoT) has introduced substantial security vulnerabilities, highlighting the need for robust Intrusion Detection Systems (IDS). Machine learning-based intrusion detection systems (ML-IDS) have significantly improved threat detection capabilities; however, they remain highly susceptible to adversarial attacks. While numerous defense mechanisms have been proposed to enhance ML-IDS resilience, a systematic approach for selecting the most effective defense against a specific adversarial attack remains absent. To address this challenge, we propose Dynamite, a dynamic defense selection framework that enhances ML-IDS by intelligently identifying and deploying the most suitable defense using a machine learning-driven selection mechanism. Our results demonstrate that Dynamite achieves a 96.2% reduction in computational time compared to the Oracle, significantly decreasing computational overhead while preserving strong prediction performance. Dynamite also demonstrates an average F1-score improvement of 76.7% over random defense and 65.8% over the best static state-of-the-art defense. △ Less

Submitted 17 April, 2025; originally announced April 2025.

Comments: Accepted by the IEEE/ACM Workshop on the Internet of Safe Things (SafeThings 2025)

arXiv:2504.01921 [pdf, other]

Client Selection in Federated Learning with Data Heterogeneity and Network Latencies

Authors: Harsh Vardhan, Xiaofan Yu, Tajana Rosing, Arya Mazumdar

Abstract: Federated learning (FL) is a distributed machine learning paradigm where multiple clients conduct local training based on their private data, then the updated models are sent to a central server for global aggregation. The practical convergence of FL is challenged by multiple factors, with the primary hurdle being the heterogeneity among clients. This heterogeneity manifests as data heterogeneity… ▽ More Federated learning (FL) is a distributed machine learning paradigm where multiple clients conduct local training based on their private data, then the updated models are sent to a central server for global aggregation. The practical convergence of FL is challenged by multiple factors, with the primary hurdle being the heterogeneity among clients. This heterogeneity manifests as data heterogeneity concerning local data distribution and latency heterogeneity during model transmission to the server. While prior research has introduced various efficient client selection methods to alleviate the negative impacts of either of these heterogeneities individually, efficient methods to handle real-world settings where both these heterogeneities exist simultaneously do not exist. In this paper, we propose two novel theoretically optimal client selection schemes that can handle both these heterogeneities. Our methods involve solving simple optimization problems every round obtained by minimizing the theoretical runtime to convergence. Empirical evaluations on 9 datasets with non-iid data distributions, 2 practical delay distributions, and non-convex neural network models demonstrate that our algorithms are at least competitive to and at most 20 times better than best existing baselines. △ Less

Submitted 2 April, 2025; originally announced April 2025.

arXiv:2503.07882 [pdf, other]

ReLATE: Resilient Learner Selection for Multivariate Time-Series Classification Against Adversarial Attacks

Authors: Cagla Ipek Kocal, Onat Gungor, Aaron Tartz, Tajana Rosing, Baris Aksanli

Abstract: Minimizing computational overhead in time-series classification, particularly in deep learning models, presents a significant challenge. This challenge is further compounded by adversarial attacks, emphasizing the need for resilient methods that ensure robust performance and efficient model selection. We introduce ReLATE, a framework that identifies robust learners based on dataset similarity, red… ▽ More Minimizing computational overhead in time-series classification, particularly in deep learning models, presents a significant challenge. This challenge is further compounded by adversarial attacks, emphasizing the need for resilient methods that ensure robust performance and efficient model selection. We introduce ReLATE, a framework that identifies robust learners based on dataset similarity, reduces computational overhead, and enhances resilience. ReLATE maintains multiple deep learning models in well-known adversarial attack scenarios, capturing model performance. ReLATE identifies the most analogous dataset to a given target using a similarity metric, then applies the optimal model from the most similar dataset. ReLATE reduces computational overhead by an average of 81.2%, enhancing adversarial resilience and streamlining robust model selection, all without sacrificing performance, within 4.2% of Oracle. △ Less

Submitted 10 March, 2025; originally announced March 2025.

Comments: Accepted by the AAAI-25 Workshop on Artificial Intelligence for Time Series Analysis (AI4TS)

arXiv:2502.15901 [pdf, other]

TS-OOD: Evaluating Time-Series Out-of-Distribution Detection and Prospective Directions for Progress

Authors: Onat Gungor, Amanda Sofie Rios, Nilesh Ahuja, Tajana Rosing

Abstract: Detecting out-of-distribution (OOD) data is a fundamental challenge in the deployment of machine learning models. From a security standpoint, this is particularly important because OOD test data can result in misleadingly confident yet erroneous predictions, which undermine the reliability of the deployed model. Although numerous models for OOD detection have been developed in computer vision and… ▽ More Detecting out-of-distribution (OOD) data is a fundamental challenge in the deployment of machine learning models. From a security standpoint, this is particularly important because OOD test data can result in misleadingly confident yet erroneous predictions, which undermine the reliability of the deployed model. Although numerous models for OOD detection have been developed in computer vision and language, their adaptability to the time-series data domain remains limited and under-explored. Yet, time-series data is ubiquitous across manufacturing and security applications for which OOD is essential. This paper seeks to address this research gap by conducting a comprehensive analysis of modality-agnostic OOD detection algorithms. We evaluate over several multivariate time-series datasets, deep learning architectures, time-series specific data augmentations, and loss functions. Our results demonstrate that: 1) the majority of state-of-the-art OOD methods exhibit limited performance on time-series data, and 2) OOD methods based on deep feature modeling may offer greater advantages for time-series OOD detection, highlighting a promising direction for future time-series OOD detection algorithm development. △ Less

Submitted 21 February, 2025; originally announced February 2025.

Comments: Accepted for an oral presentation at AAAI-25 AI4TS

arXiv:2502.15285 [pdf, other]

Offload Rethinking by Cloud Assistance for Efficient Environmental Sound Recognition on LPWANs

Authors: Le Zhang, Quanling Zhao, Run Wang, Shirley Bian, Onat Gungor, Flavio Ponzina, Tajana Rosing

Abstract: Learning-based environmental sound recognition has emerged as a crucial method for ultra-low-power environmental monitoring in biological research and city-scale sensing systems. These systems usually operate under limited resources and are often powered by harvested energy in remote areas. Recent efforts in on-device sound recognition suffer from low accuracy due to resource constraints, whereas… ▽ More Learning-based environmental sound recognition has emerged as a crucial method for ultra-low-power environmental monitoring in biological research and city-scale sensing systems. These systems usually operate under limited resources and are often powered by harvested energy in remote areas. Recent efforts in on-device sound recognition suffer from low accuracy due to resource constraints, whereas cloud offloading strategies are hindered by high communication costs. In this work, we introduce ORCA, a novel resource-efficient cloud-assisted environmental sound recognition system on batteryless devices operating over the Low-Power Wide-Area Networks (LPWANs), targeting wide-area audio sensing applications. We propose a cloud assistance strategy that remedies the low accuracy of on-device inference while minimizing the communication costs for cloud offloading. By leveraging a self-attention-based cloud sub-spectral feature selection method to facilitate efficient on-device inference, ORCA resolves three key challenges for resource-constrained cloud offloading over LPWANs: 1) high communication costs and low data rates, 2) dynamic wireless channel conditions, and 3) unreliable offloading. We implement ORCA on an energy-harvesting batteryless microcontroller and evaluate it in a real world urban sound testbed. Our results show that ORCA outperforms state-of-the-art methods by up to $80 \times$ in energy savings and $220 \times$ in latency reduction while maintaining comparable accuracy. △ Less

Submitted 21 March, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

Comments: Accepted by The 23rd ACM Conference on Embedded Networked Sensor Systems (SenSys '25)

arXiv:2502.14094 [pdf, other]

CND-IDS: Continual Novelty Detection for Intrusion Detection Systems

Authors: Sean Fuhrman, Onat Gungor, Tajana Rosing

Abstract: Intrusion detection systems (IDS) play a crucial role in IoT and network security by monitoring system data and alerting to suspicious activities. Machine learning (ML) has emerged as a promising solution for IDS, offering highly accurate intrusion detection. However, ML-IDS solutions often overlook two critical aspects needed to build reliable systems: continually changing data streams and a lack… ▽ More Intrusion detection systems (IDS) play a crucial role in IoT and network security by monitoring system data and alerting to suspicious activities. Machine learning (ML) has emerged as a promising solution for IDS, offering highly accurate intrusion detection. However, ML-IDS solutions often overlook two critical aspects needed to build reliable systems: continually changing data streams and a lack of attack labels. Streaming network traffic and associated cyber attacks are continually changing, which can degrade the performance of deployed ML models. Labeling attack data, such as zero-day attacks, in real-world intrusion scenarios may not be feasible, making the use of ML solutions that do not rely on attack labels necessary. To address both these challenges, we propose CND-IDS, a continual novelty detection IDS framework which consists of (i) a learning-based feature extractor that continuously updates new feature representations of the system data, and (ii) a novelty detector that identifies new cyber attacks by leveraging principal component analysis (PCA) reconstruction. Our results on realistic intrusion datasets show that CND-IDS achieves up to 6.1x F-score improvement, and up to 6.5x improved forward transfer over the SOTA unsupervised continual learning algorithm. Our code will be released upon acceptance. △ Less

Submitted 19 February, 2025; originally announced February 2025.

Comments: Accepted by the 62nd Design Automation Conference (DAC 2025)

arXiv:2502.07119 [pdf, other]

SAFE: Self-Supervised Anomaly Detection Framework for Intrusion Detection

Authors: Elvin Li, Zhengli Shang, Onat Gungor, Tajana Rosing

Abstract: The proliferation of IoT devices has significantly increased network vulnerabilities, creating an urgent need for effective Intrusion Detection Systems (IDS). Machine Learning-based IDS (ML-IDS) offer advanced detection capabilities but rely on labeled attack data, which limits their ability to identify unknown threats. Self-Supervised Learning (SSL) presents a promising solution by using only nor… ▽ More The proliferation of IoT devices has significantly increased network vulnerabilities, creating an urgent need for effective Intrusion Detection Systems (IDS). Machine Learning-based IDS (ML-IDS) offer advanced detection capabilities but rely on labeled attack data, which limits their ability to identify unknown threats. Self-Supervised Learning (SSL) presents a promising solution by using only normal data to detect patterns and anomalies. This paper introduces SAFE, a novel framework that transforms tabular network intrusion data into an image-like format, enabling Masked Autoencoders (MAEs) to learn robust representations of network behavior. The features extracted by the MAEs are then incorporated into a lightweight novelty detector, enhancing the effectiveness of anomaly detection. Experimental results demonstrate that SAFE outperforms the state-of-the-art anomaly detection method, Scale Learning-based Deep Anomaly Detection method (SLAD), by up to 26.2% and surpasses the state-of-the-art SSL-based network intrusion detection approach, Anomal-E, by up to 23.5% in F1-score. △ Less

Submitted 10 February, 2025; originally announced February 2025.

Comments: Accepted by the AAAI-25 Workshop on Artificial Intelligence for Cyber Security (AICS)

arXiv:2502.02883 [pdf, other]

SensorChat: Answering Qualitative and Quantitative Questions during Long-Term Multimodal Sensor Interactions

Authors: Xiaofan Yu, Lanxiang Hu, Benjamin Reichman, Dylan Chu, Rushil Chandrupatla, Xiyuan Zhang, Larry Heck, Tajana Rosing

Abstract: Natural language interaction with sensing systems is crucial for addressing users' personal concerns and providing health-related insights into their daily lives. When a user asks a question, the system automatically analyzes the full history of sensor data, extracts relevant information, and generates an appropriate response. However, existing systems are limited to short-duration (e.g., one minu… ▽ More Natural language interaction with sensing systems is crucial for addressing users' personal concerns and providing health-related insights into their daily lives. When a user asks a question, the system automatically analyzes the full history of sensor data, extracts relevant information, and generates an appropriate response. However, existing systems are limited to short-duration (e.g., one minute) or low-frequency (e.g., daily step count) sensor data. In addition, they struggle with quantitative questions that require precise numerical answers. In this work, we introduce SensorChat, the first end-to-end QA system designed for daily life monitoring using long-duration, high-frequency time series data. Given raw sensor signals spanning multiple days and a user-defined natural language question, SensorChat generates semantically meaningful responses that directly address user concerns. SensorChat effectively handles both quantitative questions that require numerical precision and qualitative questions that require high-level reasoning to infer subjective insights. To achieve this, SensorChat uses an innovative three-stage pipeline including question decomposition, sensor data query, and answer assembly. The first and third stages leverage Large Language Models (LLMs) to interpret human queries and generate responses. The intermediate querying stage extracts relevant information from the complete sensor data history. Real-world implementation demonstrate SensorChat's capability for real-time interactions on a cloud server while also being able to run entirely on edge platforms after quantization. Comprehensive QA evaluations show that SensorChat achieves up to 93% higher answer accuracy than state-of-the-art systems on quantitative questions. Additionally, a user study with eight volunteers highlights SensorChat's effectiveness in answering qualitative and open-ended questions. △ Less

Submitted 15 May, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

Comments: Under review

arXiv:2501.04974 [pdf, other]

SensorQA: A Question Answering Benchmark for Daily-Life Monitoring

Authors: Benjamin Reichman, Xiaofan Yu, Lanxiang Hu, Jack Truxal, Atishay Jain, Rushil Chandrupatla, Tajana Šimunić Rosing, Larry Heck

Abstract: With the rapid growth in sensor data, effectively interpreting and interfacing with these data in a human-understandable way has become crucial. While existing research primarily focuses on learning classification models, fewer studies have explored how end users can actively extract useful insights from sensor data, often hindered by the lack of a proper dataset. To address this gap, we introduce… ▽ More With the rapid growth in sensor data, effectively interpreting and interfacing with these data in a human-understandable way has become crucial. While existing research primarily focuses on learning classification models, fewer studies have explored how end users can actively extract useful insights from sensor data, often hindered by the lack of a proper dataset. To address this gap, we introduce SensorQA, the first human-created question-answering (QA) dataset for long-term time-series sensor data for daily life monitoring. SensorQA is created by human workers and includes 5.6K diverse and practical queries that reflect genuine human interests, paired with accurate answers derived from sensor data. We further establish benchmarks for state-of-the-art AI models on this dataset and evaluate their performance on typical edge devices. Our results reveal a gap between current models and optimal QA performance and efficiency, highlighting the need for new contributions. The dataset and code are available at: https://github.com/benjamin-reichman/SensorQA. △ Less

Submitted 3 March, 2025; v1 submitted 9 January, 2025; originally announced January 2025.

arXiv:2412.20993 [pdf, other]

Efficiently Scaling LLM Reasoning with Certaindex

Authors: Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, Hao Zhang

Abstract: Test-time reasoning algorithms such as chain-of-thought, self-consistency, and MCTS enhance LLM problem-solving but can wastefully generate many tokens without improving accuracy. At the same time, we observe that these algorithms exhibit answer stabilization: their intermediate solutions often cease to change after a certain point, and further investment of compute does not change their final ans… ▽ More Test-time reasoning algorithms such as chain-of-thought, self-consistency, and MCTS enhance LLM problem-solving but can wastefully generate many tokens without improving accuracy. At the same time, we observe that these algorithms exhibit answer stabilization: their intermediate solutions often cease to change after a certain point, and further investment of compute does not change their final answer. To quantify this phenomenon, we introduce Certaindex, an algorithm-agnostic metric measuring this evolving stability, signaling when further computation is unlikely to alter the final result. Certaindex is lightweight, can accelerate reasoning program inference via early exit, and further enables dynamic token allocation, gang scheduling, and many opportunities when integrated with real-world LLM serving systems. To quantify real-world benefits, we built Certaindex as a scheduler into Dynasor, our reasoning-aware LLM serving system, and demonstrate up to 50% compute savings and 3.3x higher throughput in real workloads with no accuracy drop. Our code is available at https://github.com/hao-ai-lab/Dynasor.git △ Less

Submitted 27 May, 2025; v1 submitted 30 December, 2024; originally announced December 2024.

arXiv:2412.11242 [pdf, other]

TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs

Authors: Lanxiang Hu, Tajana Rosing, Hao Zhang

Abstract: Specializing large language models (LLMs) for local deployment in domain-specific use cases is necessary for strong performance while meeting latency and privacy constraints. However, conventional task-specific adaptation approaches do not show simultaneous memory saving and inference speedup at deployment time. Practical compression techniques like quantization and pruning require dedicated hardw… ▽ More Specializing large language models (LLMs) for local deployment in domain-specific use cases is necessary for strong performance while meeting latency and privacy constraints. However, conventional task-specific adaptation approaches do not show simultaneous memory saving and inference speedup at deployment time. Practical compression techniques like quantization and pruning require dedicated hardware or kernel support to achieve measured inference speedup. We develop TrimLLM based on the layer-wise specialization phenomenon we empirically observed and verified on contemporary LLMs. TrimLLM reduces the depth of LLMs via progressive layer dropping. We show it retains LLMs' capacity in specific domains and achieves inference speedup irrespective of hardware and deep learning frameworks. We evaluated TrimLLM on LLMs of various sizes for inference; models adapted on medical, legal, and financial datasets all demonstrate $2.1-5.7\times$ inference speedup on consumer GPUs and up to $3.1\times$ speedup on A100 when compared to state-of-the-art model compression algorithms, with no loss in accuracy at 50$\sim$60\% model compression ratio. △ Less

Submitted 19 December, 2024; v1 submitted 15 December, 2024; originally announced December 2024.

arXiv:2411.09760 [pdf, other]

SpecPCM: A Low-power PCM-based In-Memory Computing Accelerator for Full-stack Mass Spectrometry Analysis

Authors: Keming Fan, Ashkan Moradifirouzabadi, Xiangjin Wu, Zheyu Li, Flavio Ponzina, Anton Persson, Eric Pop, Tajana Rosing, Mingu Kang

Abstract: Mass spectrometry (MS) is essential for proteomics and metabolomics but faces impending challenges in efficiently processing the vast volumes of data. This paper introduces SpecPCM, an in-memory computing (IMC) accelerator designed to achieve substantial improvements in energy and delay efficiency for both MS spectral clustering and database (DB) search. SpecPCM employs analog processing with low-… ▽ More Mass spectrometry (MS) is essential for proteomics and metabolomics but faces impending challenges in efficiently processing the vast volumes of data. This paper introduces SpecPCM, an in-memory computing (IMC) accelerator designed to achieve substantial improvements in energy and delay efficiency for both MS spectral clustering and database (DB) search. SpecPCM employs analog processing with low-voltage swing and utilizes recently introduced phase change memory (PCM) devices based on superlattice materials, optimized for low-voltage and low-power programming. Our approach integrates contributions across multiple levels: application, algorithm, circuit, device, and instruction sets. We leverage a robust hyperdimensional computing (HD) algorithm with a novel dimension-packing method and develop specialized hardware for the end-to-end MS pipeline to overcome the non-ideal behavior of PCM devices. We further optimize multi-level PCM devices for different tasks by using different materials. We also perform a comprehensive design exploration to improve energy and delay efficiency while maintaining accuracy, exploring various combinations of hardware and software parameters controlled by the instruction set architecture (ISA). SpecPCM, with up to three bits per cell, achieves speedups of up to 82x and 143x for MS clustering and DB search tasks, respectively, along with a four-orders-of-magnitude improvement in energy efficiency compared with state-of-the-art CPU/GPU tools. △ Less

Submitted 14 November, 2024; originally announced November 2024.

arXiv:2411.02814 [pdf, other]

The Hitchhiker's Guide to Programming and Optimizing CXL-Based Heterogeneous Systems

Authors: Zixuan Wang, Suyash Mahar, Luyi Li, Jangseon Park, Jinpyo Kim, Theodore Michailidis, Yue Pan, Tajana Rosing, Dean Tullsen, Steven Swanson, Kyung Chang Ryoo, Sungjoo Park, Jishen Zhao

Abstract: We present a thorough analysis of the use of CXL-based heterogeneous systems. We built a cluster of server systems that combines different vendor's CPUs and various types of CXL devices. We further developed a heterogeneous memory benchmark suite, Heimdall, to profile the performance of such heterogeneous systems. By leveraging Heimdall, we unveiled the detailed architecture design in these system… ▽ More We present a thorough analysis of the use of CXL-based heterogeneous systems. We built a cluster of server systems that combines different vendor's CPUs and various types of CXL devices. We further developed a heterogeneous memory benchmark suite, Heimdall, to profile the performance of such heterogeneous systems. By leveraging Heimdall, we unveiled the detailed architecture design in these systems, drew observations on optimizing performance for workloads, and pointed out directions for future development of CXL-based heterogeneous systems. △ Less

Submitted 5 November, 2024; originally announced November 2024.

arXiv:2410.15179 [pdf, other]

HPVM-HDC: A Heterogeneous Programming System for Accelerating Hyperdimensional Computing

Authors: Russel Arbore, Xavier Routh, Abdul Rafae Noor, Akash Kothari, Haichao Yang, Weihong Xu, Sumukh Pinge, Vikram Adve, Tajana Rosing, Minxuan Zhou

Abstract: Hyperdimensional Computing (HDC), a technique inspired by cognitive models of computation, has been proposed as an efficient and robust alternative basis for machine learning. HDC programs are often manually written in low-level and target specific languages targeting CPUs, GPUs, and FPGAs - these codes cannot be easily retargeted onto HDC-specific accelerators. No previous programming system enab… ▽ More Hyperdimensional Computing (HDC), a technique inspired by cognitive models of computation, has been proposed as an efficient and robust alternative basis for machine learning. HDC programs are often manually written in low-level and target specific languages targeting CPUs, GPUs, and FPGAs - these codes cannot be easily retargeted onto HDC-specific accelerators. No previous programming system enables productive development of HDC programs and generates efficient code for several hardware targets. We propose a heterogeneous programming system for HDC: a novel programming language, HDC++, for writing applications using a unified programming model, including HDC-specific primitives to improve programmability, and a heterogeneous compiler, HPVM-HDC, that provides an intermediate representation for compiling HDC programs to many hardware targets. We implement two tuning optimizations, automatic binarization and reduction perforation, that exploit the error resilient nature of HDC. Our evaluation shows that HPVM-HDC generates performance-competitive code for CPUs and GPUs, achieving a geomean speed-up of 1.17x over optimized baseline CUDA implementations with a geomean reduction in total lines of code of 1.6x across CPUs and GPUs. Additionally, HPVM-HDC targets an HDC Digital ASIC and an HDC ReRAM accelerator simulator, enabling the first execution of HDC applications on these devices. △ Less

Submitted 1 December, 2024; v1 submitted 19 October, 2024; originally announced October 2024.

arXiv:2409.13361 [pdf, other]

RapidOMS: FPGA-based Open Modification Spectral Library Searching with HD Computing

Authors: Sumukh Pinge, Weihong Xu, Wout Bittremieux, Niema Moshiri, Sang-Woo Jun, Tajana Rosing

Abstract: Mass spectrometry (MS) is essential for protein analysis but faces significant challenges with large datasets and complex post-translational modifications, resulting in difficulties in spectral identification. Open Modification Search (OMS) improves the analysis of these modifications. We present RapidOMS, a solution leveraging the Samsung SmartSSD, which integrates SSD and FPGA in a near-storage… ▽ More Mass spectrometry (MS) is essential for protein analysis but faces significant challenges with large datasets and complex post-translational modifications, resulting in difficulties in spectral identification. Open Modification Search (OMS) improves the analysis of these modifications. We present RapidOMS, a solution leveraging the Samsung SmartSSD, which integrates SSD and FPGA in a near-storage configuration to minimize data movement and enhance the efficiency of large-scale database searching. RapidOMS employs hyperdimensional computing (HDC), a brain-inspired, high-dimensional data processing approach, exploiting the parallel processing and low-latency capabilities of FPGAs, making it well-suited for MS. Utilizing the parallelism and efficiency of bitwise operations in HDC, RapidOMS delivers up to a 60x speedup over the state-of-the-art (SOTA) CPU tool ANN-Solo and is 2.72x faster than the GPU tool HyperOMS. Furthermore, RapidOMS achieves an 11x improvement in energy efficiency compared to conventional systems, providing scalable, energy-efficient solutions for large-scale proteomics applications and advancing the efficient processing of proteomic data. △ Less

Submitted 20 September, 2024; originally announced September 2024.

arXiv:2409.10918 [pdf, other]

FSL-HDnn: A 5.7 TOPS/W End-to-end Few-shot Learning Classifier Accelerator with Feature Extraction and Hyperdimensional Computing

Authors: Haichao Yang, Chang Eun Song, Weihong Xu, Behnam Khaleghi, Uday Mallappa, Monil Shah, Keming Fan, Mingu Kang, Tajana Rosing

Abstract: This paper introduces FSL-HDnn, an energy-efficient accelerator that implements the end-to-end pipeline of feature extraction, classification, and on-chip few-shot learning (FSL) through gradient-free learning techniques in a 40 nm CMOS process. At its core, FSL-HDnn integrates two low-power modules: Weight clustering feature extractor and Hyperdimensional Computing (HDC). Feature extractor utiliz… ▽ More This paper introduces FSL-HDnn, an energy-efficient accelerator that implements the end-to-end pipeline of feature extraction, classification, and on-chip few-shot learning (FSL) through gradient-free learning techniques in a 40 nm CMOS process. At its core, FSL-HDnn integrates two low-power modules: Weight clustering feature extractor and Hyperdimensional Computing (HDC). Feature extractor utilizes advanced weight clustering and pattern reuse strategies for optimized CNN-based feature extraction. Meanwhile, HDC emerges as a novel approach for lightweight FSL classifier, employing hyperdimensional vectors to improve training accuracy significantly compared to traditional distance-based approaches. This dual-module synergy not only simplifies the learning process by eliminating the need for complex gradients but also dramatically enhances energy efficiency and performance. Specifically, FSL-HDnn achieves an Intensity unprecedented energy efficiency of 5.7 TOPS/W for feature 1 extraction and 0.78 TOPS/W for classification and learning Training Intensity phases, achieving improvements of 2.6X and 6.6X, respectively, Storage over current state-of-the-art CNN and FSL processors. △ Less

Submitted 17 September, 2024; originally announced September 2024.

Comments: 4 pages, 12 figures, ESSERC 2024

arXiv:2409.08369 [pdf, other]

E-QUARTIC: Energy Efficient Edge Ensemble of Convolutional Neural Networks for Resource-Optimized Learning

Authors: Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing

Abstract: Ensemble learning is a meta-learning approach that combines the predictions of multiple learners, demonstrating improved accuracy and robustness. Nevertheless, ensembling models like Convolutional Neural Networks (CNNs) result in high memory and computing overhead, preventing their deployment in embedded systems. These devices are usually equipped with small batteries that provide power supply and… ▽ More Ensemble learning is a meta-learning approach that combines the predictions of multiple learners, demonstrating improved accuracy and robustness. Nevertheless, ensembling models like Convolutional Neural Networks (CNNs) result in high memory and computing overhead, preventing their deployment in embedded systems. These devices are usually equipped with small batteries that provide power supply and might include energy-harvesting modules that extract energy from the environment. In this work, we propose E-QUARTIC, a novel Energy Efficient Edge Ensembling framework to build ensembles of CNNs targeting Artificial Intelligence (AI)-based embedded systems. Our design outperforms single-instance CNN baselines and state-of-the-art edge AI solutions, improving accuracy and adapting to varying energy conditions while maintaining similar memory requirements. Then, we leverage the multi-CNN structure of the designed ensemble to implement an energy-aware model selection policy in energy-harvesting AI systems. We show that our solution outperforms the state-of-the-art by reducing system failure rate by up to 40% while ensuring higher average output qualities. Ultimately, we show that the proposed design enables concurrent on-device training and high-quality inference execution at the edge, limiting the performance and energy overheads to less than 0.04%. △ Less

Submitted 12 September, 2024; originally announced September 2024.

Comments: Accepted by the 30th Asia and South Pacific Design Automation Conference (ASP-DAC 2025)

arXiv:2407.00604 [pdf, other]

Fast-OverlaPIM: A Fast Overlap-driven Mapping Framework for Processing In-Memory Neural Network Acceleration

Authors: Xuan Wang, Minxuan Zhou, Tajana Rosing

Abstract: Processing in-memory (PIM) is promising to accelerate neural networks (NNs) because it minimizes data movement and provides large computational parallelism. Similar to machine learning accelerators, application mapping, which determines the operation scheduling and data layout, plays a critical role in the NN acceleration on PIM. The mapping optimization of previous NN accelerators focused on opti… ▽ More Processing in-memory (PIM) is promising to accelerate neural networks (NNs) because it minimizes data movement and provides large computational parallelism. Similar to machine learning accelerators, application mapping, which determines the operation scheduling and data layout, plays a critical role in the NN acceleration on PIM. The mapping optimization of previous NN accelerators focused on optimizing the latency of sequential execution. However, PIM accelerators feature a distinct design space of application mapping from conventional NN accelerators, due to the spatial execution of NN layers across different memory locations. This enables opportunities for overlapping execution of consecutive NN layers to improve the latency, where the succeeding layer can start execution before the preceding layer fully completes the computation. In this paper, we propose Fast-OverlaPIM framework that incorporates the computational overlapping optimization into the DNN mapping exploration process on PIM architectures. Fast-OverlaPIM includes analytical algorithms for fast and accurate overlap analysis. Furthermore, it proposes a novel mapping search strategy and a transformation mechanism to enable efficient design space exploration on the overlap-based mapping for the whole network. Our framework demonstrates a significant improvement in runtime performance from 3.4x to 323.1x compared to the previous state-of-the-art overlap-based framework. Our experiments show that Fast-OverlaPIM can efficiently produce mappings that are 4.6x to 18.1x faster than the state-of-the-art mapping optimization framework under the same architecture constraints. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: This work is accepted by IEEE TCAD

arXiv:2405.02756 [pdf, other]

Efficient Open Modification Spectral Library Searching in High-Dimensional Space with Multi-Level-Cell Memory

Authors: Keming Fan, Wei-Chen Chen, Sumukh Pinge, H. -S. Philip Wong, Tajana Rosing

Abstract: Open Modification Search (OMS) is a promising algorithm for mass spectrometry analysis that enables the discovery of modified peptides. However, OMS encounters challenges as it exponentially extends the search scope. Existing OMS accelerators either have limited parallelism or struggle to scale effectively with growing data volumes. In this work, we introduce an OMS accelerator utilizing multi-lev… ▽ More Open Modification Search (OMS) is a promising algorithm for mass spectrometry analysis that enables the discovery of modified peptides. However, OMS encounters challenges as it exponentially extends the search scope. Existing OMS accelerators either have limited parallelism or struggle to scale effectively with growing data volumes. In this work, we introduce an OMS accelerator utilizing multi-level-cell (MLC) RRAM memory to enhance storage capacity by 3x. Through in-memory computing, we achieve up to 77x faster data processing with two to three orders of magnitude better energy efficiency. Testing was done on a fabricated MLC RRAM chip. We leverage hyperdimensional computing to tolerate up to 10% memory errors while delivering massive parallelism in hardware. △ Less

Submitted 4 May, 2024; originally announced May 2024.

Comments: Accepted by DAC'24

arXiv:2404.00039 [pdf, other]

MicroHD: An Accuracy-Driven Optimization of Hyperdimensional Computing Algorithms for TinyML systems

Authors: Flavio Ponzina, Tajana Rosing

Abstract: Hyperdimensional computing (HDC) is emerging as a promising AI approach that can effectively target TinyML applications thanks to its lightweight computing and memory requirements. Previous works on HDC showed that limiting the standard 10k dimensions of the hyperdimensional space to much lower values is possible, reducing even more HDC resource requirements. Similarly, other studies demonstrated… ▽ More Hyperdimensional computing (HDC) is emerging as a promising AI approach that can effectively target TinyML applications thanks to its lightweight computing and memory requirements. Previous works on HDC showed that limiting the standard 10k dimensions of the hyperdimensional space to much lower values is possible, reducing even more HDC resource requirements. Similarly, other studies demonstrated that binary values can be used as elements of the generated hypervectors, leading to significant efficiency gains at the cost of some degree of accuracy degradation. Nevertheless, current optimization attempts do not concurrently co-optimize HDC hyper-parameters, and accuracy degradation is not directly controlled, resulting in sub-optimal HDC models providing several applications with unacceptable output qualities. In this work, we propose MicroHD, a novel accuracy-driven HDC optimization approach that iteratively tunes HDC hyper-parameters, reducing memory and computing requirements while ensuring user-defined accuracy levels. The proposed method can be applied to HDC implementations using different encoding functions, demonstrates good scalability for larger HDC workloads, and achieves compression and efficiency gains up to 200x when compared to baseline implementations for accuracy degradations lower than 1%. △ Less

Submitted 23 March, 2024; originally announced April 2024.

Comments: Accepted as a full paper by the tinyML Research Symposium 2024

arXiv:2403.04759 [pdf, other]

Lifelong Intelligence Beyond the Edge using Hyperdimensional Computing

Authors: Xiaofan Yu, Anthony Thomas, Ivannia Gomez Moreno, Louis Gutierrez, Tajana Rosing

Abstract: On-device learning has emerged as a prevailing trend that avoids the slow response time and costly communication of cloud-based learning. The ability to learn continuously and indefinitely in a changing environment, and with resource constraints, is critical for real sensor deployments. However, existing designs are inadequate for practical scenarios with (i) streaming data input, (ii) lack of sup… ▽ More On-device learning has emerged as a prevailing trend that avoids the slow response time and costly communication of cloud-based learning. The ability to learn continuously and indefinitely in a changing environment, and with resource constraints, is critical for real sensor deployments. However, existing designs are inadequate for practical scenarios with (i) streaming data input, (ii) lack of supervision and (iii) limited on-board resources. In this paper, we design and deploy the first on-device lifelong learning system called LifeHD for general IoT applications with limited supervision. LifeHD is designed based on a novel neurally-inspired and lightweight learning paradigm called Hyperdimensional Computing (HDC). We utilize a two-tier associative memory organization to intelligently store and manage high-dimensional, low-precision vectors, which represent the historical patterns as cluster centroids. We additionally propose two variants of LifeHD to cope with scarce labeled inputs and power constraints. We implement LifeHD on off-the-shelf edge platforms and perform extensive evaluations across three scenarios. Our measurements show that LifeHD improves the unsupervised clustering accuracy by up to 74.8% compared to the state-of-the-art NN-based unsupervised lifelong learning baselines with as much as 34.3x better energy efficiency. Our code is available at https://github.com/Orienfish/LifeHD. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: Accepted by IPSN'24

arXiv:2312.15966 [pdf, other]

Federated Hyperdimensional Computing

Authors: Kazim Ergun, Rishikanth Chandrasekaran, Tajana Rosing

Abstract: Federated learning (FL) enables a loose set of participating clients to collaboratively learn a global model via coordination by a central server and with no need for data sharing. Existing FL approaches that rely on complex algorithms with massive models, such as deep neural networks (DNNs), suffer from computation and communication bottlenecks. In this paper, we first propose FedHDC, a federated… ▽ More Federated learning (FL) enables a loose set of participating clients to collaboratively learn a global model via coordination by a central server and with no need for data sharing. Existing FL approaches that rely on complex algorithms with massive models, such as deep neural networks (DNNs), suffer from computation and communication bottlenecks. In this paper, we first propose FedHDC, a federated learning framework based on hyperdimensional computing (HDC). FedHDC allows for fast and light-weight local training on clients, provides robust learning, and has smaller model communication overhead compared to learning with DNNs. However, current HDC algorithms get poor accuracy when classifying larger & more complex images, such as CIFAR10. To address this issue, we design FHDnn, which complements FedHDC with a self-supervised contrastive learning feature extractor. We avoid the transmission of the DNN and instead train only the HDC learner in a federated manner, which accelerates learning, reduces transmission cost, and utilizes the robustness of HDC to tackle network errors. We present a formal analysis of the algorithm and derive its convergence rate both theoretically, and show experimentally that FHDnn converges 3$\times$ faster vs. DNNs. The strategies we propose to improve the communication efficiency enable our design to reduce communication costs by 66$\times$ vs. DNNs, local client compute and energy consumption by ~1.5 - 6$\times$, while being highly robust to network errors. Finally, our proposed strategies for improving the communication efficiency have up to 32$\times$ lower communication costs with good accuracy. △ Less

Submitted 26 December, 2023; originally announced December 2023.

Comments: Submitted for publication, 20 pages

arXiv:2312.04257 [pdf, other]

Proxima: Near-storage Acceleration for Graph-based Approximate Nearest Neighbor Search in 3D NAND

Authors: Weihong Xu, Junwei Chen, Po-Kai Hsu, Jaeyoung Kang, Minxuan Zhou, Sumukh Pinge, Shimeng Yu, Tajana Rosing

Abstract: Approximate nearest neighbor search (ANNS) plays an indispensable role in a wide variety of applications, including recommendation systems, information retrieval, and semantic search. Among the cutting-edge ANNS algorithms, graph-based approaches provide superior accuracy and scalability on massive datasets. However, the best-performing graph-based ANN search solutions incur tens of hundreds of me… ▽ More Approximate nearest neighbor search (ANNS) plays an indispensable role in a wide variety of applications, including recommendation systems, information retrieval, and semantic search. Among the cutting-edge ANNS algorithms, graph-based approaches provide superior accuracy and scalability on massive datasets. However, the best-performing graph-based ANN search solutions incur tens of hundreds of memory footprints as well as costly distance computation, thus hindering their efficient deployment at scale. The 3D NAND flash is emerging as a promising device for data-intensive applications due to its high density and nonvolatility. In this work, we present the near-storage processing (NSP)-based ANNS solution Proxima, to accelerate graph-based ANNS with algorithm-hardware co-design in 3D NAND flash. Proxima significantly reduces the complexity of graph search by leveraging the distance approximation and early termination. On top of the algorithmic enhancement, we implement Proxima search algorithm in 3D NAND flash using the heterogeneous integration technique. To maximize 3D NAND's bandwidth utilization, we present customized dataflow and optimized data allocation scheme. Our evaluation results show that: compared to graph ANNS on CPU and GPU, Proxima achieves a magnitude improvement in throughput or energy efficiency. Proxima yields 7x to 13x speedup over existing ASIC designs. Furthermore, Proxima achieves a good balance between accuracy, efficiency and storage density compared to previous NSP-based accelerators. △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:2311.16293 [pdf, other]

FHEmem: A Processing In-Memory Accelerator for Fully Homomorphic Encryption

Authors: Minxuan Zhou, Yujin Nam, Pranav Gangwar, Weihong Xu, Arpan Dutta, Kartikeyan Subramanyam, Chris Wilkerson, Rosario Cammarota, Saransh Gupta, Tajana Rosing

Abstract: Fully Homomorphic Encryption (FHE) is a technique that allows arbitrary computations to be performed on encrypted data without the need for decryption, making it ideal for securing many emerging applications. However, FHE computation is significantly slower than computation on plain data due to the increase in data size after encryption. Processing In-Memory (PIM) is a promising technology that ca… ▽ More Fully Homomorphic Encryption (FHE) is a technique that allows arbitrary computations to be performed on encrypted data without the need for decryption, making it ideal for securing many emerging applications. However, FHE computation is significantly slower than computation on plain data due to the increase in data size after encryption. Processing In-Memory (PIM) is a promising technology that can accelerate data-intensive workloads with extensive parallelism. However, FHE is challenging for PIM acceleration due to the long-bitwidth multiplications and complex data movements involved. We propose a PIM-based FHE accelerator, FHEmem, which exploits a novel processing in-memory architecture to achieve high-throughput and efficient acceleration for FHE. We propose an optimized end-to-end processing flow, from low-level hardware processing to high-level application mapping, that fully exploits the high throughput of FHEmem hardware. Our evaluation shows FHEmem achieves significant speedup and efficiency improvement over state-of-the-art FHE accelerators. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.12874 [pdf, other]

SpecHD: Hyperdimensional Computing Framework for FPGA-based Mass Spectrometry Clustering

Authors: Sumukh Pinge, Weihong Xu, Jaeyoung Kang, Tianqi Zhang, Neima Moshiri, Wout Bittremieux, Tajana Rosing

Abstract: Mass spectrometry-based proteomics is a key enabler for personalized healthcare, providing a deep dive into the complex protein compositions of biological systems. This technology has vast applications in biotechnology and biomedicine but faces significant computational bottlenecks. Current methodologies often require multiple hours or even days to process extensive datasets, particularly in the d… ▽ More Mass spectrometry-based proteomics is a key enabler for personalized healthcare, providing a deep dive into the complex protein compositions of biological systems. This technology has vast applications in biotechnology and biomedicine but faces significant computational bottlenecks. Current methodologies often require multiple hours or even days to process extensive datasets, particularly in the domain of spectral clustering. To tackle these inefficiencies, we introduce SpecHD, a hyperdimensional computing (HDC) framework supplemented by an FPGA-accelerated architecture with integrated near-storage preprocessing. Utilizing streamlined binary operations in an HDC environment, SpecHD capitalizes on the low-latency and parallel capabilities of FPGAs. This approach markedly improves clustering speed and efficiency, serving as a catalyst for real-time, high-throughput data analysis in future healthcare applications. Our evaluations demonstrate that SpecHD not only maintains but often surpasses existing clustering quality metrics while drastically cutting computational time. Specifically, it can cluster a large-scale human proteome dataset-comprising 25 million MS/MS spectra and 131 GB of MS data-in just 5 minutes. With energy efficiency exceeding 31x and a speedup factor that spans a range of 6x to 54x over existing state of-the-art solutions, SpecHD emerges as a promising solution for the rapid analysis of mass spectrometry data with great implications for personalized healthcare. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2305.07205 [pdf, other]

Mem-Rec: Memory Efficient Recommendation System using Alternative Representation

Authors: Gopi Krishna Jha, Anthony Thomas, Nilesh Jain, Sameh Gobriel, Tajana Rosing, Ravi Iyer

Abstract: Deep learning-based recommendation systems (e.g., DLRMs) are widely used AI models to provide high-quality personalized recommendations. Training data used for modern recommendation systems commonly includes categorical features taking on tens-of-millions of possible distinct values. These categorical tokens are typically assigned learned vector representations, that are stored in large embedding… ▽ More Deep learning-based recommendation systems (e.g., DLRMs) are widely used AI models to provide high-quality personalized recommendations. Training data used for modern recommendation systems commonly includes categorical features taking on tens-of-millions of possible distinct values. These categorical tokens are typically assigned learned vector representations, that are stored in large embedding tables, on the order of 100s of GB. Storing and accessing these tables represent a substantial burden in commercial deployments. Our work proposes MEM-REC, a novel alternative representation approach for embedding tables. MEM-REC leverages bloom filters and hashing methods to encode categorical features using two cache-friendly embedding tables. The first table (token embedding) contains raw embeddings (i.e. learned vector representation), and the second table (weight embedding), which is much smaller, contains weights to scale these raw embeddings to provide better discriminative capability to each data point. We provide a detailed architecture, design and analysis of MEM-REC addressing trade-offs in accuracy and computation requirements, in comparison with state-of-the-art techniques. We show that MEM-REC can not only maintain the recommendation quality and significantly reduce the memory footprint for commercial scale recommendation models but can also improve the embedding latency. In particular, based on our results, MEM-REC compresses the MLPerf CriteoTB benchmark DLRM model size by 2900x and performs up to 3.4x faster embeddings while achieving the same AUC as that of the full uncompressed model. △ Less

Submitted 14 May, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

arXiv:2303.15604 [pdf, other]

HD-Bind: Encoding of Molecular Structure with Low Precision, Hyperdimensional Binary Representations

Authors: Derek Jones, Jonathan E. Allen, Xiaohua Zhang, Behnam Khaleghi, Jaeyoung Kang, Weihong Xu, Niema Moshiri, Tajana S. Rosing

Abstract: Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between t… ▽ More Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between the drug to its protein target. A major drawback of the approaches is that they require exceptional computing capabilities to consider for even relatively small collections of molecules. Hyperdimensional Computing (HDC) is a recently proposed learning paradigm that is able to leverage low-precision binary vector arithmetic to build efficient representations of the data that can be obtained without the need for gradient-based optimization approaches that are required in many conventional machine learning and deep learning approaches. This algorithmic simplicity allows for acceleration in hardware that has been previously demonstrated for a range of application areas. We consider existing HDC approaches for molecular property classification and introduce two novel encoding algorithms that leverage the extended connectivity fingerprint (ECFP) algorithm. We show that HDC-based inference methods are as much as 90 times more efficient than more complex representative machine learning methods and achieve an acceleration of nearly 9 orders of magnitude as compared to inference with molecular docking. We demonstrate multiple approaches for the encoding of molecular data for HDC and examine their relative performance on a range of challenging molecular property prediction and drug-protein binding classification tasks. Our work thus motivates further investigation into molecular representation learning to develop ultra-efficient pre-screening tools. △ Less

Submitted 27 March, 2023; originally announced March 2023.

arXiv:2301.09740 [pdf, other]

DODEM: DOuble DEfense Mechanism Against Adversarial Attacks Towards Secure Industrial Internet of Things Analytics

Authors: Onat Gungor, Tajana Rosing, Baris Aksanli

Abstract: Industrial Internet of Things (I-IoT) is a collaboration of devices, sensors, and networking equipment to monitor and collect data from industrial operations. Machine learning (ML) methods use this data to make high-level decisions with minimal human intervention. Data-driven predictive maintenance (PDM) is a crucial ML-based I-IoT application to find an optimal maintenance schedule for industrial… ▽ More Industrial Internet of Things (I-IoT) is a collaboration of devices, sensors, and networking equipment to monitor and collect data from industrial operations. Machine learning (ML) methods use this data to make high-level decisions with minimal human intervention. Data-driven predictive maintenance (PDM) is a crucial ML-based I-IoT application to find an optimal maintenance schedule for industrial assets. The performance of these ML methods can seriously be threatened by adversarial attacks where an adversary crafts perturbed data and sends it to the ML model to deteriorate its prediction performance. The models should be able to stay robust against these attacks where robustness is measured by how much perturbation in input data affects model performance. Hence, there is a need for effective defense mechanisms that can protect these models against adversarial attacks. In this work, we propose a double defense mechanism to detect and mitigate adversarial attacks in I-IoT environments. We first detect if there is an adversarial attack on a given sample using novelty detection algorithms. Then, based on the outcome of our algorithm, marking an instance as attack or normal, we select adversarial retraining or standard training to provide a secondary defense layer. If there is an attack, adversarial retraining provides a more robust model, while we apply standard training for regular samples. Since we may not know if an attack will take place, our adaptive mechanism allows us to consider irregular changes in data. The results show that our double defense strategy is highly efficient where we can improve model robustness by up to 64.6% and 52% compared to standard and adversarial retraining, respectively. △ Less

Submitted 23 January, 2023; originally announced January 2023.

arXiv:2301.06646 [pdf, other]

doi 10.1145/3576842.3582377

Async-HFL: Efficient and Robust Asynchronous Federated Learning in Hierarchical IoT Networks

Authors: Xiaofan Yu, Ludmila Cherkasova, Harsh Vardhan, Quanling Zhao, Emily Ekaireb, Xiyuan Zhang, Arya Mazumdar, Tajana Rosing

Abstract: Federated Learning (FL) has gained increasing interest in recent years as a distributed on-device learning paradigm. However, multiple challenges remain to be addressed for deploying FL in real-world Internet-of-Things (IoT) networks with hierarchies. Although existing works have proposed various approaches to account data heterogeneity, system heterogeneity, unexpected stragglers and scalibility,… ▽ More Federated Learning (FL) has gained increasing interest in recent years as a distributed on-device learning paradigm. However, multiple challenges remain to be addressed for deploying FL in real-world Internet-of-Things (IoT) networks with hierarchies. Although existing works have proposed various approaches to account data heterogeneity, system heterogeneity, unexpected stragglers and scalibility, none of them provides a systematic solution to address all of the challenges in a hierarchical and unreliable IoT network. In this paper, we propose an asynchronous and hierarchical framework (Async-HFL) for performing FL in a common three-tier IoT network architecture. In response to the largely varied delays, Async-HFL employs asynchronous aggregations at both the gateway and the cloud levels thus avoids long waiting time. To fully unleash the potential of Async-HFL in converging speed under system heterogeneities and stragglers, we design device selection at the gateway level and device-gateway association at the cloud level. Device selection chooses edge devices to trigger local training in real-time while device-gateway association determines the network topology periodically after several cloud epochs, both satisfying bandwidth limitation. We evaluate Async-HFL's convergence speedup using large-scale simulations based on ns-3 and a network topology from NYCMesh. Our results show that Async-HFL converges 1.08-1.31x faster in wall-clock time and saves up to 21.6% total communication cost compared to state-of-the-art asynchronous FL algorithms (with client selection). We further validate Async-HFL on a physical deployment and observe robust convergence under unexpected stragglers. △ Less

Submitted 10 April, 2023; v1 submitted 16 January, 2023; originally announced January 2023.

Comments: Accepted by IoTDI'23

arXiv:2211.16422 [pdf, other]

Massively Parallel Open Modification Spectral Library Searching with Hyperdimensional Computing

Authors: Jaeyoung Kang, Weihong Xu, Wout Bittremieux, Tajana Rosing

Abstract: Mass spectrometry, commonly used for protein identification, generates a massive number of spectra that need to be matched against a large database. In reality, most of them remain unidentified or mismatched due to unexpected post-translational modifications. Open modification search (OMS) has been proposed as a strategy to improve the identification rate by considering every possible change in sp… ▽ More Mass spectrometry, commonly used for protein identification, generates a massive number of spectra that need to be matched against a large database. In reality, most of them remain unidentified or mismatched due to unexpected post-translational modifications. Open modification search (OMS) has been proposed as a strategy to improve the identification rate by considering every possible change in spectra, but it expands the search space exponentially. In this work, we propose HyperOMS, which redesigns OMS based on hyperdimensional computing to cope with such challenges. Unlike existing algorithms that represent spectral data with floating point numbers, HyperOMS encodes them with high dimensional binary vectors and performs the efficient OMS in high-dimensional space. With the massive parallelism and simple boolean operations, HyperOMS can be efficiently handled on parallel computing platforms. Experimental results show that HyperOMS on GPU is up to $17\times$ faster and $6.4\times$ more energy efficient than the state-of-the-art GPU-based OMS tool while providing comparable search quality to competing search tools. △ Less

Submitted 31 December, 2022; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: 6 pages, 7 figures, extension of PACT 2022 paper

arXiv:2211.05733 [pdf, other]

doi 10.1109/TCAD.2023.3239537

RAPIDx: High-performance ReRAM Processing in-Memory Accelerator for Sequence Alignment

Authors: Weihong Xu, Saransh Gupta, Niema Moshiri, Tajana Rosing

Abstract: Genome sequence alignment is the core of many biological applications. The advancement of sequencing technologies produces a tremendous amount of data, making sequence alignment a critical bottleneck in bioinformatics analysis. The existing hardware accelerators for alignment suffer from limited on-chip memory, costly data movement, and poorly optimized alignment algorithms. They cannot afford to… ▽ More Genome sequence alignment is the core of many biological applications. The advancement of sequencing technologies produces a tremendous amount of data, making sequence alignment a critical bottleneck in bioinformatics analysis. The existing hardware accelerators for alignment suffer from limited on-chip memory, costly data movement, and poorly optimized alignment algorithms. They cannot afford to concurrently process the massive amount of data generated by sequencing machines. In this paper, we propose a ReRAM-based accelerator, RAPIDx, using processing in-memory (PIM) for sequence alignment. RAPIDx achieves superior efficiency and performance via software-hardware co-design. First, we propose an adaptive banded parallelism alignment algorithm suitable for PIM architecture. Compared to the original dynamic programming-based alignment, the proposed algorithm significantly reduces the required complexity, data bit width, and memory footprint at the cost of negligible accuracy degradation. Then we propose the efficient PIM architecture that implements the proposed algorithm. The data flow in RAPIDx achieves four-level parallelism and we design an in-situ alignment computation flow in ReRAM, delivering $5.5$-$9.7\times$ efficiency and throughput improvements compared to our previous PIM design, RAPID. The proposed RAPIDx is reconfigurable to serve as a co-processor integrated into existing genome analysis pipeline to boost sequence alignment or edit distance calculation. On short-read alignment, RAPIDx delivers $131.1\times$ and $46.8\times$ throughput improvements over state-of-the-art CPU and GPU libraries, respectively. As compared to ASIC accelerators for long-read alignment, the performance of RAPIDx is $1.8$-$2.9\times$ higher. △ Less

Submitted 24 January, 2023; v1 submitted 10 November, 2022; originally announced November 2022.

arXiv:2209.09868 [pdf, other]

Streaming Encoding Algorithms for Scalable Hyperdimensional Computing

Authors: Anthony Thomas, Behnam Khaleghi, Gopi Krishna Jha, Sanjoy Dasgupta, Nageen Himayat, Ravi Iyer, Nilesh Jain, Tajana Rosing

Abstract: Hyperdimensional computing (HDC) is a paradigm for data representation and learning originating in computational neuroscience. HDC represents data as high-dimensional, low-precision vectors which can be used for a variety of information processing tasks like learning or recall. The mapping to high-dimensional space is a fundamental problem in HDC, and existing methods encounter scalability issues… ▽ More Hyperdimensional computing (HDC) is a paradigm for data representation and learning originating in computational neuroscience. HDC represents data as high-dimensional, low-precision vectors which can be used for a variety of information processing tasks like learning or recall. The mapping to high-dimensional space is a fundamental problem in HDC, and existing methods encounter scalability issues when the input data itself is high-dimensional. In this work, we explore a family of streaming encoding techniques based on hashing. We show formally that these methods enjoy comparable guarantees on performance for learning applications while being substantially more efficient than existing alternatives. We validate these results experimentally on a popular high-dimensional classification problem and show that our approach easily scales to very large data sets. △ Less

Submitted 8 February, 2023; v1 submitted 20 September, 2022; originally announced September 2022.

arXiv:2208.11266 [pdf, other]

SCALE: Online Self-Supervised Lifelong Learning without Prior Knowledge

Authors: Xiaofan Yu, Yunhui Guo, Sicun Gao, Tajana Rosing

Abstract: Unsupervised lifelong learning refers to the ability to learn over time while memorizing previous patterns without supervision. Although great progress has been made in this direction, existing work often assumes strong prior knowledge about the incoming data (e.g., knowing the class boundaries), which can be impossible to obtain in complex and unpredictable environments. In this paper, motivated… ▽ More Unsupervised lifelong learning refers to the ability to learn over time while memorizing previous patterns without supervision. Although great progress has been made in this direction, existing work often assumes strong prior knowledge about the incoming data (e.g., knowing the class boundaries), which can be impossible to obtain in complex and unpredictable environments. In this paper, motivated by real-world scenarios, we propose a more practical problem setting called online self-supervised lifelong learning without prior knowledge. The proposed setting is challenging due to the non-iid and single-pass data, the absence of external supervision, and no prior knowledge. To address the challenges, we propose Self-Supervised ContrAstive Lifelong LEarning without Prior Knowledge (SCALE) which can extract and memorize representations on the fly purely from the data continuum. SCALE is designed around three major components: a pseudo-supervised contrastive loss, a self-supervised forgetting loss, and an online memory update for uniform subset selection. All three components are designed to work collaboratively to maximize learning performance. We perform comprehensive experiments of SCALE under iid and four non-iid data streams. The results show that SCALE outperforms the state-of-the-art algorithm in all settings with improvements up to 3.83%, 2.77% and 5.86% in terms of kNN accuracy on CIFAR-10, CIFAR-100, and TinyImageNet datasets. △ Less

Submitted 10 April, 2023; v1 submitted 23 August, 2022; originally announced August 2022.

Comments: Accepted by CLVision'23

arXiv:2204.12557 [pdf]

MemFHE: End-to-End Computing with Fully Homomorphic Encryption in Memory

Authors: Saransh Gupta, Rosario Cammarota, Tajana Rosing

Abstract: The increasing amount of data and the growing complexity of problems has resulted in an ever-growing reliance on cloud computing. However, many applications, most notably in healthcare, finance or defense, demand security and privacy which today's solutions cannot fully address. Fully homomorphic encryption (FHE) elevates the bar of today's solutions by adding confidentiality of data during proces… ▽ More The increasing amount of data and the growing complexity of problems has resulted in an ever-growing reliance on cloud computing. However, many applications, most notably in healthcare, finance or defense, demand security and privacy which today's solutions cannot fully address. Fully homomorphic encryption (FHE) elevates the bar of today's solutions by adding confidentiality of data during processing. It allows computation on fully encrypted data without the need for decryption, thus fully preserving privacy. To enable processing encrypted data at usable levels of classic security, e.g., 128-bit, the encryption procedure introduces noticeable data size expansion - the ciphertext is much bigger than the native aggregate of native data types. In this paper, we present MemFHE which is the first accelerator of both client and server for the latest Ring-GSW (Gentry, Sahai, and Waters) based homomorphic encryption schemes using Processing In Memory (PIM). PIM alleviates the data movement issues with large FHE encrypted data, while providing in-situ execution and extensive parallelism needed for FHE's polynomial operations. While the client-PIM can homomorphically encrypt and decrypt data, the server-PIM can process homomorphically encrypted data without decryption. MemFHE's server-PIM is pipelined and is designed to provide flexible bootstrapping, allowing two encryption techniques and various FHE security-levels based on the application requirements. We evaluate MemFHE for various security-levels and compare it with state-of-the-art CPU implementations for Ring-GSW based FHE. MemFHE is up to 20kx (265x) faster than CPU (GPU) for FHE arithmetic operations and provides on average 2007x higher throughput than the state-of-the-art while implementing neural networks with FHE. △ Less

Submitted 26 April, 2022; originally announced April 2022.

arXiv:2203.08148 [pdf, other]

RES-HD: Resilient Intelligent Fault Diagnosis Against Adversarial Attacks Using Hyper-Dimensional Computing

Authors: Onat Gungor, Tajana Rosing, Baris Aksanli

Abstract: Industrial Internet of Things (I-IoT) enables fully automated production systems by continuously monitoring devices and analyzing collected data. Machine learning methods are commonly utilized for data analytics in such systems. Cyber-attacks are a grave threat to I-IoT as they can manipulate legitimate inputs, corrupting ML predictions and causing disruptions in the production systems. Hyper-dime… ▽ More Industrial Internet of Things (I-IoT) enables fully automated production systems by continuously monitoring devices and analyzing collected data. Machine learning methods are commonly utilized for data analytics in such systems. Cyber-attacks are a grave threat to I-IoT as they can manipulate legitimate inputs, corrupting ML predictions and causing disruptions in the production systems. Hyper-dimensional computing (HDC) is a brain-inspired machine learning method that has been shown to be sufficiently accurate while being extremely robust, fast, and energy-efficient. In this work, we use HDC for intelligent fault diagnosis against different adversarial attacks. Our black-box adversarial attacks first train a substitute model and create perturbed test instances using this trained model. These examples are then transferred to the target models. The change in the classification accuracy is measured as the difference before and after the attacks. This change measures the resiliency of a learning method. Our experiments show that HDC leads to a more resilient and lightweight learning solution than the state-of-the-art deep learning methods. HDC has up to 67.5% higher resiliency compared to the state-of-the-art methods while being up to 25.1% faster to train. △ Less

Submitted 14 March, 2022; originally announced March 2022.

arXiv:2010.07426 [pdf, ps, other]

doi 10.1613/jair.1.12664

A Theoretical Perspective on Hyperdimensional Computing

Authors: Anthony Thomas, Sanjoy Dasgupta, Tajana Rosing

Abstract: Hyperdimensional (HD) computing is a set of neurally inspired methods for obtaining high-dimensional, low-precision, distributed representations of data. These representations can be combined with simple, neurally plausible algorithms to effect a variety of information processing tasks. HD computing has recently garnered significant interest from the computer hardware community as an energy-effici… ▽ More Hyperdimensional (HD) computing is a set of neurally inspired methods for obtaining high-dimensional, low-precision, distributed representations of data. These representations can be combined with simple, neurally plausible algorithms to effect a variety of information processing tasks. HD computing has recently garnered significant interest from the computer hardware community as an energy-efficient, low-latency, and noise-robust tool for solving learning problems. In this review, we present a unified treatment of the theoretical foundations of HD computing with a focus on the suitability of representations for learning. △ Less

Submitted 17 February, 2022; v1 submitted 14 October, 2020; originally announced October 2020.

Comments: Updates with published version

Journal ref: Journal of Artificial Intelligence Research 72 (2021): 215-249

arXiv:2008.04449 [pdf, ps, other]

Trustworthy AI Inference Systems: An Industry Research View

Authors: Rosario Cammarota, Matthias Schunter, Anand Rajan, Fabian Boemer, Ágnes Kiss, Amos Treiber, Christian Weinert, Thomas Schneider, Emmanuel Stapf, Ahmad-Reza Sadeghi, Daniel Demmler, Joshua Stock, Huili Chen, Siam Umar Hussain, Sadegh Riazi, Farinaz Koushanfar, Saransh Gupta, Tajan Simunic Rosing, Kamalika Chaudhuri, Hamid Nejatollahi, Nikil Dutt, Mohsen Imani, Kim Laine, Anuj Dubey, Aydin Aysu , et al. (4 additional authors not shown)

Abstract: In this work, we provide an industry research view for approaching the design, deployment, and operation of trustworthy Artificial Intelligence (AI) inference systems. Such systems provide customers with timely, informed, and customized inferences to aid their decision, while at the same time utilizing appropriate security protection mechanisms for AI models. Additionally, such systems should also… ▽ More In this work, we provide an industry research view for approaching the design, deployment, and operation of trustworthy Artificial Intelligence (AI) inference systems. Such systems provide customers with timely, informed, and customized inferences to aid their decision, while at the same time utilizing appropriate security protection mechanisms for AI models. Additionally, such systems should also use Privacy-Enhancing Technologies (PETs) to protect customers' data at any time. To approach the subject, we start by introducing current trends in AI inference systems. We continue by elaborating on the relationship between Intellectual Property (IP) and private data protection in such systems. Regarding the protection mechanisms, we survey the security and privacy building blocks instrumental in designing, building, deploying, and operating private AI inference systems. For example, we highlight opportunities and challenges in AI systems using trusted execution environments combined with more recent advances in cryptographic techniques to protect data in use. Finally, we outline areas of further development that require the global collective attention of industry, academia, and government researchers to sustain the operation of trustworthy AI inference systems. △ Less

Submitted 10 February, 2023; v1 submitted 10 August, 2020; originally announced August 2020.

arXiv:2007.10330 [pdf, other]

SHEARer: Highly-Efficient Hyperdimensional Computing by Software-Hardware Enabled Multifold Approximation

Authors: Behnam Khaleghi, Sahand Salamat, Anthony Thomas, Fatemeh Asgarinejad, Yeseong Kim, Tajana Rosing

Abstract: Hyperdimensional computing (HD) is an emerging paradigm for machine learning based on the evidence that the brain computes on high-dimensional, distributed, representations of data. The main operation of HD is encoding, which transfers the input data to hyperspace by mapping each input feature to a hypervector, accompanied by so-called bundling procedure that simply adds up the hypervectors to rea… ▽ More Hyperdimensional computing (HD) is an emerging paradigm for machine learning based on the evidence that the brain computes on high-dimensional, distributed, representations of data. The main operation of HD is encoding, which transfers the input data to hyperspace by mapping each input feature to a hypervector, accompanied by so-called bundling procedure that simply adds up the hypervectors to realize encoding hypervector. Although the operations of HD are highly parallelizable, the massive number of operations hampers the efficiency of HD in embedded domain. In this paper, we propose SHEARer, an algorithm-hardware co-optimization to improve the performance and energy consumption of HD computing. We gain insight from a prudent scheme of approximating the hypervectors that, thanks to inherent error resiliency of HD, has minimal impact on accuracy while provides high prospect for hardware optimization. In contrast to previous works that generate the encoding hypervectors in full precision and then ex-post quantizing, we compute the encoding hypervectors in an approximate manner that saves a significant amount of resources yet affords high accuracy. We also propose a novel FPGA implementation that achieves striking performance through massive parallelism with low power consumption. Moreover, we develop a software framework that enables training HD models by emulating the proposed approximate encodings. The FPGA implementation of SHEARer achieves an average throughput boost of 104,904x (15.7x) and energy savings of up to 56,044x (301x) compared to state-of-the-art encoding methods implemented on Raspberry Pi 3 (GeForce GTX 1080 Ti) using practical machine learning datasets. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: A shorter version is accepted in ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED 2020)

arXiv:2005.06716 [pdf, other]

Prive-HD: Privacy-Preserved Hyperdimensional Computing

Authors: Behnam Khaleghi, Mohsen Imani, Tajana Rosing

Abstract: The privacy of data is a major challenge in machine learning as a trained model may expose sensitive information of the enclosed dataset. Besides, the limited computation capability and capacity of edge devices have made cloud-hosted inference inevitable. Sending private information to remote servers makes the privacy of inference also vulnerable because of susceptible communication channels or ev… ▽ More The privacy of data is a major challenge in machine learning as a trained model may expose sensitive information of the enclosed dataset. Besides, the limited computation capability and capacity of edge devices have made cloud-hosted inference inevitable. Sending private information to remote servers makes the privacy of inference also vulnerable because of susceptible communication channels or even untrustworthy hosts. In this paper, we target privacy-preserving training and inference of brain-inspired Hyperdimensional (HD) computing, a new learning algorithm that is gaining traction due to its light-weight computation and robustness particularly appealing for edge devices with tight constraints. Indeed, despite its promising attributes, HD computing has virtually no privacy due to its reversible computation. We present an accuracy-privacy trade-off method through meticulous quantization and pruning of hypervectors, the building blocks of HD, to realize a differentially private model as well as to obfuscate the information sent for cloud-hosted inference. Finally, we show how the proposed techniques can be also leveraged for efficient hardware implementation. △ Less

Submitted 14 May, 2020; originally announced May 2020.

Comments: Accepted in Design Automation Conference (DAC) 2020

arXiv:2002.02394 [pdf, other]

FPGA Acceleration of Sequence Alignment: A Survey

Authors: Sahand Salamat, Tajana Rosing

Abstract: Genomics is changing our understanding of humans, evolution, diseases, and medicines to name but a few. As sequencing technology is developed collecting DNA sequences takes less time thereby generating more genetic data every day. Today the rate of generating genetic data is outpacing the rate of computation power growth. Current sequencing machines can sequence 50 humans genome per day; however,… ▽ More Genomics is changing our understanding of humans, evolution, diseases, and medicines to name but a few. As sequencing technology is developed collecting DNA sequences takes less time thereby generating more genetic data every day. Today the rate of generating genetic data is outpacing the rate of computation power growth. Current sequencing machines can sequence 50 humans genome per day; however, aligning the read sequences against a reference genome and assembling the genome will take 1300 CPU hours. The main step in constructing the genome is aligning the reads against a reference genome. Numerous accelerators have been proposed to accelerate the DNA alignment process. Providing massive parallelism, FPGA-based accelerators have shown great performance in accelerating DNA alignment algorithms. Additionally, FPGA-based accelerators provide better energy efficiency than general-purpose processors. In this survey, we introduce three main DNA alignment algorithms and FPGA-based implementation of these algorithms to accelerate the DNA alignment. We also, compare these three alignment categories and show how accelerators are developing during the time. △ Less

Submitted 27 July, 2020; v1 submitted 4 February, 2020; originally announced February 2020.

arXiv:1912.07200 [pdf, other]

A Broader Study of Cross-Domain Few-Shot Learning

Authors: Yunhui Guo, Noel C. Codella, Leonid Karlinsky, James V. Codella, John R. Smith, Kate Saenko, Tajana Rosing, Rogerio Feris

Abstract: Recent progress on few-shot learning largely relies on annotated data for meta-learning: base classes sampled from the same domain as the novel classes. However, in many applications, collecting data for meta-learning is infeasible or impossible. This leads to the cross-domain few-shot learning problem, where there is a large shift between base and novel class domains. While investigations of the… ▽ More Recent progress on few-shot learning largely relies on annotated data for meta-learning: base classes sampled from the same domain as the novel classes. However, in many applications, collecting data for meta-learning is infeasible or impossible. This leads to the cross-domain few-shot learning problem, where there is a large shift between base and novel class domains. While investigations of the cross-domain few-shot scenario exist, these works are limited to natural images that still contain a high degree of visual similarity. No work yet exists that examines few-shot learning across different imaging methods seen in real world scenarios, such as aerial and medical imaging. In this paper, we propose the Broader Study of Cross-Domain Few-Shot Learning (BSCD-FSL) benchmark, consisting of image data from a diverse assortment of image acquisition methods. This includes natural images, such as crop disease images, but additionally those that present with an increasing dissimilarity to natural images, such as satellite images, dermatology images, and radiology images. Extensive experiments on the proposed benchmark are performed to evaluate state-of-art meta-learning approaches, transfer learning approaches, and newer methods for cross-domain few-shot learning. The results demonstrate that state-of-art meta-learning methods are surprisingly outperformed by earlier meta-learning approaches, and all meta-learning methods underperform in relation to simple fine-tuning by 12.8% average accuracy. Performance gains previously observed with methods specialized for cross-domain few-shot learning vanish in this more challenging benchmark. Finally, accuracy of all methods tend to correlate with dataset similarity to natural images, verifying the value of the benchmark to better represent the diversity of data seen in practice and guiding future research. △ Less

Submitted 17 July, 2020; v1 submitted 16 December, 2019; originally announced December 2019.

Comments: ECCV 2020. Website: https://www.learning-with-limited-labels.com/

arXiv:1911.12446 [pdf, other]

QubitHD: A Stochastic Acceleration Method for HD Computing-Based Machine Learning

Authors: Samuel Bosch, Alexander Sanchez de la Cerda, Mohsen Imani, Tajana Simunic Rosing, Giovanni De Micheli

Abstract: Machine Learning algorithms based on Brain-inspired Hyperdimensional(HD) computing imitate cognition by exploiting statistical properties of high-dimensional vector spaces. It is a promising solution for achieving high energy efficiency in different machine learning tasks, such as classification, semi-supervised learning, and clustering. A weakness of existing HD computing-based ML algorithms is t… ▽ More Machine Learning algorithms based on Brain-inspired Hyperdimensional(HD) computing imitate cognition by exploiting statistical properties of high-dimensional vector spaces. It is a promising solution for achieving high energy efficiency in different machine learning tasks, such as classification, semi-supervised learning, and clustering. A weakness of existing HD computing-based ML algorithms is the fact that they have to be binarized to achieve very high energy efficiency. At the same time, binarized models reach lower classification accuracies. To solve the problem of the trade-off between energy efficiency and classification accuracy, we propose the QubitHD algorithm. It stochastically binarizes HD-based algorithms, while maintaining comparable classification accuracies to their non-binarized counterparts. The FPGA implementation of QubitHD provides a 65% improvement in terms of energy efficiency, and a 95% improvement in terms of training time, as compared to state-of-the-art HD-based ML algorithms. It also outperforms state-of-the-art low-cost classifiers (such as Binarized Neural Networks) in terms of speed and energy efficiency by an order of magnitude during training and inference. △ Less

Submitted 10 October, 2022; v1 submitted 27 November, 2019; originally announced November 2019.

Comments: 8 pages, 5 figures, 3 tables

arXiv:1911.09659 [pdf, other]

AdaFilter: Adaptive Filter Fine-tuning for Deep Transfer Learning

Authors: Yunhui Guo, Yandong Li, Liqiang Wang, Tajana Rosing

Abstract: There is an increasing number of pre-trained deep neural network models. However, it is still unclear how to effectively use these models for a new task. Transfer learning, which aims to transfer knowledge from source tasks to a target task, is an effective solution to this problem. Fine-tuning is a popular transfer learning technique for deep neural networks where a few rounds of training are app… ▽ More There is an increasing number of pre-trained deep neural network models. However, it is still unclear how to effectively use these models for a new task. Transfer learning, which aims to transfer knowledge from source tasks to a target task, is an effective solution to this problem. Fine-tuning is a popular transfer learning technique for deep neural networks where a few rounds of training are applied to the parameters of a pre-trained model to adapt them to a new task. Despite its popularity, in this paper, we show that fine-tuning suffers from several drawbacks. We propose an adaptive fine-tuning approach, called AdaFilter, which selects only a part of the convolutional filters in the pre-trained model to optimize on a per-example basis. We use a recurrent gated network to selectively fine-tune convolutional filters based on the activations of the previous layer. We experiment with 7 public image classification datasets and the results show that AdaFilter can reduce the average classification error of the standard fine-tuning by 2.54%. △ Less

Submitted 8 December, 2019; v1 submitted 21 November, 2019; originally announced November 2019.

arXiv:1911.07187 [pdf, other]

FPGA Energy Efficiency by Leveraging Thermal Margin

Authors: Behnam Khaleghi, Sahand Salamat, Mohsen Imani, Tajana Rosing

Abstract: Cutting edge FPGAs are not energy efficient as conventionally presumed to be, and therefore, aggressive power-saving techniques have become imperative. The clock rate of an FPGA-mapped design is set based on worst-case conditions to ensure reliable operation under all circumstances. This usually leaves a considerable timing margin that can be exploited to reduce power consumption by scaling voltag… ▽ More Cutting edge FPGAs are not energy efficient as conventionally presumed to be, and therefore, aggressive power-saving techniques have become imperative. The clock rate of an FPGA-mapped design is set based on worst-case conditions to ensure reliable operation under all circumstances. This usually leaves a considerable timing margin that can be exploited to reduce power consumption by scaling voltage without lowering clock frequency. There are hurdles for such opportunistic voltage scaling in FPGAs because (a) critical paths change with designs, making timing evaluation difficult as voltage changes, (b) each FPGA resource has particular power-delay trade-off with voltage, (c) data corruption of configuration cells and memory blocks further hampers voltage scaling. In this paper, we propose a systematical approach to leverage the available thermal headroom of FPGA-mapped designs for power and energy improvement. By comprehensively analyzing the timing and power consumption of FPGA building blocks under varying temperatures and voltages, we propose a thermal-aware voltage scaling flow that effectively utilizes the thermal margin to reduce power consumption without degrading performance. We show the proposed flow can be employed for energy optimization as well, whereby power consumption and delay are compromised to accomplish the tasks with minimum energy. Lastly, we propose a simulation framework to be able to examine the efficiency of the proposed method for other applications that are inherently tolerant to a certain amount of error, granting further power saving opportunity. Experimental results over a set of industrial benchmarks indicate up to 36% power reduction with the same performance, and 66% total energy saving when energy is the optimization target. △ Less

Submitted 17 November, 2019; originally announced November 2019.

Comments: Accepted in IEEE International Conference on Computer Design (ICCD) 2019

arXiv:1909.11763 [pdf, other]

Improved Schemes for Episodic Memory-based Lifelong Learning

Authors: Yunhui Guo, Mingrui Liu, Tianbao Yang, Tajana Rosing

Abstract: Current deep neural networks can achieve remarkable performance on a single task. However, when the deep neural network is continually trained on a sequence of tasks, it seems to gradually forget the previous learned knowledge. This phenomenon is referred to as \textit{catastrophic forgetting} and motivates the field called lifelong learning. Recently, episodic memory based approaches such as GEM… ▽ More Current deep neural networks can achieve remarkable performance on a single task. However, when the deep neural network is continually trained on a sequence of tasks, it seems to gradually forget the previous learned knowledge. This phenomenon is referred to as \textit{catastrophic forgetting} and motivates the field called lifelong learning. Recently, episodic memory based approaches such as GEM \cite{lopez2017gradient} and A-GEM \cite{chaudhry2018efficient} have shown remarkable performance. In this paper, we provide the first unified view of episodic memory based approaches from an optimization's perspective. This view leads to two improved schemes for episodic memory based lifelong learning, called MEGA-I and MEGA-II. MEGA-I and MEGA-II modulate the balance between old tasks and the new task by integrating the current gradient with the gradient computed on the episodic memory. Notably, we show that GEM and A-GEM are degenerate cases of MEGA-I and MEGA-II which consistently put the same emphasis on the current task, regardless of how the loss changes over time. Our proposed schemes address this issue by using novel loss-balancing updating rules, which drastically improve the performance over GEM and A-GEM. Extensive experimental results show that the proposed schemes significantly advance the state-of-the-art on four commonly used lifelong learning benchmarks, reducing the error by up to 18\%. △ Less

Submitted 14 December, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: NeurIPS 2020, Spotlight. 17 pages. Code: https://github.com/yunhuiguo/MEGA

arXiv:1908.06519 [pdf, other]

Workload-Aware Opportunistic Energy Efficiency in Multi-FPGA Platforms

Authors: Sahand Salamat, Behnam Khaleghi, Mohsen Imani, Tajana Rosing

Abstract: The continuous growth of big data applications with high computational and scalability demands has resulted in increasing popularity of cloud computing. Optimizing the performance and power consumption of cloud resources is therefore crucial to relieve the costs of data centers. In recent years, multi-FPGA platforms have gained traction in data centers as low-cost yet high-performance solutions pa… ▽ More The continuous growth of big data applications with high computational and scalability demands has resulted in increasing popularity of cloud computing. Optimizing the performance and power consumption of cloud resources is therefore crucial to relieve the costs of data centers. In recent years, multi-FPGA platforms have gained traction in data centers as low-cost yet high-performance solutions particularly as acceleration engines, thanks to the high degree of parallelism they provide. Nonetheless, the size of data centers workloads varies during service time, leading to significant underutilization of computing resources while consuming a large amount of power, which turns out as a key factor of data center inefficiency, regardless of the underlying hardware structure. In this paper, we propose an efficient framework to throttle the power consumption of multi-FPGA platforms by dynamically scaling the voltage and hereby frequency during runtime according to prediction of, and adjustment to the workload level, while maintaining the desired Quality of Service (QoS). This is in contrast to, and more efficient than, conventional approaches that merely scale (i.e., power-gate) the computing nodes or frequency. The proposed framework carefully exploits a pre-characterized library of delay-voltage, and power-voltage information of FPGA resources, which we show is indispensable to obtain the efficient operating point due to the different sensitivity of resources w.r.t. voltage scaling, particularly considering multiple power rails residing in these devices. Our evaluations by implementing state-of-the-art deep neural network accelerators revealed that, providing an average power reduction of 4.0X, the proposed framework surpasses the previous works by 33.6% (up to 83%). △ Less

Submitted 28 October, 2019; v1 submitted 18 August, 2019; originally announced August 2019.

Comments: The paper will be published in ICCAD 2019

arXiv:1902.00927 [pdf, other]

Depthwise Convolution is All You Need for Learning Multiple Visual Domains

Authors: Yunhui Guo, Yandong Li, Rogerio Feris, Liqiang Wang, Tajana Rosing

Abstract: There is a growing interest in designing models that can deal with images from different visual domains. If there exists a universal structure in different visual domains that can be captured via a common parameterization, then we can use a single model for all domains rather than one model per domain. A model aware of the relationships between different domains can also be trained to work on new… ▽ More There is a growing interest in designing models that can deal with images from different visual domains. If there exists a universal structure in different visual domains that can be captured via a common parameterization, then we can use a single model for all domains rather than one model per domain. A model aware of the relationships between different domains can also be trained to work on new domains with less resources. However, to identify the reusable structure in a model is not easy. In this paper, we propose a multi-domain learning architecture based on depthwise separable convolution. The proposed approach is based on the assumption that images from different domains share cross-channel correlations but have domain-specific spatial correlations. The proposed model is compact and has minimal overhead when being applied to new domains. Additionally, we introduce a gating mechanism to promote soft sharing between different domains. We evaluate our approach on Visual Decathlon Challenge, a benchmark for testing the ability of multi-domain models. The experiments show that our approach can achieve the highest score while only requiring 50% of the parameters compared with the state-of-the-art approaches. △ Less

Submitted 19 February, 2019; v1 submitted 3 February, 2019; originally announced February 2019.

Showing 1–50 of 52 results for author: Rosing, T