-
ReFrame: Layer Caching for Accelerated Inference in Real-Time Rendering
Authors:
Lufei Liu,
Tor M. Aamodt
Abstract:
Graphics rendering applications increasingly leverage neural networks in tasks such as denoising, supersampling, and frame extrapolation to improve image quality while maintaining frame rates. The temporal coherence inherent in these tasks presents an opportunity to reuse intermediate results from previous frames and avoid redundant computations. Recent work has shown that caching intermediate fea…
▽ More
Graphics rendering applications increasingly leverage neural networks in tasks such as denoising, supersampling, and frame extrapolation to improve image quality while maintaining frame rates. The temporal coherence inherent in these tasks presents an opportunity to reuse intermediate results from previous frames and avoid redundant computations. Recent work has shown that caching intermediate features to be reused in subsequent inferences is an effective method to reduce latency in diffusion models. We extend this idea to real-time rendering and present ReFrame, which explores different caching policies to optimize trade-offs between quality and performance in rendering workloads. ReFrame can be applied to a variety of encoder-decoder style networks commonly found in rendering pipelines. Experimental results show that we achieve 1.4x speedup on average with negligible quality loss in three real-time rendering tasks. Code available: https://ubc-aamodt-group.github.io/reframe-layer-caching/
△ Less
Submitted 14 June, 2025;
originally announced June 2025.
-
Learning Label Encodings for Deep Regression
Authors:
Deval Shah,
Tor M. Aamodt
Abstract:
Deep regression networks are widely used to tackle the problem of predicting a continuous value for a given input. Task-specialized approaches for training regression networks have shown significant improvement over generic approaches, such as direct regression. More recently, a generic approach based on regression by binary classification using binary-encoded labels has shown significant improvem…
▽ More
Deep regression networks are widely used to tackle the problem of predicting a continuous value for a given input. Task-specialized approaches for training regression networks have shown significant improvement over generic approaches, such as direct regression. More recently, a generic approach based on regression by binary classification using binary-encoded labels has shown significant improvement over direct regression. The space of label encodings for regression is large. Lacking heretofore have been automated approaches to find a good label encoding for a given application. This paper introduces Regularized Label Encoding Learning (RLEL) for end-to-end training of an entire network and its label encoding. RLEL provides a generic approach for tackling regression. Underlying RLEL is our observation that the search space of label encodings can be constrained and efficiently explored by using a continuous search space of real-valued label encodings combined with a regularization function designed to encourage encodings with certain properties. These properties balance the probability of classification error in individual bits against error correction capability. Label encodings found by RLEL result in lower or comparable errors to manually designed label encodings. Applying RLEL results in 10.9% and 12.4% improvement in Mean Absolute Error (MAE) over direct regression and multiclass classification, respectively. Our evaluation demonstrates that RLEL can be combined with off-the-shelf feature extractors and is suitable across different architectures, datasets, and tasks. Code is available at https://github.com/ubc-aamodt-group/RLEL_regression.
△ Less
Submitted 3 March, 2023;
originally announced March 2023.
-
Label Encoding for Regression Networks
Authors:
Deval Shah,
Zi Yu Xue,
Tor M. Aamodt
Abstract:
Deep neural networks are used for a wide range of regression problems. However, there exists a significant gap in accuracy between specialized approaches and generic direct regression in which a network is trained by minimizing the squared or absolute error of output labels. Prior work has shown that solving a regression problem with a set of binary classifiers can improve accuracy by utilizing we…
▽ More
Deep neural networks are used for a wide range of regression problems. However, there exists a significant gap in accuracy between specialized approaches and generic direct regression in which a network is trained by minimizing the squared or absolute error of output labels. Prior work has shown that solving a regression problem with a set of binary classifiers can improve accuracy by utilizing well-studied binary classification algorithms. We introduce binary-encoded labels (BEL), which generalizes the application of binary classification to regression by providing a framework for considering arbitrary multi-bit values when encoding target values. We identify desirable properties of suitable encoding and decoding functions used for the conversion between real-valued and binary-encoded labels based on theoretical and empirical study. These properties highlight a tradeoff between classification error probability and error-correction capabilities of label encodings. BEL can be combined with off-the-shelf task-specific feature extractors and trained end-to-end. We propose a series of sample encoding, decoding, and training loss functions for BEL and demonstrate they result in lower error than direct regression and specialized approaches while being suitable for a diverse set of regression problems, network architectures, and evaluation metrics. BEL achieves state-of-the-art accuracies for several regression benchmarks. Code is available at https://github.com/ubc-aamodt-group/BEL_regression.
△ Less
Submitted 4 December, 2022;
originally announced December 2022.
-
Characterizing and Improving the Resilience of Accelerators in Autonomous Robots
Authors:
Deval Shah,
Zi Yu Xue,
Karthik Pattabiraman,
Tor M. Aamodt
Abstract:
Motion planning is a computationally intensive and well-studied problem in autonomous robots. However, motion planning hardware accelerators (MPA) must be soft-error resilient for deployment in safety-critical applications, and blanket application of traditional mitigation techniques is ill-suited due to cost, power, and performance overheads. We propose Collision Exposure Factor (CEF), a novel me…
▽ More
Motion planning is a computationally intensive and well-studied problem in autonomous robots. However, motion planning hardware accelerators (MPA) must be soft-error resilient for deployment in safety-critical applications, and blanket application of traditional mitigation techniques is ill-suited due to cost, power, and performance overheads. We propose Collision Exposure Factor (CEF), a novel metric to assess the failure vulnerability of circuits processing spatial relationships, including motion planning. CEF is based on the insight that the safety violation probability increases with the surface area of the physical space exposed by a bit-flip. We evaluate CEF on four MPAs. We demonstrate empirically that CEF is correlated with safety violation probability, and that CEF-aware selective error mitigation provides 12.3x, 9.6x, and 4.2x lower Failures-In-Time (FIT) rate on average for the same amount of protected memory compared to uniform, bit-position, and access-frequency-aware selection of critical data. Furthermore, we show how to employ CEF to enable fault characterization using 23,000x fewer fault injection (FI) experiments than exhaustive FI, and evaluate our FI approach on different robots and MPAs. We demonstrate that CEF-aware FI can provide insights on vulnerable bits in an MPA while taking the same amount of time as uniform statistical FI. Finally, we use the CEF to formulate guidelines for designing soft-error resilient MPAs.
△ Less
Submitted 17 October, 2021;
originally announced October 2021.
-
Sparse Weight Activation Training
Authors:
Md Aamir Raihan,
Tor M. Aamodt
Abstract:
Neural network training is computationally and memory intensive. Sparse training can reduce the burden on emerging hardware platforms designed to accelerate sparse computations, but it can affect network convergence. In this work, we propose a novel CNN training algorithm Sparse Weight Activation Training (SWAT). SWAT is more computation and memory-efficient than conventional training. SWAT modifi…
▽ More
Neural network training is computationally and memory intensive. Sparse training can reduce the burden on emerging hardware platforms designed to accelerate sparse computations, but it can affect network convergence. In this work, we propose a novel CNN training algorithm Sparse Weight Activation Training (SWAT). SWAT is more computation and memory-efficient than conventional training. SWAT modifies back-propagation based on the empirical insight that convergence during training tends to be robust to the elimination of (i) small magnitude weights during the forward pass and (ii) both small magnitude weights and activations during the backward pass. We evaluate SWAT on recent CNN architectures such as ResNet, VGG, DenseNet and WideResNet using CIFAR-10, CIFAR-100 and ImageNet datasets. For ResNet-50 on ImageNet SWAT reduces total floating-point operations (FLOPS) during training by 80% resulting in a 3.3$\times$ training speedup when run on a simulated sparse learning accelerator representative of emerging platforms while incurring only 1.63% reduction in validation accuracy. Moreover, SWAT reduces memory footprint during the backward pass by 23% to 50% for activations and 50% to 90% for weights.
△ Less
Submitted 31 October, 2020; v1 submitted 7 January, 2020;
originally announced January 2020.
-
Surface Compression Using Dynamic Color Palettes
Authors:
Ayub A. Gubran,
Felix Huang,
Tor M. Aamodt
Abstract:
Off-chip memory traffic is a major source of power and energy consumption on mobile platforms. A large amount of this off-chip traffic is used to manipulate graphics framebuffer surfaces. To cut down the cost of accessing off-chip memory, framebuffer surfaces are compressed to reduce the bandwidth consumed on surface manipulation when rendering or displaying.
In this work, we study the compressi…
▽ More
Off-chip memory traffic is a major source of power and energy consumption on mobile platforms. A large amount of this off-chip traffic is used to manipulate graphics framebuffer surfaces. To cut down the cost of accessing off-chip memory, framebuffer surfaces are compressed to reduce the bandwidth consumed on surface manipulation when rendering or displaying.
In this work, we study the compression properties of framebuffer surfaces and highlight the fact that surfaces from different applications have different compression characteristics. We use the results of our analysis to propose a scheme, Dynamic Color Palettes (DCP), which achieves higher compression rates with UI and 2D surfaces.
DCP is a hardware mechanism for exploiting inter-frame coherence in lossless surface compression; it implements a scheme that dynamically constructs color palettes, which are then used to efficiently compress framebuffer surfaces. To evaluate DCP, we created an extensive set of OpenGL workload traces from 124 Android applications. We found that DCP improves compression rates by 91% for UI and 20% for 2D applications compared to previous proposals. We also evaluate a hybrid scheme that combines DCP with a generic compression scheme. We found that compression rates improve over previous proposals by 161%, 124% and 83% for UI, 2D and 3D applications, respectively.
△ Less
Submitted 18 January, 2019;
originally announced March 2019.
-
HoLiSwap: Reducing Wire Energy in L1 Caches
Authors:
Yatish Turakhia,
Subhasis Das,
Tor M. Aamodt,
William J. Dally
Abstract:
This paper describes HoLiSwap a method to reduce L1 cache wire energy, a significant fraction of total cache energy, by swapping hot lines to the cache way nearest to the processor. We observe that (i) a small fraction (<3%) of cache lines (hot lines) serve over 60% of the L1 cache accesses and (ii) the difference in wire energy between the nearest and farthest cache subarray can be over 6…
▽ More
This paper describes HoLiSwap a method to reduce L1 cache wire energy, a significant fraction of total cache energy, by swapping hot lines to the cache way nearest to the processor. We observe that (i) a small fraction (<3%) of cache lines (hot lines) serve over 60% of the L1 cache accesses and (ii) the difference in wire energy between the nearest and farthest cache subarray can be over 6$\times$. Our method exploits this difference in wire energy to dynamically identify hot lines and swap them to the nearest physical way in a set-associative L1 cache. This provides up to 44% improvement in the wire energy (1.82% saving in overall system energy) with no impact on the cache miss rate and 0.13% performance drop. We also show that HoLiSwap can simplify way-prediction.
△ Less
Submitted 13 January, 2017;
originally announced January 2017.
-
CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution
Authors:
Milad Mohammadi,
Tor M. Aamodt,
William J. Dally
Abstract:
We introduce the Coarse-Grain Out-of-Order (CG- OoO) general purpose processor designed to achieve close to In-Order processor energy while maintaining Out-of-Order (OoO) performance. CG-OoO is an energy-performance proportional general purpose architecture that scales according to the program load. Block-level code processing is at the heart of the this architecture; CG-OoO speculates, fetches, s…
▽ More
We introduce the Coarse-Grain Out-of-Order (CG- OoO) general purpose processor designed to achieve close to In-Order processor energy while maintaining Out-of-Order (OoO) performance. CG-OoO is an energy-performance proportional general purpose architecture that scales according to the program load. Block-level code processing is at the heart of the this architecture; CG-OoO speculates, fetches, schedules, and commits code at block-level granularity. It eliminates unnecessary accesses to energy consuming tables, and turns large tables into smaller and distributed tables that are cheaper to access. CG-OoO leverages compiler-level code optimizations to deliver efficient static code, and exploits dynamic instruction-level parallelism and block-level parallelism. CG-OoO introduces Skipahead issue, a complexity effective, limited out-of-order instruction scheduling model. Through the energy efficiency techniques applied to the compiler and processor pipeline stages, CG-OoO closes 64% of the average energy gap between the In-Order and Out-of-Order baseline processors at the performance of the OoO baseline. This makes CG-OoO 1.9x more efficient than the OoO on the energy-delay product inverse metric.
△ Less
Submitted 5 June, 2016;
originally announced June 2016.