Skip to main content

Showing 1–31 of 31 results for author: Zhang, S Q

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.01572  [pdf, other

    cs.AI cs.DC

    PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding

    Authors: Bradley McDanel, Sai Qian Zhang, Yunhai Hu, Zining Liu

    Abstract: Speculative decoding accelerates large language model inference by using smaller draft models to generate candidate tokens for parallel verification. However, current approaches are limited by sequential stage dependencies that prevent full hardware utilization. We present PipeSpec, a framework that generalizes speculative decoding to $k$ models arranged in a hierarchical pipeline, enabling asynch… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: 10 pages, 5 figures, 2 tables

  2. arXiv:2503.21854  [pdf, other

    cs.CV cs.AI

    Foveated Instance Segmentation

    Authors: Hongyi Zeng, Wenxuan Liu, Tianhua Xia, Jinhui Chen, Ziyun Li, Sai Qian Zhang

    Abstract: Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real-world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource-constrained AR/VR devices, causing large processing latency and degrading u… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  3. arXiv:2502.19732  [pdf, other

    cs.CL

    Speculative Decoding and Beyond: An In-Depth Survey of Techniques

    Authors: Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang

    Abstract: Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model quality, recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated. This survey presents a comprehen… ▽ More

    Submitted 3 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  4. arXiv:2502.11832  [pdf, other

    cs.AR

    HAAN: A Holistic Approach for Accelerating Normalization Operations in Large Language Models

    Authors: Tianfan Peng, Jiajun Qin, Tianhua Xia, Sai Qian Zhang

    Abstract: Large language models (LLMs) have revolutionized natural language processing (NLP) tasks by achieving state-of-the-art performance across a range of benchmarks. Central to the success of these models is the integration of sophisticated architectural components aimed at improving training stability, convergence speed, and generalization capabilities. Among these components, normalization operation,… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

  5. arXiv:2502.00922  [pdf, other

    cs.LG cs.AR

    Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference

    Authors: Patrick Yubeaton, Tareq Mahmoud, Shehab Naga, Pooria Taheri, Tianhua Xia, Arun George, Yasmein Khalil, Sai Qian Zhang, Siddharth Joshi, Chinmay Hegde, Siddharth Garg

    Abstract: As they become more capable, large language models (LLMs) have continued to rapidly increase in size. This has exacerbated the difficulty in running state of the art LLMs on small, edge devices. Standard techniques advocate solving this problem through lossy compression techniques such as quantization or pruning. However, such compression techniques are lossy, and have been shown to change model b… ▽ More

    Submitted 2 February, 2025; originally announced February 2025.

  6. arXiv:2412.10456  [pdf, other

    cs.CV cs.AI

    FovealNet: Advancing AI-Driven Gaze Tracking Solutions for Optimized Foveated Rendering System Performance in Virtual Reality

    Authors: Wenxuan Liu, Monde Duinkharjav, Qi Sun, Sai Qian Zhang

    Abstract: Leveraging real-time eye-tracking, foveated rendering optimizes hardware efficiency and enhances visual quality virtual reality (VR). This approach leverages eye-tracking techniques to determine where the user is looking, allowing the system to render high-resolution graphics only in the foveal region-the small area of the retina where visual acuity is highest, while the peripheral view is rendere… ▽ More

    Submitted 30 December, 2024; v1 submitted 12 December, 2024; originally announced December 2024.

  7. arXiv:2412.10448  [pdf, other

    cs.CV cs.AI

    Unlocking Visual Secrets: Inverting Features with Diffusion Priors for Image Reconstruction

    Authors: Sai Qian Zhang, Ziyun Li, Chuan Guo, Saeed Mahloujifar, Deeksha Dangwal, Edward Suh, Barbara De Salvo, Chiao Liu

    Abstract: Inverting visual representations within deep neural networks (DNNs) presents a challenging and important problem in the field of security and privacy for deep learning. The main goal is to invert the features of an unidentified target image generated by a pre-trained DNN, aiming to reconstruct the original image. Feature inversion holds particular significance in understanding the privacy leakage… ▽ More

    Submitted 11 December, 2024; originally announced December 2024.

  8. arXiv:2412.00648  [pdf, ps, other

    cs.LG stat.ML

    DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation

    Authors: Jingyang Xiang, Sai Qian Zhang

    Abstract: Rotating the activation and weight matrices to reduce the influence of outliers in large language models (LLMs) has recently attracted significant attention, particularly in the context of model quantization. Prior studies have shown that in low-precision quantization scenarios, such as 4-bit weights and 4-bit activations (W4A4), randomized Hadamard transforms can achieve significantly higher accu… ▽ More

    Submitted 2 December, 2024; v1 submitted 30 November, 2024; originally announced December 2024.

    Comments: 24 pages, 38 figures, source code \url{https://github.com/JingyangXiang/DFRot}

  9. arXiv:2411.04335  [pdf, other

    cs.CV

    GazeGen: Gaze-Driven User Interaction for Visual Content Generation

    Authors: He-Yen Hsieh, Ziyun Li, Sai Qian Zhang, Wei-Te Mark Ting, Kao-Den Chang, Barbara De Salvo, Chiao Liu, H. T. Kung

    Abstract: We present GazeGen, a user interaction system that generates visual content (images and videos) for locations indicated by the user's eye gaze. GazeGen allows intuitive manipulation of visual content by targeting regions of interest with gaze. Using advanced techniques in object detection and generative AI, GazeGen performs gaze-controlled image adding/deleting, repositioning, and surface style ch… ▽ More

    Submitted 17 November, 2024; v1 submitted 6 November, 2024; originally announced November 2024.

    Comments: 12 pages, 10 figures

  10. arXiv:2410.17661  [pdf, other

    cs.AI cs.LG

    PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers in a resource-limited Context

    Authors: Maximilian Augustin, Syed Shakib Sarwar, Mostafa Elhoushi, Sai Qian Zhang, Yuecheng Li, Barbara De Salvo

    Abstract: Following their success in natural language processing (NLP), there has been a shift towards transformer models in computer vision. While transformers perform well and offer promising multi-tasking performance, due to their high compute requirements, many resource-constrained applications still rely on convolutional or hybrid models that combine the benefits of convolution and attention layers and… ▽ More

    Submitted 23 October, 2024; originally announced October 2024.

  11. arXiv:2410.08326  [pdf, other

    cs.CV cs.AR cs.LG cs.PF

    Neural Architecture Search of Hybrid Models for NPU-CIM Heterogeneous AR/VR Devices

    Authors: Yiwei Zhao, Ziyun Li, Win-San Khwa, Xiaoyu Sun, Sai Qian Zhang, Syed Shakib Sarwar, Kleber Hugo Stangherlin, Yi-Lun Lu, Jorge Tomas Gomez, Jae-Sun Seo, Phillip B. Gibbons, Barbara De Salvo, Chiao Liu

    Abstract: Low-Latency and Low-Power Edge AI is essential for Virtual Reality and Augmented Reality applications. Recent advances show that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can pose system challenges for latency and energy-efficien… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

  12. arXiv:2409.07756  [pdf, other

    cs.CV

    DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing

    Authors: Zhenyuan Dong, Sai Qian Zhang

    Abstract: Diffusion Transformers (DiTs) have recently attracted significant interest from both industry and academia due to their enhanced capabilities in visual generation, surpassing the performance of traditional diffusion models that employ U-Net. However, the improved performance of DiTs comes at the expense of higher parameter counts and implementation costs, which significantly limits their deploymen… ▽ More

    Submitted 24 November, 2024; v1 submitted 12 September, 2024; originally announced September 2024.

    Comments: Accepted at WACV 2025. Code is available at https://github.com/DZY122/DiTAS

  13. arXiv:2408.12885  [pdf, other

    cs.CV

    T3M: Text Guided 3D Human Motion Synthesis from Speech

    Authors: Wenshuo Peng, Kaipeng Zhang, Sai Qian Zhang

    Abstract: Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to inaccurate and inflexible synthesis results. To mitigate this problem, we introduce a novel text-guided 3D human motion synthesis method, termed \texti… ▽ More

    Submitted 23 August, 2024; originally announced August 2024.

    Comments: 10 pages,4figures

  14. arXiv:2407.18276  [pdf, other

    cs.AR cs.AI

    Rome was Not Built in a Single Step: Hierarchical Prompting for LLM-based Chip Design

    Authors: Andre Nakkab, Sai Qian Zhang, Ramesh Karri, Siddharth Garg

    Abstract: Large Language Models (LLMs) are effective in computer hardware synthesis via hardware description language (HDL) generation. However, LLM-assisted approaches for HDL generation struggle when handling complex tasks. We introduce a suite of hierarchical prompting techniques which facilitate efficient stepwise design methods, and develop a generalizable automation pipeline for the process. To evalua… ▽ More

    Submitted 9 September, 2024; v1 submitted 23 July, 2024; originally announced July 2024.

    Comments: Accepted at MLCAD '24. 10 pages, 7 figures, 5 tables

  15. arXiv:2405.19751  [pdf, other

    cs.CV cs.AI

    HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

    Authors: Wenxuan Liu, Sai Qian Zhang

    Abstract: Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial and academic fields for their superior visual generation capabilities, outperforming traditional diffusion models that use U-Net. However,the enhanced performance of DiTs also comes with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobil… ▽ More

    Submitted 31 May, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

  16. arXiv:2404.05182  [pdf, other

    cs.LG cs.AI cs.CL cs.DC

    DLoRA: Distributed Parameter-Efficient Fine-Tuning Solution for Large Language Model

    Authors: Chao Gao, Sai Qian Zhang

    Abstract: To enhance the performance of large language models (LLM) on downstream tasks, one solution is to fine-tune certain LLM parameters and make it better align with the characteristics of the training dataset. This process is commonly known as parameter-efficient fine-tuning (PEFT). Due to the scale of LLM, PEFT operations are usually executed in the public environment (e.g., cloud server). This neces… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  17. arXiv:2403.14608  [pdf, other

    cs.LG

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Authors: Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang

    Abstract: Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pos… ▽ More

    Submitted 15 September, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: 25 pages, 12 figures. Due to word limit, the abstract here is truncated. The full abstract is available in the PDF

  18. arXiv:2311.17218  [pdf, other

    cs.CV cs.LG

    BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling

    Authors: Yixuan Luo, Mengye Ren, Sai Qian Zhang

    Abstract: Like masked language modeling (MLM) in natural language processing, masked image modeling (MIM) aims to extract valuable insights from image patches to enhance the feature extraction capabilities of the underlying deep neural network (DNN). Contrasted with other training paradigms like supervised learning and unsupervised contrastive learning, masked image modeling (MIM) pretraining typically dema… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  19. arXiv:2311.13290  [pdf, other

    cs.AR

    Hyft: A Reconfigurable Softmax Accelerator with Hybrid Numeric Format for both Training and Inference

    Authors: Tianhua Xia, Sai Qian Zhang

    Abstract: The attention mechanism is a pivotal element within the transformer architecture, making a substantial contribution to its exceptional performance. Within this attention mechanism, Softmax is an imperative component that enables the model to assess the degree of correlation between various segments of the input. Yet, prior research has shown that Softmax operations can significantly increase proce… ▽ More

    Submitted 4 September, 2024; v1 submitted 22 November, 2023; originally announced November 2023.

  20. arXiv:2305.03148  [pdf, other

    cs.AR cs.LG cs.NE

    CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning

    Authors: Sai Qian Zhang, Thierry Tambe, Nestor Cuevas, Gu-Yeon Wei, David Brooks

    Abstract: On-device learning allows AI models to adapt to user data, thereby enhancing service quality on edge platforms. However, training AI on resource-limited devices poses significant challenges due to the demanding computing workload and the substantial memory consumption and data access required by deep neural networks (DNNs). To address these issues, we propose utilizing embedded dynamic random-acce… ▽ More

    Submitted 22 December, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

  21. arXiv:2207.09413  [pdf, other

    cs.LG cs.AI cs.CV cs.DC

    SphereFed: Hyperspherical Federated Learning

    Authors: Xin Dong, Sai Qian Zhang, Ang Li, H. T. Kung

    Abstract: Federated Learning aims at training a global model from multiple decentralized devices (i.e. clients) without exchanging their private local data. A key challenge is the handling of non-i.i.d. (independent identically distributed) data across multiple clients that may induce disparities of their local features. We introduce the Hyperspherical Federated Learning (SphereFed) framework to address the… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: European Conference on Computer Vision 2022

  22. arXiv:2201.02932  [pdf, other

    cs.LG cs.AI

    A Multi-agent Reinforcement Learning Approach for Efficient Client Selection in Federated Learning

    Authors: Sai Qian Zhang, Jieyu Lin, Qi Zhang

    Abstract: Federated learning (FL) is a training technique that enables client devices to jointly learn a shared model by aggregating locally-computed models without exposing their raw data. While most of the existing work focuses on improving the FL model accuracy, in this paper, we focus on the improving the training efficiency, which is often a hurdle for adopting FL in real-world applications. Specifical… ▽ More

    Submitted 9 January, 2022; originally announced January 2022.

    Comments: To be appeared at AAAI, 2022

  23. arXiv:2110.15456  [pdf, other

    cs.LG cs.AR

    FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding

    Authors: Sai Qian Zhang, Bradley McDanel, H. T. Kung

    Abstract: Block Floating Point (BFP) can efficiently support quantization for Deep Neural Network (DNN) training by providing a wide dynamic range via a shared exponent across a group of values. In this paper, we propose a Fast First, Accurate Second Training (FAST) system for DNNs, where the weights, activations, and gradients are represented in BFP. FAST supports matrix multiplication with variable precis… ▽ More

    Submitted 28 October, 2021; originally announced October 2021.

  24. arXiv:2010.14391  [pdf, other

    cs.AI cs.LG cs.MA

    Succinct and Robust Multi-Agent Communication With Temporal Message Control

    Authors: Sai Qian Zhang, Jieyu Lin, Qi Zhang

    Abstract: Recent studies have shown that introducing communication between agents can significantly improve overall performance in cooperative Multi-agent reinforcement learning (MARL). However, existing communication schemes often require agents to exchange an excessive number of messages at run-time under a reliable communication channel, which hinders its practicality in many real-world situations. In th… ▽ More

    Submitted 24 December, 2020; v1 submitted 27 October, 2020; originally announced October 2020.

  25. arXiv:2007.06389  [pdf, other

    cs.CV cs.LG

    Term Revealing: Furthering Quantization at Run Time on Quantized DNNs

    Authors: H. T. Kung, Bradley McDanel, Sai Qian Zhang

    Abstract: We present a novel technique, called Term Revealing (TR), for furthering quantization at run time for improved performance of Deep Neural Networks (DNNs) already quantized with conventional quantization methods. TR operates on power-of-two terms in binary expressions of values. In computing a dot-product computation, TR dynamically selects a fixed number of largest terms to use from the values of… ▽ More

    Submitted 26 July, 2020; v1 submitted 13 July, 2020; originally announced July 2020.

    Comments: 13 pages, 19 figures, 4 tables, To appear in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020 Update: Revised writing/figures and added more references for Section IV Update: Revised Section IV writing/figures and added additional references on signed digit representations

  26. arXiv:2003.03722  [pdf, other

    cs.LG cs.CR stat.ML

    On the Robustness of Cooperative Multi-Agent Reinforcement Learning

    Authors: Jieyu Lin, Kristina Dzeparoska, Sai Qian Zhang, Alberto Leon-Garcia, Nicolas Papernot

    Abstract: In cooperative multi-agent reinforcement learning (c-MARL), agents learn to cooperatively take actions as a team to maximize a total team reward. We analyze the robustness of c-MARL to adversaries capable of attacking one of the agents on a team. Through the ability to manipulate this agent's observations, the adversary seeks to decrease the total team reward. Attacking c-MARL is challenging for… ▽ More

    Submitted 8 March, 2020; originally announced March 2020.

  27. arXiv:1912.02057  [pdf, other

    cs.LG eess.SP

    RTN: Reparameterized Ternary Network

    Authors: Yuhang Li, Xin Dong, Sai Qian Zhang, Haoli Bai, Yuanpeng Chen, Wei Wang

    Abstract: To deploy deep neural networks on resource-limited devices, quantization has been widely explored. In this work, we study the extremely low-bit networks which have tremendous speed-up, memory saving with quantized activation and weights. We first bring up three omitted issues in extremely low-bit networks: the squashing range of quantized values; the gradient vanishing during backpropagation and t… ▽ More

    Submitted 12 December, 2019; v1 submitted 4 December, 2019; originally announced December 2019.

    Comments: To appear at AAAI-20

  28. arXiv:1909.02682  [pdf, other

    cs.LG stat.ML

    Efficient Communication in Multi-Agent Reinforcement Learning via Variance Based Control

    Authors: Sai Qian Zhang, Qi Zhang, Jieyu Lin

    Abstract: Multi-agent reinforcement learning (MARL) has recently received considerable attention due to its applicability to a wide range of real-world applications. However, achieving efficient communication among agents has always been an overarching problem in MARL. In this work, we propose Variance Based Control (VBC), a simple yet efficient technique to improve communication efficiency in MARL. By limi… ▽ More

    Submitted 1 November, 2019; v1 submitted 5 September, 2019; originally announced September 2019.

  29. arXiv:1905.00462  [pdf, other

    cs.LG

    Full-stack Optimization for Accelerating CNNs with FPGA Validation

    Authors: Bradley McDanel, Sai Qian Zhang, H. T. Kung, Xin Dong

    Abstract: We present a full-stack optimization framework for accelerating inference of CNNs (Convolutional Neural Networks) and validate the approach with field-programmable gate arrays (FPGA) implementations. By jointly optimizing CNN models, computing architectures, and hardware implementations, our full-stack approach achieves unprecedented performance in the trade-off space characterized by inference la… ▽ More

    Submitted 1 May, 2019; originally announced May 2019.

  30. arXiv:1811.04770  [pdf, other

    cs.LG cs.AR stat.ML

    Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization

    Authors: H. T. Kung, Bradley McDanel, Sai Qian Zhang

    Abstract: This paper describes a novel approach of packing sparse convolutional neural networks for their efficient systolic array implementations. By combining subsets of columns in the original filter matrix associated with a convolutional layer, we increase the utilization efficiency of the systolic array substantially (e.g., ~4x) due to the increased density of nonzeros in the resulting packed filter ma… ▽ More

    Submitted 7 November, 2018; originally announced November 2018.

    Comments: To appear in ASPLOS 2019

  31. arXiv:1802.03373  [pdf, other

    cs.NI

    InferBeam: A Fast Beam Alignment Protocol for Millimeter-wave Networking

    Authors: Sai Qian Zhang, H. T. Kung, Youngjune Gwon

    Abstract: We introduce fast millimeter-wave base station (BS) and its antenna sector selection for user equipment based on its location. Using a conditional random field inference model with specially designed parameters, which are robust to change of environment, InferBeam allows the use of measurement samples on best beam selection at a small number of locations to infer the rest dynamically. Compared to… ▽ More

    Submitted 5 March, 2018; v1 submitted 9 February, 2018; originally announced February 2018.