-
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation
Authors:
Weilun Feng,
Chuanguang Yang,
Haotong Qin,
Yuqi Li,
Xiangqi Li,
Zhulin An,
Libo Huang,
Boyu Diao,
Fuzhen Zhuang,
Michele Magno,
Yongjun Xu,
Yingli Tian,
Tingwen Huang
Abstract:
Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applyin…
▽ More
Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved \textbf{M}ixed \textbf{P}recision \textbf{Q}uantization framework for extremely low-bit \textbf{D}iffusion \textbf{M}odels. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose \textit{Flexible Z-Order Residual Mixed Quantization} that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose \textit{Object-Oriented Low-Rank Initialization} to use prior quantization error for informative initialization. We then propose \textit{Memory-based Temporal Relation Distillation} to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.
△ Less
Submitted 6 July, 2025;
originally announced July 2025.
-
Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers
Authors:
Weilun Feng,
Chuanguang Yang,
Haotong Qin,
Xiangqi Li,
Yu Wang,
Zhulin An,
Libo Huang,
Boyu Diao,
Zixiang Zhao,
Yongjun Xu,
Michele Magno
Abstract:
Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not gener…
▽ More
Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$\times$. Code will be available at https://github.com/cantbebetter2/Q-VDiT.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
BLAST: Balanced Sampling Time Series Corpus for Universal Forecasting Models
Authors:
Zezhi Shao,
Yujie Li,
Fei Wang,
Chengqing Yu,
Yisong Fu,
Tangwen Qian,
Bin Xu,
Boyu Diao,
Yongjun Xu,
Xueqi Cheng
Abstract:
The advent of universal time series forecasting models has revolutionized zero-shot forecasting across diverse domains, yet the critical role of data diversity in training these models remains underexplored. Existing large-scale time series datasets often suffer from inherent biases and imbalanced distributions, leading to suboptimal model performance and generalization. To address this gap, we in…
▽ More
The advent of universal time series forecasting models has revolutionized zero-shot forecasting across diverse domains, yet the critical role of data diversity in training these models remains underexplored. Existing large-scale time series datasets often suffer from inherent biases and imbalanced distributions, leading to suboptimal model performance and generalization. To address this gap, we introduce BLAST, a novel pre-training corpus designed to enhance data diversity through a balanced sampling strategy. First, BLAST incorporates 321 billion observations from publicly available datasets and employs a comprehensive suite of statistical metrics to characterize time series patterns. Then, to facilitate pattern-oriented sampling, the data is implicitly clustered using grid-based partitioning. Furthermore, by integrating grid sampling and grid mixup techniques, BLAST ensures a balanced and representative coverage of diverse patterns. Experimental results demonstrate that models pre-trained on BLAST achieve state-of-the-art performance with a fraction of the computational resources and training tokens required by existing methods. Our findings highlight the pivotal role of data diversity in improving both training efficiency and model performance for the universal forecasting task.
△ Less
Submitted 26 May, 2025; v1 submitted 23 May, 2025;
originally announced May 2025.
-
PrePrompt: Predictive prompting for class incremental learning
Authors:
Libo Huang,
Zhulin An,
Chuanguang Yang,
Boyu Diao,
Fei Wang,
Yan Zeng,
Zhifeng Hao,
Yongjun Xu
Abstract:
Class Incremental Learning (CIL) based on pre-trained models offers a promising direction for open-world continual learning. Existing methods typically rely on correlation-based strategies, where an image's classification feature is used as a query to retrieve the most related key prompts and select the corresponding value prompts for training. However, these approaches face an inherent limitation…
▽ More
Class Incremental Learning (CIL) based on pre-trained models offers a promising direction for open-world continual learning. Existing methods typically rely on correlation-based strategies, where an image's classification feature is used as a query to retrieve the most related key prompts and select the corresponding value prompts for training. However, these approaches face an inherent limitation: fitting the entire feature space of all tasks with only a few trainable prompts is fundamentally challenging. We propose Predictive Prompting (PrePrompt), a novel CIL framework that circumvents correlation-based limitations by leveraging pre-trained models' natural classification ability to predict task-specific prompts. Specifically, PrePrompt decomposes CIL into a two-stage prediction framework: task-specific prompt prediction followed by label prediction. While theoretically appealing, this framework risks bias toward recent classes due to missing historical data for older classifier calibration. PrePrompt then mitigates this by incorporating feature translation, dynamically balancing stability and plasticity. Experiments across multiple benchmarks demonstrate PrePrompt's superiority over state-of-the-art prompt-based CIL methods. Code available at \href{github.com/libo-huang/preprompt}{github.com/libo-huang/preprompt}.
△ Less
Submitted 17 May, 2025; v1 submitted 13 May, 2025;
originally announced May 2025.
-
A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning
Authors:
Junzhou Xu,
Boyu Diao
Abstract:
As deep learning models expand, the pre-training-fine-tuning paradigm has become the standard approach for handling various downstream tasks. However, shared parameters can lead to diminished performance when dealing with complex datasets involving multiple tasks. While introducing Mixture-of-Experts (MoE) methods has alleviated this issue to some extent, it also significantly increases the number…
▽ More
As deep learning models expand, the pre-training-fine-tuning paradigm has become the standard approach for handling various downstream tasks. However, shared parameters can lead to diminished performance when dealing with complex datasets involving multiple tasks. While introducing Mixture-of-Experts (MoE) methods has alleviated this issue to some extent, it also significantly increases the number of parameters required for fine-tuning and training time, introducing greater parameter redundancy. To address these challenges, we propose a method for allocating expert numbers based on parameter sensitivity LoRA-SMoE (A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning). This method rapidly assesses the sensitivity of different tasks to parameters by sampling a small amount of data and using gradient information. It then adaptively allocates expert numbers within a given budget. The process maintains comparable memory consumption to LoRA (Low-Rank Adaptation) while ensuring an efficient and resource-friendly fine-tuning procedure. Experimental results demonstrate that compared to SOTA fine-tuning methods, our LoRA-SMoE approach can enhance model performance while reducing the number of trainable parameters. This significantly improves model performance in resource-constrained environments. Additionally, due to its efficient parameter sensitivity evaluation mechanism, LoRA-SMoE requires minimal computational overhead to optimize expert allocation, making it particularly suitable for scenarios with limited computational resources. All the code in this study will be made publicly available following the acceptance of the paper for publication. Source code is at https://github.com/EMLS-ICTCAS/LoRA-SMoE
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
A Nonlinear Hash-based Optimization Method for SpMV on GPUs
Authors:
Chen Yan,
Boyu Diao,
Hangda Liu,
Zhulin An,
Yongjun Xu
Abstract:
Sparse matrix-vector multiplication (SpMV) is a fundamental operation with a wide range of applications in scientific computing and artificial intelligence. However, the large scale and sparsity of sparse matrix often make it a performance bottleneck. In this paper, we highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering, introducing the Hash-based Partition…
▽ More
Sparse matrix-vector multiplication (SpMV) is a fundamental operation with a wide range of applications in scientific computing and artificial intelligence. However, the large scale and sparsity of sparse matrix often make it a performance bottleneck. In this paper, we highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering, introducing the Hash-based Partition (HBP) format, a lightweight SpMV approach. HBP retains the performance benefits of the 2D-partitioning method while leveraging the hash transformation's ability to group similar elements, thereby accelerating the pre-processing phase of sparse matrix reordering. Additionally, we achieve parallel load balancing across matrix blocks through a competitive method. Our experiments, conducted on both Nvidia Jetson AGX Orin and Nvidia RTX 4090, show that in the pre-processing step, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D. Furthermore, in SpMV, our method achieves a maximum speedup of 3.32 times on Orin and 3.01 times on RTX4090 against the CSR format in sparse matrices from the University of Florida Sparse Matrix Collection.
△ Less
Submitted 11 April, 2025;
originally announced April 2025.
-
Efficient Continual Learning through Frequency Decomposition and Integration
Authors:
Ruiqi Liu,
Boyu Diao,
Libo Huang,
Hangda Liu,
Chuanguang Yang,
Zhulin An,
Yongjun Xu
Abstract:
Continual learning (CL) aims to learn new tasks while retaining past knowledge, addressing the challenge of forgetting during task adaptation. Rehearsal-based methods, which replay previous samples, effectively mitigate forgetting. However, research on enhancing the efficiency of these methods, especially in resource-constrained environments, remains limited, hindering their application in real-wo…
▽ More
Continual learning (CL) aims to learn new tasks while retaining past knowledge, addressing the challenge of forgetting during task adaptation. Rehearsal-based methods, which replay previous samples, effectively mitigate forgetting. However, research on enhancing the efficiency of these methods, especially in resource-constrained environments, remains limited, hindering their application in real-world systems with dynamic data streams. The human perceptual system processes visual scenes through complementary frequency channels: low-frequency signals capture holistic cues, while high-frequency components convey structural details vital for fine-grained discrimination. Inspired by this, we propose the Frequency Decomposition and Integration Network (FDINet), a novel framework that decomposes and integrates information across frequencies. FDINet designs two lightweight networks to independently process low- and high-frequency components of images. When integrated with rehearsal-based methods, this frequency-aware design effectively enhances cross-task generalization through low-frequency information, preserves class-specific details using high-frequency information, and facilitates efficient training due to its lightweight architecture. Experiments demonstrate that FDINet reduces backbone parameters by 78%, improves accuracy by up to 7.49% over state-of-the-art (SOTA) methods, and decreases peak memory usage by up to 80%. Additionally, on edge devices, FDINet accelerates training by up to 5$\times$.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Gensor: A Graph-based Construction Tensor Compilation Method for Deep Learning
Authors:
Hangda Liu,
Boyu Diao,
Yu Yang,
Wenxin Chen,
Xiaohui Peng,
Yongjun Xu
Abstract:
High-performance deep learning depends on efficient tensor programs. In recent years, automatic tensor program optimization, also known as tensor compilation, has emerged as the primary approach to generating efficient tensor programs. However, how to generate kernels with higher performance in a shorter time is still the key challenge. In this paper, we present Gensor, a graph-based construction…
▽ More
High-performance deep learning depends on efficient tensor programs. In recent years, automatic tensor program optimization, also known as tensor compilation, has emerged as the primary approach to generating efficient tensor programs. However, how to generate kernels with higher performance in a shorter time is still the key challenge. In this paper, we present Gensor, a graph-based construction tensor compilation method for deep learning, to further improve the performance of construction tensor compilation. Unlike existing tree-based methods, Gensor abstracts construction space into a graph structure. Gensor then explores the construction space with Markov analysis. Gensor takes tensor programs as states and models scheduling primitives as transition actions between these states. Therefore, the process of tensor program construction optimization is abstracted as a graph traversal process. This approach expands the optimization space, improving operator performance while ensuring rapid optimization. Extensive experiments with typical operators demonstrate that Gensor significantly outperforms the state-of-the-art methods on GPUs for both cloud servers and edge devices. As a result, Gensor can generate operator kernels in seconds, with performance increasing by 18\% on average, reaching a maximum of 30\%. It also achieves high speedup for end-to-end models like ResNet-50 and GPT-2, with an average acceleration of 20\%.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models
Authors:
Weilun Feng,
Haotong Qin,
Chuanguang Yang,
Zhulin An,
Libo Huang,
Boyu Diao,
Fei Wang,
Renshuai Tao,
Yongjun Xu,
Michele Magno
Abstract:
Diffusion models have received wide attention in generation tasks. However, the expensive computation cost prevents the application of diffusion models in resource-constrained scenarios. Quantization emerges as a practical solution that significantly saves storage and computation by reducing the bit-width of parameters. However, the existing quantization methods for diffusion models still cause se…
▽ More
Diffusion models have received wide attention in generation tasks. However, the expensive computation cost prevents the application of diffusion models in resource-constrained scenarios. Quantization emerges as a practical solution that significantly saves storage and computation by reducing the bit-width of parameters. However, the existing quantization methods for diffusion models still cause severe degradation in performance, especially under extremely low bit-widths (2-4 bit). The primary decrease in performance comes from the significant discretization of activation values at low bit quantization. Too few activation candidates are unfriendly for outlier significant weight channel quantization, and the discretized features prevent stable learning over different time steps of the diffusion model. This paper presents MPQ-DM, a Mixed-Precision Quantization method for Diffusion Models. The proposed MPQ-DM mainly relies on two techniques:(1) To mitigate the quantization error caused by outlier severe weight channels, we propose an Outlier-Driven Mixed Quantization (OMQ) technique that uses $Kurtosis$ to quantify outlier salient channels and apply optimized intra-layer mixed-precision bit-width allocation to recover accuracy performance within target efficiency.(2) To robustly learn representations crossing time steps, we construct a Time-Smoothed Relation Distillation (TRD) scheme between the quantized diffusion model and its full-precision counterpart, transferring discrete and continuous latent to a unified relation space to reduce the representation inconsistency. Comprehensive experiments demonstrate that MPQ-DM achieves significant accuracy gains under extremely low bit-widths compared with SOTA quantization methods. MPQ-DM achieves a 58\% FID decrease under W2A4 setting compared with baseline, while all other methods even collapse.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
Relational Diffusion Distillation for Efficient Image Generation
Authors:
Weilun Feng,
Chuanguang Yang,
Zhulin An,
Libo Huang,
Boyu Diao,
Fei Wang,
Yongjun Xu
Abstract:
Although the diffusion model has achieved remarkable performance in the field of image generation, its high inference delay hinders its wide application in edge devices with scarce computing resources. Therefore, many training-free sampling methods have been proposed to reduce the number of sampling steps required for diffusion models. However, they perform poorly under a very small number of samp…
▽ More
Although the diffusion model has achieved remarkable performance in the field of image generation, its high inference delay hinders its wide application in edge devices with scarce computing resources. Therefore, many training-free sampling methods have been proposed to reduce the number of sampling steps required for diffusion models. However, they perform poorly under a very small number of sampling steps. Thanks to the emergence of knowledge distillation technology, the existing training scheme methods have achieved excellent results at very low step numbers. However, the current methods mainly focus on designing novel diffusion model sampling methods with knowledge distillation. How to transfer better diffusion knowledge from teacher models is a more valuable problem but rarely studied. Therefore, we propose Relational Diffusion Distillation (RDD), a novel distillation method tailored specifically for distilling diffusion models. Unlike existing methods that simply align teacher and student models at pixel level or feature distributions, our method introduces cross-sample relationship interaction during the distillation process and alleviates the memory constraints induced by multiple sample interactions. Our RDD significantly enhances the effectiveness of the progressive distillation framework within the diffusion model. Extensive experiments on several datasets (e.g., CIFAR-10 and ImageNet) demonstrate that our proposed RDD leads to 1.47 FID decrease under 1 sampling step compared to state-of-the-art diffusion distillation methods and achieving 256x speed-up compared to DDIM strategy. Code is available at https://github.com/cantbebetter2/RDD.
△ Less
Submitted 15 December, 2024; v1 submitted 10 October, 2024;
originally announced October 2024.
-
Continual Learning in the Frequency Domain
Authors:
Ruiqi Liu,
Boyu Diao,
Libo Huang,
Zijia An,
Zhulin An,
Yongjun Xu
Abstract:
Continual learning (CL) is designed to learn new tasks while preserving existing knowledge. Replaying samples from earlier tasks has proven to be an effective method to mitigate the forgetting of previously acquired knowledge. However, the current research on the training efficiency of rehearsal-based methods is insufficient, which limits the practical application of CL systems in resource-limited…
▽ More
Continual learning (CL) is designed to learn new tasks while preserving existing knowledge. Replaying samples from earlier tasks has proven to be an effective method to mitigate the forgetting of previously acquired knowledge. However, the current research on the training efficiency of rehearsal-based methods is insufficient, which limits the practical application of CL systems in resource-limited scenarios. The human visual system (HVS) exhibits varying sensitivities to different frequency components, enabling the efficient elimination of visually redundant information. Inspired by HVS, we propose a novel framework called Continual Learning in the Frequency Domain (CLFD). To our knowledge, this is the first study to utilize frequency domain features to enhance the performance and efficiency of CL training on edge devices. For the input features of the feature extractor, CLFD employs wavelet transform to map the original input image into the frequency domain, thereby effectively reducing the size of input feature maps. Regarding the output features of the feature extractor, CLFD selectively utilizes output features for distinct classes for classification, thereby balancing the reusability and interference of output features based on the frequency domain similarity of the classes across various tasks. Optimizing only the input and output features of the feature extractor allows for seamless integration of CLFD with various rehearsal-based methods. Extensive experiments conducted in both cloud and edge environments demonstrate that CLFD consistently improves the performance of state-of-the-art (SOTA) methods in both precision and training efficiency. Specifically, CLFD can increase the accuracy of the SOTA CL method by up to 6.83% and reduce the training time by 2.6$\times$.
△ Less
Submitted 13 November, 2024; v1 submitted 9 October, 2024;
originally announced October 2024.
-
IOR: Inversed Objects Replay for Incremental Object Detection
Authors:
Zijia An,
Boyu Diao,
Libo Huang,
Ruiqi Liu,
Zhulin An,
Yongjun Xu
Abstract:
Existing Incremental Object Detection (IOD) methods partially alleviate catastrophic forgetting when incrementally detecting new objects in real-world scenarios. However, many of these methods rely on the assumption that unlabeled old-class objects may co-occur with labeled new-class objects in the incremental data. When unlabeled old-class objects are absent, the performance of existing methods t…
▽ More
Existing Incremental Object Detection (IOD) methods partially alleviate catastrophic forgetting when incrementally detecting new objects in real-world scenarios. However, many of these methods rely on the assumption that unlabeled old-class objects may co-occur with labeled new-class objects in the incremental data. When unlabeled old-class objects are absent, the performance of existing methods tends to degrade. The absence can be mitigated by generating old-class samples, but it incurs high costs. This paper argues that previous generation-based IOD suffers from redundancy, both in the use of generative models, which require additional training and storage, and in the overproduction of generated samples, many of which do not contribute significantly to performance improvements. To eliminate the redundancy, we propose Inversed Objects Replay (IOR). Specifically, we generate old-class samples by inversing the original detectors, thus eliminating the necessity of training and storing additional generative models. We propose augmented replay to reuse the objects in generated samples, reducing redundant generations. Moreover, we propose high-value knowledge distillation focusing on the positions of old-class objects overwhelmed by the background, which transfers the knowledge to the incremental detector. Extensive experiments conducted on MS COCO 2017 demonstrate that our method can efficiently improve detection performance in IOD scenarios with the absence of old-class objects.
△ Less
Submitted 16 January, 2025; v1 submitted 7 June, 2024;
originally announced June 2024.
-
E2Net: Resource-Efficient Continual Learning with Elastic Expansion Network
Authors:
RuiQi Liu,
Boyu Diao,
Libo Huang,
Zhulin An,
Yongjun Xu
Abstract:
Continual Learning methods are designed to learn new tasks without erasing previous knowledge. However, Continual Learning often requires massive computational power and storage capacity for satisfactory performance. In this paper, we propose a resource-efficient continual learning method called the Elastic Expansion Network (E2Net). Leveraging core subnet distillation and precise replay sample se…
▽ More
Continual Learning methods are designed to learn new tasks without erasing previous knowledge. However, Continual Learning often requires massive computational power and storage capacity for satisfactory performance. In this paper, we propose a resource-efficient continual learning method called the Elastic Expansion Network (E2Net). Leveraging core subnet distillation and precise replay sample selection, E2Net achieves superior average accuracy and diminished forgetting within the same computational and storage constraints, all while minimizing processing time. In E2Net, we propose Representative Network Distillation to identify the representative core subnet by assessing parameter quantity and output similarity with the working network, distilling analogous subnets within the working network to mitigate reliance on rehearsal buffers and facilitating knowledge transfer across previous tasks. To enhance storage resource utilization, we then propose Subnet Constraint Experience Replay to optimize rehearsal efficiency through a sample storage strategy based on the structures of representative networks. Extensive experiments conducted predominantly on cloud environments with diverse datasets and also spanning the edge environment demonstrate that E2Net consistently outperforms state-of-the-art methods. In addition, our method outperforms competitors in terms of both storage and computational requirements.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
CLIP-KD: An Empirical Study of CLIP Model Distillation
Authors:
Chuanguang Yang,
Zhulin An,
Libo Huang,
Junyu Bi,
Xinqiang Yu,
Han Yang,
Boyu Diao,
Yongjun Xu
Abstract:
Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a si…
▽ More
Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover, interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\% and 20.1\% margins, respectively. Our code is released on https://github.com/winycg/CLIP-KD.
△ Less
Submitted 7 May, 2024; v1 submitted 24 July, 2023;
originally announced July 2023.
-
eTag: Class-Incremental Learning with Embedding Distillation and Task-Oriented Generation
Authors:
Libo Huang,
Yan Zeng,
Chuanguang Yang,
Zhulin An,
Boyu Diao,
Yongjun Xu
Abstract:
Class-Incremental Learning (CIL) aims to solve the neural networks' catastrophic forgetting problem, which refers to the fact that once the network updates on a new task, its performance on previously-learned tasks drops dramatically. Most successful CIL methods incrementally train a feature extractor with the aid of stored exemplars, or estimate the feature distribution with the stored prototypes…
▽ More
Class-Incremental Learning (CIL) aims to solve the neural networks' catastrophic forgetting problem, which refers to the fact that once the network updates on a new task, its performance on previously-learned tasks drops dramatically. Most successful CIL methods incrementally train a feature extractor with the aid of stored exemplars, or estimate the feature distribution with the stored prototypes. However, the stored exemplars would violate the data privacy concerns, while the stored prototypes might not reasonably be consistent with a proper feature distribution, hindering the exploration of real-world CIL applications. In this paper, we propose a method of \textit{e}mbedding distillation and \textit{Ta}sk-oriented \textit{g}eneration (\textit{eTag}) for CIL, which requires neither the exemplar nor the prototype. Instead, eTag achieves a data-free manner to train the neural networks incrementally. To prevent the feature extractor from forgetting, eTag distills the embeddings of the network's intermediate blocks. Additionally, eTag enables a generative network to produce suitable features, fitting the needs of the top incremental classifier. Experimental results confirmed that our proposed eTag considerably outperforms the state-of-the-art methods on CIFAR-100 and ImageNet-sub\footnote{Our code is available in the Supplementary Materials.
△ Less
Submitted 20 April, 2023;
originally announced April 2023.
-
A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration
Authors:
LingFei Dai,
Boyu Diao,
Chao Li,
Yongjun Xu
Abstract:
Distributed training is an effective way to accelerate the training process of large-scale deep learning models. However, the parameter exchange and synchronization of distributed stochastic gradient descent introduce a large amount of communication overhead. Gradient compression is an effective method to reduce communication overhead. In synchronization SGD compression methods, many Top-k sparsif…
▽ More
Distributed training is an effective way to accelerate the training process of large-scale deep learning models. However, the parameter exchange and synchronization of distributed stochastic gradient descent introduce a large amount of communication overhead. Gradient compression is an effective method to reduce communication overhead. In synchronization SGD compression methods, many Top-k sparsification based gradient compression methods have been proposed to reduce the communication. However, the centralized method based on the parameter servers has the single point of failure problem and limited scalability, while the decentralized method with global parameter exchanging may reduce the convergence rate of training. In contrast with Top-$k$ based methods, we proposed a gradient compression method with globe gradient vector sketching, which uses the Count-Sketch structure to store the gradients to reduce the loss of the accuracy in the training process, named global-sketching SGD (gs-SGD). The gs-SGD has better convergence efficiency on deep learning models and a communication complexity of O($\log d*\log P$), where $d$ is the number of model parameters and P is the number of workers. We conducted experiments on GPU clusters to verify that our method has better convergence efficiency than global Top-$k$ and Sketching-based methods. In addition, gs-SGD achieves 1.3-3.1x higher throughput compared with gTop-$k$, and 1.1-1.2x higher throughput compared with original Sketched-SGD.
△ Less
Submitted 12 August, 2021;
originally announced August 2021.
-
A Channel-Aware Routing Protocol With Nearest Neighbor Regression For Underwater Sensor Networks
Authors:
Boyu Diao,
Chao Li,
Qi Wang,
Zhulin An,
Yongjun Xu
Abstract:
The underwater acoustic channel is one of the most challenging communication channels. Due to periodical tidal and daily climatic variation, underwater noise is periodically fluctuating, which result in the periodical changing of acoustic channel quality in long-term. Also, time-variant channel quality leads to routing failure. Routing protocols with acoustic channel estimation, namely underwater…
▽ More
The underwater acoustic channel is one of the most challenging communication channels. Due to periodical tidal and daily climatic variation, underwater noise is periodically fluctuating, which result in the periodical changing of acoustic channel quality in long-term. Also, time-variant channel quality leads to routing failure. Routing protocols with acoustic channel estimation, namely underwater channel-aware routing protocols are recently proposed to maintain the routing performance. However, channel estimation algorithms for these routing protocols are mostly linear and rarely consider periodicity of acoustic channels. In this paper, we introduce acoustic channel estimation based on nearest neighbor regression for underwater acoustic networks. We extend nearest neighbor regression for SNR (Signal-to-Noise Ratio) time series prediction, providing an outstanding prediction accuracy for intricately periodical and fluctuating received SNR time series. Moreover, we propose a quick search algorithm and use statistical storage compression to optimize the time and space complexity of the algorithm. In contrast with linear methods, this algorithm significantly improves channel prediction accuracy (over three times at most) on both simulation and sea trial data sets. With this channel estimation method, we then propose a Depth-Based Channel-Aware Routing protocol (DBCAR). Taking advantage of depth-greedy forwarding and channel-aware reliable communication, DBCAR has an outstanding network performance on packet delivery ratio, average energy consumption and average transmission delay which is validated through extensive simulations.
△ Less
Submitted 14 August, 2021; v1 submitted 11 August, 2021;
originally announced August 2021.
-
PFGDF: Pruning Filter via Gaussian Distribution Feature for Deep Neural Networks Acceleration
Authors:
Jianrong Xu,
Boyu Diao,
Bifeng Cui,
Kang Yang,
Chao Li,
Yongjun Xu
Abstract:
Deep learning has achieved impressive results in many areas, but the deployment of edge intelligent devices is still very slow. To solve this problem, we propose a novel compression and acceleration method based on data distribution characteristics for deep neural networks, namely Pruning Filter via Gaussian Distribution Feature (PFGDF). Compared with previous advanced pruning methods, PFGDF compr…
▽ More
Deep learning has achieved impressive results in many areas, but the deployment of edge intelligent devices is still very slow. To solve this problem, we propose a novel compression and acceleration method based on data distribution characteristics for deep neural networks, namely Pruning Filter via Gaussian Distribution Feature (PFGDF). Compared with previous advanced pruning methods, PFGDF compresses the model by filters with insignificance in distribution, regardless of the contribution and sensitivity information of the convolution filter. PFGDF is significantly different from weight sparsification pruning because it does not require the special accelerated library to process the sparse weight matrix and introduces no more extra parameters. The pruning process of PFGDF is automated. Furthermore, the model compressed by PFGDF can restore the same performance as the uncompressed model. We evaluate PFGDF through extensive experiments, on CIFAR-10, PFGDF compresses the convolution filter on VGG-16 by 66.62% with more than 90% parameter reduced, while the inference time is accelerated by 83.73% on Huawei MATE 10.
△ Less
Submitted 26 May, 2022; v1 submitted 23 June, 2020;
originally announced June 2020.
-
Multi-Objective Pruning for CNNs Using Genetic Algorithm
Authors:
Chuanguang Yang,
Zhulin An,
Chao Li,
Boyu Diao,
Yongjun Xu
Abstract:
In this work, we propose a heuristic genetic algorithm (GA) for pruning convolutional neural networks (CNNs) according to the multi-objective trade-off among error, computation and sparsity. In our experiments, we apply our approach to prune pre-trained LeNet across the MNIST dataset, which reduces 95.42% parameter size and achieves 16$\times$ speedups of convolutional layer computation with tiny…
▽ More
In this work, we propose a heuristic genetic algorithm (GA) for pruning convolutional neural networks (CNNs) according to the multi-objective trade-off among error, computation and sparsity. In our experiments, we apply our approach to prune pre-trained LeNet across the MNIST dataset, which reduces 95.42% parameter size and achieves 16$\times$ speedups of convolutional layer computation with tiny accuracy loss by laying emphasis on sparsity and computation, respectively. Our empirical study suggests that GA is an alternative pruning approach for obtaining a competitive compression performance. Additionally, compared with state-of-the-art approaches, GA is capable of automatically pruning CNNs based on the multi-objective importance by a pre-defined fitness function.
△ Less
Submitted 4 July, 2019; v1 submitted 2 June, 2019;
originally announced June 2019.
-
Mean-Field Games for Marriage
Authors:
Dario Bauso,
Ben Mansour Dia,
Boualem Djehiche,
Hamidou Tembine,
Raul Tempone
Abstract:
This article examines mean-field games for marriage. The results support the argument that optimizing the long-term well-being through effort and social feeling state distribution (mean-field) will help to stabilize marriage. However, if the cost of effort is very high, the couple fluctuates in a bad feeling state or the marriage breaks down. We then examine the influence of society on a couple us…
▽ More
This article examines mean-field games for marriage. The results support the argument that optimizing the long-term well-being through effort and social feeling state distribution (mean-field) will help to stabilize marriage. However, if the cost of effort is very high, the couple fluctuates in a bad feeling state or the marriage breaks down. We then examine the influence of society on a couple using mean field sentimental games. We show that, in mean-field equilibrium, the optimal effort is always higher than the one-shot optimal effort. We illustrate numerically the influence of the couple's network on their feeling states and their well-being.
△ Less
Submitted 13 April, 2014;
originally announced April 2014.