-
Evaluating Loss Functions for Graph Neural Networks: Towards Pretraining and Generalization
Authors:
Khushnood Abbas,
Ruizhe Hou,
Zhou Wengang,
Dong Shi,
Niu Ling,
Satyaki Nan,
Alireza Abbasi
Abstract:
Graph Neural Networks (GNNs) became useful for learning on non-Euclidean data. However, their best performance depends on choosing the right model architecture and the training objective, also called the loss function. Researchers have studied these parts separately, but a large-scale evaluation has not looked at how GNN models and many loss functions work together across different tasks. To fix t…
▽ More
Graph Neural Networks (GNNs) became useful for learning on non-Euclidean data. However, their best performance depends on choosing the right model architecture and the training objective, also called the loss function. Researchers have studied these parts separately, but a large-scale evaluation has not looked at how GNN models and many loss functions work together across different tasks. To fix this, we ran a thorough study - it included seven well-known GNN architectures. We also used a large group of 30 single plus mixed loss functions. The study looked at both inductive and transductive settings. Our evaluation spanned three distinct real-world datasets, assessing performance in both inductive and transductive settings using 21 comprehensive evaluation metrics. From these extensive results (detailed in supplementary information 1 \& 2), we meticulously analyzed the top ten model-loss combinations for each metric based on their average rank. Our findings reveal that, especially for the inductive case: 1) Hybrid loss functions generally yield superior and more robust performance compared to single loss functions, indicating the benefit of multi-objective optimization. 2) The GIN architecture always showed the highest-level average performance, especially with Cross-Entropy loss. 3) Although some combinations had overall lower average ranks, models such as GAT, particularly with certain hybrid losses, demonstrated incredible specialized strengths, maximizing the most top-1 results among the individual metrics, emphasizing subtle strengths for particular task demands. 4) On the other hand, the MPNN architecture typically lagged behind the scenarios it was tested against.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
TVC: Tokenized Video Compression with Ultra-Low Bitrate
Authors:
Lebin Zhou,
Cihan Ruan,
Nam Ling,
Wei Wang,
Wei Jiang
Abstract:
Tokenized visual representations have shown great promise in image compression, yet their extension to video remains underexplored due to the challenges posed by complex temporal dynamics and stringent bitrate constraints. In this paper, we propose Tokenized Video Compression (TVC), the first token-based dual-stream video compression framework designed to operate effectively at ultra-low bitrates.…
▽ More
Tokenized visual representations have shown great promise in image compression, yet their extension to video remains underexplored due to the challenges posed by complex temporal dynamics and stringent bitrate constraints. In this paper, we propose Tokenized Video Compression (TVC), the first token-based dual-stream video compression framework designed to operate effectively at ultra-low bitrates. TVC leverages the powerful Cosmos video tokenizer to extract both discrete and continuous token streams. The discrete tokens (i.e., code maps generated by FSQ) are partially masked using a strategic masking scheme, then compressed losslessly with a discrete checkerboard context model to reduce transmission overhead. The masked tokens are reconstructed by a decoder-only transformer with spatiotemporal token prediction. Meanwhile, the continuous tokens, produced via an autoencoder (AE), are quantized and compressed using a continuous checkerboard context model, providing complementary continuous information at ultra-low bitrate. At the Decoder side, both streams are fused using ControlNet, with multi-scale hierarchical integration to ensure high perceptual quality alongside strong fidelity in reconstruction. This work mitigates the long-standing skepticism about the practicality of tokenized video compression and opens up new avenues for semantics-aware, token-native video compression.
△ Less
Submitted 12 May, 2025; v1 submitted 22 April, 2025;
originally announced April 2025.
-
Stereo Image Coding for Machines with Joint Visual Feature Compression
Authors:
Dengchao Jin,
Jianjun Lei,
Bo Peng,
Zhaoqing Pan,
Nam Ling,
Qingming Huang
Abstract:
2D image coding for machines (ICM) has achieved great success in coding efficiency, while less effort has been devoted to stereo image fields. To promote the efficiency of stereo image compression (SIC) and intelligent analysis, the stereo image coding for machines (SICM) is formulated and explored in this paper. More specifically, a machine vision-oriented stereo feature compression network (MVSF…
▽ More
2D image coding for machines (ICM) has achieved great success in coding efficiency, while less effort has been devoted to stereo image fields. To promote the efficiency of stereo image compression (SIC) and intelligent analysis, the stereo image coding for machines (SICM) is formulated and explored in this paper. More specifically, a machine vision-oriented stereo feature compression network (MVSFC-Net) is proposed for SICM, where the stereo visual features are effectively extracted, compressed, and transmitted for 3D visual task. To efficiently compress stereo visual features in MVSFC-Net, a stereo multi-scale feature compression (SMFC) module is designed to gradually transform sparse stereo multi-scale features into compact joint visual representations by removing spatial, inter-view, and cross-scale redundancies simultaneously. Experimental results show that the proposed MVSFC-Net obtains superior compression efficiency as well as 3D visual task performance, when compared with the existing ICM anchors recommended by MPEG and the state-of-the-art SIC method.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications
Authors:
Neiwen Ling,
Guojun Chen,
Lin Zhong
Abstract:
Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend complex commands and process diverse tasks. This advancement facilitates their application in controlling drones and robots for various tasks. However, existing LLM serving systems typically employ a first-come, first-served (FCFS) batching mechanism, which fails to address the time-sensitive requirements of robotic appli…
▽ More
Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend complex commands and process diverse tasks. This advancement facilitates their application in controlling drones and robots for various tasks. However, existing LLM serving systems typically employ a first-come, first-served (FCFS) batching mechanism, which fails to address the time-sensitive requirements of robotic applications. To address it, this paper proposes a new system named TimelyLLM serving multiple robotic agents with time-sensitive requests. TimelyLLM introduces novel mechanisms of segmented generation and scheduling that optimally leverage redundancy between robot plan generation and execution phases. We report an implementation of TimelyLLM on a widely-used LLM serving framework and evaluate it on a range of robotic applications. Our evaluation shows that TimelyLLM improves the time utility up to 1.97x, and reduces the overall waiting time by 84%.
△ Less
Submitted 24 December, 2024;
originally announced December 2024.
-
WTDUN: Wavelet Tree-Structured Sampling and Deep Unfolding Network for Image Compressed Sensing
Authors:
Kai Han,
Jin Wang,
Yunhui Shi,
Hanqin Cai,
Nam Ling,
Baocai Yin
Abstract:
Deep unfolding networks have gained increasing attention in the field of compressed sensing (CS) owing to their theoretical interpretability and superior reconstruction performance. However, most existing deep unfolding methods often face the following issues: 1) they learn directly from single-channel images, leading to a simple feature representation that does not fully capture complex features;…
▽ More
Deep unfolding networks have gained increasing attention in the field of compressed sensing (CS) owing to their theoretical interpretability and superior reconstruction performance. However, most existing deep unfolding methods often face the following issues: 1) they learn directly from single-channel images, leading to a simple feature representation that does not fully capture complex features; and 2) they treat various image components uniformly, ignoring the characteristics of different components. To address these issues, we propose a novel wavelet-domain deep unfolding framework named WTDUN, which operates directly on the multi-scale wavelet subbands. Our method utilizes the intrinsic sparsity and multi-scale structure of wavelet coefficients to achieve a tree-structured sampling and reconstruction, effectively capturing and highlighting the most important features within images. Specifically, the design of tree-structured reconstruction aims to capture the inter-dependencies among the multi-scale subbands, enabling the identification of both fine and coarse features, which can lead to a marked improvement in reconstruction quality. Furthermore, a wavelet domain adaptive sampling method is proposed to greatly improve the sampling capability, which is realized by assigning measurements to each wavelet subband based on its importance. Unlike pure deep learning methods that treat all components uniformly, our method introduces a targeted focus on important subbands, considering their energy and sparsity. This targeted strategy lets us capture key information more efficiently while discarding less important information, resulting in a more effective and detailed reconstruction. Extensive experimental results on various datasets validate the superior performance of our proposed method.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
GameIR: A Large-Scale Synthesized Ground-Truth Dataset for Image Restoration over Gaming Content
Authors:
Lebin Zhou,
Kun Han,
Nam Ling,
Wei Wang,
Wei Jiang
Abstract:
Image restoration methods like super-resolution and image synthesis have been successfully used in commercial cloud gaming products like NVIDIA's DLSS. However, restoration over gaming content is not well studied by the general public. The discrepancy is mainly caused by the lack of ground-truth gaming training data that match the test cases. Due to the unique characteristics of gaming content, th…
▽ More
Image restoration methods like super-resolution and image synthesis have been successfully used in commercial cloud gaming products like NVIDIA's DLSS. However, restoration over gaming content is not well studied by the general public. The discrepancy is mainly caused by the lack of ground-truth gaming training data that match the test cases. Due to the unique characteristics of gaming content, the common approach of generating pseudo training data by degrading the original HR images results in inferior restoration performance. In this work, we develop GameIR, a large-scale high-quality computer-synthesized ground-truth dataset to fill in the blanks, targeting at two different applications. The first is super-resolution with deferred rendering, to support the gaming solution of rendering and transferring LR images only and restoring HR images on the client side. We provide 19200 LR-HR paired ground-truth frames coming from 640 videos rendered at 720p and 1440p for this task. The second is novel view synthesis (NVS), to support the multiview gaming solution of rendering and transferring part of the multiview frames and generating the remaining frames on the client side. This task has 57,600 HR frames from 960 videos of 160 scenes with 6 camera views. In addition to the RGB frames, the GBuffers during the deferred rendering stage are also provided, which can be used to help restoration. Furthermore, we evaluate several SOTA super-resolution algorithms and NeRF-based NVS algorithms over our dataset, which demonstrates the effectiveness of our ground-truth GameIR data in improving restoration performance for gaming content. Also, we test the method of incorporating the GBuffers as additional input information for helping super-resolution and NVS. We release our dataset and models to the general public to facilitate research on restoration methods over gaming content.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
HAAP: Vision-context Hierarchical Attention Autoregressive with Adaptive Permutation for Scene Text Recognition
Authors:
Honghui Chen,
Yuhang Qiu,
Jiabao Wang,
Pingping Chen,
Nam Ling
Abstract:
Internal Language Model (LM)-based methods use permutation language modeling (PLM) to solve the error correction caused by conditional independence in external LM-based methods. However, random permutations of human interference cause fit oscillations in the model training, and Iterative Refinement (IR) operation to improve multimodal information decoupling also introduces additional overhead. To…
▽ More
Internal Language Model (LM)-based methods use permutation language modeling (PLM) to solve the error correction caused by conditional independence in external LM-based methods. However, random permutations of human interference cause fit oscillations in the model training, and Iterative Refinement (IR) operation to improve multimodal information decoupling also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance the location-context-image interaction capability, improving autoregressive generalization with internal LM. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks to dynamically exploit token dependencies. The adaptive masks increase the diversity of training data and prevent model dependency on a specific order. It reduces the training overhead of PLM while avoiding training fit oscillations. Second, we develop Cross-modal Hierarchical Attention mechanism (CHA) to couple context and image features. This processing establishes rich positional semantic dependencies between context and image while avoiding IR. Extensive experimental results show the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Soar: Design and Deployment of A Smart Roadside Infrastructure System for Autonomous Driving
Authors:
Shuyao Shi,
Neiwen Ling,
Zhehao Jiang,
Xuan Huang,
Yuze He,
Xiaoguang Zhao,
Bufang Yang,
Chen Bian,
Jingfei Xia,
Zhenyu Yan,
Raymond Yeung,
Guoliang Xing
Abstract:
Recently,smart roadside infrastructure (SRI) has demonstrated the potential of achieving fully autonomous driving systems. To explore the potential of infrastructure-assisted autonomous driving, this paper presents the design and deployment of Soar, the first end-to-end SRI system specifically designed to support autonomous driving systems. Soar consists of both software and hardware components ca…
▽ More
Recently,smart roadside infrastructure (SRI) has demonstrated the potential of achieving fully autonomous driving systems. To explore the potential of infrastructure-assisted autonomous driving, this paper presents the design and deployment of Soar, the first end-to-end SRI system specifically designed to support autonomous driving systems. Soar consists of both software and hardware components carefully designed to overcome various system and physical challenges. Soar can leverage the existing operational infrastructure like street lampposts for a lower barrier of adoption. Soar adopts a new communication architecture that comprises a bi-directional multi-hop I2I network and a downlink I2V broadcast service, which are designed based on off-the-shelf 802.11ac interfaces in an integrated manner. Soar also features a hierarchical DL task management framework to achieve desirable load balancing among nodes and enable them to collaborate efficiently to run multiple data-intensive autonomous driving applications. We deployed a total of 18 Soar nodes on existing lampposts on campus, which have been operational for over two years. Our real-world evaluation shows that Soar can support a diverse set of autonomous driving applications and achieve desirable real-time performance and high communication reliability. Our findings and experiences in this work offer key insights into the development and deployment of next-generation smart roadside infrastructure and autonomous driving systems.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
TypeFly: Flying Drones with Large Language Model
Authors:
Guojun Chen,
Xiaojing Yu,
Neiwen Ling,
Lin Zhong
Abstract:
Recent advancements in robot control using large language models (LLMs) have demonstrated significant potential, primarily due to LLMs' capabilities to understand natural language commands and generate executable plans in various languages. However, in real-time and interactive applications involving mobile robots, particularly drones, the sequential token generation process inherent to LLMs intro…
▽ More
Recent advancements in robot control using large language models (LLMs) have demonstrated significant potential, primarily due to LLMs' capabilities to understand natural language commands and generate executable plans in various languages. However, in real-time and interactive applications involving mobile robots, particularly drones, the sequential token generation process inherent to LLMs introduces substantial latency, i.e. response time, in control plan generation.
In this paper, we present a system called ChatFly that tackles this problem using a combination of a novel programming language called MiniSpec and its runtime to reduce the plan generation time and drone response time. That is, instead of asking an LLM to write a program (robotic plan) in the popular but verbose Python, ChatFly gets it to do it in MiniSpec specially designed for token efficiency and stream interpretation. Using a set of challenging drone tasks, we show that design choices made by ChatFly can reduce up to 62% response time and provide a more consistent user experience, enabling responsive and intelligent LLM-based drone control with efficient completion.
△ Less
Submitted 26 September, 2024; v1 submitted 8 December, 2023;
originally announced December 2023.
-
EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge
Authors:
Bufang Yang,
Lixing He,
Neiwen Ling,
Zhenyu Yan,
Guoliang Xing,
Xian Shuai,
Xiaozhe Ren,
Xin Jiang
Abstract:
Deep Learning (DL) models have been widely deployed on IoT devices with the help of advancements in DL algorithms and chips. However, the limited resources of edge devices make these on-device DL models hard to be generalizable to diverse environments and tasks. Although the recently emerged foundation models (FMs) show impressive generalization power, how to effectively leverage the rich knowledg…
▽ More
Deep Learning (DL) models have been widely deployed on IoT devices with the help of advancements in DL algorithms and chips. However, the limited resources of edge devices make these on-device DL models hard to be generalizable to diverse environments and tasks. Although the recently emerged foundation models (FMs) show impressive generalization power, how to effectively leverage the rich knowledge of FMs on resource-limited edge devices is still not explored. In this paper, we propose EdgeFM, a novel edge-cloud cooperative system with open-set recognition capability. EdgeFM selectively uploads unlabeled data to query the FM on the cloud and customizes the specific knowledge and architectures for edge models. Meanwhile, EdgeFM conducts dynamic model switching at run-time taking into account both data uncertainty and dynamic network variations, which ensures the accuracy always close to the original FM. We implement EdgeFM using two FMs on two edge platforms. We evaluate EdgeFM on three public datasets and two self-collected datasets. Results show that EdgeFM can reduce the end-to-end latency up to 3.2x and achieve 34.3% accuracy increase compared with the baseline.
△ Less
Submitted 22 November, 2023; v1 submitted 18 November, 2023;
originally announced November 2023.
-
Timely Fusion of Surround Radar/Lidar for Object Detection in Autonomous Driving Systems
Authors:
Wenjing Xie,
Tao Hu,
Neiwen Ling,
Guoliang Xing,
Chun Jason Xue,
Nan Guan
Abstract:
Fusing Radar and Lidar sensor data can fully utilize their complementary advantages and provide more accurate reconstruction of the surrounding for autonomous driving systems. Surround Radar/Lidar can provide 360-degree view sampling with the minimal cost, which are promising sensing hardware solutions for autonomous driving systems. However, due to the intrinsic physical constraints, the rotating…
▽ More
Fusing Radar and Lidar sensor data can fully utilize their complementary advantages and provide more accurate reconstruction of the surrounding for autonomous driving systems. Surround Radar/Lidar can provide 360-degree view sampling with the minimal cost, which are promising sensing hardware solutions for autonomous driving systems. However, due to the intrinsic physical constraints, the rotating speed of surround Radar, and thus the frequency to generate Radar data frames, is much lower than surround Lidar. Existing Radar/Lidar fusion methods have to work at the low frequency of surround Radar, which cannot meet the high responsiveness requirement of autonomous driving systems.This paper develops techniques to fuse surround Radar/Lidar with working frequency only limited by the faster surround Lidar instead of the slower surround Radar, based on the state-of-the-art object detection model MVDNet. The basic idea of our approach is simple: we let MVDNet work with temporally unaligned data from Radar/Lidar, so that fusion can take place at any time when a new Lidar data frame arrives, instead of waiting for the slow Radar data frame. However, directly applying MVDNet to temporally unaligned Radar/Lidar data greatly degrades its object detection accuracy. The key information revealed in this paper is that we can achieve high output frequency with little accuracy loss by enhancing the training procedure to explore the temporal redundancy in MVDNet so that it can tolerate the temporal unalignment of input data. We explore several different ways of training enhancement and compare them quantitatively with experiments.
△ Less
Submitted 27 May, 2024; v1 submitted 9 September, 2023;
originally announced September 2023.
-
Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU
Authors:
Zhihe Zhao,
Neiwen Ling,
Nan Guan,
Guoliang Xing
Abstract:
Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack hardwa…
▽ More
Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack hardware-level resource management mechanisms for avoiding resource contention. Therefore, we propose Miriam, a contention-aware task coordination framework for multi-DNN inference on edge GPU. Miriam consolidates two main components, an elastic-kernel generator, and a runtime dynamic kernel coordinator, to support mixed critical DNN inference. To evaluate Miriam, we build a new DNN inference benchmark based on CUDA with diverse representative DNN workloads. Experiments on two edge GPU platforms show that Miriam can increase system throughput by 92% while only incurring less than 10\% latency overhead for critical tasks, compared to state of art baselines.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Moses: Efficient Exploitation of Cross-device Transferable Features for Tensor Program Optimization
Authors:
Zhihe Zhao,
Xian Shuai,
Yang Bai,
Neiwen Ling,
Nan Guan,
Zhenyu Yan,
Guoliang Xing
Abstract:
Achieving efficient execution of machine learning models has attracted significant attention recently. To generate tensor programs efficiently, a key component of DNN compilers is the cost model that can predict the performance of each configuration on specific devices. However, due to the rapid emergence of hardware platforms, it is increasingly labor-intensive to train domain-specific predictors…
▽ More
Achieving efficient execution of machine learning models has attracted significant attention recently. To generate tensor programs efficiently, a key component of DNN compilers is the cost model that can predict the performance of each configuration on specific devices. However, due to the rapid emergence of hardware platforms, it is increasingly labor-intensive to train domain-specific predictors for every new platform. Besides, current design of cost models cannot provide transferable features between different hardware accelerators efficiently and effectively. In this paper, we propose Moses, a simple and efficient design based on the lottery ticket hypothesis, which fully takes advantage of the features transferable to the target device via domain adaptation. Compared with state-of-the-art approaches, Moses achieves up to 1.53X efficiency gain in the search stage and 1.41X inference speedup on challenging DNN benchmarks.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Novel View Synthesis from a Single Image via Unsupervised learning
Authors:
Bingzheng Liu,
Jianjun Lei,
Bo Peng,
Chuanbo Yu,
Wanqing Li,
Nam Ling
Abstract:
View synthesis aims to generate novel views from one or more given source views. Although existing methods have achieved promising performance, they usually require paired views of different poses to learn a pixel transformation. This paper proposes an unsupervised network to learn such a pixel transformation from a single source viewpoint. In particular, the network consists of a token transforma…
▽ More
View synthesis aims to generate novel views from one or more given source views. Although existing methods have achieved promising performance, they usually require paired views of different poses to learn a pixel transformation. This paper proposes an unsupervised network to learn such a pixel transformation from a single source viewpoint. In particular, the network consists of a token transformation module (TTM) that facilities the transformation of the features extracted from a source viewpoint image into an intrinsic representation with respect to a pre-defined reference pose and a view generation module (VGM) that synthesizes an arbitrary view from the representation. The learned transformation allows us to synthesize a novel view from any single source viewpoint image of unknown pose. Experiments on the widely used view synthesis datasets have demonstrated that the proposed network is able to produce comparable results to the state-of-the-art methods despite the fact that learning is unsupervised and only a single source viewpoint image is required for generating a novel view. The code will be available soon.
△ Less
Submitted 29 October, 2021;
originally announced October 2021.
-
TempNodeEmb:Temporal Node Embedding considering temporal edge influence matrix
Authors:
Khushnood Abbas,
Alireza Abbasi,
Dong Shi,
Niu Ling,
Mingsheng Shang,
Chen Liong,
Bolun Chen
Abstract:
Understanding the evolutionary patterns of real-world evolving complex systems such as human interactions, transport networks, biological interactions, and computer networks has important implications in our daily lives. Predicting future links among the nodes in such networks reveals an important aspect of the evolution of temporal networks. To analyse networks, they are mapped to adjacency matri…
▽ More
Understanding the evolutionary patterns of real-world evolving complex systems such as human interactions, transport networks, biological interactions, and computer networks has important implications in our daily lives. Predicting future links among the nodes in such networks reveals an important aspect of the evolution of temporal networks. To analyse networks, they are mapped to adjacency matrices, however, a single adjacency matrix cannot represent complex relationships (e.g. temporal pattern), and therefore, some approaches consider a simplified representation of temporal networks but in high-dimensional and generally sparse matrices. As a result, adjacency matrices cannot be directly used by machine learning models for making network or node level predictions. To overcome this problem, automated frameworks are proposed for learning low-dimensional vectors for nodes or edges, as state-of-the-art techniques in predicting temporal patterns in networks such as link prediction. However, these models fail to consider temporal dimensions of the networks. This gap motivated us to propose in this research a new node embedding technique which exploits the evolving nature of the networks considering a simple three-layer graph neural network at each time step, and extracting node orientation by Given's angle method. To prove our proposed algorithm's efficiency, we evaluated the efficiency of our proposed algorithm against six state-of-the-art benchmark network embedding models, on four real temporal networks data, and the results show our model outperforms other methods in predicting future links in temporal networks.
△ Less
Submitted 16 August, 2020;
originally announced August 2020.
-
Improved Image Coding Autoencoder With Deep Learning
Authors:
Licheng Xiao,
Hairong Wang,
Nam Ling
Abstract:
In this paper, we build autoencoder based pipelines for extreme end-to-end image compression based on Ballé's approach, which is the state-of-the-art open source implementation in image compression using deep learning. We deepened the network by adding one more hidden layer before each strided convolutional layer with exactly the same number of down-samplings and up-samplings. Our approach outperf…
▽ More
In this paper, we build autoencoder based pipelines for extreme end-to-end image compression based on Ballé's approach, which is the state-of-the-art open source implementation in image compression using deep learning. We deepened the network by adding one more hidden layer before each strided convolutional layer with exactly the same number of down-samplings and up-samplings. Our approach outperformed Ballé's approach, and achieved around 4.0% reduction in bits per pixel (bpp), 0.03% increase in multi-scale structural similarity (MS-SSIM), and only 0.47% decrease in peak signal-to-noise ratio (PSNR), It also outperforms all traditional image compression methods including JPEG2000 and HEIC by at least 20% in terms of compression efficiency at similar reconstruction image quality. Regarding encoding and decoding time, our approach takes similar amount of time compared with traditional methods with the support of GPU, which means it's almost ready for industrial applications.
△ Less
Submitted 27 February, 2020;
originally announced February 2020.
-
HSCS: Hierarchical Sparsity Based Co-saliency Detection for RGBD Images
Authors:
Runmin Cong,
Jianjun Lei,
Huazhu Fu,
Qingming Huang,
Xiaochun Cao,
Nam Ling
Abstract:
Co-saliency detection aims to discover common and salient objects in an image group containing more than two relevant images. Moreover, depth information has been demonstrated to be effective for many computer vision tasks. In this paper, we propose a novel co-saliency detection method for RGBD images based on hierarchical sparsity reconstruction and energy function refinement. With the assistance…
▽ More
Co-saliency detection aims to discover common and salient objects in an image group containing more than two relevant images. Moreover, depth information has been demonstrated to be effective for many computer vision tasks. In this paper, we propose a novel co-saliency detection method for RGBD images based on hierarchical sparsity reconstruction and energy function refinement. With the assistance of the intra saliency map, the inter-image correspondence is formulated as a hierarchical sparsity reconstruction framework. The global sparsity reconstruction model with a ranking scheme focuses on capturing the global characteristics among the whole image group through a common foreground dictionary. The pairwise sparsity reconstruction model aims to explore the corresponding relationship between pairwise images through a set of pairwise dictionaries. In order to improve the intra-image smoothness and inter-image consistency, an energy function refinement model is proposed, which includes the unary data term, spatial smooth term, and holistic consistency term. Experiments on two RGBD co-saliency detection benchmarks demonstrate that the proposed method outperforms the state-of-the-art algorithms both qualitatively and quantitatively.
△ Less
Submitted 16 November, 2018;
originally announced November 2018.