Skip to main content

Showing 1–50 of 140 results for author: Pu, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10322  [pdf, other

    cs.LG math.OC

    Asynchronous Decentralized SGD under Non-Convexity: A Block-Coordinate Descent Framework

    Authors: Yijie Zhou, Shi Pu

    Abstract: Decentralized optimization has become vital for leveraging distributed data without central control, enhancing scalability and privacy. However, practical deployments face fundamental challenges due to heterogeneous computation speeds and unpredictable communication delays. This paper introduces a refined model of Asynchronous Decentralized Stochastic Gradient Descent (ADSGD) under practical assum… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2503.17489  [pdf, other

    cs.CL cs.CV

    Judge Anything: MLLM as a Judge Across Any Modality

    Authors: Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, Philip S. Yu

    Abstract: Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language underst… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  3. arXiv:2503.16123  [pdf, other

    math.OC cs.LG

    Distributed Learning over Arbitrary Topology: Linear Speed-Up with Polynomial Transient Time

    Authors: Runze You, Shi Pu

    Abstract: We study a distributed learning problem in which $n$ agents, each with potentially heterogeneous local data, collaboratively minimize the sum of their local cost functions via peer-to-peer communication. We propose a novel algorithm, Spanning Tree Push-Pull (STPP), which employs two spanning trees extracted from a general communication graph to distribute both model parameters and stochastic gradi… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  4. arXiv:2411.17188  [pdf, other

    cs.CV cs.CL

    Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment

    Authors: Dongping Chen, Ruoxi Chen, Shu Pu, Zhaoyi Liu, Yanru Wu, Caixi Chen, Benlin Liu, Yue Huang, Yao Wan, Pan Zhou, Ranjay Krishna

    Abstract: Many real-world user queries (e.g. "How do to make egg fried rice?") could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evalua… ▽ More

    Submitted 24 March, 2025; v1 submitted 26 November, 2024; originally announced November 2024.

    Comments: Accepted by ICLR 2025 as Spotlight. Project homepage: https://interleave-eval.github.io/

  5. arXiv:2411.12591  [pdf, other

    cs.CV cs.AI

    Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

    Authors: Haojie Zheng, Tianyang Xu, Hanchi Sun, Shu Pu, Ruoxi Chen, Lichao Sun

    Abstract: Multimodal large language models (MLLMs) have advanced the integration of visual and linguistic modalities, establishing themselves as the dominant paradigm for visual-language tasks. Current approaches like chain of thought (CoT) reasoning have augmented the cognitive capabilities of large language models (LLMs), yet their adaptation to MLLMs is hindered by heightened risks of hallucination in cr… ▽ More

    Submitted 15 November, 2024; originally announced November 2024.

  6. arXiv:2409.18971  [pdf, other

    cs.MM cs.AI cs.SD eess.AS

    Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

    Authors: Mengying Ge, Mingyang Li, Dongkai Tang, Pengbo Li, Kuo Liu, Shuhao Deng, Songbai Pu, Long Liu, Yang Song, Tao Zhang

    Abstract: In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with ot… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

  7. arXiv:2406.11410  [pdf, other

    cs.CL cs.AI

    HARE: HumAn pRiors, a key to small language model Efficiency

    Authors: Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu

    Abstract: Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale… ▽ More

    Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  8. arXiv:2403.05172  [pdf, other

    cs.CV

    Learning Expressive And Generalizable Motion Features For Face Forgery Detection

    Authors: Jingyi Zhang, Peng Zhang, Jingjing Wang, Di Xie, Shiliang Pu

    Abstract: Previous face forgery detection methods mainly focus on appearance features, which may be easily attacked by sophisticated manipulation. Considering the majority of current face manipulation methods generate fake faces based on a single frame, which do not take frame consistency and coordination into consideration, artifacts on frame sequences are more effective for face forgery detection. However… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: Accepted to ICASSP 2023

  9. arXiv:2403.05117  [pdf, other

    cs.CV

    Arbitrary-Scale Point Cloud Upsampling by Voxel-Based Network with Latent Geometric-Consistent Learning

    Authors: Hang Du, Xuejun Yan, Jingjing Wang, Di Xie, Shiliang Pu

    Abstract: Recently, arbitrary-scale point cloud upsampling mechanism became increasingly popular due to its efficiency and convenience for practical applications. To achieve this, most previous approaches formulate it as a problem of surface approximation and employ point-based networks to learn surface representations. However, learning surfaces from sparse point clouds is more challenging, and thus they o… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: Accepted to AAAI 2024. The source code is available at https://github.com/hikvision-research/3DVision

  10. arXiv:2403.00258  [pdf, ps, other

    stat.ML cs.LG

    "Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach

    Authors: Lingyu Gu, Yongqi Du, Yuan Zhang, Di Xie, Shiliang Pu, Robert C. Qiu, Zhenyu Liao

    Abstract: Modern deep neural networks (DNNs) are extremely powerful; however, this comes at the price of increased depth and having more parameters per layer, making their training and inference more computationally challenging. In an attempt to address this key limitation, efforts have been devoted to the compression (e.g., sparsification and/or quantization) of these large-scale machine learning models, s… ▽ More

    Submitted 29 February, 2024; originally announced March 2024.

    Comments: 32 pages, 4 figures, and 2 tables. Fixing typos in Theorems 1 and 2 from NeurIPS 2022 proceeding (https://proceedings.neurips.cc/paper_files/paper/2022/hash/185087ea328b4f03ea8fd0c8aa96f747-Abstract-Conference.html)

  11. arXiv:2402.09714  [pdf, other

    math.OC cs.DC cs.MA

    An Accelerated Distributed Stochastic Gradient Method with Momentum

    Authors: Kun Huang, Shi Pu, Angelia Nedić

    Abstract: In this paper, we introduce an accelerated distributed stochastic gradient method with momentum for solving the distributed optimization problem, where a group of $n$ agents collaboratively minimize the average of the local objective functions over a connected network. The method, termed ``Distributed Stochastic Momentum Tracking (DSMT)'', is a single-loop algorithm that utilizes the momentum trac… ▽ More

    Submitted 26 March, 2025; v1 submitted 15 February, 2024; originally announced February 2024.

    Comments: 45 pages, 5 figures

  12. arXiv:2312.09979  [pdf, other

    cs.CL

    LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin

    Authors: Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

    Abstract: Supervised fine-tuning (SFT) is a crucial step for large language models (LLMs), enabling them to align with human instructions and enhance their capabilities in downstream tasks. Increasing instruction data substantially is a direct solution to align the model with a broader range of downstream tasks or notably improve its performance on a specific task. However, we find that large-scale increase… ▽ More

    Submitted 8 March, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: 14 pages, 7 figures

  13. arXiv:2310.08298  [pdf, other

    cs.CL

    MProto: Multi-Prototype Network with Denoised Optimal Transport for Distantly Supervised Named Entity Recognition

    Authors: Shuhui Wu, Yongliang Shen, Zeqi Tan, Wenqi Ren, Jietian Guo, Shiliang Pu, Weiming Lu

    Abstract: Distantly supervised named entity recognition (DS-NER) aims to locate entity mentions and classify their types with only knowledge bases or gazetteers and unlabeled corpus. However, distant annotations are noisy and degrade the performance of NER models. In this paper, we propose a noise-robust prototype network named MProto for the DS-NER task. Different from previous prototype-based NER methods,… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP-2023, camera ready version

  14. arXiv:2306.12037  [pdf, other

    math.OC cs.LG cs.MA

    Distributed Random Reshuffling Methods with Improved Convergence

    Authors: Kun Huang, Linli Zhou, Shi Pu

    Abstract: This paper proposes two distributed random reshuffling methods, namely Gradient Tracking with Random Reshuffling (GT-RR) and Exact Diffusion with Random Reshuffling (ED-RR), to solve the distributed optimization problem over a connected network, where a set of agents aim to minimize the average of their local cost functions. Both algorithms invoke random reshuffling (RR) update for each agent, inh… ▽ More

    Submitted 16 March, 2025; v1 submitted 21 June, 2023; originally announced June 2023.

    Comments: 16 pages, 8 figures, a short version would appear in IEEE TAC

  15. Accelerating Dynamic Network Embedding with Billions of Parameter Updates to Milliseconds

    Authors: Haoran Deng, Yang Yang, Jiahe Li, Haoyang Cai, Shiliang Pu, Weihao Jiang

    Abstract: Network embedding, a graph representation learning method illustrating network topology by mapping nodes into lower-dimension vectors, is challenging to accommodate the ever-changing dynamic graphs in practice. Existing research is mainly based on node-by-node embedding modifications, which falls into the dilemma of efficient calculation and accuracy. Observing that the embedding dimensions are us… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

  16. Single Domain Dynamic Generalization for Iris Presentation Attack Detection

    Authors: Yachun Li, Jingjing Wang, Yuhui Chen, Di Xie, Shiliang Pu

    Abstract: Iris presentation attack detection (PAD) has achieved great success under intra-domain settings but easily degrades on unseen domains. Conventional domain generalization methods mitigate the gap by learning domain-invariant features. However, they ignore the discriminative information in the domain-specific features. Moreover, we usually face a more realistic scenario with only one single domain a… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: ICASSP 2023 Camera Ready

  17. arXiv:2305.11004  [pdf, other

    cs.CL

    Insert or Attach: Taxonomy Completion via Box Embedding

    Authors: Wei Xue, Yongliang Shen, Wenqi Ren, Jietian Guo, Shiliang Pu, Weiming Lu

    Abstract: Taxonomy completion, enriching existing taxonomies by inserting new concepts as parents or attaching them as children, has gained significant interest. Previous approaches embed concepts as vectors in Euclidean space, which makes it difficult to model asymmetric relations in taxonomy. In addition, they introduce pseudo-leaves to convert attachment cases into insertion cases, leading to an incorrec… ▽ More

    Submitted 18 June, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  18. arXiv:2305.04175  [pdf, other

    cs.CR cs.CV cs.MM

    Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning

    Authors: Shengfang Zhai, Yinpeng Dong, Qingni Shen, Shi Pu, Yuejian Fang, Hang Su

    Abstract: With the help of conditioning mechanisms, the state-of-the-art diffusion models have achieved tremendous success in guided image generation, particularly in text-to-image synthesis. To gain a better understanding of the training process and potential risks of text-to-image synthesis, we perform a systematic investigation of backdoor attack on text-to-image diffusion models and propose BadT2I, a ge… ▽ More

    Submitted 22 October, 2023; v1 submitted 6 May, 2023; originally announced May 2023.

    Comments: Carmera-ready version. To appear in ACM MM 2023. Code will be released at: https://github.com/sf-zhai/BadT2I

  19. arXiv:2304.02950  [pdf, other

    cs.CV

    Multi-view Adversarial Discriminator: Mine the Non-causal Factors for Object Detection in Unseen Domains

    Authors: Mingjun Xu, Lingyun Qin, Weijie Chen, Shiliang Pu, Lei Zhang

    Abstract: Domain shift degrades the performance of object detection models in practical applications. To alleviate the influence of domain shift, plenty of previous work try to decouple and learn the domain-invariant (common) features from source domains via domain adversarial learning (DAL). However, inspired by causal mechanisms, we find that previous methods ignore the implicit insignificant non-causal f… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

    Comments: CVPR 2023 (Highlight, top 2.5%). Pytorch vs. MindSpore Code at "https://github.com/K2OKOH/MAD"

  20. arXiv:2303.17167  [pdf, other

    cs.CV

    Rethinking the Approximation Error in 3D Surface Fitting for Point Cloud Normal Estimation

    Authors: Hang Du, Xuejun Yan, Jingjing Wang, Di Xie, Shiliang Pu

    Abstract: Most existing approaches for point cloud normal estimation aim to locally fit a geometric surface and calculate the normal from the fitted surface. Recently, learning-based methods have adopted a routine of predicting point-wise weights to solve the weighted least-squares surface fitting problem. Despite achieving remarkable progress, these methods overlook the approximation error of the fitting p… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: The first two authors contributed equally to this work. The source code are available at https://github.com/hikvision-research/3DVision. Accepted to CVPR 2023

  21. arXiv:2303.06555  [pdf, other

    cs.LG cs.CV

    One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

    Authors: Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu

    Abstract: This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. I… ▽ More

    Submitted 30 May, 2023; v1 submitted 11 March, 2023; originally announced March 2023.

    Comments: Accepted to ICML2023

  22. arXiv:2302.00912  [pdf

    cs.CV

    Advances and Challenges in Multimodal Remote Sensing Image Registration

    Authors: Bai Zhu, Liang Zhou, Simiao Pu, Jianwei Fan, Yuanxin Ye

    Abstract: Over the past few decades, with the rapid development of global aerospace and aerial remote sensing technology, the types of sensors have evolved from the traditional monomodal sensors (e.g., optical sensors) to the new generation of multimodal sensors [e.g., multispectral, hyperspectral, light detection and ranging (LiDAR) and synthetic aperture radar (SAR) sensors]. These advanced devices can dy… ▽ More

    Submitted 7 February, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

    Comments: 10 pages, 4 figures

  23. arXiv:2301.12677  [pdf, other

    math.OC cs.LG stat.ML

    Distributed Stochastic Optimization under a General Variance Condition

    Authors: Kun Huang, Xiao Li, Shi Pu

    Abstract: Distributed stochastic optimization has drawn great attention recently due to its effectiveness in solving large-scale machine learning problems. Though numerous algorithms have been proposed and successfully applied to general practical problems, their theoretical guarantees mainly rely on certain boundedness conditions on the stochastic gradients, varying from uniform boundedness to the relaxed… ▽ More

    Submitted 13 December, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

    Comments: 16 pages, 2 figure

  24. arXiv:2301.05872  [pdf, other

    math.OC cs.DC cs.LG cs.MA

    CEDAS: A Compressed Decentralized Stochastic Gradient Method with Improved Convergence

    Authors: Kun Huang, Shi Pu

    Abstract: In this paper, we consider solving the distributed optimization problem over a multi-agent network under the communication restricted setting. We study a compressed decentralized stochastic gradient method, termed ``compressed exact diffusion with adaptive stepsizes (CEDAS)", and show the method asymptotically achieves comparable convergence rate as centralized { stochastic gradient descent (SGD)}… ▽ More

    Submitted 28 September, 2024; v1 submitted 14 January, 2023; originally announced January 2023.

    Comments: 16 pages, 8 figures

  25. arXiv:2301.04796  [pdf, other

    cs.CV

    1st Place Solution for ECCV 2022 OOD-CV Challenge Object Detection Track

    Authors: Wei Zhao, Binbin Chen, Weijie Chen, Shicai Yang, Di Xie, Shiliang Pu, Yueting Zhuang

    Abstract: OOD-CV challenge is an out-of-distribution generalization task. To solve this problem in object detection track, we propose a simple yet effective Generalize-then-Adapt (G&A) framework, which is composed of a two-stage domain generalization part and a one-stage domain adaptation part. The domain generalization part is implemented by a Supervised Model Pretraining stage using source data for model… ▽ More

    Submitted 11 January, 2023; originally announced January 2023.

    Comments: Tech Report

  26. arXiv:2301.04795  [pdf, other

    cs.CV

    1st Place Solution for ECCV 2022 OOD-CV Challenge Image Classification Track

    Authors: Yilu Guo, Xingyue Shi, Weijie Chen, Shicai Yang, Di Xie, Shiliang Pu, Yueting Zhuang

    Abstract: OOD-CV challenge is an out-of-distribution generalization task. In this challenge, our core solution can be summarized as that Noisy Label Learning Is A Strong Test-Time Domain Adaptation Optimizer. Briefly speaking, our main pipeline can be divided into two stages, a pre-training stage for domain generalization and a test-time training stage for domain adaptation. We only exploit labeled source d… ▽ More

    Submitted 11 January, 2023; originally announced January 2023.

    Comments: Tech Report

  27. arXiv:2212.14710  [pdf, other

    cs.CV

    NeRF-Gaze: A Head-Eye Redirection Parametric Model for Gaze Estimation

    Authors: Pengwei Yin, Jiawu Dai, Jingjing Wang, Di Xie, Shiliang Pu

    Abstract: Gaze estimation is the fundamental basis for many visual tasks. Yet, the high cost of acquiring gaze datasets with 3D annotations hinders the optimization and application of gaze estimation models. In this work, we propose a novel Head-Eye redirection parametric model based on Neural Radiance Field, which allows dense gaze data generation with view consistency and accurate gaze direction. Moreover… ▽ More

    Submitted 30 December, 2022; originally announced December 2022.

    Comments: 10 pages, 8 figures, submitted to CVPR 2023

  28. arXiv:2211.08998  [pdf, other

    cs.LG math.OC

    Data-pooling Reinforcement Learning for Personalized Healthcare Intervention

    Authors: Xinyun Chen, Pengyi Shi, Shanwen Pu

    Abstract: Motivated by the emerging needs of personalized preventative intervention in many healthcare applications, we consider a multi-stage, dynamic decision-making problem in the online setting with unknown model parameters. To deal with the pervasive issue of small sample size in personalized planning, we develop a novel data-pooling reinforcement learning (RL) algorithm based on a general perturbed va… ▽ More

    Submitted 16 November, 2022; originally announced November 2022.

  29. arXiv:2210.04206  [pdf, other

    cs.CV

    Attention Diversification for Domain Generalization

    Authors: Rang Meng, Xianfeng Li, Weijie Chen, Shicai Yang, Jie Song, Xinchao Wang, Lei Zhang, Mingli Song, Di Xie, Shiliang Pu

    Abstract: Convolutional neural networks (CNNs) have demonstrated gratifying results at learning discriminative features. However, when applied to unseen domains, state-of-the-art models are usually prone to errors due to domain shift. After investigating this issue from the perspective of shortcut learning, we find the devils lie in the fact that models trained on different domains merely bias to different… ▽ More

    Submitted 9 October, 2022; originally announced October 2022.

    Comments: ECCV 2022. Code available at https://github.com/hikvision-research/DomainGeneralization

    Journal ref: European Conference on Computer Vision (ECCV 2022)

  30. arXiv:2210.03974  [pdf, other

    cs.CV

    FBNet: Feedback Network for Point Cloud Completion

    Authors: Xuejun Yan, Hongyu Yan, Jingjing Wang, Hang Du, Zhihong Wu, Di Xie, Shiliang Pu, Li Lu

    Abstract: The rapid development of point cloud learning has driven point cloud completion into a new era. However, the information flows of most existing completion methods are solely feedforward, and high-level information is rarely reused to improve low-level feature learning. To this end, we propose a novel Feedback Network (FBNet) for point cloud completion, in which present features are efficiently ref… ▽ More

    Submitted 8 October, 2022; originally announced October 2022.

    Comments: The first two authors contributed equally to this work. The source code and model are available at https://github.com/hikvision-research/3DVision/. Accepted to ECCV 2022 as oral presentation

  31. arXiv:2210.03942  [pdf, other

    cs.CV

    Point Cloud Upsampling via Cascaded Refinement Network

    Authors: Hang Du, Xuejun Yan, Jingjing Wang, Di Xie, Shiliang Pu

    Abstract: Point cloud upsampling focuses on generating a dense, uniform and proximity-to-surface point set. Most previous approaches accomplish these objectives by carefully designing a single-stage network, which makes it still challenging to generate a high-fidelity point distribution. Instead, upsampling point cloud in a coarse-to-fine manner is a decent solution. However, existing coarse-to-fine upsampl… ▽ More

    Submitted 8 October, 2022; originally announced October 2022.

    Comments: The first two authors contributed equally to this work. The code is publicly available at https://github.com/hikvision-research/3DVision. Accepted to ACCV 2022 as oral presentation

  32. arXiv:2210.03899  [pdf, other

    cs.CV

    Multi-Scale Wavelet Transformer for Face Forgery Detection

    Authors: Jie Liu, Jingjing Wang, Peng Zhang, Chunmao Wang, Di Xie, Shiliang Pu

    Abstract: Currently, many face forgery detection methods aggregate spatial and frequency features to enhance the generalization ability and gain promising performance under the cross-dataset scenario. However, these methods only leverage one level frequency information which limits their expressive ability. To overcome these limitations, we propose a multi-scale wavelet transformer framework for face forger… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

    Comments: The first two authors contributed equally to this work. Accepted to ACCV 2022 as oral presentation

  33. arXiv:2209.03629  [pdf, other

    cs.LG cs.SI physics.soc-ph

    Hierarchical Graph Pooling is an Effective Citywide Traffic Condition Prediction Model

    Authors: Shilin Pu, Liang Chu, Zhuoran Hou, Jincheng Hu, Yanjun Huang, Yuanjian Zhang

    Abstract: Accurate traffic conditions prediction provides a solid foundation for vehicle-environment coordination and traffic control tasks. Because of the complexity of road network data in spatial distribution and the diversity of deep learning methods, it becomes challenging to effectively define traffic data and adequately capture the complex spatial nonlinear features in the data. This paper applies tw… ▽ More

    Submitted 8 September, 2022; originally announced September 2022.

    Comments: 16 pages, 15 figures

  34. arXiv:2208.05818  [pdf, other

    cs.MM

    HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

    Authors: Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Wenqiao Zhang, Jiaxu Miao, Shiliang Pu, Fei Wu

    Abstract: Video Object Grounding (VOG) is the problem of associating spatial object regions in the video to a descriptive natural language query. This is a challenging vision-language task that necessitates constructing the correct cross-modal correspondence and modeling the appropriate spatio-temporal context of the query video and caption, thereby localizing the specific objects accurately. In this paper,… ▽ More

    Submitted 11 August, 2022; originally announced August 2022.

  35. Unified Normalization for Accelerating and Stabilizing Transformers

    Authors: Qiming Yang, Kai Zhang, Chaoxiang Lan, Zhi Yang, Zheyang Li, Wenming Tan, Jun Xiao, Shiliang Pu

    Abstract: Solid results from Transformers have made them prevailing architectures in various natural language and vision tasks. As a default component in Transformers, Layer Normalization (LN) normalizes activations within each token to boost the robustness. However, LN requires on-the-fly statistics calculation in inference as well as division and square root operations, leading to inefficiency on hardware… ▽ More

    Submitted 2 August, 2022; originally announced August 2022.

    Comments: ACM MM'22

  36. arXiv:2207.11034  [pdf

    cs.LG

    Spatial-Temporal Feature Extraction and Evaluation Network for Citywide Traffic Condition Prediction

    Authors: Shilin Pu, Liang Chu, Zhuoran Hou, Jincheng Hu, Yanjun Huang, Yuanjian Zhang

    Abstract: Traffic prediction plays an important role in the realization of traffic control and scheduling tasks in intelligent transportation systems. With the diversification of data sources, reasonably using rich traffic data to model the complex spatial-temporal dependence and nonlinear characteristics in traffic flow are the key challenge for intelligent transportation system. In addition, clearly evalu… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

    Comments: 39 pages, 14 figures, 5 tables

  37. arXiv:2207.06754  [pdf, other

    cs.CV

    E2-AEN: End-to-End Incremental Learning with Adaptively Expandable Network

    Authors: Guimei Cao, Zhanzhan Cheng, Yunlu Xu, Duo Li, Shiliang Pu, Yi Niu, Fei Wu

    Abstract: Expandable networks have demonstrated their advantages in dealing with catastrophic forgetting problem in incremental learning. Considering that different tasks may need different structures, recent methods design dynamic structures adapted to different tasks via sophisticated skills. Their routine is to search expandable structures first and then train on the new tasks, which, however, breaks tas… ▽ More

    Submitted 14 July, 2022; originally announced July 2022.

  38. arXiv:2207.06744  [pdf, other

    cs.CV

    TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents

    Authors: Zhanzhan Cheng, Peng Zhang, Can Li, Qiao Liang, Yunlu Xu, Pengfei Li, Shiliang Pu, Yi Niu, Fei Wu

    Abstract: Recently, automatically extracting information from visually rich documents (e.g., tickets and resumes) has become a hot and vital research topic due to its widespread commercial value. Most existing methods divide this task into two subparts: the text reading part for obtaining the plain text from the original document images and the information extraction part for extracting key contents. These… ▽ More

    Submitted 14 July, 2022; originally announced July 2022.

  39. arXiv:2207.06694  [pdf, other

    cs.CV

    Dynamic Low-Resolution Distillation for Cost-Efficient End-to-End Text Spotting

    Authors: Ying Chen, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, Xi Li

    Abstract: End-to-end text spotting has attached great attention recently due to its benefits on global optimization and high maintainability for real applications. However, the input scale has always been a tough trade-off since recognizing a small text instance usually requires enlarging the whole image, which brings high computational costs. In this paper, to address this problem, we propose a novel cost-… ▽ More

    Submitted 14 July, 2022; v1 submitted 14 July, 2022; originally announced July 2022.

    Comments: Accept by ECCV2022

  40. arXiv:2207.06085  [pdf, other

    cs.CV

    Semi-supervised Ranking for Object Image Blur Assessment

    Authors: Qiang Li, Zhaoliang Yao, Jingjing Wang, Ye Tian, Pengju Yang, Di Xie, Shiliang Pu

    Abstract: Assessing the blurriness of an object image is fundamentally important to improve the performance for object recognition and retrieval. The main challenge lies in the lack of abundant images with reliable labels and effective learning strategies. Current datasets are labeled with limited and confused quality levels. To overcome this limitation, we propose to label the rank relationships between pa… ▽ More

    Submitted 13 July, 2022; originally announced July 2022.

    Comments: The first two authors contributed equally to this work. Dataset is available at https://github.com/yzliangHIK2022/SSRanking-for-Object-BA. Accepted to ICIP 2022

  41. arXiv:2207.01756  [pdf, other

    cs.CV

    Universal Domain Adaptive Object Detector

    Authors: Wenxu Shi, Lei Zhang, Weijie Chen, Shiliang Pu

    Abstract: Universal domain adaptive object detection (UniDAOD)is more challenging than domain adaptive object detection (DAOD) since the label space of the source domain may not be the same as that of the target and the scale of objects in the universal scenarios can vary dramatically (i.e, category shift and scale shift). To this end, we propose US-DAF, namely Universal Scale-Aware Domain Adaptive Faster R… ▽ More

    Submitted 4 July, 2022; originally announced July 2022.

    Comments: Accepted to ACM MM2022

  42. arXiv:2206.06620  [pdf, other

    cs.CV

    Slimmable Domain Adaptation

    Authors: Rang Meng, Weijie Chen, Shicai Yang, Jie Song, Luojun Lin, Di Xie, Shiliang Pu, Xinchao Wang, Mingli Song, Yueting Zhuang

    Abstract: Vanilla unsupervised domain adaptation methods tend to optimize the model with fixed neural architecture, which is not very practical in real-world scenarios since the target data is usually processed by different resource-limited devices. It is therefore of great necessity to facilitate architecture adaptation across various devices. In this paper, we introduce a simple framework, Slimmable Domai… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: To appear in CVPR 2022. Code is coming soon: https://github.com/hikvision-research/SlimDA

    Journal ref: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

  43. arXiv:2206.06608  [pdf, other

    cs.CV

    Label Matching Semi-Supervised Object Detection

    Authors: Binbin Chen, Weijie Chen, Shicai Yang, Yunyi Xuan, Jie Song, Di Xie, Shiliang Pu, Mingli Song, Yueting Zhuang

    Abstract: Semi-supervised object detection has made significant progress with the development of mean teacher driven self-training. Despite the promising results, the label mismatch problem is not yet fully explored in the previous works, leading to severe confirmation bias during self-training. In this paper, we delve into this problem and propose a simple yet effective LabelMatch framework from two differ… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: To appear in CVPR 2022. Code is coming soon: https://github.com/hikvision-research/SSOD

    Journal ref: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

  44. arXiv:2206.06293  [pdf, other

    cs.CV cs.AI

    Learning Domain Adaptive Object Detection with Probabilistic Teacher

    Authors: Meilin Chen, Weijie Chen, Shicai Yang, Jie Song, Xinchao Wang, Lei Zhang, Yunfeng Yan, Donglian Qi, Yueting Zhuang, Di Xie, Shiliang Pu

    Abstract: Self-training for unsupervised domain adaptive object detection is a challenging task, of which the performance depends heavily on the quality of pseudo boxes. Despite the promising results, prior works have largely overlooked the uncertainty of pseudo boxes during self-training. In this paper, we present a simple yet effective framework, termed as Probabilistic Teacher (PT), which aims to capture… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: To appear in ICML 2022. Code is coming soon: https://github.com/hikvision-research/ProbabilisticTeacher

    Journal ref: International Conference on Machine Learning (ICML), 2022

  45. arXiv:2206.06177  [pdf, other

    cs.CV cs.AI

    Transductive CLIP with Class-Conditional Contrastive Learning

    Authors: Junchu Huang, Weijie Chen, Shicai Yang, Di Xie, Shiliang Pu, Yueting Zhuang

    Abstract: Inspired by the remarkable zero-shot generalization capacity of vision-language pre-trained model, we seek to leverage the supervision from CLIP model to alleviate the burden of data labeling. However, such supervision inevitably contains the label noise, which significantly degrades the discriminative power of the classification model. In this work, we propose Transductive CLIP, a novel framework… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: Published in IEEE ICASSP 2022

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022

  46. arXiv:2206.06168  [pdf, other

    cs.CV

    2nd Place Solution for ICCV 2021 VIPriors Image Classification Challenge: An Attract-and-Repulse Learning Approach

    Authors: Yilu Guo, Shicai Yang, Weijie Chen, Liang Ma, Di Xie, Shiliang Pu

    Abstract: Convolutional neural networks (CNNs) have achieved significant success in image classification by utilizing large-scale datasets. However, it is still of great challenge to learn from scratch on small-scale datasets efficiently and effectively. With limited training datasets, the concepts of categories will be ambiguous since the over-parameterized CNNs tend to simply memorize the dataset, leading… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: 2nd Place Solution for ICCV 2021 VIPriors Image Classification Challenge

  47. arXiv:2205.11126  [pdf, other

    cs.LG cs.CV

    KRNet: Towards Efficient Knowledge Replay

    Authors: Yingying Zhang, Qiaoyong Zhong, Di Xie, Shiliang Pu

    Abstract: The knowledge replay technique has been widely used in many tasks such as continual learning and continuous domain adaptation. The key lies in how to effectively encode the knowledge extracted from previous data and replay them during current training procedure. A simple yet effective model to achieve knowledge replay is autoencoder. However, the number of stored latent codes in autoencoder increa… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: Accepted by ICPR 2022

  48. arXiv:2205.11071  [pdf, other

    cs.CV

    Self-distilled Knowledge Delegator for Exemplar-free Class Incremental Learning

    Authors: Fanfan Ye, Liang Ma, Qiaoyong Zhong, Di Xie, Shiliang Pu

    Abstract: Exemplar-free incremental learning is extremely challenging due to inaccessibility of data from old tasks. In this paper, we attempt to exploit the knowledge encoded in a previously trained classification model to handle the catastrophic forgetting problem in continual learning. Specifically, we introduce a so-called knowledge delegator, which is capable of transferring knowledge from the trained… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: Accepted by IJCNN 2022

  49. arXiv:2204.11448  [pdf, other

    eess.IV cs.CV

    High-Efficiency Lossy Image Coding Through Adaptive Neighborhood Information Aggregation

    Authors: Ming Lu, Fangdong Chen, Shiliang Pu, Zhan Ma

    Abstract: Questing for learned lossy image coding (LIC) with superior compression performance and computation throughput is challenging. The vital factor behind it is how to intelligently explore Adaptive Neighborhood Information Aggregation (ANIA) in transform and entropy coding modules. To this end, Integrated Convolution and Self-Attention (ICSA) unit is first proposed to form a content-adaptive transfor… ▽ More

    Submitted 12 October, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

  50. arXiv:2204.00379  [pdf, other

    cs.CV

    Weakly Supervised Regional and Temporal Learning for Facial Action Unit Recognition

    Authors: Jingwei Yan, Jingjing Wang, Qiang Li, Chunmao Wang, Shiliang Pu

    Abstract: Automatic facial action unit (AU) recognition is a challenging task due to the scarcity of manual annotations. To alleviate this problem, a large amount of efforts has been dedicated to exploiting various weakly supervised methods which leverage numerous unlabeled data. However, many aspects with regard to some unique properties of AUs, such as the regional and relational characteristics, are not… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

    Comments: The first two authors contributed equally to this work. Extension of arXiv:2107.14399. Accepted to IEEE Transactions on Multimedia 2022