Skip to main content

Showing 1–50 of 304 results for author: Liang, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.09644  [pdf, ps, other

    cs.IT eess.IV

    Joint Source-Channel Noise Adding with Adaptive Denoising for Diffusion-Based Semantic Communications

    Authors: Chengyang Liang, Dong Li

    Abstract: Semantic communication (SemCom) aims to convey the intended meaning of messages rather than merely transmitting bits, thereby offering greater efficiency and robustness, particularly in resource-constrained or noisy environments. In this paper, we propose a novel framework which is referred to as joint source-channel noise adding with adaptive denoising (JSCNA-AD) for SemCom based on a diffusion m… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  2. arXiv:2505.08366  [pdf

    eess.SP cs.AI

    Non-contact Vital Signs Detection in Dynamic Environments

    Authors: Shuai Sun, Chong-Xi Liang, Chengwei Ye, Huanzhen Zhang, Kangsheng Wang

    Abstract: Accurate phase demodulation is critical for vital sign detection using millimeter-wave radar. However, in complex environments, time-varying DC offsets and phase imbalances can severely degrade demodulation performance. To address this, we propose a novel DC offset calibration method alongside a Hilbert and Differential Cross-Multiply (HADCM) demodulation algorithm. The approach estimates time-var… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  3. arXiv:2505.07425  [pdf, ps, other

    cs.SE

    A Systematic Literature Review on Neural Code Translation

    Authors: Xiang Chen, Jiacheng Xue, Xiaofei Xie, Caokai Liang, Xiaolin Ju

    Abstract: Code translation aims to convert code from one programming language to another automatically. It is motivated by the need for multi-language software development and legacy system migration. In recent years, neural code translation has gained significant attention, driven by rapid advancements in deep learning and large language models. Researchers have proposed various techniques to improve neura… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  4. arXiv:2505.03230  [pdf, other

    cs.LG

    Joint Resource Management for Energy-efficient UAV-assisted SWIPT-MEC: A Deep Reinforcement Learning Approach

    Authors: Yue Chen, Hui Kang, Jiahui Li, Geng Su, Boxiong Wang, Jiacheng Wang, Cong Liang, Shuang Liang, Dusit Niyato

    Abstract: The integration of simultaneous wireless information and power transfer (SWIPT) technology in 6G Internet of Things (IoT) networks faces significant challenges in remote areas and disaster scenarios where ground infrastructure is unavailable. This paper proposes a novel unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) system enhanced by directional antennas to provide both comput… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  5. arXiv:2504.20930  [pdf, other

    cs.AI cs.CL cs.CV

    ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification

    Authors: Ziqing Fan, Cheng Liang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, Weidi Xie

    Abstract: Recent advances in reasoning-enhanced large language models (LLMs) and multimodal LLMs (MLLMs) have significantly improved performance in complex tasks, yet medical AI models often overlook the structured reasoning processes inherent in clinical practice. In this work, we present ChestX-Reasoner, a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical repor… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  6. arXiv:2504.18790  [pdf, other

    cs.RO

    Coherence-based Approximate Derivatives via Web of Affine Spaces Optimization

    Authors: Daniel Rakita, Chen Liang, Qian Wang

    Abstract: Computing derivatives is a crucial subroutine in computer science and related fields as it provides a local characterization of a function's steepest directions of ascent or descent. In this work, we recognize that derivatives are often not computed in isolation; conversely, it is quite common to compute a \textit{sequence} of derivatives, each one somewhat related to the last. Thus, we propose ac… ▽ More

    Submitted 26 April, 2025; originally announced April 2025.

    Comments: To appear in the proceedings of Robotics Science and Systems (RSS) 2025

  7. arXiv:2504.16136  [pdf, other

    cs.LG

    Active Learning Methods for Efficient Data Utilization and Model Performance Enhancement

    Authors: Chiung-Yi Tseng, Junhao Song, Ziqian Bi, Tianyang Wang, Chia Xin Liang, Ming Liu

    Abstract: In the era of data-driven intelligence, the paradox of data abundance and annotation scarcity has emerged as a critical bottleneck in the advancement of machine learning. This paper gives a detailed overview of Active Learning (AL), which is a strategy in machine learning that helps models achieve better performance using fewer labeled examples. It introduces the basic concepts of AL and discusses… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  8. arXiv:2504.15976  [pdf, other

    cs.RO

    ad-trait: A Fast and Flexible Automatic Differentiation Library in Rust

    Authors: Chen Liang, Qian Wang, Andy Xu, Daniel Rakita

    Abstract: The Rust programming language is an attractive choice for robotics and related fields, offering highly efficient and memory-safe code. However, a key limitation preventing its broader adoption in these domains is the lack of high-quality, well-supported Automatic Differentiation (AD)-a fundamental technique that enables convenient derivative computation by systematically accumulating data during f… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  9. arXiv:2504.13825  [pdf, other

    cs.CL cs.LG

    Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

    Authors: Junjie Yang, Junhao Song, Xudong Han, Ziqian Bi, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Yichao Zhang, Qian Niu, Benji Peng, Keyu Chen, Ming Liu

    Abstract: Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, suc… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  10. arXiv:2504.08764  [pdf

    cs.IR cs.CL

    Evaluation of the phi-3-mini SLM for identification of texts related to medicine, health, and sports injuries

    Authors: Chris Brogly, Saif Rjaibi, Charlotte Liang, Erica Lam, Edward Wang, Adam Levitan, Sarah Paleczny, Michael Cusimano

    Abstract: Small Language Models (SLMs) have potential to be used for automatically labelling and identifying aspects of text data for medicine/health-related purposes from documents and the web. As their resource requirements are significantly lower than Large Language Models (LLMs), these can be deployed potentially on more types of devices. SLMs often are benchmarked on health/medicine-related tasks, such… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

  11. arXiv:2504.06878  [pdf, other

    cond-mat.mtrl-sci cs.LG

    CRYSIM: Prediction of Symmetric Structures of Large Crystals with GPU-based Ising Machines

    Authors: Chen Liang, Diptesh Das, Jiang Guo, Ryo Tamura, Zetian Mao, Koji Tsuda

    Abstract: Solving black-box optimization problems with Ising machines is increasingly common in materials science. However, their application to crystal structure prediction (CSP) is still ineffective due to symmetry agnostic encoding of atomic coordinates. We introduce CRYSIM, an algorithm that encodes the space group, the Wyckoff positions combination, and coordinates of independent atomic sites as separa… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

    Comments: 18 pages, 4 figures, 1 table

  12. arXiv:2504.02628  [pdf, other

    eess.IV cs.CV

    Towards Computation- and Communication-efficient Computational Pathology

    Authors: Chu Han, Bingchao Zhao, Jiatai Lin, Shanshan Lyu, Longfei Wang, Tianpeng Deng, Cheng Lu, Changhong Liang, Hannah Y. Wen, Xiaojing Guo, Zhenwei Shi, Zaiyi Liu

    Abstract: Despite the impressive performance across a wide range of applications, current computational pathology models face significant diagnostic efficiency challenges due to their reliance on high-magnification whole-slide image analysis. This limitation severely compromises their clinical utility, especially in time-sensitive diagnostic scenarios and situations requiring efficient data transfer. To add… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  13. arXiv:2503.13522  [pdf, ps, other

    q-bio.BM cs.AI cs.LG

    Advanced Deep Learning Methods for Protein Structure Prediction and Design

    Authors: Yichao Zhang, Ningyuan Deng, Xinyuan Song, Ziqian Bi, Tianyang Wang, Zheyu Yao, Keyu Chen, Ming Li, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Ming Liu, Li Zhang, Xuanhe Pan, Jinlang Wang, Pohsun Feng, Yizhu Wen, Lawrence KQ Yan, Hongming Tseng, Yan Zhong, Yunze Wang, Ziyuan Qin, Bowen Jing, Junjie Yang , et al. (3 additional authors not shown)

    Abstract: After AlphaFold won the Nobel Prize, protein prediction with deep learning once again became a hot topic. We comprehensively explore advanced deep learning methods applied to protein structure prediction and design. It begins by examining recent innovations in prediction architectures, with detailed discussions on improvements such as diffusion based frameworks and novel pairwise attention modules… ▽ More

    Submitted 29 March, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

  14. arXiv:2503.12478  [pdf, other

    cs.LG cs.AI cs.DB

    KDSelector: A Knowledge-Enhanced and Data-Efficient Model Selector Learning Framework for Time Series Anomaly Detection

    Authors: Zhiyu Liang, Dongrui Cai, Chenyuan Zhang, Zheng Liang, Chen Liang, Bo Zheng, Shi Qiu, Jin Wang, Hongzhi Wang

    Abstract: Model selection has been raised as an essential problem in the area of time series anomaly detection (TSAD), because there is no single best TSAD model for the highly heterogeneous time series in real-world applications. However, despite the success of existing model selection solutions that train a classification model (especially neural network, NN) using historical data as a selector to predict… ▽ More

    Submitted 19 March, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

    Comments: This paper has been accepted by SIGMOD 2025

  15. arXiv:2503.12332  [pdf, other

    cs.CV

    VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining

    Authors: Yunze Liu, Peiran Wu, Cheng Liang, Junxiao Shen, Limin Wang, Li Yi

    Abstract: Recent Mamba-based architectures for video understanding demonstrate promising computational efficiency and competitive performance, yet struggle with overfitting issues that hinder their scalability. To overcome this challenge, we introduce VideoMAP, a Hybrid Mamba-Transformer framework featuring a novel pre-training approach. VideoMAP uses a 4:1 Mamba-to-Transformer ratio, effectively balancing… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

  16. arXiv:2503.10029  [pdf, other

    cs.HC

    HandProxy: Expanding the Affordances of Speech Interfaces in Immersive Environments with a Virtual Proxy Hand

    Authors: Chen Liang, Yuxuan Liu, Martez Mott, Anhong Guo

    Abstract: Hand interactions are increasingly used as the primary input modality in immersive environments, but they are not always feasible due to situational impairments, motor limitations, and environmental constraints. Speech interfaces have been explored as an alternative to hand input in research and commercial solutions, but are limited to initiating basic hand gestures and system controls. We introdu… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  17. arXiv:2503.06744  [pdf, other

    cs.CV

    CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving

    Authors: Rui Song, Chenwei Liang, Yan Xia, Walter Zimmer, Hu Cao, Holger Caesar, Andreas Festag, Alois Knoll

    Abstract: Dynamic scene rendering opens new avenues in autonomous driving by enabling closed-loop simulations with photorealistic data, which is crucial for validating end-to-end algorithms. However, the complex and highly dynamic nature of traffic environments presents significant challenges in accurately rendering these scenes. In this paper, we introduce a novel 4D Gaussian Splatting (4DGS) approach, whi… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  18. arXiv:2503.04104  [pdf, other

    cs.CL

    LLMs Can Generate a Better Answer by Aggregating Their Own Responses

    Authors: Zichong Li, Xinyu Feng, Yuheng Cai, Zixuan Zhang, Tianyi Liu, Chen Liang, Weizhu Chen, Haoyu Wang, Tuo Zhao

    Abstract: Large Language Models (LLMs) have shown remarkable capabilities across tasks, yet they often require additional prompting techniques when facing complex problems. While approaches like self-correction and response selection have emerged as popular solutions, recent studies have shown these methods perform poorly when relying on the LLM itself to provide feedback or selection criteria. We argue thi… ▽ More

    Submitted 12 April, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

  19. arXiv:2503.01743  [pdf, other

    cs.CL cs.AI cs.LG

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Authors: Microsoft, :, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami , et al. (51 additional authors not shown)

    Abstract: We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement… ▽ More

    Submitted 7 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

    Comments: 39 pages

  20. arXiv:2503.00723  [pdf, other

    cs.LG

    Re-Imagining Multimodal Instruction Tuning: A Representation View

    Authors: Yiyang Liu, James Chenhao Liang, Ruixiang Tang, Yugyung Lee, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Lifu Huang, Dongfang Liu, Qifan Wang, Cheng Han

    Abstract: Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to re… ▽ More

    Submitted 20 March, 2025; v1 submitted 1 March, 2025; originally announced March 2025.

  21. arXiv:2503.00634  [pdf, other

    cs.LG cs.AI

    Efficiently Editing Mixture-of-Experts Models with Compressed Experts

    Authors: Yifei He, Yang Liu, Chen Liang, Hany Hassan Awadalla

    Abstract: Mixture-of-Experts (MoE) models have become a key approach for scaling large language models efficiently by activating only a subset of experts during training and inference. Typically, the number of activated experts presents a trade-off: fewer experts reduce computational costs, while more experts improve performance. Recent studies reveal that not all activated experts contribute equally to mod… ▽ More

    Submitted 1 March, 2025; originally announced March 2025.

  22. arXiv:2502.17410  [pdf, other

    cs.LG

    COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs

    Authors: Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, Tuo Zhao

    Abstract: Large Language Models (LLMs) have demonstrated remarkable success across various domains, yet their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit. While adaptive optimizers such as AdamW are widely used, they suffer from critical limitations, including an inability to capture interdependencies between coordinates and high memory c… ▽ More

    Submitted 25 February, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: 23 pages, 9 figures, 6 tables

  23. InteRecon: Towards Reconstructing Interactivity of Personal Memorable Items in Mixed Reality

    Authors: Zisu Li, Jiawei Li, Zeyu Xiong, Shumeng Zhang, Faraz Faruqi, Stefanie Mueller, Chen Liang, Xiaojuan Ma, Mingming Fan

    Abstract: Digital capturing of memorable personal items is a key way to archive personal memories. Although current digitization methods (e.g., photos, videos, 3D scanning) can replicate the physical appearance of an item, they often cannot preserve its real-world interactivity. We present Interactive Digital Item (IDI), a concept of reconstructing both the physical appearance and, more importantly, the int… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

    Comments: 19 pages, 8 figures

  24. arXiv:2502.06061  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization

    Authors: Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, Ge Liu

    Abstract: Recent advancements in reinforcement learning (RL) have achieved great success in fine-tuning diffusion-based generative models. However, fine-tuning continuous flow-based generative models to align with arbitrary user-defined reward functions remains challenging, particularly due to issues such as policy collapse from overoptimization and the prohibitively high computational cost of likelihoods i… ▽ More

    Submitted 9 February, 2025; originally announced February 2025.

    Comments: 61 pages

  25. arXiv:2502.04116  [pdf, other

    cs.LG cs.CV

    Generative Adversarial Networks Bridging Art and Machine Intelligence

    Authors: Junhao Song, Yichao Zhang, Ziqian Bi, Tianyang Wang, Keyu Chen, Ming Li, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Ming Liu, Jiawei Xu, Xuanhe Pan, Jinlang Wang, Pohsun Feng, Yizhu Wen, Lawrence K. Q. Yan, Hong-Ming Tseng, Xinyuan Song, Jintao Ren, Silin Chen, Yunze Wang, Weiche Hsieh, Bowen Jing, Junjie Yang , et al. (3 additional authors not shown)

    Abstract: Generative Adversarial Networks (GAN) have greatly influenced the development of computer vision and artificial intelligence in the past decade and also connected art and machine intelligence together. This book begins with a detailed introduction to the fundamental principles and historical development of GANs, contrasting them with traditional generative models and elucidating the core adversari… ▽ More

    Submitted 9 February, 2025; v1 submitted 6 February, 2025; originally announced February 2025.

  26. arXiv:2502.01061  [pdf, other

    cs.CV

    OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

    Authors: Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang

    Abstract: End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related c… ▽ More

    Submitted 13 February, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

    Comments: https://omnihuman-lab.github.io/

  27. arXiv:2501.17605  [pdf, other

    cs.AR

    Towards Reliable Systems: A Scalable Approach to AXI4 Transaction Monitoring

    Authors: Chaoqun Liang, Thomas Benz, Alessandro Ottaviano, Angelo Garofalo, Luca Benini, Davide Rossi

    Abstract: In safety-critical SoC applications such as automotive and aerospace, reliable transaction monitoring is crucial for maintaining system integrity. This paper introduces a drop-in Transaction Monitoring Unit (TMU) for AXI4 subordinate endpoints that detects transaction failures including protocol violations or timeouts and triggers recovery by resetting the affected subordinates. Two TMU variants a… ▽ More

    Submitted 29 January, 2025; originally announced January 2025.

    Comments: 7 pages, 11 figures, accepted as a regular paper at DATE25

  28. arXiv:2501.10161  [pdf, other

    cs.AR

    AXI-REALM: Safe, Modular and Lightweight Traffic Monitoring and Regulation for Heterogeneous Mixed-Criticality Systems

    Authors: Thomas Benz, Alessandro Ottaviano, Chaoqun Liang, Robert Balas, Angelo Garofalo, Francesco Restuccia, Alessandro Biondi, Davide Rossi, Luca Benini

    Abstract: The automotive industry is transitioning from federated, homogeneous, interconnected devices to integrated, heterogeneous, mixed-criticality systems (MCS). This leads to challenges in achieving timing predictability techniques due to access contention on shared resources, which can be mitigated using hardware-based spatial and temporal isolation techniques. Focusing on the interconnect as the poin… ▽ More

    Submitted 17 January, 2025; originally announced January 2025.

    Comments: 14 pages, 13 figures, 3 tables, submitted to the IEEE for possible publication

  29. arXiv:2501.06825  [pdf, other

    cs.CL

    Event Argument Extraction with Enriched Prompts

    Authors: Chen Liang

    Abstract: This work aims to delve deeper into prompt-based event argument extraction (EAE) models. We explore the impact of incorporating various types of information into the prompt on model performance, including trigger, other role arguments for the same event, and role arguments across multiple events within the same document. Further, we provide the best possible performance that the prompt-based EAE m… ▽ More

    Submitted 12 January, 2025; originally announced January 2025.

  30. arXiv:2501.02548  [pdf, other

    cs.LG cs.AI

    AMM: Adaptive Modularized Reinforcement Model for Multi-city Traffic Signal Control

    Authors: Zherui Huang, Yicheng Liu, Chumeng Liang, Guanjie Zheng

    Abstract: Traffic signal control (TSC) is an important and widely studied direction. Recently, reinforcement learning (RL) methods have been used to solve TSC problems and achieve superior performance over conventional TSC methods. However, applying RL methods to the real world is challenging due to the huge cost of experiments in real-world traffic environments. One possible solution is TSC domain adaptati… ▽ More

    Submitted 5 January, 2025; originally announced January 2025.

  31. arXiv:2501.02476  [pdf, other

    cs.CV cs.LG

    Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data

    Authors: Chao Liang, Linchao Zhu, Zongxin Yang, Wei Chen, Yi Yang

    Abstract: We focus on the challenging problem of learning an unbiased classifier from a large number of potentially relevant but noisily labeled web images given only a few clean labeled images. This problem is particularly practical because it reduces the expensive annotation costs by utilizing freely accessible web images with noisy labels. Typically, prototypes are representative images or features used… ▽ More

    Submitted 5 January, 2025; originally announced January 2025.

    Comments: Accepted by TOMM 2024

  32. arXiv:2501.01631  [pdf, other

    cs.DB

    Revisiting Data Analysis with Pre-trained Foundation Models

    Authors: Chen Liang, Donghua Yang, Zheng Liang, Zhiyu Liang, Tianle Zhang, Boyu Xiao, Yuqing Yang, Wenqi Wang, Hongzhi Wang

    Abstract: Data analysis focuses on harnessing advanced statistics, programming, and machine learning techniques to extract valuable insights from vast datasets. An increasing volume and variety of research emerged, addressing datasets of diverse modalities, formats, scales, and resolutions across various industries. However, experienced data analysts often find themselves overwhelmed by intricate details in… ▽ More

    Submitted 2 January, 2025; originally announced January 2025.

    Comments: 22 pages, 7 figures

  33. arXiv:2501.00584  [pdf, other

    cs.CV cs.LG

    Online Video Understanding: OVBench and VideoChat-Online

    Authors: Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, Limin Wang

    Abstract: Multimodal Large Language Models (MLLMs) have significantly progressed in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: e… ▽ More

    Submitted 17 April, 2025; v1 submitted 31 December, 2024; originally announced January 2025.

    Comments: CVPR 2025 Camera Ready Version. Project Page: https://videochat-online.github.io

  34. arXiv:2412.17022  [pdf, other

    cs.CV

    FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos

    Authors: Zhengqian Wu, Ruizhe Li, Zijun Xu, Zhongyuan Wang, Chunxia Xiao, Chao Liang

    Abstract: Video question answering (VideoQA) aims to answer natural language questions according to the given videos. Although existing models perform well in the factoid VideoQA task, they still face challenges in deep video understanding (DVU) task, which focuses on story videos. Compared to factoid videos, the most significant feature of story videos is storylines, which are composed of complex interacti… ▽ More

    Submitted 22 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  35. arXiv:2412.16915  [pdf, other

    cs.CV cs.AI cs.GR cs.SD eess.AS

    FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

    Authors: Tianyun Zhong, Chao Liang, Jianwen Jiang, Gaojie Lin, Jiaqi Yang, Zhou Zhao

    Abstract: Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit re… ▽ More

    Submitted 4 April, 2025; v1 submitted 22 December, 2024; originally announced December 2024.

    Comments: CVPR 2025, Homepage https://fadavatar.github.io/

  36. arXiv:2412.15622  [pdf, other

    eess.AS cs.CL eess.SP

    TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch

    Authors: Xingchen Song, Chengdong Liang, Binbin Zhang, Pengshen Zhang, ZiYu Wang, Youcheng Ma, Menglong Xu, Lin Wang, Di Wu, Fuping Pan, Dinghao Zhou, Zhendong Peng

    Abstract: Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially pr… ▽ More

    Submitted 20 December, 2024; originally announced December 2024.

    Comments: Technical Report

  37. arXiv:2412.13716  [pdf, other

    q-bio.GN cs.LG

    Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA

    Authors: Lifeng Qiao, Peng Ye, Yuchen Ren, Weiqiang Bai, Chaoqi Liang, Xinzhu Ma, Nanqing Dong, Wanli Ouyang

    Abstract: Foundation models have made significant strides in understanding the genomic language of DNA sequences. However, previous models typically adopt the tokenization methods designed for natural language, which are unsuitable for DNA sequences due to their unique characteristics. In addition, the optimal approach to tokenize DNA remains largely under-explored, and may not be intuitively understood by… ▽ More

    Submitted 18 December, 2024; originally announced December 2024.

    Comments: Accepted by NeurIPS 2024

  38. arXiv:2412.12571  [pdf, other

    cs.CV

    ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

    Authors: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Chen Liang, Tong Shen, Han Zhang, Huanzhang Dou, Yu Liu, Jingren Zhou

    Abstract: Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped a… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

    Comments: Tech report. Project page: https://ali-vilab.github.io/ChatDiT-Page/

  39. arXiv:2412.11767  [pdf, other

    cs.CV

    IDEA-Bench: How Far are Generative Models from Professional Designing?

    Authors: Chen Liang, Lianghua Huang, Jingwu Fang, Huanzhang Dou, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Junge Zhang, Xin Zhao, Yu Liu

    Abstract: Real-world design tasks - such as picture book creation, film storyboard development using character sets, photo retouching, visual effects, and font transfer - are highly diverse and complex, requiring deep interpretation and extraction of various elements from instructions, descriptions, and reference images. The resulting images often implicitly capture key features from references or user inpu… ▽ More

    Submitted 16 December, 2024; originally announced December 2024.

  40. arXiv:2412.09656  [pdf, ps, other

    cs.CV cs.AI

    From Noise to Nuance: Advances in Deep Generative Image Models

    Authors: Benji Peng, Chia Xin Liang, Ziqian Bi, Ming Liu, Yichao Zhang, Tianyang Wang, Keyu Chen, Xinyuan Song, Pohsun Feng

    Abstract: Deep learning-based image generation has undergone a paradigm shift since 2021, marked by fundamental architectural breakthroughs and computational innovations. Through reviewing architectural innovations and empirical results, this paper analyzes the transition from traditional generative methods to advanced architectures, with focus on compute-efficient diffusion models and vision transformer ar… ▽ More

    Submitted 11 December, 2024; originally announced December 2024.

  41. arXiv:2412.08969  [pdf, other

    cs.CR cs.LG cs.SE

    Deep Learning Model Security: Threats and Defenses

    Authors: Tianyang Wang, Ziqian Bi, Yichao Zhang, Ming Liu, Weiche Hsieh, Pohsun Feng, Lawrence K. Q. Yan, Yizhu Wen, Benji Peng, Junyu Liu, Keyu Chen, Sen Zhang, Ming Li, Chuanqi Jiang, Xinyuan Song, Junjie Yang, Bowen Jing, Jintao Ren, Junhao Song, Hong-Ming Tseng, Silin Chen, Yunze Wang, Chia Xin Liang, Jiawei Xu, Xuanhe Pan , et al. (2 additional authors not shown)

    Abstract: Deep learning has transformed AI applications but faces critical security challenges, including adversarial attacks, data poisoning, model theft, and privacy leakage. This survey examines these vulnerabilities, detailing their mechanisms and impact on model integrity and confidentiality. Practical implementations, including adversarial examples, label flipping, and backdoor attacks, are explored a… ▽ More

    Submitted 15 December, 2024; v1 submitted 12 December, 2024; originally announced December 2024.

  42. arXiv:2412.02734  [pdf, other

    cs.CV cs.RO

    MVCTrack: Boosting 3D Point Cloud Tracking via Multimodal-Guided Virtual Cues

    Authors: Zhaofeng Hu, Sifan Zhou, Shibo Zhao, Zhihang Yuan, Ci-Jyun Liang

    Abstract: 3D single object tracking is essential in autonomous driving and robotics. Existing methods often struggle with sparse and incomplete point cloud scenarios. To address these limitations, we propose a Multimodal-guided Virtual Cues Projection (MVCP) scheme that generates virtual cues to enrich sparse point clouds. Additionally, we introduce an enhanced tracker MVCTrack based on the generated virtua… ▽ More

    Submitted 24 March, 2025; v1 submitted 3 December, 2024; originally announced December 2024.

    Comments: Accepted by ICRA 2025

  43. arXiv:2412.02187  [pdf, other

    cs.LG

    Deep Learning, Machine Learning, Advancing Big Data Analytics and Management

    Authors: Weiche Hsieh, Ziqian Bi, Keyu Chen, Benji Peng, Sen Zhang, Jiawei Xu, Jinlang Wang, Caitlyn Heqi Yin, Yichao Zhang, Pohsun Feng, Yizhu Wen, Tianyang Wang, Ming Li, Chia Xin Liang, Jintao Ren, Qian Niu, Silin Chen, Lawrence K. Q. Yan, Han Xu, Hong-Ming Tseng, Xinyuan Song, Bowen Jing, Junjie Yang, Junhao Song, Junyu Liu , et al. (1 additional authors not shown)

    Abstract: Advancements in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management into pivotal domains for research and application. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies, emphasizing their role in uncovering actionable insights from massive,… ▽ More

    Submitted 3 December, 2024; originally announced December 2024.

    Comments: 174 pages

  44. arXiv:2412.00800  [pdf, other

    cs.LG cs.AI

    A Comprehensive Guide to Explainable AI: From Classical Models to LLMs

    Authors: Weiche Hsieh, Ziqian Bi, Chuanqi Jiang, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Pohsun Feng, Yizhu Wen, Xinyuan Song, Tianyang Wang, Ming Liu, Junjie Yang, Ming Li, Bowen Jing, Jintao Ren, Junhao Song, Hong-Ming Tseng, Yichao Zhang, Lawrence K. Q. Yan, Qian Niu, Silin Chen , et al. (2 additional authors not shown)

    Abstract: Explainable Artificial Intelligence (XAI) addresses the growing need for transparency and interpretability in AI systems, enabling trust and accountability in decision-making processes. This book offers a comprehensive guide to XAI, bridging foundational concepts with advanced methodologies. It explores interpretability in traditional models such as Decision Trees, Linear Regression, and Support V… ▽ More

    Submitted 8 December, 2024; v1 submitted 1 December, 2024; originally announced December 2024.

  45. arXiv:2411.15758  [pdf, other

    cs.AI cs.CY cs.SI

    Decoding Urban Industrial Complexity: Enhancing Knowledge-Driven Insights via IndustryScopeGPT

    Authors: Siqi Wang, Chao Liang, Yunfan Gao, Yang Liu, Jing Li, Haofen Wang

    Abstract: Industrial parks are critical to urban economic growth. Yet, their development often encounters challenges stemming from imbalances between industrial requirements and urban services, underscoring the need for strategic planning and operations. This paper introduces IndustryScopeKG, a pioneering large-scale multi-modal, multi-level industrial park knowledge graph, which integrates diverse urban da… ▽ More

    Submitted 24 November, 2024; originally announced November 2024.

    Comments: 9 pages, 6 figures, the 32nd ACM International Conference on Multimedia

    ACM Class: I.2.0; I.2.7; H.3.3; H.4.0

    Journal ref: In Proceedings of the 32nd ACM International Conference on Multimedia, pp. 4757-4765 (2024, October)

  46. arXiv:2411.15194  [pdf, other

    cs.LG cs.AI cs.CL cs.LO

    Guiding Word Equation Solving using Graph Neural Networks (Extended Technical Report)

    Authors: Parosh Aziz Abdulla, Mohamed Faouzi Atig, Julie Cailler, Chencheng Liang, Philipp Rümmer

    Abstract: This paper proposes a Graph Neural Network-guided algorithm for solving word equations, based on the well-known Nielsen transformation for splitting equations. The algorithm iteratively rewrites the first terms of each side of an equation, giving rise to a tree-like search space. The choice of path at each split point of the tree significantly impacts solving time, motivating the use of Graph Neur… ▽ More

    Submitted 19 November, 2024; originally announced November 2024.

  47. arXiv:2411.13587  [pdf, other

    cs.RO cs.AI

    Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

    Authors: Taowen Wang, Cheng Han, James Chenhao Liang, Wenhao Yang, Dongfang Liu, Luna Xinyu Zhang, Qifan Wang, Jiebo Luo, Ruixiang Tang

    Abstract: Recently in robotics, Vision-Language-Action (VLA) models have emerged as a transformative approach, enabling robots to execute complex tasks by integrating visual and linguistic inputs within an end-to-end learning framework. While VLA models offer significant capabilities, they also introduce new attack surfaces, making them vulnerable to adversarial attacks. With these vulnerabilities largely u… ▽ More

    Submitted 9 March, 2025; v1 submitted 17 November, 2024; originally announced November 2024.

    Comments: Github: https://github.com/William-wAng618/roboticAttack Homepage: https://vlaattacker.github.io/

  48. arXiv:2411.06284  [pdf, other

    cs.AI

    A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

    Authors: Chia Xin Liang, Pu Tian, Caitlyn Heqi Yin, Yao Yua, Wei An-Hou, Li Ming, Tianyang Wang, Ziqian Bi, Ming Liu

    Abstract: This survey and application guide to multimodal large language models(MLLMs) explores the rapidly developing field of MLLMs, examining their architectures, applications, and impact on AI and Generative Models. Starting with foundational concepts, we delve into how MLLMs integrate various data types, including text, images, video and audio, to enable complex AI systems for cross-modal understanding… ▽ More

    Submitted 8 December, 2024; v1 submitted 9 November, 2024; originally announced November 2024.

  49. arXiv:2411.05026  [pdf, ps, other

    cs.CL cs.HC

    Deep Learning and Machine Learning -- Natural Language Processing: From Theory to Application

    Authors: Keyu Chen, Cheng Fei, Ziqian Bi, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Caitlyn Heqi Yin, Yichao Zhang, Pohsun Feng, Yizhu Wen, Tianyang Wang, Ming Li, Jintao Ren, Qian Niu, Silin Chen, Weiche Hsieh, Lawrence K. Q. Yan, Chia Xin Liang, Han Xu, Hong-Ming Tseng, Xinyuan Song, Ming Liu

    Abstract: With a focus on natural language processing (NLP) and the role of large language models (LLMs), we explore the intersection of machine learning, deep learning, and artificial intelligence. As artificial intelligence continues to revolutionize fields from healthcare to finance, NLP techniques such as tokenization, text classification, and entity recognition are essential for processing and understa… ▽ More

    Submitted 17 December, 2024; v1 submitted 30 October, 2024; originally announced November 2024.

    Comments: 252 pages

  50. arXiv:2411.02999  [pdf, other

    cs.CV

    Precise Drive with VLM: First Prize Solution for PRCV 2024 Drive LM challenge

    Authors: Bin Huang, Siyu Wang, Yuanpeng Chen, Yidan Wu, Hui Song, Zifan Ding, Jing Leng, Chengpeng Liang, Peng Xue, Junliang Zhang, Tiankun Zhao

    Abstract: This technical report outlines the methodologies we applied for the PRCV Challenge, focusing on cognition and decision-making in driving scenarios. We employed InternVL-2.0, a pioneering open-source multi-modal model, and enhanced it by refining both the model input and training methodologies. For the input data, we strategically concatenated and formatted the multi-view images. It is worth mentio… ▽ More

    Submitted 5 November, 2024; originally announced November 2024.