Skip to main content

Showing 1–50 of 448 results for author: Wu, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.07062  [pdf, ps, other

    cs.CV cs.AI

    Seed1.5-VL Technical Report

    Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

    Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  2. arXiv:2505.05271  [pdf, other

    cs.CL cs.AI

    T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction

    Authors: Kun Peng, Chaodong Tong, Cong Cao, Hao Peng, Qian Li, Guanlin Wu, Lei Jiang, Yanbing Liu, Philip S. Yu

    Abstract: Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstr… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: Accepted by IJCAI2025

  3. arXiv:2505.01476  [pdf, other

    eess.IV cs.AI cs.CV

    CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering

    Authors: Zhe Zhang, Mingxiu Cai, Hanxiao Wang, Gaochang Wu, Tianyou Chai, Xiatian Zhu

    Abstract: Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is ina… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: 20 pages, 11 figures, 10 tables, accepted by Forty-Second International Conference on Machine Learning ( ICML 2025 )

  4. arXiv:2504.20406  [pdf, other

    cs.AI cs.SE

    Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs

    Authors: Paiheng Xu, Gang Wu, Xiang Chen, Tong Yu, Chang Xiao, Franck Dernoncourt, Tianyi Zhou, Wei Ai, Viswanathan Swaminathan

    Abstract: Scripting interfaces enable users to automate tasks and customize software workflows, but creating scripts traditionally requires programming expertise and familiarity with specific APIs, posing barriers for many users. While Large Language Models (LLMs) can generate code from natural language queries, runtime code generation is severely limited due to unverified code, security risks, longer respo… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  5. arXiv:2504.19867  [pdf, other

    cs.CL cs.DC cs.LG

    semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

    Authors: Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu, Buhe Han, Guohao Dai, Yun Liang, Yu Wang

    Abstract: Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticat… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: 18 pages, 16 figures

  6. arXiv:2504.19237  [pdf, other

    cs.SE

    Deep Reinforcement Learning for Automated Web GUI Testing

    Authors: Zhiyu Gu, Chenxu Liu, Guoquan Wu, Yifei Zhang, ChenXi Yang, Zheheng Liang, Wei Chen, Jun Wei

    Abstract: Automated GUI testing of web applications has always been considered a challenging task considering their large state space and complex interaction logic. Deep Reinforcement Learning (DRL) is a recent extension of Reinforcement Learning (RL), which takes advantage of the powerful learning capabilities of neural networks, making it suitable for complex exploration space. In this paper, leveraging t… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

    Comments: 12 pages, 7 figures

  7. arXiv:2504.18569  [pdf, other

    cs.CR cs.AI cs.LG

    Large Language Model Empowered Privacy-Protected Framework for PHI Annotation in Clinical Notes

    Authors: Guanchen Wu, Linzhi Zheng, Han Xie, Zhen Xiang, Jiaying Lu, Darren Liu, Delgersuren Bold, Bo Li, Xiao Hu, Carl Yang

    Abstract: The de-identification of private information in medical data is a crucial process to mitigate the risk of confidentiality breaches, particularly when patient personal details are not adequately removed before the release of medical records. Although rule-based and learning-based methods have been proposed, they often struggle with limited generalizability and require substantial amounts of annotat… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: Shorter version published in MedInfo 2025

  8. arXiv:2504.18096  [pdf, other

    cs.AI cs.LG

    Combating the Bucket Effect:Multi-Knowledge Alignment for Medication Recommendation

    Authors: Xiang Li, Haixu Ma, Guanyong Wu, Shi Mu, Chen Li, Shunpan Liang

    Abstract: Medication recommendation is crucial in healthcare, offering effective treatments based on patient's electronic health records (EHR). Previous studies show that integrating more medication-related knowledge improves medication representation accuracy. However, not all medications encompass multiple types of knowledge data simultaneously. For instance, some medications provide only textual descript… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: 18 pages, 5 figures

  9. RGB-D Tracking via Hierarchical Modality Aggregation and Distribution Network

    Authors: Boyue Xu, Yi Xu, Ruichao Hou, Jia Bei, Tongwei Ren, Gangshan Wu

    Abstract: The integration of dual-modal features has been pivotal in advancing RGB-Depth (RGB-D) tracking. However, current trackers are less efficient and focus solely on single-level features, resulting in weaker robustness in fusion and slower speeds that fail to meet the demands of real-world applications. In this paper, we introduce a novel network, denoted as HMAD (Hierarchical Modality Aggregation an… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  10. RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory

    Authors: Boyue Xu, Ruichao Hou, Tongwei Ren, Gangshan Wu

    Abstract: The RGB-Depth (RGB-D) Video Object Segmentation (VOS) aims to integrate the fine-grained texture information of RGB with the spatial geometric clues of depth modality, boosting the performance of segmentation. However, off-the-shelf RGB-D segmentation methods fail to fully explore cross-modal information and suffer from object drift during long-term prediction. In this paper, we propose a novel RG… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

  11. arXiv:2504.16322  [pdf, other

    cs.NI cs.MM

    BAROC: Concealing Packet Losses in LSNs with Bimodal Behavior Awareness for Livecast Ingestion

    Authors: Haoyuan Zhao, Jianxin Shi, Guanzhen Wu, Hao Fang, Yi Ching Chou, Long Chen, Feng Wang, Jiangchuan Liu

    Abstract: The advent of Low-Earth Orbit satellite networks (LSNs), exemplified by initiatives like \emph{Starlink}, \emph{OneWeb} and \emph{Kuiper}, has ushered in a new era of ``Internet from Space" global connectivity. Recent studies have shown that LSNs are capable of providing unprecedented download capacity and low latency to support Livecast viewing. However, Livecast ingestion still faces significant… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: This is the preprint version of the paper accepted to IEEE INFOCOM 2025

  12. arXiv:2504.14582  [pdf, other

    cs.CV

    NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results

    Authors: Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu , et al. (86 additional authors not shown)

    Abstract: This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that ach… ▽ More

    Submitted 28 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

    Comments: NTIRE 2025 webpage: https://www.cvlai.net/ntire/2025. Code: https://github.com/zhengchen1999/NTIRE2025_ImageSR_x4

  13. arXiv:2504.13420  [pdf, other

    cs.RO cs.SE

    Testing the Fault-Tolerance of Multi-Sensor Fusion Perception in Autonomous Driving Systems

    Authors: Haoxiang Tian, Wenqiang Ding, Xingshuo Han, Guoquan Wu, An Guo, Junqi Zhang. Wei Chen, Jun Wei, Tianwei Zhang

    Abstract: High-level Autonomous Driving Systems (ADSs), such as Google Waymo and Baidu Apollo, typically rely on multi-sensor fusion (MSF) based approaches to perceive their surroundings. This strategy increases perception robustness by combining the respective strengths of the camera and LiDAR and directly affects the safety-critical driving decisions of autonomous vehicles (AVs). However, in real-world au… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  14. arXiv:2504.11914  [pdf, other

    cs.CV

    AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection

    Authors: Yuhao Chao, Jie Liu, Jie Tang, Gangshan Wu

    Abstract: Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. W… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  15. arXiv:2504.11571  [pdf, other

    cs.AI cs.CL

    GraphicBench: A Planning Benchmark for Graphic Design with Language Agents

    Authors: Dayeon Ki, Tianyi Zhou, Marine Carpuat, Gang Wu, Puneet Mathur, Viswanathan Swaminathan

    Abstract: Large Language Model (LLM)-powered agents have unlocked new possibilities for automating human tasks. While prior work has focused on well-defined tasks with specified goals, the capabilities of agents in creative design tasks with open-ended goals remain underexplored. We introduce GraphicBench, a new planning benchmark for graphic design that covers 1,079 user queries and input images across fou… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Comments: 41 pages, 11 figures

  16. arXiv:2504.11346  [pdf, other

    cs.CV

    Seedream 3.0 Technical Report

    Authors: Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai , et al. (6 additional authors not shown)

    Abstract: We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 st… ▽ More

    Submitted 16 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

    Comments: Seedream 3.0 Technical Report

  17. arXiv:2504.09973  [pdf, other

    cs.CV

    Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration

    Authors: Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Liqiang Nie

    Abstract: All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from p… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Project page: https://github.com/Aitical/CPLIR

  18. TickIt: Leveraging Large Language Models for Automated Ticket Escalation

    Authors: Fengrui Liu, Xiao He, Tieying Zhang, Jianjun Chen, Yi Li, Lihua Yi, Haipeng Zhang, Gang Wu, Rui Shi

    Abstract: In large-scale cloud service systems, support tickets serve as a critical mechanism for resolving customer issues and maintaining service quality. However, traditional manual ticket escalation processes encounter significant challenges, including inefficiency, inaccuracy, and difficulty in handling the high volume and complexity of tickets. While previous research has proposed various machine lear… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: 33rd ACM International Conference on the Foundations of Software Engineering

  19. arXiv:2504.08215  [pdf, other

    stat.ML cs.LG math.ST

    Deep Distributional Learning with Non-crossing Quantile Network

    Authors: Guohao Shen, Runpeng Dai, Guojun Wu, Shikai Luo, Chengchun Shi, Hongtu Zhu

    Abstract: In this paper, we introduce a non-crossing quantile (NQ) network for conditional distribution learning. By leveraging non-negative activation functions, the NQ network ensures that the learned distributions remain monotonic, effectively addressing the issue of quantile crossing. Furthermore, the NQ network-based deep distributional learning framework is highly adaptable, applicable to a wide range… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  20. arXiv:2504.05878  [pdf, other

    cs.MM cs.CV

    KAN-SAM: Kolmogorov-Arnold Network Guided Segment Anything Model for RGB-T Salient Object Detection

    Authors: Xingyuan Li, Ruichao Hou, Tongwei Ren, Gangshan Wu

    Abstract: Existing RGB-thermal salient object detection (RGB-T SOD) methods aim to identify visually significant objects by leveraging both RGB and thermal modalities to enable robust performance in complex scenarios, but they often suffer from limited generalization due to the constrained diversity of available datasets and the inefficiencies in constructing multi-modal representations. In this paper, we p… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: This paper is accepted by ICME2025

  21. arXiv:2504.04869  [pdf, other

    cs.CV

    Content-Aware Transformer for All-in-one Image Restoration

    Authors: Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu

    Abstract: Image restoration has witnessed significant advancements with the development of deep learning models. Although Transformer architectures have progressed considerably in recent years, challenges remain, particularly the limited receptive field in window-based self-attention. In this work, we propose DSwinIR, a Deformable Sliding window Transformer for Image Restoration. DSwinIR introduces a novel… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  22. arXiv:2504.04422  [pdf, other

    cs.CR cs.SE

    LeakGuard: Detecting Memory Leaks Accurately and Scalably

    Authors: Hongliang Liang, Luming Yin, Guohao Wu, Yuxiang Li, Qiuping Yi, Lei Wang

    Abstract: Memory leaks are prevalent in various real-world software projects, thereby leading to serious attacks like denial-of-service. Though prior methods for detecting memory leaks made significant advance, they often suffer from low accuracy and weak scalability for testing large and complex programs. In this paper we present LeakGuard, a memory leak detection tool which provides satisfactory balance o… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

    Comments: 21 pages, 5 figures, conference paper on memory leak detection

  23. Design and Implementation of the Transparent, Interpretable, and Multimodal (TIM) AR Personal Assistant

    Authors: Erin McGowan, Joao Rulff, Sonia Castelo, Guande Wu, Shaoyu Chen, Roque Lopez, Bea Steers, Iran R. Roman, Fabio F. Dias, Jing Qian, Parikshit Solunke, Michael Middleton, Ryan McKendrick, Claudio T. Silva

    Abstract: The concept of an AI assistant for task guidance is rapidly shifting from a science fiction staple to an impending reality. Such a system is inherently complex, requiring models for perceptual grounding, attention, and reasoning, an intuitive interface that adapts to the performer's needs, and the orchestration of data streams from many sensors. Moreover, all data acquired by the system must be re… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: Copyright 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted, but republication/redistribution requires IEEE permission. Article accepted for publication in IEEE Computer Graphics and Applications. This is the author's version, content may change prior to final publication

  24. ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

    Authors: Yiqiao Jin, Stefano Petrangeli, Yu Shen, Gang Wu

    Abstract: Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: Accepted to MM4SG Workshop at The Web Conference 2025

  25. Progressive Human Motion Generation Based on Text and Few Motion Frames

    Authors: Ling-An Zeng, Gaojie Wu, Ancong Wu, Jian-Fang Hu, Wei-Shi Zheng

    Abstract: Although existing text-to-motion (T2M) methods can produce realistic human motion from text description, it is still difficult to align the generated motion with the desired postures since using text alone is insufficient for precisely describing diverse postures. To achieve more controllable generation, an intuitive way is to allow the user to input a few motion frames describing precise desired… ▽ More

    Submitted 30 March, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: Accepted to IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025

  26. arXiv:2503.11571  [pdf, other

    cs.CV cs.AI

    RASA: Replace Anyone, Say Anything -- A Training-Free Framework for Audio-Driven and Universal Portrait Video Editing

    Authors: Tianrui Pan, Lin Liu, Jie Liu, Xiaopeng Zhang, Jie Tang, Gangshan Wu, Qi Tian

    Abstract: Portrait video editing focuses on modifying specific attributes of portrait videos, guided by audio or video streams. Previous methods typically either concentrate on lip-region reenactment or require training specialized models to extract keypoints for motion transfer to a new identity. In this paper, we introduce a training-free universal portrait video editing framework that provides a versatil… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.

    Comments: Demo is available at https://alice01010101.github.io/RASA/

  27. arXiv:2503.08010  [pdf, other

    cs.CV cs.AI

    SKALD: Learning-Based Shot Assembly for Coherent Multi-Shot Video Creation

    Authors: Chen Yi Lu, Md Mehrab Tanjim, Ishita Dasgupta, Somdeb Sarkhel, Gang Wu, Saayan Mitra, Somali Chaterji

    Abstract: We present SKALD, a multi-shot video assembly method that constructs coherent video sequences from candidate shots with minimal reliance on text. Central to our approach is the Learned Clip Assembly (LCA) score, a learning-based metric that measures temporal and semantic relationships between shots to quantify narrative coherence. We tackle the exponential complexity of combining multiple shots wi… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  28. arXiv:2503.07703  [pdf, other

    cs.CV

    Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

    Authors: Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Linjie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang , et al. (3 additional authors not shown)

    Abstract: Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual im… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: Official Page: https://team.doubao.com/tech/seedream

  29. arXiv:2503.07597  [pdf, other

    cs.CV

    HumanMM: Global Human Motion Recovery from Multi-shot Videos

    Authors: Yuhong Zhang, Guanlin Wu, Ling-Hao Chen, Zhuokai Zhao, Jing Lin, Xiaoke Jiang, Jiamin Wu, Zhuoheng Li, Hao Frank Yang, Haoqian Wang, Lei Zhang

    Abstract: In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions,… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: CVPR 2025; Project page: https://zhangyuhong01.github.io/HumanMM/

  30. arXiv:2503.06896  [pdf, other

    cs.CV

    CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution

    Authors: Xin Liu, Jie Liu, Jie Tang, Gangshan Wu

    Abstract: Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR). However, its computational complexity grows quadratically with the spatial resolution. A series of works attempt to alleviate this problem by dividing Low-Resolution images into local windows, axial stripes, or dilated windows. SR typically leverages the redundancy of i… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025

  31. arXiv:2503.06477  [pdf, other

    cs.CV cs.AI

    PDB: Not All Drivers Are the Same -- A Personalized Dataset for Understanding Driving Behavior

    Authors: Chuheng Wei, Ziye Qin, Siyan Li, Ziyan Zhang, Xuanpeng Zhao, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Matthew J. Barth, Guoyuan Wu

    Abstract: Driving behavior is inherently personal, influenced by individual habits, decision-making styles, and physiological states. However, most existing datasets treat all drivers as homogeneous, overlooking driver-specific variability. To address this gap, we introduce the Personalized Driving Behavior (PDB) dataset, a multi-modal dataset designed to capture personalization in driving behavior under na… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  32. arXiv:2503.04344  [pdf, other

    cs.CV

    LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding

    Authors: Shen Zhang, Yaning Tan, Siyuan Liang, Zhaowei Chen, Linze Li, Ge Wu, Yuhao Chen, Shuheng Li, Zhenyu Zhao, Caihua Chen, Jiajun Liang, Yao Tang

    Abstract: Diffusion transformers(DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolation which degrades performance when the inference resolution differs from training. In this paper, we propose a Length-Extrapolatable Diffusion Transformer(LEDiT), a simple yet powerful archi… ▽ More

    Submitted 7 March, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

  33. OCL: Ordinal Contrastive Learning for Imputating Features with Progressive Labels

    Authors: Seunghun Baek, Jaeyoon Sim, Guorong Wu, Won Hwa Kim

    Abstract: Accurately discriminating progressive stages of Alzheimer's Disease (AD) is crucial for early diagnosis and prevention. It often involves multiple imaging modalities to understand the complex pathology of AD, however, acquiring a complete set of images is challenging due to high cost and burden for subjects. In the end, missing data become inevitable which lead to limited sample-size and decrease… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: MICCAI 2024 (Provisional Accept)

  34. Modality-Agnostic Style Transfer for Holistic Feature Imputation

    Authors: Seunghun Baek, Jaeyoon Sim, Mustafa Dere, Minjeong Kim, Guorong Wu, Won Hwa Kim

    Abstract: Characterizing a preclinical stage of Alzheimer's Disease (AD) via single imaging is difficult as its early symptoms are quite subtle. Therefore, many neuroimaging studies are curated with various imaging modalities, e.g., MRI and PET, however, it is often challenging to acquire all of them from all subjects and missing data become inevitable. In this regards, in this paper, we propose a framework… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: ISBI 2024 (oral)

  35. arXiv:2503.01754  [pdf, other

    cs.CV

    SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces

    Authors: Guande Wu, Huan Song, Yawei Wang, Qiaojing Yan, Yijun Tian, Lin Lee Cheong, Panpan Xu

    Abstract: Reasoning is increasingly crucial for various tasks. While chain-of-thought prompting enables large language models to leverage reasoning effectively, harnessing the reasoning capabilities of Vision-Language Models (VLMs) remains challenging. To solve this problem, we propose a novel self-distillation framework that enhances the reasoning capabilities of the model. The proposed framework introduce… ▽ More

    Submitted 19 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

  36. arXiv:2503.01565  [pdf, other

    cs.CV eess.IV

    AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning

    Authors: Yuheng Xu, Shijie Yang, Xin Liu, Jie Liu, Jie Tang, Gangshan Wu

    Abstract: In recent years, the increasing popularity of Hi-DPI screens has driven a rising demand for high-resolution images. However, the limited computational power of edge devices poses a challenge in deploying complex super-resolution neural networks, highlighting the need for efficient methods. While prior works have made significant progress, they have not fully exploited pixel-level information. More… ▽ More

    Submitted 7 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR2025

  37. Learning Covariance-Based Multi-Scale Representation of Neuroimaging Measures for Alzheimer Classification

    Authors: Seunghun Baek, Injun Choi, Mustafa Dere, Minjeong Kim, Guorong Wu, Won Hwa Kim

    Abstract: Stacking excessive layers in DNN results in highly underdetermined system when training samples are limited, which is very common in medical applications. In this regard, we present a framework capable of deriving an efficient high-dimensional space with reasonable increase in model size. This is done by utilizing a transform (i.e., convolution) that leverages scale-space theory with covariance st… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: ISBI 2023

  38. arXiv:2503.01210  [pdf, other

    cs.CV

    Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond

    Authors: Guanyao Wu, Haoyu Liu, Hongming Fu, Yichuan Peng, Jinyuan Liu, Xin Fan, Risheng Liu

    Abstract: Multi-modality image fusion, particularly infrared and visible, plays a crucial role in integrating diverse modalities to enhance scene understanding. Although early research prioritized visual quality, preserving fine details and adapting to downstream tasks remains challenging. Recent approaches attempt task-specific design but rarely achieve "The Best of Both Worlds" due to inconsistent optimiz… ▽ More

    Submitted 25 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

  39. arXiv:2503.01143  [pdf, other

    cs.LG

    DPR: Diffusion Preference-based Reward for Offline Reinforcement Learning

    Authors: Teng Pang, Bingzheng Wang, Guoqiang Wu, Yilong Yin

    Abstract: Offline preference-based reinforcement learning (PbRL) mitigates the need for reward definition, aligning with human preferences via preference-driven reward feedback without interacting with the environment. However, the effectiveness of preference-driven reward functions depends on the modeling ability of the learning model, which current MLP-based and Transformer-based methods may fail to adequ… ▽ More

    Submitted 13 May, 2025; v1 submitted 2 March, 2025; originally announced March 2025.

  40. arXiv:2503.00510  [pdf, other

    eess.IV cs.CV

    NeuroSymAD: A Neuro-Symbolic Framework for Interpretable Alzheimer's Disease Diagnosis

    Authors: Yexiao He, Ziyao Wang, Yuning Zhang, Tingting Dan, Tianlong Chen, Guorong Wu, Ang Li

    Abstract: Alzheimer's disease (AD) diagnosis is complex, requiring the integration of imaging and clinical data for accurate assessment. While deep learning has shown promise in brain MRI analysis, it often functions as a black box, limiting interpretability and lacking mechanisms to effectively integrate critical clinical data such as biomarkers, medical history, and demographic information. To bridge this… ▽ More

    Submitted 1 March, 2025; originally announced March 2025.

  41. arXiv:2502.18167  [pdf, ps, other

    cs.LG stat.ML

    Sharper Concentration Inequalities for Multi-Graph Dependent Variables

    Authors: Xiao Shao, Guoqiang Wu

    Abstract: In multi-task learning (MTL) with each task involving graph-dependent data, generalization results of existing theoretical analyses yield a sub-optimal risk bound of $O(\frac{1}{\sqrt{n}})$, where $n$ is the number of training samples.This is attributed to the lack of a foundational sharper concentration inequality for multi-graph dependent random variables. To fill this gap, this paper proposes a… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

    Comments: 34 pages

  42. arXiv:2502.14583  [pdf, other

    cs.LG cs.AI

    A Theory for Conditional Generative Modeling on Multiple Data Sources

    Authors: Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu

    Abstract: The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper takes the first step toward a rigorous analysis of multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specif… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: 35 pages

  43. arXiv:2502.12202  [pdf, other

    cs.CL cs.AI cs.LG

    BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack

    Authors: Zihao Zhu, Hongbao Zhang, Mingda Zhang, Ruotong Wang, Guanzong Wu, Ke Xu, Baoyuan Wu

    Abstract: Longer thought, better performance: large language models with deep reasoning capabilities, particularly o1-like models, have demonstrated remarkable performance by generating extensive thought processes during inference. This trade-off reveals a potential vulnerability: adversaries could compromise model performance by forcing immediate responses without thought processes. To this end, in this pa… ▽ More

    Submitted 16 February, 2025; originally announced February 2025.

  44. arXiv:2502.10606  [pdf, other

    cs.CV cs.RO

    HIPPo: Harnessing Image-to-3D Priors for Model-free Zero-shot 6D Pose Estimation

    Authors: Yibo Liu, Zhaodong Jiang, Binbin Xu, Guile Wu, Yuan Ren, Tongtong Cao, Bingbing Liu, Rui Heng Yang, Amir Rasouli, Jinjun Shan

    Abstract: This work focuses on model-free zero-shot 6D object pose estimation for robotics applications. While existing methods can estimate the precise 6D pose of objects, they heavily rely on curated CAD models or reference images, the preparation of which is a time-consuming and labor-intensive process. Moreover, in real-world scenarios, 3D models or reference images may not be available in advance and i… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  45. arXiv:2502.06171  [pdf

    eess.IV cs.CV

    A Data-Efficient Pan-Tumor Foundation Model for Oncology CT Interpretation

    Authors: Wenhui Lei, Hanyu Chen, Zitian Zhang, Luyang Luo, Qiong Xiao, Yannian Gu, Peng Gao, Yankai Jiang, Ci Wang, Guangtao Wu, Tongjia Xu, Yingjie Zhang, Xiaofan Zhang, Pranav Rajpurkar, Shaoting Zhang, Zhenning Wang

    Abstract: Artificial intelligence-assisted imaging analysis has made substantial strides in tumor diagnosis and management. Here we present PASTA, a pan-tumor CT foundation model that achieves state-of-the-art performance on 45 of 46 representative oncology tasks -- including lesion segmentation, tumor detection in plain CT, tumor staging, survival prediction, structured report generation, and cross-modalit… ▽ More

    Submitted 10 February, 2025; originally announced February 2025.

    Comments: 57 pages, 7 figures

  46. arXiv:2501.16391  [pdf, other

    cs.LG cs.AI q-bio.BM

    Inductive-Associative Meta-learning Pipeline with Human Cognitive Patterns for Unseen Drug-Target Interaction Prediction

    Authors: Xiaoqing Lian, Jie Zhu, Tianxu Lv, Shiyun Nie, Hang Fan, Guosheng Wu, Yunjun Ge, Lihua Li, Xiangxiang Zeng, Xiang Pan

    Abstract: Significant differences in protein structures hinder the generalization of existing drug-target interaction (DTI) models, which often rely heavily on pre-learned binding principles or detailed annotations. In contrast, BioBridge designs an Inductive-Associative pipeline inspired by the workflow of scientists who base their accumulated expertise on drawing insights into novel drug-target pairs from… ▽ More

    Submitted 27 March, 2025; v1 submitted 26 January, 2025; originally announced January 2025.

  47. arXiv:2501.13896  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration

    Authors: Yue Fan, Handong Zhao, Ruiyi Zhang, Yu Shen, Xin Eric Wang, Gang Wu

    Abstract: Graphical User Interface (GUI) action grounding is a critical step in GUI automation that maps language instructions to actionable elements on GUI screens. Most recent works of GUI action grounding leverage large GUI datasets to fine-tune MLLMs. However, the fine-tuning data always covers limited GUI environments, and we find the performance of the resulting model deteriorates in novel environment… ▽ More

    Submitted 27 January, 2025; v1 submitted 23 January, 2025; originally announced January 2025.

  48. arXiv:2501.13183  [pdf, other

    cs.CV

    MONA: Moving Object Detection from Videos Shot by Dynamic Camera

    Authors: Boxun Hu, Mingze Xia, Ding Zhao, Guanlin Wu

    Abstract: Dynamic urban environments, characterized by moving cameras and objects, pose significant challenges for camera trajectory estimation by complicating the distinction between camera-induced and object motion. We introduce MONA, a novel framework designed for robust moving object detection and segmentation from videos shot by dynamic cameras. MONA comprises two key modules: Dynamic Points Extraction… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

  49. arXiv:2501.12420  [pdf, other

    cs.SE cs.AI cs.LG

    Consolidating TinyML Lifecycle with Large Language Models: Reality, Illusion, or Opportunity?

    Authors: Guanghan Wu, Sasu Tarkoma, Roberto Morabito

    Abstract: The evolving requirements of Internet of Things (IoT) applications are driving an increasing shift toward bringing intelligence to the edge, enabling real-time insights and decision-making within resource-constrained environments. Tiny Machine Learning (TinyML) has emerged as a key enabler of this evolution, facilitating the deployment of ML models on devices such as microcontrollers and embedded… ▽ More

    Submitted 5 April, 2025; v1 submitted 20 January, 2025; originally announced January 2025.

    Comments: This paper has been accepted for publication in the IEEE Internet of Things Magazine (Special Issue on Applications of Large Language Models in IoT). The copyright will be transferred to IEEE upon publication. A preliminary version of this work was presented at the Edge AI Foundation event Beyond LLMs and Chatbots: The Journey to Generative AI at the Edge (https://youtu.be/aFWfisdjQIs)

  50. arXiv:2501.10761  [pdf, other

    cs.CV

    Infrared and Visible Image Fusion: From Data Compatibility to Task Adaption

    Authors: Jinyuan Liu, Guanyao Wu, Zhu Liu, Di Wang, Zhiying Jiang, Long Ma, Wei Zhong, Xin Fan, Risheng Liu

    Abstract: Infrared-visible image fusion (IVIF) is a critical task in computer vision, aimed at integrating the unique features of both infrared and visible spectra into a unified representation. Since 2018, the field has entered the deep learning era, with an increasing variety of approaches introducing a range of networks and loss functions to enhance visual performance. However, challenges such as data co… ▽ More

    Submitted 18 January, 2025; originally announced January 2025.