Skip to main content

Showing 1–50 of 708 results for author: Zhu, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.09655  [pdf, other

    cs.CL

    DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

    Authors: Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi

    Abstract: Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinc… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  2. arXiv:2505.08220  [pdf

    cs.LG

    Deep Probabilistic Modeling of User Behavior for Anomaly Detection via Mixture Density Networks

    Authors: Lu Dai, Wenxuan Zhu, Xuehui Quan, Renzi Meng, Sheng Cai, Yichen Wang

    Abstract: To improve the identification of potential anomaly patterns in complex user behavior, this paper proposes an anomaly detection method based on a deep mixture density network. The method constructs a Gaussian mixture model parameterized by a neural network, enabling conditional probability modeling of user behavior. It effectively captures the multimodal distribution characteristics commonly presen… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  3. arXiv:2505.07734  [pdf, ps, other

    cs.CV

    LAMM-ViT: AI Face Detection via Layer-Aware Modulation of Region-Guided Attention

    Authors: Jiangling Zhang, Weijie Zhu, Jirui Huang, Yaxiong Chen

    Abstract: Detecting AI-synthetic faces presents a critical challenge: it is hard to capture consistent structural relationships between facial regions across diverse generation techniques. Current methods, which focus on specific artifacts rather than fundamental inconsistencies, often fail when confronted with novel generative models. To address this limitation, we introduce Layer-aware Mask Modulation Vis… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  4. arXiv:2505.07674  [pdf

    cs.LG

    Joint Graph Convolution and Sequential Modeling for Scalable Network Traffic Estimation

    Authors: Nan Jiang, Wenxuan Zhu, Xu Han, Weiqiang Huang, Yumeng Sun

    Abstract: This study focuses on the challenge of predicting network traffic within complex topological environments. It introduces a spatiotemporal modeling approach that integrates Graph Convolutional Networks (GCN) with Gated Recurrent Units (GRU). The GCN component captures spatial dependencies among network nodes, while the GRU component models the temporal evolution of traffic data. This combination al… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  5. arXiv:2505.07062  [pdf, ps, other

    cs.CV cs.AI

    Seed1.5-VL Technical Report

    Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

    Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  6. arXiv:2505.06114  [pdf, other

    cs.LG

    FIC-TSC: Learning Time Series Classification with Fisher Information Constraint

    Authors: Xiwen Chen, Wenhui Zhu, Peijie Qiu, Hao Wang, Huayu Li, Zihan Li, Yalin Wang, Aristeidis Sotiras, Abolfazl Razi

    Abstract: Analyzing time series data is crucial to a wide spectrum of applications, including economics, online marketplaces, and human healthcare. In particular, time series classification plays an indispensable role in segmenting different phases in stock markets, predicting customer behavior, and classifying worker actions and engagement levels. These aspects contribute significantly to the advancement o… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML2025. Pre camera-ready version

  7. arXiv:2504.21252  [pdf, other

    cs.CL

    Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA

    Authors: Xuanzhao Dong, Wenhui Zhu, Hao Wang, Xiwen Chen, Peijie Qiu, Rui Yin, Yi Su, Yalin Wang

    Abstract: Medical question answering (QA) is a reasoning-intensive task that remains challenging for large language models (LLMs) due to hallucinations and outdated domain knowledge. Retrieval-Augmented Generation (RAG) provides a promising post-training solution by leveraging external knowledge. However, existing medical RAG systems suffer from two key limitations: (1) a lack of modeling for human-like rea… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  8. arXiv:2504.20020  [pdf, other

    cs.LG cs.AI

    Modular Machine Learning: An Indispensable Path towards New-Generation Large Language Models

    Authors: Xin Wang, Haoyang Li, Zeyang Zhang, Haibo Chen, Wenwu Zhu

    Abstract: Large language models (LLMs) have dramatically advanced machine learning research including natural language processing, computer vision, data mining, etc., yet they still exhibit critical limitations in reasoning, factual consistency, and interpretability. In this paper, we introduce a novel learning paradigm -- Modular Machine Learning (MML) -- as an essential approach toward new-generation LLMs… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: 11 pages, 3 figures

  9. arXiv:2504.18049  [pdf, ps, other

    cs.CV cs.AI

    A BERT-Style Self-Supervised Learning CNN for Disease Identification from Retinal Images

    Authors: Xin Li, Wenhui Zhu, Peijie Qiu, Oana M. Dumitrascu, Amal Youssef, Yalin Wang

    Abstract: In the field of medical imaging, the advent of deep learning, especially the application of convolutional neural networks (CNNs) has revolutionized the analysis and interpretation of medical images. Nevertheless, deep learning methods usually rely on large amounts of labeled data. In medical imaging research, the acquisition of high-quality labels is both expensive and difficult. The introduction… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  10. arXiv:2504.17828  [pdf, other

    cs.CV cs.AI

    VEU-Bench: Towards Comprehensive Understanding of Video Editing

    Authors: Bozheng Li, Yongliang Wu, Yi Lu, Jiashuo Yu, Licheng Tang, Jiawang Cao, Wenqing Zhu, Yuyang Sun, Jay Wu, Wenbo Zhu

    Abstract: Widely shared videos on the internet are often edited. Recently, although Video Large Language Models (Vid-LLMs) have made great progress in general video understanding tasks, their capabilities in video editing understanding (VEU) tasks remain unexplored. To address this gap, in this paper, we introduce VEU-Bench (Video Editing Understanding Benchmark), a comprehensive benchmark that categorizes… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: Accepted to CVPR2025

  11. arXiv:2504.15817  [pdf, other

    cs.CR cs.AR

    EFFACT: A Highly Efficient Full-Stack FHE Acceleration Platform

    Authors: Yi Huang, Xinsheng Gong, Xiangyu Kong, Dibei Chen, Jianfeng Zhu, Wenping Zhu, Liangwei Li, Mingyu Gao, Shaojun Wei, Aoyang Zhang, Leibo Liu

    Abstract: Fully Homomorphic Encryption (FHE) is a set of powerful cryptographic schemes that allows computation to be performed directly on encrypted data with an unlimited depth. Despite FHE's promising in privacy-preserving computing, yet in most FHE schemes, ciphertext generally blows up thousands of times compared to the original message, and the massive amount of data load from off-chip memory for boot… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: Accepted by HPCA 2025

  12. arXiv:2504.14868  [pdf, ps, other

    cs.CV

    Twin Co-Adaptive Dialogue for Progressive Image Generation

    Authors: Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Hongyang He, Wenyu Zhu, Xinhang Yuan, Kuan Lu, Menghao Huo, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, Xueqian Wang

    Abstract: Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, i… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  13. arXiv:2504.14783  [pdf, other

    cs.CV cs.AI eess.IV stat.ML

    How Effective Can Dropout Be in Multiple Instance Learning ?

    Authors: Wenhui Zhu, Peijie Qiu, Xiwen Chen, Zhangsihao Yang, Aristeidis Sotiras, Abolfazl Razi, Yalin Wang

    Abstract: Multiple Instance Learning (MIL) is a popular weakly-supervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

  14. arXiv:2504.14221  [pdf, other

    cs.CV

    Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection

    Authors: Wenbing Zhu, Lidong Wang, Ziqing Zhou, Chengjie Wang, Yurui Pan, Ruoyi Zhang, Zhuhao Chen, Linjie Cheng, Bin-Bin Gao, Jiangning Zhang, Zhenye Gan, Yuxie Wang, Yulong Chen, Shuguang Qian, Mingmin Chi, Bo Peng, Lizhuang Ma

    Abstract: The increasing complexity of industrial anomaly detection (IAD) has positioned multimodal detection methods as a focal area of machine vision research. However, dedicated multimodal datasets specifically tailored for IAD remain limited. Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with… ▽ More

    Submitted 19 April, 2025; originally announced April 2025.

    Comments: 13 pages. Dataset and code: https://realiad4ad.github.io/Real-IAD D3

  15. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  16. arXiv:2504.13629  [pdf, other

    cs.CL cs.AI econ.GN

    Divergent LLM Adoption and Heterogeneous Convergence Paths in Research Writing

    Authors: Cong William Lin, Wu Zhu

    Abstract: Large Language Models (LLMs), such as ChatGPT, are reshaping content creation and academic writing. This study investigates the impact of AI-assisted generative revisions on research manuscripts, focusing on heterogeneous adoption patterns and their influence on writing convergence. Leveraging a dataset of over 627,000 academic papers from arXiv, we develop a novel classification framework by fine… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  17. arXiv:2504.13479  [pdf, other

    cs.NI cs.DC cs.LG

    SFL-LEO: Asynchronous Split-Federated Learning Design for LEO Satellite-Ground Network Framework

    Authors: Jiasheng Wu, Jingjing Zhang, Zheng Lin, Zhe Chen, Xiong Wang, Wenjun Zhu, Yue Gao

    Abstract: Recently, the rapid development of LEO satellite networks spurs another widespread concern-data processing at satellites. However, achieving efficient computation at LEO satellites in highly dynamic satellite networks is challenging and remains an open problem when considering the constrained computation capability of LEO satellites. For the first time, we propose a novel distributed learning fram… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Comments: 13 pages, 14 figures

  18. arXiv:2504.12680  [pdf, other

    cs.AI cs.CV

    Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning

    Authors: Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li, Wenwu Zhu

    Abstract: Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs)… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: 12 pages, 5 figures

  19. arXiv:2504.12048  [pdf, other

    cs.CV

    Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM

    Authors: Zirui Pan, Xin Wang, Yipeng Zhang, Hong Chen, Kwan Man Cheng, Yaofei Wu, Wenwu Zhu

    Abstract: Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. Howe… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: AAAI 2025 Poster

  20. arXiv:2504.11833  [pdf, other

    cs.CL

    Could Thinking Multilingually Empower LLM Reasoning?

    Authors: Changjiang Gao, Xu Huang, Wenhao Zhu, Shujian Huang, Lei Li, Fei Yuan

    Abstract: Previous work indicates that large language models exhibit a significant "English bias", i.e. they often perform better when tasks are presented in English. Interestingly, we have observed that using certain other languages in reasoning tasks can yield better performance than English. However, this phenomenon remains under-explored. In this paper, we explore the upper bound of harnessing multiling… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  21. arXiv:2504.11473  [pdf, other

    cs.CV cs.AI

    Visual moral inference and communication

    Authors: Warren Zhu, Aida Ramezani, Yang Xu

    Abstract: Humans can make moral inferences from multiple sources of input. In contrast, automated moral inference in artificial intelligence typically relies on language models with textual input. However, morality is conveyed through modalities beyond language. We present a computational framework that supports moral inference from natural images, demonstrated in two related tasks: 1) inferring human moral… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  22. arXiv:2504.11373  [pdf, other

    cs.CL cs.CY

    Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions

    Authors: Wang Bill Zhu, Tianqi Chen, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia

    Abstract: Cancer patients are increasingly turning to large language models (LLMs) as a new form of internet search for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with detailed clinical contexts. In t… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  23. arXiv:2504.11162  [pdf, ps, other

    eess.SP cs.IT

    Scalable Transceiver Design for Multi-User Communication in FDD Massive MIMO Systems via Deep Learning

    Authors: Lin Zhu, Weifeng Zhu, Shuowen Zhang, Shuguang Cui, Liang Liu

    Abstract: This paper addresses the joint transceiver design, including pilot transmission, channel feature extraction and feedback, as well as precoding, for low-overhead downlink massive multiple-input multiple-output (MIMO) communication in frequency-division duplex (FDD) systems. Although deep learning (DL) has shown great potential in tackling this problem, existing methods often suffer from poor scalab… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

  24. arXiv:2504.07102  [pdf, other

    cs.IR cs.LG

    Behavior Importance-Aware Graph Neural Architecture Search for Cross-Domain Recommendation

    Authors: Chendi Ge, Xin Wang, Ziwei Zhang, Yijian Qin, Hong Chen, Haiyang Wu, Yang Zhang, Yuekui Yang, Wenwu Zhu

    Abstract: Cross-domain recommendation (CDR) mitigates data sparsity and cold-start issues in recommendation systems. While recent CDR approaches using graph neural networks (GNNs) capture complex user-item interactions, they rely on manually designed architectures that are often suboptimal and labor-intensive. Additionally, extracting valuable behavioral information from source domains to improve target dom… ▽ More

    Submitted 11 March, 2025; originally announced April 2025.

    Comments: AAAI 2025 Oral

  25. arXiv:2504.06270  [pdf, other

    cs.IR cs.AI

    Addressing Cold-start Problem in Click-Through Rate Prediction via Supervised Diffusion Modeling

    Authors: Wenqiao Zhu, Lulu Wang, Jun Wu

    Abstract: Predicting Click-Through Rates is a crucial function within recommendation and advertising platforms, as the output of CTR prediction determines the order of items shown to users. The Embedding \& MLP paradigm has become a standard approach for industrial recommendation systems and has been widely deployed. However, this paradigm suffers from cold-start problems, where there is either no or only l… ▽ More

    Submitted 1 March, 2025; originally announced April 2025.

  26. arXiv:2504.04974  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Towards Visual Text Grounding of Multimodal Large Language Model

    Authors: Ming Li, Ruiyi Zhang, Jian Chen, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, Tong Sun

    Abstract: Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these chal… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  27. arXiv:2504.01004  [pdf, other

    cs.CV

    Schrödinger Diffusion Driven Signal Recovery in 3T BOLD fMRI Using Unmatched 7T Observations

    Authors: Yujian Xiong, Xuanzhao Dong, Sebastian Waz, Wenhui Zhu, Negar Mallak, Zhong-lin Lu, Yalin Wang

    Abstract: Ultra-high-field (7 Tesla) BOLD fMRI offers exceptional detail in both spatial and temporal domains, along with robust signal-to-noise characteristics, making it a powerful modality for studying visual information processing in the brain. However, due to the limited accessibility of 7T scanners, the majority of neuroimaging studies are still conducted using 3T systems, which inherently suffer from… ▽ More

    Submitted 13 May, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

  28. arXiv:2503.23436  [pdf, other

    cs.IR

    Filtering with Time-frequency Analysis: An Adaptive and Lightweight Model for Sequential Recommender Systems Based on Discrete Wavelet Transform

    Authors: Sheng Lu, Mingxi Ge, Jiuyi Zhang, Wanli Zhu, Guanjin Li, Fangming Gu

    Abstract: Sequential Recommender Systems (SRS) aim to model sequential behaviors of users to capture their interests which usually evolve over time. Transformer-based SRS have achieved distinguished successes recently. However, studies reveal self-attention mechanism in Transformer-based models is essentially a low-pass filter and ignores high frequency information potentially including meaningful user inte… ▽ More

    Submitted 4 May, 2025; v1 submitted 30 March, 2025; originally announced March 2025.

    Comments: 17pages, accepted by ICIC 2025 oral

  29. arXiv:2503.23118  [pdf, other

    cs.CY

    Optimizing Library Usage and Browser Experience: Application to the New York Public Library

    Authors: Zhi Liu, Wenchang Zhu, Sarah Rankin, Nikhil Garg

    Abstract: We tackle the challenge brought to urban library systems by the {holds system} -- which allows users to request books available at other branches to be transferred for local pickup. The holds system increases usage of the entire collection, at the expense of an in-person browser's experience at the source branch. We study the optimization of usage and browser experience, where the library has two… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

  30. arXiv:2503.22759  [pdf, other

    cs.CR cs.AI

    Data Poisoning in Deep Learning: A Survey

    Authors: Pinlong Zhao, Weiyao Zhu, Pengfei Jiao, Di Gao, Ou Wu

    Abstract: Deep learning has become a cornerstone of modern artificial intelligence, enabling transformative applications across a wide range of domains. As the core element of deep learning, the quality and security of training data critically influence model performance and reliability. However, during the training process, deep learning models face the significant threat of data poisoning, where attackers… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

  31. arXiv:2503.21082  [pdf, other

    cs.CV

    Can Video Diffusion Model Reconstruct 4D Geometry?

    Authors: Jinjie Mai, Wenxuan Zhu, Haozhe Liu, Bing Li, Cheng Zheng, Jürgen Schmidhuber, Bernard Ghanem

    Abstract: Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemp… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  32. arXiv:2503.19999  [pdf, ps, other

    cs.DS

    Online Disjoint Spanning Trees and Polymatroid Bases

    Authors: Karthekeyan Chandrasekaran, Chandra Chekuri, Weihao Zhu

    Abstract: Finding the maximum number of disjoint spanning trees in a given graph is a well-studied problem with several applications and connections. The Tutte-Nash-Williams theorem provides a min-max relation for this problem which also extends to disjoint bases in a matroid and leads to efficient algorithms. Several other packing problems such as element disjoint Steiner trees, disjoint set covers, and di… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  33. arXiv:2503.18407  [pdf, other

    cs.CV

    VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

    Authors: Wencheng Zhu, Yuexin Wang, Hongxuan Li, Pengfei Zhu, Qinghua Hu

    Abstract: Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing approaches primarily rely on parameter-efficient fine-tuning of image-text pre-trained models, yet they often suffer from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video… ▽ More

    Submitted 24 March, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

  34. arXiv:2503.17827  [pdf, other

    cs.CV

    4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

    Authors: Wenxuan Zhu, Bing Li, Cheng Zheng, Jinjie Mai, Jun Chen, Letian Jiang, Abdullah Hamdi, Sara Rojas Martinez, Chia-Wen Lin, Mohamed Elhoseiny, Bernard Ghanem

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understand… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

  35. arXiv:2503.16991  [pdf, other

    cs.LG

    TRACE: Time SeRies PArameter EffiCient FinE-tuning

    Authors: Yuze Li, Wei Zhu

    Abstract: We propose an efficient fine-tuning method for time series foundation models, termed TRACE: Time Series Parameter Efficient Fine-tuning. While pretrained time series foundation models are gaining popularity, they face the following challenges: (1) Unlike natural language tasks, time series data vary in frequency, channel numbers, historical/prediction lengths. For long-term forecasting tasks in pa… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

  36. arXiv:2503.16188  [pdf, other

    cs.CV

    Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning

    Authors: Ming Li, Jike Zhong, Shitian Zhao, Yuxiang Lai, Haoquan Zhang, Wang Bill Zhu, Kaipeng Zhang

    Abstract: This paper investigates the role of explicit thinking process in rule-based reinforcement fine-tuning (RFT) for MLLMs. We first propose CLS-RL for MLLM image classification, using verifiable rewards for fine-tuning. Experiments show CLS-RL significantly outperforms SFT and yields a cross-dataset generalization effect. We then rethink and question whether explicit thinking in RFT is always necessar… ▽ More

    Submitted 12 May, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: Preprint, work in progress. Add results on adaptive-thinking and response inconsistency

  37. arXiv:2503.14985  [pdf, other

    cs.CL

    ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

    Authors: Dewei Wang, Wei Zhu, Liyang Ling, Ettore Tiotto, Quintin Wang, Whitney Tsang, Julian Opperman, Jacky Deng

    Abstract: In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces like CUDA or SYCL, Triton has emerged as a DSL that offers a more user-friendly and portable alternative by programming at a higher level. The current Triton star… ▽ More

    Submitted 26 March, 2025; v1 submitted 19 March, 2025; originally announced March 2025.

  38. arXiv:2503.12538  [pdf, other

    cs.RO cs.LG

    EmoBipedNav: Emotion-aware Social Navigation for Bipedal Robots with Deep Reinforcement Learning

    Authors: Wei Zhu, Abirath Raju, Abdulaziz Shamsah, Anqi Wu, Seth Hutchinson, Ye Zhao

    Abstract: This study presents an emotion-aware navigation framework -- EmoBipedNav -- using deep reinforcement learning (DRL) for bipedal robots walking in socially interactive environments. The inherent locomotion constraints of bipedal robots challenge their safe maneuvering capabilities in dynamic environments. When combined with the intricacies of social environments, including pedestrian interactions a… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

    Comments: 13 pages

  39. arXiv:2503.11240  [pdf, other

    cs.CV cs.LG

    Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

    Authors: Zijing Hu, Fengda Zhang, Long Chen, Kun Kuang, Jiahui Li, Kaifeng Gao, Jun Xiao, Xin Wang, Wenwu Zhu

    Abstract: Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is on… ▽ More

    Submitted 26 March, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025, add references

  40. arXiv:2503.08906  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation

    Authors: Xiwen Chen, Wenhui Zhu, Peijie Qiu, Hao Wang, Huayu Li, Haiyu Wu, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi

    Abstract: Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  41. arXiv:2503.08099  [pdf, other

    cs.LG

    Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors

    Authors: Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, Chun Yuan

    Abstract: Model merging seeks to integrate task-specific expert models into a unified architecture while preserving multi-task generalization capabilities, yet parameter interference between constituent models frequently induces performance degradation. Although prior work has explored many merging strategies, resolving interference without additional data for retraining or test-time computation remains cha… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: 23 pages, 15 figures, 9 tables

  42. arXiv:2503.07114  [pdf, other

    cs.LG stat.ML

    Sequential Function-Space Variational Inference via Gaussian Mixture Approximation

    Authors: Menghao Waiyan William Zhu, Pengcheng Hao, Ercan Engin Kuruoğlu

    Abstract: Continual learning is learning from a sequence of tasks with the aim of learning new tasks without forgetting old tasks. Sequential function-space variational inference (SFSVI) is a continual learning method based on variational inference which uses a Gaussian variational distribution to approximate the distribution of the outputs of a finite number of selected inducing points. Since the posterior… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  43. arXiv:2503.04446  [pdf, other

    cs.SI cs.MM

    SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity

    Authors: Yijie Xu, Bolun Zheng, Wei Zhu, Hangjia Pan, Yuchen Yao, Ning Xu, Anan Liu, Quan Zhang, Chenggang Yan

    Abstract: Social media popularity prediction task aims to predict the popularity of posts on social media platforms, which has a positive driving effect on application scenarios such as content optimization, digital marketing and online advertising. Though many studies have made significant progress, few of them pay much attention to the integration between popularity prediction with temporal alignment. In… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

    Comments: accept by CVPR 2025

  44. arXiv:2503.04346  [pdf, other

    cs.CL

    Adding Alignment Control to Language Models

    Authors: Wenhong Zhu, Weinan Zhang, Rui Wang

    Abstract: Post-training alignment has increasingly become a crucial factor in enhancing the usability of language models (LMs). However, the strength of alignment varies depending on individual preferences. This paper proposes a method to incorporate alignment control into a single model, referred to as CLM. This approach adds one identity layer preceding the initial layers and performs preference learning… ▽ More

    Submitted 7 March, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

  45. arXiv:2503.03987  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models

    Authors: Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, Yalin Wang

    Abstract: Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpre… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  46. arXiv:2503.01292  [pdf, other

    cs.CV

    PA-CLIP: Enhancing Zero-Shot Anomaly Detection through Pseudo-Anomaly Awareness

    Authors: Yurui Pan, Lidong Wang, Yuchao Chen, Wenbing Zhu, Bo Peng, Mingmin Chi

    Abstract: In industrial anomaly detection (IAD), accurately identifying defects amidst diverse anomalies and under varying imaging conditions remains a significant challenge. Traditional approaches often struggle with high false-positive rates, frequently misclassifying normal shadows and surface deformations as defects, an issue that becomes particularly pronounced in products with complex and intricate su… ▽ More

    Submitted 3 March, 2025; originally announced March 2025.

    Comments: 9 pages

  47. arXiv:2503.01253  [pdf, other

    cs.DC

    NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU

    Authors: Cong Ma, Du Wu, Zhelang Deng, Jiang Chen, Xiaowen Huang, Jintao Meng, Wenxi Zhu, Bingqiang Wang, Amelie Chi Zhou, Peng Chen, Minwen Deng, Yanjie Wei, Shengzhong Feng, Yi Pan

    Abstract: Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight pruning, particularly through N:M sparsity matrix multiplication, offers an efficient solution by transforming dense operations into semi-sparse ones. N:M sparsity pro… ▽ More

    Submitted 4 March, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

    Comments: 12 pages, 10 figures, accepted at IPDPS 2025. Code: https://github.com/M-H482/NM-SpMM

    ACM Class: C.1.4; D.1.3; G.1.0

  48. arXiv:2503.00495  [pdf, other

    cs.CV cs.AI

    Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture

    Authors: Xuanchen Li, Jianyu Wang, Yuhao Cheng, Yikun Zeng, Xingyu Ren, Wenhan Zhu, Weiming Zhao, Yichao Yan

    Abstract: Significant progress has been made for speech-driven 3D face animation, but most works focus on learning the motion of mesh/geometry, ignoring the impact of dynamic texture. In this work, we reveal that dynamic texture plays a key role in rendering high-fidelity talking avatars, and introduce a high-resolution 4D dataset \textbf{TexTalk4D}, consisting of 100 minutes of audio-synced scan-level mesh… ▽ More

    Submitted 1 March, 2025; originally announced March 2025.

  49. arXiv:2502.20545  [pdf, other

    cs.LG

    SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers

    Authors: Kechen Li, Wenqi Zhu, Coralia Cartis, Tianbo Ji, Shiwei Liu

    Abstract: Large Language Models (LLMs) have achieved human-level proficiency across diverse tasks, but their ability to perform rigorous mathematical problem solving remains an open challenge. In this work, we investigate a fundamental yet computationally intractable problem: determining whether a given multivariate polynomial is nonnegative. This problem, closely related to Hilbert's Seventeenth Problem, p… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

  50. arXiv:2502.18763  [pdf, other

    cs.IT

    CommGPT: A Graph and Retrieval-Augmented Multimodal Communication Foundation Model

    Authors: Feibo Jiang, Wanyun Zhu, Li Dong, Kezhi Wang, Kun Yang, Cunhua Pan, Octavia A. Dobre

    Abstract: Large Language Models (LLMs) possess human-level cognitive and decision-making capabilities, making them a key technology for 6G. However, applying LLMs to the communication domain faces three major challenges: 1) Inadequate communication data; 2) Restricted input modalities; and 3) Difficulty in knowledge retrieval. To overcome these issues, we propose CommGPT, a multimodal foundation model desig… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.