Skip to main content

Showing 1–50 of 394 results for author: Yulin

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.16504  [pdf, ps, other

    cs.CV cs.AI

    Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

    Authors: Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxiang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, Sheng Zhang, Xin Huang, Di Luo, Fan Yang, Fang Yang, Lifu Wang, Sicong Liu, Yixuan Tang, Yulin Cai, Zebin He, Tian Liu, Yuhong Liu, Jie Jiang, Linus, Jingwei Huang , et al. (1 additional authors not shown)

    Abstract: In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: Technical report

  2. arXiv:2506.15647  [pdf, ps, other

    cs.AI

    Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement

    Authors: Weixiang Zhao, Jiahe Guo, Yang Deng, Xingyu Sui, Yulin Hu, Yanyan Zhao, Wanxiang Che, Bing Qin, Tat-Seng Chua, Ting Liu

    Abstract: Recent advancements in large reasoning models (LRMs) have significantly enhanced language models' capabilities in complex problem-solving by emulating human-like deliberative thinking. However, these models often exhibit overthinking (i.e., the generation of unnecessarily verbose and redundant content), which hinders efficiency and inflates inference cost. In this work, we explore the representati… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  3. arXiv:2506.15442  [pdf, ps, other

    cs.CV cs.AI

    Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

    Authors: Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, Qingxiang Lin, Zeqiang Lai, Xianghui Yang, Huiwen Shi, Zibo Zhao, Bowen Zhang, Hongyu Yan, Lifu Wang, Sicong Liu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu , et al. (28 additional authors not shown)

    Abstract: 3D AI-generated content (AIGC) is a passionate field that has significantly accelerated the creation of 3D models in gaming, film, and design. Despite the development of several groundbreaking models that have revolutionized 3D generation, the field remains largely accessible only to researchers, developers, and designers due to the complexities involved in collecting, processing, and training 3D… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: Github link: https://github.com/Tencent-Hunyuan/Hunyuan3D-2.1

  4. arXiv:2506.13585  [pdf, ps, other

    cs.CL cs.LG

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Authors: MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou , et al. (103 additional authors not shown)

    Abstract: We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: A technical report from MiniMax. The authors are listed in alphabetical order. We open-source our MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1

  5. arXiv:2506.13068  [pdf, ps, other

    cs.MA

    Towards the Autonomous Optimization of Urban Logistics: Training Generative AI with Scientific Tools via Agentic Digital Twins and Model Context Protocol

    Authors: Haowen Xu, Yulin Sun, Jose Tupayachi, Olufemi Omitaomu, Sisi Zlatanova, Xueping Li

    Abstract: Optimizing urban freight logistics is critical for developing sustainable, low-carbon cities. Traditional methods often rely on manual coordination of simulation tools, optimization solvers, and expert-driven workflows, limiting their efficiency and scalability. This paper presents an agentic system architecture that leverages the model context protocol (MCP) to orchestrate multi-agent collaborati… ▽ More

    Submitted 17 June, 2025; v1 submitted 15 June, 2025; originally announced June 2025.

  6. arXiv:2506.10949  [pdf, ps, other

    cs.CR cs.AI

    Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

    Authors: Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, He He

    Abstract: Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent, leaving them blind to malicious intent that emerges over a sequence of seemingly benign instr… ▽ More

    Submitted 14 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

  7. arXiv:2506.02827  [pdf, other

    cs.CL

    TO-GATE: Clarifying Questions and Summarizing Responses with Trajectory Optimization for Eliciting Human Preference

    Authors: Yulin Dou, Jiangming Liu

    Abstract: Large language models (LLMs) can effectively elicit human preferences through multi-turn dialogue. Complex tasks can be accomplished through iterative clarifying questions and final responses generated by an LLM acting as a questioner (STaR-GATE; Andukuri et al., 2024}). However, existing approaches based on self-taught reasoning struggle to identify optimal dialogue trajectories and avoid irrelev… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

  8. arXiv:2505.23229  [pdf, ps, other

    cs.CL cs.AI cs.CY

    MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration

    Authors: Hao Lu, Yanchi Gu, Haoyuan Huang, Yulin Zhou, Ningxin Zhu, Chen Li

    Abstract: The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: 50 pages, 3 figures

  9. arXiv:2505.21951  [pdf, ps, other

    cs.IT eess.SP

    When Feedback Empowers the Uplink: Integrating Adaptive Coding with Wireless Power Transfer

    Authors: Zijian Yang, Yulin Shao, Shaodan Ma

    Abstract: Energy consumption and device lifetime are critical concerns for battery-constrained IoT devices. This paper introduces the Feedback-Aided Coding and Energy Transfer (FACET) framework, which synergistically combines adaptive feedback channel coding with wireless power transfer. FACET leverages the saturation effect of feedback coding, where increasing downlink power yields diminishing returns, to… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  10. arXiv:2505.21375  [pdf, ps, other

    cs.CV

    GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

    Authors: Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang

    Abstract: Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

  11. arXiv:2505.19756  [pdf, other

    cs.CL

    Efficient Reasoning via Chain of Unconscious Thought

    Authors: Ruihan Gong, Yue Liu, Wenjie Qu, Mingzhe Du, Yufei He, Yingwei Ma, Yulin Chen, Xiang Liu, Yi Wen, Xinfeng Li, Ruidong Wang, Xinzhong Zhu, Bryan Hooi, Jiaheng Zhang

    Abstract: Large Reasoning Models (LRMs) achieve promising performance but compromise token efficiency due to verbose reasoning processes. Unconscious Thought Theory (UTT) posits that complex problems can be solved more efficiently through internalized cognitive processes. Inspired by UTT, we propose a new reasoning paradigm, termed Chain of Unconscious Thought (CoUT), to improve the token efficiency of LRMs… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  12. arXiv:2505.19709  [pdf, ps, other

    cs.IT eess.SP

    Capacity-Optimized Pre-Equalizer Design for Visible Light Communication Systems

    Authors: Runxin Zhang, Yulin Shao, Jian Xiong, Lu Lu, Murat Uysal

    Abstract: Since commercial LEDs are primarily designed for illumination rather than data transmission, their modulation bandwidth is inherently limited to a few MHz. This becomes a major bottleneck in the implementation of visible light communication (VLC) systems necessiating the design of pre-equalizers. While state-of-the-art equalizer designs primarily focus on the data rate increasing through bandwidth… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  13. arXiv:2505.18714  [pdf, ps, other

    cs.RO

    YOPO-Rally: A Sim-to-Real Single-Stage Planner for Off-Road Terrain

    Authors: Hongyu Cao, Junjie Lu, Xuewei Zhang, Yulin Hui, Zhiyu Li, Bailing Tian

    Abstract: Off-road navigation remains challenging for autonomous robots due to the harsh terrain and clustered obstacles. In this letter, we extend the YOPO (You Only Plan Once) end-to-end navigation framework to off-road environments, explicitly focusing on forest terrains, consisting of a high-performance, multi-sensor supported off-road simulator YOPO-Sim, a zero-shot transfer sim-to-real planner YOPO-Ra… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: 8 pages, 8 figures

  14. arXiv:2505.16869  [pdf, other

    cs.CL

    MPO: Multilingual Safety Alignment via Reward Gap Optimization

    Authors: Weixiang Zhao, Yulin Hu, Yang Deng, Tongtong Wu, Wenxuan Zhang, Jiahe Guo, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu

    Abstract: Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: To Appear at ACL 2025 (Main)

  15. arXiv:2505.15456  [pdf, other

    cs.CL

    Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment

    Authors: Weixiang Zhao, Xingyu Sui, Yulin Hu, Jiahe Guo, Haixiao Liu, Biye Li, Yanyan Zhao, Bing Qin, Ting Liu

    Abstract: Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Person… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 30 pages, 18 figures, 10 tables

  16. arXiv:2505.15257  [pdf, other

    cs.CL

    When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners

    Authors: Weixiang Zhao, Jiahe Guo, Yang Deng, Tongtong Wu, Wenxuan Zhang, Yulin Hu, Xingyu Sui, Yanyan Zhao, Wanxiang Che, Bing Qin, Tat-Seng Chua, Ting Liu

    Abstract: Multilingual reasoning remains a significant challenge for large language models (LLMs), with performance disproportionately favoring high-resource languages. Drawing inspiration from cognitive neuroscience, which suggests that human reasoning functions largely independently of language processing, we hypothesize that LLMs similarly encode reasoning and language as separable components that can be… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 26 pages, 13 figures

  17. arXiv:2505.14671  [pdf, other

    cs.CV

    UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens

    Authors: Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, Bocheng Zou, Chaoqun Yang, Wentao Zhang

    Abstract: Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ w… ▽ More

    Submitted 22 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

  18. arXiv:2505.14535  [pdf, ps, other

    cs.LG cs.HC

    Spiking Neural Networks with Temporal Attention-Guided Adaptive Fusion for imbalanced Multi-modal Learning

    Authors: Jiangrong Shen, Yulin Xie, Qi Xu, Gang Pan, Huajin Tang, Badong Chen

    Abstract: Multimodal spiking neural networks (SNNs) hold significant potential for energy-efficient sensory processing but face critical challenges in modality imbalance and temporal misalignment. Current approaches suffer from uncoordinated convergence speeds across modalities and static fusion mechanisms that ignore time-varying cross-modal interactions. We propose the temporal attention-guided adaptive f… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  19. arXiv:2505.14135  [pdf, other

    cs.CV

    Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

    Authors: Ruihuang Li, Caijin Zhou, Shoujian Zheng, Jianxiang Lu, Jiabin Huang, Comi Chen, Junshu Tang, Guangzheng Xu, Jiale Tao, Hongmei Wang, Donghao Li, Wenqing Yu, Senbo Wang, Zhimin Li, Yetshuan Shi, Haoyu Yang, Yukun Wang, Wenxun Dai, Jiaqi Li, Linqing Wang, Qixun Wang, Zhiyong Xu, Yingfang Zhang, Jiangfeng Xiong, Weijie Kong , et al. (33 additional authors not shown)

    Abstract: Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simult… ▽ More

    Submitted 28 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

  20. arXiv:2505.12410  [pdf, ps, other

    cs.RO

    MTIL: Encoding Full History with Mamba for Temporal Imitation Learning

    Authors: Yulin Zhou, Yuankai Lin, Fanzhe Peng, Jiahui Chen, Zhuang Zhou, Kaiji Huang, Hua Yang, Zhouping Yin

    Abstract: Standard imitation learning (IL) methods have achieved considerable success in robotics, yet often rely on the Markov assumption, limiting their applicability to tasks where historical context is crucial for disambiguating current observations. This limitation hinders performance in long-horizon sequential manipulation tasks where the correct action depends on past events not fully captured by the… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

    Comments: 16 pages,6 figures,Submitted to IEEE RAL

  21. arXiv:2505.11049  [pdf, other

    cs.AI cs.CR

    GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

    Authors: Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi

    Abstract: To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, ba… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  22. arXiv:2505.06923  [pdf, ps, other

    cs.RO

    YOPOv2-Tracker: An End-to-End Agile Tracking and Navigation Framework from Perception to Action

    Authors: Junjie Lu, Yulin Hui, Xuewei Zhang, Wencan Feng, Hongming Shen, Zhiyu Li, Bailing Tian

    Abstract: Traditional target tracking pipelines including detection, mapping, navigation, and control are comprehensive but introduce high latency, limitting the agility of quadrotors. On the contrary, we follow the design principle of "less is more", striving to simplify the process while maintaining effectiveness. In this work, we propose an end-to-end agile tracking and navigation framework for quadrotor… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  23. arXiv:2505.04410  [pdf, other

    cs.CV

    DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

    Authors: Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, Zhuotao Tian

    Abstract: Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature repr… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  24. arXiv:2504.20472  [pdf, other

    cs.CR

    Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

    Authors: Yulin Chen, Haoran Li, Yuan Sui, Yue Liu, Yufei He, Yangqiu Song, Bryan Hooi

    Abstract: Large language models (LLMs) have demonstrated impressive performance and have come to dominate the field of natural language processing (NLP) across various tasks. However, due to their strong instruction-following capabilities and inability to distinguish between instructions and data content, LLMs are vulnerable to prompt injection attacks. These attacks manipulate LLMs into deviating from the… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  25. arXiv:2504.13820  [pdf, other

    cs.CV

    CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning

    Authors: Yang Yue, Yulin Wang, Chenxin Tao, Pan Liu, Shiji Song, Gao Huang

    Abstract: Humans can develop internal world models that encode common sense knowledge, telling them how the world works and predicting the consequences of their actions. This concept has emerged as a promising direction for establishing general-purpose machine-learning models in recent preliminary works, e.g., for visual representation learning. In this paper, we present CheXWorld, the first effort towards… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR 2025

  26. arXiv:2504.13065  [pdf, other

    cs.CV

    EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance

    Authors: Yang Yue, Yulin Wang, Haojun Jiang, Pan Liu, Shiji Song, Gao Huang

    Abstract: Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, a… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR 2025

  27. arXiv:2504.09466  [pdf, other

    cs.CR cs.CL

    AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

    Authors: Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu

    Abstract: Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjus… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

    Comments: 17 pages, 6 figures, 9 tables

  28. Variability-Driven User-Story Generation using LLM and Triadic Concept Analysis

    Authors: Alexandre Bazin, Alain Gutierrez, Marianne Huchard, Pierre Martin, Yulin, Zhang

    Abstract: A widely used Agile practice for requirements is to produce a set of user stories (also called ``agile product backlog''), which roughly includes a list of pairs (role, feature), where the role handles the feature for a certain purpose. In the context of Software Product Lines, the requirements for a family of similar systems is thus a family of user-story sets, one per system, leading to a 3-dime… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

    Comments: 20th International Conference on Evaluation of Novel Approaches to Software Engineering April 4-6, 2025, in Porto, Portugal

    Journal ref: Proceedings of ENASE 2025; SciTePress, pages 618-625 (2025)

  29. arXiv:2504.05419  [pdf, other

    cs.AI cs.CL

    Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

    Authors: Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, He He

    Abstract: Reasoning models have achieved remarkable performance on tasks like math and logical reasoning thanks to their ability to search during reasoning. However, they still suffer from overthinking, often performing unnecessary reasoning steps even after reaching the correct answer. This raises the question: can models evaluate the correctness of their intermediate answers during reasoning? In this work… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  30. arXiv:2504.03108  [pdf, other

    cs.CV cs.AI

    Multi-Granularity Vision Fastformer with Fusion Mechanism for Skin Lesion Segmentation

    Authors: Xuanyu Liu, Huiyun Yao, Jinggui Gao, Zhongyi Guo, Xue Zhang, Yulin Dong

    Abstract: Background:Convolutional Neural Networks(CNN) and Vision Transformers(ViT) are the main techniques used in Medical image segmentation. However, CNN is limited to local contextual information, and ViT's quadratic complexity results in significant computational costs. At the same time, equipping the model to distinguish lesion boundaries with varying degrees of severity is also a challenge encounter… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  31. arXiv:2504.02901  [pdf, other

    cs.LG cs.AI

    Hide and Seek in Noise Labels: Noise-Robust Collaborative Active Learning with LLM-Powered Assistance

    Authors: Bo Yuan, Yulin Chen, Yin Zhang, Wei Jiang

    Abstract: Learning from noisy labels (LNL) is a challenge that arises in many real-world scenarios where collected training data can contain incorrect or corrupted labels. Most existing solutions identify noisy labels and adopt active learning to query human experts on them for denoising. In the era of large language models (LLMs), although we can reduce the human effort to improve these methods, their perf… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  32. arXiv:2503.23990  [pdf, other

    cs.CL

    BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation

    Authors: Yumeng Fu, Junjie Wu, Zhongjie Wang, Meishan Zhang, Yulin Wu, Bingquan Liu

    Abstract: Multimodal emotion recognition in conversation (MERC), the task of identifying the emotion label for each utterance in a conversation, is vital for developing empathetic machines. Current MLLM-based MERC studies focus mainly on capturing the speaker's textual or vocal characteristics, but ignore the significance of video-derived behavior information. Different from text and audio inputs, learning… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  33. arXiv:2503.23771  [pdf, other

    cs.CV

    XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

    Authors: Fengxiang Wang, Hongzhen Wang, Mingshuo Chen, Di Wang, Yulin Wang, Zonghao Guo, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, Zhiyuan Liu, Maosong Sun

    Abstract: The astonishing breakthrough of multimodal large language models (MLLMs) has necessitated new benchmarks to quantitatively assess their capabilities, reveal their limitations, and indicate future research directions. However, this is challenging in the context of remote sensing (RS), since the imagery features ultra-high resolution that incorporates extremely complex semantic relationships. Existi… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: It has been accepted by CVPR2025

  34. arXiv:2503.22998  [pdf, other

    cs.LG cs.AI cs.CR

    AuditVotes: A Framework Towards More Deployable Certified Robustness for Graph Neural Networks

    Authors: Yuni Lai, Yulin Zhu, Yixuan Sun, Yulun Wu, Bin Xiao, Gaolei Li, Jianhua Li, Kai Zhou

    Abstract: Despite advancements in Graph Neural Networks (GNNs), adaptive attacks continue to challenge their robustness. Certified robustness based on randomized smoothing has emerged as a promising solution, offering provable guarantees that a model's predictions remain stable under adversarial perturbations within a specified range. However, existing methods face a critical trade-off between accuracy and… ▽ More

    Submitted 29 March, 2025; originally announced March 2025.

    Comments: 20 pages

  35. arXiv:2503.21833  [pdf, other

    cs.CL

    Refining Time Series Anomaly Detectors using Large Language Models

    Authors: Alan Yang, Yulin Chen, Sean Lee, Venus Montes

    Abstract: Time series anomaly detection (TSAD) is of widespread interest across many industries, including finance, healthcare, and manufacturing. Despite the development of numerous automatic methods for detecting anomalies, human oversight remains necessary to review and act upon detected anomalies, as well as verify their accuracy. We study the use of multimodal large language models (LLMs) to partially… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

    Comments: Main content: 4 pages, 1 figure, 1 table

  36. arXiv:2503.20314  [pdf, other

    cs.CV

    Wan: Open and Advanced Large-Scale Video Generative Models

    Authors: Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu , et al. (37 additional authors not shown)

    Abstract: This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluat… ▽ More

    Submitted 18 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: 60 pages, 33 figures

  37. arXiv:2503.20208  [pdf, other

    cs.RO cs.AI cs.LG

    Learning Adaptive Dexterous Grasping from Single Demonstrations

    Authors: Liangzhi Shi, Yulin Liu, Lingqi Zeng, Bo Ai, Zhengdong Hong, Hao Su

    Abstract: How can robots learn dexterous grasping skills efficiently and apply them adaptively based on user instructions? This work tackles two key challenges: efficient skill acquisition from limited human demonstrations and context-driven skill selection. We introduce AdaDexGrasp, a framework that learns a library of grasping skills from a single human demonstration per skill and selects the most suitabl… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  38. arXiv:2503.19591  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization

    Authors: Weifei Jin, Junjie Su, Hejia Wang, Yulin Ye, Jie Hao

    Abstract: With the widespread application of automatic speech recognition (ASR) systems, their vulnerability to adversarial attacks has been extensively studied. However, most existing adversarial examples are generated on specific individual models, resulting in a lack of transferability. In real-world scenarios, attackers often cannot access detailed information about the target model, making query-based… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Accepted to ICME 2025

  39. arXiv:2503.18854   

    cs.CV cs.AI

    MC-LLaVA: Multi-Concept Personalized Vision-Language Model

    Authors: Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang

    Abstract: Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies investigate VLM personalization to understand user-provided concepts. However, they mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. Thi… ▽ More

    Submitted 25 March, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

    Comments: I sincerely apologize for any inconvenience caused. We actually uploaded this paper to arXiv in November 2024, as arXiv:2411.11706. During this update, we did not consider the replacement operation of arXiv, which led to duplicate submissions. We have made modifications at the original address arXiv:2411.11706

  40. arXiv:2503.17979  [pdf, other

    cs.AI cs.CL

    Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

    Authors: Weixiang Zhao, Xingyu Sui, Jiahe Guo, Yulin Hu, Yang Deng, Yanyan Zhao, Bing Qin, Wanxiang Che, Tat-Seng Chua, Ting Liu

    Abstract: Recent advancements in Large Reasoning Models (LRMs), such as OpenAI's o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought reasoning. However, our systematic evaluation across various model families (DeepSeek, Qwen, and LLaMA) and scales (7B to 671B) reveals that acquiring these deliberati… ▽ More

    Submitted 23 March, 2025; originally announced March 2025.

    Comments: 23 pages. Work in progress

  41. arXiv:2503.16385  [pdf, ps, other

    cs.AI

    Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation

    Authors: Yijia Luo, Yulin Song, Xingyao Zhang, Jiaheng Liu, Weixun Wang, GengRu Chen, Wenbo Su, Bo Zheng

    Abstract: Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities through long chain-of-thought (CoT) reasoning. The R1 distillation scheme has emerged as a promising approach for training cost-effective models with enhanced reasoning abilities. However, the underlying mechanisms driving its effectiveness remain unclear. This study examines the universality of… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  42. arXiv:2503.14482  [pdf, other

    cs.CV

    ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

    Authors: Yulin Pan, Xiangteng He, Chaojie Mao, Zhen Han, Zeyinzi Jiang, Jingfeng Zhang, Yu Liu

    Abstract: Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Task… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: 17 pages

  43. arXiv:2503.14355  [pdf, other

    cs.CV

    MAST-Pro: Dynamic Mixture-of-Experts for Adaptive Segmentation of Pan-Tumors with Knowledge-Driven Prompts

    Authors: Runqi Meng, Sifan Song, Pengfei Jin, Yujin Oh, Lin Teng, Yulin Wang, Yiqun Sun, Ling Chen, Xiang Li, Quanzheng Li, Ning Guo, Dinggang Shen

    Abstract: Accurate tumor segmentation is crucial for cancer diagnosis and treatment. While foundation models have advanced general-purpose segmentation, existing methods still struggle with: (1) limited incorporation of medical priors, (2) imbalance between generic and tumor-specific features, and (3) high computational costs for clinical adaptation. To address these challenges, we propose MAST-Pro (Mixture… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

    Comments: 10 pages, 2 figures

  44. arXiv:2503.12999  [pdf, other

    cs.CV cs.AI

    Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization

    Authors: Ruichuan An, Kai Zeng, Ming Lu, Sihan Yang, Renrui Zhang, Huitong Ji, Qizhe Zhang, Yulin Luo, Hao Liang, Wentao Zhang

    Abstract: Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low… ▽ More

    Submitted 23 March, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

    Comments: The code is released at $\href{https://github.com/zengkaiya/CaT}{\text{https://github.com/zengkaiya/CaT}}$

  45. arXiv:2503.12450  [pdf, other

    cs.CV

    LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching

    Authors: Feihong Yan, Qingyan Wei, Jiayi Tang, Jiajun Li, Yulin Wang, Xuming Hu, Huiqi Li, Linfeng Zhang

    Abstract: Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that und… ▽ More

    Submitted 16 March, 2025; originally announced March 2025.

    Comments: 10 pages, 6 figures

  46. arXiv:2503.10392  [pdf, other

    cs.CV cs.AI

    RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

    Authors: Fengxiang Wang, Hongzhen Wang, Yulin Wang, Di Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Long Lan, Wenjing Yang, Jing Zhang

    Abstract: Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  47. arXiv:2503.09281  [pdf, other

    cs.SI

    Crowdsourced Homophily Ties Based Graph Annotation Via Large Language Model

    Authors: Yu Bu, Yulin Zhu, Kai Zhou

    Abstract: Accurate graph annotation typically requires substantial labeled data, which is often challenging and resource-intensive to obtain. In this paper, we present Crowdsourced Homophily Ties Based Graph Annotation via Large Language Model (CSA-LLM), a novel approach that combines the strengths of crowdsourced annotations with the capabilities of large language models (LLMs) to enhance the graph annotat… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  48. arXiv:2503.07598  [pdf, other

    cs.CV

    VACE: All-in-One Video Creation and Editing

    Authors: Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, Yu Liu

    Abstract: Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synt… ▽ More

    Submitted 11 March, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

    Comments: Project page: https://ali-vilab.github.io/VACE-Page/

  49. arXiv:2503.05362  [pdf, other

    cs.CL

    Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter

    Authors: Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, Ting Liu

    Abstract: The growing emotional stress in modern society has increased the demand for Emotional Support Conversations (ESC). While Large Language Models (LLMs) show promise for ESC, they face two key challenges: (1) low strategy selection accuracy, and (2) preference bias, limiting their adaptability to emotional needs of users. Existing supervised fine-tuning (SFT) struggles to address these issues, as it… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

    Comments: 19 pages, 9 figures, 15 tables

  50. arXiv:2503.05127  [pdf, other

    cs.CV cs.AI

    HexPlane Representation for 3D Semantic Scene Understanding

    Authors: Zeren Chen, Yuenan Hou, Yulin Chen, Li Liu, Xiao Sun, Lu Sheng

    Abstract: In this paper, we introduce the HexPlane representation for 3D semantic scene understanding. Specifically, we first design the View Projection Module (VPM) to project the 3D point cloud into six planes to maximally retain the original spatial information. Features of six planes are extracted by the 2D encoder and sent to the HexPlane Association Module (HAM) to adaptively fuse the most informative… ▽ More

    Submitted 6 March, 2025; originally announced March 2025.

    Comments: 7 pages, 2 figures