Skip to main content

Showing 1–50 of 418 results for author: Zeng, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04447  [pdf, ps, other

    cs.CV cs.RO

    DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    Authors: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin

    Abstract: Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and sema… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  2. arXiv:2507.00880  [pdf, ps, other

    cs.LG cs.AI

    NN-Former: Rethinking Graph Structure in Neural Architecture Representation

    Authors: Ruihan Xu, Haokui Zhang, Yaowei Wang, Wei Zeng, Shiliang Zhang

    Abstract: The growing use of deep learning necessitates efficient network design and deployment, making neural predictors vital for estimating attributes such as accuracy and latency. Recently, Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures. However, each of both methods has its disadvantages. GNNs lack the capabilities to represent compli… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to CVPR 2025. Code is avaiable at https://github.com/XuRuihan/NNFormer

  3. arXiv:2506.23563  [pdf, ps, other

    cs.AI cs.CL cs.CV

    MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

    Authors: Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao

    Abstract: Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of inter… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Technical report

  4. arXiv:2506.22807  [pdf, ps, other

    cs.CV

    FreqDGT: Frequency-Adaptive Dynamic Graph Networks with Transformer for Cross-subject EEG Emotion Recognition

    Authors: Yueyang Li, Shengyu Gong, Weiming Zeng, Nizhuan Wang, Wai Ting Siok

    Abstract: Electroencephalography (EEG) serves as a reliable and objective signal for emotion recognition in affective brain-computer interfaces, offering unique advantages through its high temporal resolution and ability to capture authentic emotional states that cannot be consciously controlled. However, cross-subject generalization remains a fundamental challenge due to individual variability, cognitive t… ▽ More

    Submitted 30 June, 2025; v1 submitted 28 June, 2025; originally announced June 2025.

  5. arXiv:2506.20487  [pdf, ps, other

    cs.RO

    Behavior Foundation Model: Towards Next-Generation Whole-Body Control System of Humanoid Robots

    Authors: Mingqi Yuan, Tao Yu, Wenqi Ge, Xiuyong Yao, Dapeng Li, Huijiang Wang, Jiayu Chen, Xin Jin, Bo Li, Hua Chen, Wei Zhang, Wenjun Zeng

    Abstract: Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: 19 pages, 8 figures

  6. arXiv:2506.12779  [pdf, ps, other

    cs.RO cs.LG

    From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots

    Authors: Yuxuan Wang, Ming Yang, Weishuai Zeng, Yu Zhang, Xinrun Xu, Haobin Jiang, Ziluo Ding, Zongqing Lu

    Abstract: Achieving general agile whole-body control on humanoid robots remains a major challenge due to diverse motion demands and data conflicts. While existing frameworks excel in training single motion-specific policies, they struggle to generalize across highly varied behaviors due to conflicting control requirements and mismatched data distributions. In this work, we propose BumbleBee (BB), an expert-… ▽ More

    Submitted 19 June, 2025; v1 submitted 15 June, 2025; originally announced June 2025.

  7. arXiv:2506.12769  [pdf, ps, other

    cs.RO cs.LG

    RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control

    Authors: Junpeng Yue, Zepeng Wang, Yuxuan Wang, Weishuai Zeng, Jiangxing Wang, Xinrun Xu, Yu Zhang, Sipeng Zheng, Ziluo Ding, Zongqing Lu

    Abstract: This paper focuses on a critical challenge in robotics: translating text-driven human motions into executable actions for humanoid robots, enabling efficient and cost-effective learning of new behaviors. While existing text-to-motion generation methods achieve semantic alignment between language and motion, they often produce kinematically or physically infeasible motions unsuitable for real-world… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  8. arXiv:2506.11144  [pdf, ps, other

    cs.CV

    AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation

    Authors: Chao Liang, Jianwen Jiang, Wang Liao, Jiaqi Yang, Zerong zheng, Weihong Zeng, Han Liang

    Abstract: Recent advancements in human video generation and animation tasks, driven by diffusion models, have achieved significant progress. However, expressive and realistic human animation remains challenging due to the trade-off between motion naturalness and visual fidelity. To address this, we propose \textbf{AlignHuman}, a framework that combines Preference Optimization as a post-training technique wi… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Homepage: https://alignhuman.github.io/

  9. arXiv:2506.09656  [pdf, ps, other

    cs.AI

    Application-Driven Value Alignment in Agentic AI Systems: Survey and Perspectives

    Authors: Wei Zeng, Hengshu Zhu, Chuan Qin, Han Wu, Yihang Cheng, Sirui Zhang, Xiaowei Jin, Yinuo Shen, Zhenxing Wang, Feimin Zhong, Hui Xiong

    Abstract: The ongoing evolution of AI paradigms has propelled AI research into the Agentic AI stage. Consequently, the focus of research has shifted from single agents and simple applications towards multi-agent autonomous decision-making and task collaboration in complex environments. As Large Language Models (LLMs) advance, their applications become more diverse and complex, leading to increasingly situat… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  10. arXiv:2506.01636  [pdf, ps, other

    cs.CV

    Visual Explanation via Similar Feature Activation for Metric Learning

    Authors: Yi Liao, Ugochukwu Ejike Akpudo, Jue Zhang, Yongsheng Gao, Jun Zhou, Wenyi Zeng, Weichuan Zhang

    Abstract: Visual explanation maps enhance the trustworthiness of decisions made by deep learning models and offer valuable guidance for developing new algorithms in image recognition tasks. Class activation maps (CAM) and their variants (e.g., Grad-CAM and Relevance-CAM) have been extensively employed to explore the interpretability of softmax-based convolutional neural networks, which require a fully conne… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  11. arXiv:2505.22203  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning

    Authors: Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, Junxian He

    Abstract: Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their im… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  12. arXiv:2505.18934  [pdf, ps, other

    cs.LG cs.AI cs.IR cs.SI

    Chi-Square Wavelet Graph Neural Networks for Heterogeneous Graph Anomaly Detection

    Authors: Xiping Li, Xiangyu Dong, Xingyi Zhang, Kun Xie, Yuanhao Feng, Bo Wang, Guilin Li, Wuxiong Zeng, Xiujun Shu, Sibo Wang

    Abstract: Graph Anomaly Detection (GAD) in heterogeneous networks presents unique challenges due to node and edge heterogeneity. Existing Graph Neural Network (GNN) methods primarily focus on homogeneous GAD and thus fail to address three key issues: (C1) Capturing abnormal signal and rich semantics across diverse meta-paths; (C2) Retaining high-frequency content in HIN dimension alignment; and (C3) Learnin… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  13. arXiv:2505.15649  [pdf, other

    cs.CV

    The Devil is in Fine-tuning and Long-tailed Problems:A New Benchmark for Scene Text Detection

    Authors: Tianjiao Cao, Jiahao Lyu, Weichao Zeng, Weimin Mu, Yu Zhou

    Abstract: Scene text detection has seen the emergence of high-performing methods that excel on academic benchmarks. However, these detectors often fail to replicate such success in real-world scenarios. We uncover two key factors contributing to this discrepancy through extensive experiments. First, a \textit{Fine-tuning Gap}, where models leverage \textit{Dataset-Specific Optimization} (DSO) paradigm for o… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Accepted by IJCAI2025

  14. arXiv:2505.15616  [pdf, ps, other

    cs.CV

    LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

    Authors: Ruilin Yao, Bo Zhang, Jirui Huang, Xinwei Long, Yifang Zhang, Tianyu Zou, Yufei Wu, Shichao Su, Yifan Xu, Wenxi Zeng, Zhaoyu Yang, Guoyou Li, Shilan Zhang, Zichan Li, Yaxiong Chen, Shengwu Xiong, Peng Xu, Jiajun Zhang, Bowen Zhou, David Clifton, Luc Van Gool

    Abstract: Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  15. arXiv:2505.14664  [pdf, ps, other

    cs.CV cs.AI cs.HC cs.LG

    AKRMap: Adaptive Kernel Regression for Trustworthy Visualization of Cross-Modal Embeddings

    Authors: Yilin Ye, Junchao Huang, Xingchen Zeng, Jiazhi Xia, Wei Zeng

    Abstract: Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple m… ▽ More

    Submitted 28 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

  16. arXiv:2505.12069  [pdf, ps, other

    cs.CV cs.AI

    MT-CYP-Net: Multi-Task Network for Pixel-Level Crop Yield Prediction Under Very Few Samples

    Authors: Shenzhou Liu, Di Wang, Haonan Guo, Chengxi Han, Wenzhi Zeng

    Abstract: Accurate and fine-grained crop yield prediction plays a crucial role in advancing global agriculture. However, the accuracy of pixel-level yield estimation based on satellite remote sensing data has been constrained by the scarcity of ground truth data. To address this challenge, we propose a novel approach called the Multi-Task Crop Yield Prediction Network (MT-CYP-Net). This framework introduces… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  17. arXiv:2505.08608  [pdf

    q-bio.QM cs.LG

    Automated Model-Free Sorting of Single-Molecule Fluorescence Events Using a Deep Learning Based Hidden-State Model

    Authors: Wenqi Zeng, Shuqi Zhou, Yuan Yao, Chunlai Chen

    Abstract: Single-molecule fluorescence assays enable high-resolution analysis of biomolecular dynamics, but traditional analysis pipelines are labor-intensive and rely on users' experience, limiting scalability and reproducibility. Recent deep learning models have automated aspects of data processing, yet many still require manual thresholds, complex architectures, or extensive labeled data. Therefore, we p… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  18. arXiv:2505.06584  [pdf, ps, other

    cs.RO cs.AI

    JAEGER: Dual-Level Humanoid Whole-Body Controller

    Authors: Ziluo Ding, Haobin Jiang, Yuxuan Wang, Zhenguo Sun, Yu Zhang, Xiaojie Niu, Ming Yang, Weishuai Zeng, Xinrun Xu, Zongqing Lu

    Abstract: This paper presents JAEGER, a dual-level whole-body controller for humanoid robots that addresses the challenges of training a more robust and versatile policy. Unlike traditional single-controller approaches, JAEGER separates the control of the upper and lower bodies into two independent controllers, so that they can better focus on their distinct tasks. This separation alleviates the dimensional… ▽ More

    Submitted 16 June, 2025; v1 submitted 10 May, 2025; originally announced May 2025.

    Comments: 15 pages, 2 figures

  19. arXiv:2505.06152  [pdf, ps, other

    cs.CV cs.AI

    MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks

    Authors: Wenqi Zeng, Yuqi Sun, Chenxi Ma, Weimin Tan, Bo Yan

    Abstract: Medical vision-language models (VLMs) have shown promise as clinical assistants across various medical fields. However, specialized dermatology VLM capable of delivering professional and detailed diagnostic analysis remains underdeveloped, primarily due to less specialized text descriptions in current dermatology multimodal datasets. To address this issue, we propose MM-Skin, the first large-scale… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  20. arXiv:2505.03807  [pdf, other

    cs.HC cs.AI cs.CV cs.MA

    Facilitating Video Story Interaction with Multi-Agent Collaborative System

    Authors: Yiwen Zhang, Jianing Hao, Zhan Wang, Hongling Sheng, Wei Zeng

    Abstract: Video story interaction enables viewers to engage with and explore narrative content for personalized experiences. However, existing methods are limited to user selection, specially designed narratives, and lack customization. To address this, we propose an interactive system based on user intent. Our system uses a Vision Language Model (VLM) to enable machines to understand video stories, combini… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: Prepared and submitted in 2024

  21. arXiv:2504.21682  [pdf, ps, other

    cs.CV

    Visual Text Processing: A Comprehensive Review and Unified Evaluation

    Authors: Yan Shu, Weichao Zeng, Fangmin Zhao, Zeyu Chen, Zhenhang Li, Xiaomeng Yang, Yu Zhou, Paolo Rota, Xiang Bai, Lianwen Jin, Xu-Cheng Yin, Nicu Sebe

    Abstract: Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manip… ▽ More

    Submitted 5 June, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

  22. arXiv:2504.18367  [pdf

    physics.comp-ph cs.LG physics.chem-ph q-bio.BM

    Enhanced Sampling, Public Dataset and Generative Model for Drug-Protein Dissociation Dynamics

    Authors: Maodong Li, Jiying Zhang, Bin Feng, Wenqi Zeng, Dechin Chen, Zhijun Pan, Yu Li, Zijing Liu, Yi Isaac Yang

    Abstract: Drug-protein binding and dissociation dynamics are fundamental to understanding molecular interactions in biological systems. While many tools for drug-protein interaction studies have emerged, especially artificial intelligence (AI)-based generative models, predictive tools on binding/dissociation kinetics and dynamics are still limited. We propose a novel research paradigm that combines molecula… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: The code will be accessed from our GitHub repository https://huggingface.co/SZBL-IDEA

  23. arXiv:2504.17814  [pdf, other

    cs.IR

    FIM: Frequency-Aware Multi-View Interest Modeling for Local-Life Service Recommendation

    Authors: Guoquan Wang, Qiang Luo, Weisong Hu, Pengfei Yao, Wencong Zeng, Guorui Zhou, Kun Gai

    Abstract: People's daily lives involve numerous periodic behaviors, such as eating and traveling. Local-life platforms cater to these recurring needs by providing essential services tied to daily routines. Therefore, users' periodic intentions are reflected in their interactions with the platforms. There are two main challenges in modeling users' periodic behaviors in the local-life service recommendation s… ▽ More

    Submitted 30 April, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

    Comments: 10 pages, 5 figures, Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25), July 13--18, 2025, Padua, Italy

    ACM Class: H.3.3

  24. arXiv:2504.17490  [pdf, ps, other

    cs.LG cs.AI

    Plasticine: Accelerating Research in Plasticity-Motivated Deep Reinforcement Learning

    Authors: Mingqi Yuan, Qi Wang, Guozheng Ma, Bo Li, Xin Jin, Yunbo Wang, Xiaokang Yang, Wenjun Zeng, Dacheng Tao

    Abstract: Developing lifelong learning agents is crucial for artificial general intelligence. However, deep reinforcement learning (RL) systems often suffer from plasticity loss, where neural networks gradually lose their ability to adapt during training. Despite its significance, this field lacks unified benchmarks and evaluation protocols. We introduce Plasticine, the first open-source framework for bench… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: 23 pages

  25. arXiv:2504.14606  [pdf, other

    cs.CV

    MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation

    Authors: Siyi Jiao, Wenzheng Zeng, Yerong Li, Huayu Zhang, Changxin Gao, Nong Sang, Mike Zheng Shou

    Abstract: Human instance matting aims to estimate an alpha matte for each human instance in an image, which is challenging as it easily fails in complex cases requiring disentangling mingled pixels belonging to multiple instances along hairy and thin boundary structures. In this work, we address this by introducing MP-Mat, a novel 3D-and-instance-aware matting framework with multiplane representation, where… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

    Comments: Accepted by ICLR 2025

  26. arXiv:2504.14507  [pdf, other

    cs.HC

    VizTA: Enhancing Comprehension of Distributional Visualization with Visual-Lexical Fused Conversational Interface

    Authors: Liangwei Wang, Zhan Wang, Shishi Xiao, Le Liu, Fugee Tsung, Wei Zeng

    Abstract: Comprehending visualizations requires readers to interpret visual encoding and the underlying meanings actively. This poses challenges for visualization novices, particularly when interpreting distributional visualizations that depict statistical uncertainty. Advancements in LLM-based conversational interfaces show promise in promoting visualization comprehension. However, they fail to provide con… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

    Comments: 12 pages, 7 figures, published to EuroVis 2025

  27. arXiv:2504.09156  [pdf, other

    cs.CV

    LEREL: Lipschitz Continuity-Constrained Emotion Recognition Ensemble Learning For Electroencephalography

    Authors: Shengyu Gong, Yueyang Li, Zijian Kang, Weiming Zeng, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

    Abstract: Accurate and efficient perception of emotional states in oneself and others is crucial, as emotion-related disorders are associated with severe psychosocial impairments. While electroencephalography (EEG) offers a powerful tool for emotion detection, current EEG-based emotion recognition (EER) methods face key limitations: insufficient model stability, limited accuracy in processing high-dimension… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  28. Learning from Elders: Making an LLM-powered Chatbot for Retirement Communities more Accessible through User-centered Design

    Authors: Luna Xingyu Li, Ray-yuan Chung, Feng Chen, Wenyu Zeng, Yein Jeon, Oleg Zaslavsky

    Abstract: Low technology and eHealth literacy among older adults in retirement communities hinder engagement with digital tools. To address this, we designed an LLM-powered chatbot prototype using a human-centered approach for a local retirement community. Through interviews and persona development, we prioritized accessibility and dual functionality: simplifying internal information retrieval and improving… ▽ More

    Submitted 28 April, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

    Comments: Accepted as Research talk for Considering Cultural and Linguistic Diversity in AI Applications workshop at CALD-AI@ASIS&T 2025

  29. Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval

    Authors: Zehong Ma, Hao Chen, Wei Zeng, Limin Su, Shiliang Zhang

    Abstract: Fine-grained text-to-image retrieval aims to retrieve a fine-grained target image with a given text query. Existing methods typically assume that each training image is accurately depicted by its textual descriptions. However, textual descriptions can be ambiguous and fail to depict discriminative visual details in images, leading to inaccurate representation learning. To alleviate the effects of… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: TMM25

  30. arXiv:2504.07479  [pdf, other

    cs.AR

    UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference

    Authors: Weikai Xu, Wenxuan Zeng, Qianqian Huang, Meng Li, Ru Huang

    Abstract: Transformer-based large language models (LLMs) have achieved impressive performance in various natural language processing (NLP) applications. However, the high memory and computation cost induced by the KV cache limits the inference efficiency, especially for long input sequences. Compute-in-memory (CIM)-based accelerators have been proposed for LLM acceleration with KV cache pruning. However, as… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

  31. arXiv:2504.04190  [pdf, other

    cs.CV

    Interpretable Single-View 3D Gaussian Splatting using Unsupervised Hierarchical Disentangled Representation Learning

    Authors: Yuyang Zhang, Baao Xie, Hu Zhu, Qi Wang, Huanting Guo, Xin Jin, Wenjun Zeng

    Abstract: Gaussian Splatting (GS) has recently marked a significant advancement in 3D reconstruction, delivering both rapid rendering and high-quality results. However, existing 3DGS methods pose challenges in understanding underlying 3D semantics, which hinders model controllability and interpretability. To address it, we propose an interpretable single-view 3DGS framework, termed 3DisGS, to discover both… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

  32. arXiv:2504.01081  [pdf, other

    cs.CV cs.CL eess.IV

    ShieldGemma 2: Robust and Tractable Image Content Moderation

    Authors: Wenjun Zeng, Dana Kurniawan, Ryan Mullins, Yuchi Liu, Tamoghna Saha, Dirichi Ike-Njoku, Jindong Gu, Yiwen Song, Cai Xu, Jingjing Zhou, Aparna Joshi, Shravan Dheep, Mani Malek, Hamid Palangi, Joon Baek, Rick Pereira, Karthik Narasimhan

    Abstract: We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \& Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both… ▽ More

    Submitted 8 April, 2025; v1 submitted 1 April, 2025; originally announced April 2025.

  33. arXiv:2503.23407  [pdf, other

    cs.CV cs.AI

    GMapLatent: Geometric Mapping in Latent Space

    Authors: Wei Zeng, Xuebin Chang, Jianghao Su, Xiang Gu, Jian Sun, Zongben Xu

    Abstract: Cross-domain generative models based on encoder-decoder AI architectures have attracted much attention in generating realistic images, where domain alignment is crucial for generation accuracy. Domain alignment methods usually deal directly with the initial distribution; however, mismatched or mixed clusters can lead to mode collapse and mixture problems in the decoder, compromising model generali… ▽ More

    Submitted 30 March, 2025; originally announced March 2025.

  34. arXiv:2503.21817  [pdf, ps, other

    cs.CV

    Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping

    Authors: Weili Zeng, Ziyuan Huang, Kaixiang Ji, Yichao Yan

    Abstract: Transformer-based models have driven significant advancements in Multimodal Large Language Models (MLLMs), yet their computational costs surge drastically when scaling resolution, training data, and model parameters. A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and… ▽ More

    Submitted 3 July, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: Accepted by ICCV2025

  35. arXiv:2503.18892  [pdf, other

    cs.LG cs.AI cs.CL

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Authors: Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, Junxian He

    Abstract: DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representati… ▽ More

    Submitted 7 May, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

  36. arXiv:2503.18435  [pdf, other

    cs.CV cs.CL

    On the Perception Bottleneck of VLMs for Chart Understanding

    Authors: Junteng Liu, Weihao Zeng, Xiwen Zhang, Yijun Wang, Zifei Shan, Junxian He

    Abstract: Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the visi… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  37. arXiv:2503.12880  [pdf, ps, other

    cs.CL cs.AI

    nvBench 2.0: Resolving Ambiguity in Text-to-Visualization through Stepwise Reasoning

    Authors: Tianqi Luo, Chuhan Huang, Leixian Shen, Boyan Li, Shuyu Shen, Wei Zeng, Nan Tang, Yuyu Luo

    Abstract: Text-to-Visualization (Text2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, Text2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nBench 2.0, a new benchmark designed to evaluate Text2VIS systems in scenar… ▽ More

    Submitted 7 June, 2025; v1 submitted 17 March, 2025; originally announced March 2025.

  38. arXiv:2503.08751  [pdf, other

    cs.CV cs.LG

    Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

    Authors: Qi Wang, Zhipeng Zhang, Baao Xie, Xin Jin, Yunbo Wang, Shiyu Wang, Liaomo Zheng, Xiaokang Yang, Wenjun Zeng

    Abstract: Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentanglement representation learning, these methods usually start learning from scratch without prior knowledge of the world. This pape… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  39. arXiv:2503.08144  [pdf, other

    cs.CV

    Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method

    Authors: Fei Wang, Chengcheng Chen, Hongyu Chen, Yugang Chang, Weiming Zeng

    Abstract: Recently, large language models (LLMs) and vision-language models (VLMs) have achieved significant success, demonstrating remarkable capabilities in understanding various images and videos, particularly in classification and detection tasks. However, due to the substantial differences between remote sensing images and conventional optical images, these models face considerable challenges in compre… ▽ More

    Submitted 20 March, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

  40. arXiv:2503.08046  [pdf, other

    cs.IR

    MultiConIR: Towards multi-condition Information Retrieval

    Authors: Xuan Lu, Sifan Liu, Bochao Yin, Yongqi Li, Xinghao Chen, Hui Su, Yaohui Jin, Wenjun Zeng, Xiaoyu Shen

    Abstract: In this paper, we introduce MultiConIR, the first benchmark designed to evaluate retrieval models in multi-condition scenarios. Unlike existing datasets that primarily focus on single-condition queries from search engines, MultiConIR captures real-world complexity by incorporating five diverse domains: books, movies, people, medical cases, and legal documents. We propose three tasks to systematica… ▽ More

    Submitted 11 March, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

  41. arXiv:2503.06101  [pdf, other

    cs.LG cs.AI

    ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning

    Authors: Mingqi Yuan, Bo Li, Xin Jin, Wenjun Zeng

    Abstract: Hyperparameter optimization (HPO) is a billion-dollar problem in machine learning, which significantly impacts the training efficiency and model performance. However, achieving efficient and robust HPO in deep reinforcement learning (RL) is consistently challenging due to its high non-stationarity and computational cost. To tackle this problem, existing approaches attempt to adapt common HPO techn… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

    Comments: 23 pages, 22 figures

  42. GenColor: Generative Color-Concept Association in Visual Design

    Authors: Yihan Hou, Xingchen Zeng, Yusong Wang, Manling Yang, Xiaojiao Chen, Wei Zeng

    Abstract: Existing approaches for color-concept association typically rely on query-based image referencing, and color extraction from image references. However, these approaches are effective only for common concepts, and are vulnerable to unstable image referencing and varying image conditions. Our formative study with designers underscores the need for primary-accent color compositions and context-depend… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

    Comments: 19 pages, 16 figures. Accepted at CHI Conference on Human Factors in Computing Systems (CHI'25), April 26-May 1, 2025, Yokohama, Japan

  43. arXiv:2502.21154  [pdf, other

    cs.HC

    Hypergraph Multi-Modal Learning for EEG-based Emotion Recognition in Conversation

    Authors: Zijian Kang, Yueyang Li, Shengyu Gong, Weiming Zeng, Hongjie Yan, Lingbin Bian, Wai Ting Siok, Nizhuan Wang

    Abstract: Emotional Recognition in Conversation (ERC) is an important method for diagnosing health conditions such as autism or depression, as well as understanding emotions in individuals who struggle to express their feelings. Current ERC methods primarily rely on complete semantic textual information, including audio and visual data, but face challenges in integrating physiological signals such as electr… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

  44. arXiv:2502.20769  [pdf, other

    cs.CV

    Information Bottleneck-Guided Heterogeneous Graph Learning for Interpretable Neurodevelopmental Disorder Diagnosis

    Authors: Yueyang Li, Lei Chen, Wenhao Dong, Shengyu Gong, Zijian Kang, Boyang Wei, Weiming Zeng, Hongjie Yan, Lingbin Bian, Wai Ting Siok, Nizhuan Wang

    Abstract: Developing interpretable models for diagnosing neurodevelopmental disorders (NDDs) is highly valuable yet challenging, primarily due to the complexity of encoding, decoding and integrating imaging and non-imaging data. Many existing machine learning models struggle to provide comprehensive interpretability, often failing to extract meaningful biomarkers from imaging data, such as functional magnet… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

  45. arXiv:2502.17307  [pdf, ps, other

    cs.LG cs.GT cs.MA

    Survey on Strategic Mining in Blockchain: A Reinforcement Learning Approach

    Authors: Jichen Li, Lijia Xie, Hanting Huang, Bo Zhou, Binfeng Song, Wanying Zeng, Xiaotie Deng, Xiao Zhang

    Abstract: Strategic mining attacks, such as selfish mining, exploit blockchain consensus protocols by deviating from honest behavior to maximize rewards. Markov Decision Process (MDP) analysis faces scalability challenges in modern digital economics, including blockchain. To address these limitations, reinforcement learning (RL) provides a scalable alternative, enabling adaptive strategy optimization in com… ▽ More

    Submitted 24 February, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

    Comments: 10 pages

  46. arXiv:2502.12196  [pdf

    cs.NE math.OC

    Integrated Scheduling Model for Arrivals and Departures in Metroplex Terminal Area

    Authors: Tonghe li, Jixin Liu, Hao Jiang, Weili Zeng, Lei Yang

    Abstract: In light of the rapid expansion of civil aviation, addressing the delays and congestion phenomena in the vicinity of metroplex caused by the imbalance between air traffic flow and capacity is crucial. This paper first proposes a bi-level optimization model for the collaborative flight sequencing of arrival and departure flights in the metroplex with multiple airports, considering both the runway s… ▽ More

    Submitted 20 February, 2025; v1 submitted 15 February, 2025; originally announced February 2025.

    Comments: 37 pages, 28 figures

  47. arXiv:2502.11089  [pdf, other

    cs.CL cs.AI cs.LG

    Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

    Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng

    Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with har… ▽ More

    Submitted 27 February, 2025; v1 submitted 16 February, 2025; originally announced February 2025.

  48. arXiv:2502.08820  [pdf, other

    cs.AI cs.CL

    Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model

    Authors: Emre Can Acikgoz, Jeremiah Greer, Akul Datta, Ze Yang, William Zeng, Oussama Elachqar, Emmanouil Koukoumidis, Dilek Hakkani-Tür, Gokhan Tur

    Abstract: Large Language Models (LLMs) with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs ar… ▽ More

    Submitted 18 February, 2025; v1 submitted 12 February, 2025; originally announced February 2025.

  49. arXiv:2502.07556  [pdf, other

    cs.HC cs.CV

    SketchFlex: Facilitating Spatial-Semantic Coherence in Text-to-Image Generation with Region-Based Sketches

    Authors: Haichuan Lin, Yilin Ye, Jiazhi Xia, Wei Zeng

    Abstract: Text-to-image models can generate visually appealing images from text descriptions. Efforts have been devoted to improving model controls with prompt tuning and spatial conditioning. However, our formative study highlights the challenges for non-expert users in crafting appropriate prompts and specifying fine-grained spatial conditions (e.g., depth or canny references) to generate semantically coh… ▽ More

    Submitted 11 February, 2025; originally announced February 2025.

    Comments: conference: CHI2025

  50. arXiv:2501.12948  [pdf, other

    cs.CL cs.AI cs.LG

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu , et al. (175 additional authors not shown)

    Abstract: We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.