Skip to main content

Showing 1–50 of 700 results for author: Zhao, Q

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.00917  [pdf, ps, other

    cs.RO

    A Survey: Learning Embodied Intelligence from Physical Simulators and World Models

    Authors: Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, Wei Li, Wei Yin, Yao Yao, Jia Pan, Qiu Shen, Ruigang Yang, Xun Cao, Qionghai Dai

    Abstract: The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interacti… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey

  2. arXiv:2506.20045  [pdf, ps, other

    cs.RO cs.CV

    Consensus-Driven Uncertainty for Robotic Grasping based on RGB Perception

    Authors: Eric C. Joyce, Qianwen Zhao, Nathaniel Burgdorfer, Long Wang, Philippos Mordohai

    Abstract: Deep object pose estimators are notoriously overconfident. A grasping agent that both estimates the 6-DoF pose of a target object and predicts the uncertainty of its own estimate could avoid task failure by choosing not to act under high uncertainty. Even though object pose estimation improves and uncertainty quantification research continues to make strides, few studies have connected them to the… ▽ More

    Submitted 26 June, 2025; v1 submitted 24 June, 2025; originally announced June 2025.

    Comments: Accepted to IROS 2025

  3. arXiv:2506.19937  [pdf, ps, other

    cs.LG

    The Most Important Features in Generalized Additive Models Might Be Groups of Features

    Authors: Tomas M. Bosschieter, Luis Franca, Jessica Wolk, Yiyuan Wu, Bella Mehta, Joseph Dehoney, Orsolya Kiss, Fiona C. Baker, Qingyu Zhao, Rich Caruana, Kilian M. Pohl

    Abstract: While analyzing the importance of features has become ubiquitous in interpretable machine learning, the joint signal from a group of related features is sometimes overlooked or inadvertently excluded. Neglecting the joint signal could bypass a critical insight: in many instances, the most significant predictors are not isolated features, but rather the combined effect of groups of features. This c… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  4. arXiv:2506.19270  [pdf, ps, other

    quant-ph cs.LG

    Continuous-variable Quantum Diffusion Model for State Generation and Restoration

    Authors: Haitao Huang, Chuangtao Chen, Qinglin Zhao

    Abstract: The generation and preservation of complex quantum states against environmental noise are paramount challenges in advancing continuous-variable (CV) quantum information processing. This paper introduces a novel framework based on continuous-variable quantum diffusion principles, synergizing them with CV quantum neural networks (CVQNNs) to address these dual challenges. For the task of state genera… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: 15+3 pages, 14 figures, 7 tables

    MSC Class: 81P68

  5. arXiv:2506.18898  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.MM

    Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

    Authors: Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang

    Abstract: This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expan… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: Project page: https://tar.csuhan.com

  6. arXiv:2506.17108  [pdf, ps, other

    eess.SP cs.IT stat.ML

    Searching for a Hidden Markov Anomaly over Multiple Processes

    Authors: Levli Citron, Kobi Cohen, Qing Zhao

    Abstract: We address the problem of detecting an anomalous process among a large number of processes. At each time t, normal processes are in state zero (normal state), while the abnormal process may be in either state zero (normal state) or state one (abnormal state), with the states being hidden. The transition between states for the abnormal process is governed by a Markov chain over time. At each time s… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: 13 pages, 9 figures

  7. arXiv:2506.15715  [pdf, ps, other

    cs.LG cs.AI

    NeuronSeek: On Stability and Expressivity of Task-driven Neurons

    Authors: Hanyu Pei, Jing-Xiao Liao, Qibin Zhao, Ting Gao, Shijun Zhang, Xiaoge Zhang, Feng-Lei Fan

    Abstract: Drawing inspiration from our human brain that designs different neurons for different tasks, recent advances in deep learning have explored modifying a network's neurons to develop so-called task-driven neurons. Prototyping task-driven neurons (referred to as NeuronSeek) employs symbolic regression (SR) to discover the optimal neuron formulation and construct a network from these optimized neurons… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: 14 pages, 10 figures

  8. arXiv:2506.12321  [pdf, ps, other

    cs.LG cs.AI

    Extending Memorization Dynamics in Pythia Models from Instance-Level Insights

    Authors: Jie Zhang, Qinghua Zhao, Lei Li, Chi-ho Lin

    Abstract: Large language models have demonstrated a remarkable ability for verbatim memorization. While numerous works have explored factors influencing model memorization, the dynamic evolution memorization patterns remains underexplored. This paper presents a detailed analysis of memorization in the Pythia model family across varying scales and training steps under prefix perturbations. Using granular met… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 5 figures

  9. arXiv:2506.10941  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    VINCIE: Unlocking In-context Image Editing from Video

    Authors: Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, Lu Jiang

    Abstract: In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable appro… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: Project page: https://vincie2025.github.io/

  10. arXiv:2506.10406  [pdf, ps, other

    cs.CL cs.AI cs.LG

    PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier

    Authors: Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, Lin Yan

    Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks, yet they still struggle to reliably verify the correctness of their own outputs. Existing solutions to this verification challenge often depend on separate verifier models or require multi-stage self-correction training pipelines, which limit scalability. In this paper, we propose Policy as Generativ… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  11. arXiv:2506.07809  [pdf, ps, other

    cs.CV

    Incorporating Uncertainty-Guided and Top-k Codebook Matching for Real-World Blind Image Super-Resolution

    Authors: Weilei Wen, Tianyi Zhang, Qianqian Zhao, Zhaohui Zheng, Chunle Guo, Xiuli Shao, Chongyi Li

    Abstract: Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address t… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  12. arXiv:2506.06787  [pdf, ps, other

    cs.LG cs.AR

    FuncGNN: Learning Functional Semantics of Logic Circuits with Graph Neural Networks

    Authors: Qiyun Zhao

    Abstract: As integrated circuit scale grows and design complexity rises, effective circuit representation helps support logic synthesis, formal verification, and other automated processes in electronic design automation. And-Inverter Graphs (AIGs), as a compact and canonical structure, are widely adopted for representing Boolean logic in these workflows. However, the increasing complexity and integration de… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

  13. arXiv:2506.06710  [pdf, ps, other

    cs.CV eess.IV

    A Systematic Investigation on Deep Learning-Based Omnidirectional Image and Video Super-Resolution

    Authors: Qianqian Zhao, Chunle Guo, Tianyi Zhang, Junpei Zhang, Peiyang Jia, Tan Su, Wenjie Jiang, Chongyi Li

    Abstract: Omnidirectional image and video super-resolution is a crucial research topic in low-level vision, playing an essential role in virtual reality and augmented reality applications. Its goal is to reconstruct high-resolution images or video frames from low-resolution inputs, thereby enhancing detail preservation and enabling more accurate scene analysis and interpretation. In recent years, numerous i… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

  14. arXiv:2506.04185  [pdf, ps, other

    cs.CL

    R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning

    Authors: Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, Limin Liu

    Abstract: Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning-search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework fo… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 16 pages, 3 figures

  15. arXiv:2506.01713  [pdf, ps, other

    cs.CL

    SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

    Authors: Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Chaofan Tao, Yangfan He, Mi Zhang, Shen Yan

    Abstract: Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge l… ▽ More

    Submitted 20 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

    Comments: Technical report

  16. arXiv:2505.24378  [pdf, other

    cs.LG cs.AI

    Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer

    Authors: Yilun Kong, Guozheng Ma, Qi Zhao, Haoyu Wang, Li Shen, Xueqian Wang, Dacheng Tao

    Abstract: Despite recent advancements in offline multi-task reinforcement learning (MTRL) have harnessed the powerful capabilities of the Transformer architecture, most approaches focus on a limited number of tasks, with scaling to extremely massive tasks remaining a formidable challenge. In this paper, we first revisit the key impact of task numbers on current MTRL method, and further reveal that naively e… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: ICML 2025

  17. arXiv:2505.23861  [pdf, ps, other

    cs.LG cs.AI

    BiBLDR: Bidirectional Behavior Learning for Drug Repositioning

    Authors: Renye Zhang, Mengyun Yang, Qichang Zhao, Jianxin Wang

    Abstract: Drug repositioning aims to identify potential new indications for existing drugs to reduce the time and financial costs associated with developing new drugs. Most existing deep learning-based drug repositioning methods predominantly utilize graph-based representations. However, graph-based drug repositioning methods struggle to perform effective inference in cold-start scenarios involving novel dr… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  18. arXiv:2505.23537  [pdf, ps, other

    cs.LG cs.CL

    Domain-Aware Tensor Network Structure Search

    Authors: Giorgos Iacovides, Wuyang Zhou, Chao Li, Qibin Zhao, Danilo Mandic

    Abstract: Tensor networks (TNs) provide efficient representations of high-dimensional data, yet identification of the optimal TN structures, the so called tensor network structure search (TN-SS) problem, remains a challenge. Current state-of-the-art (SOTA) algorithms are computationally expensive as they require extensive function evaluations, which is prohibitive for real-world applications. In addition, e… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  19. arXiv:2505.22543  [pdf, ps, other

    cs.CV cs.AI

    Scaling-up Perceptual Video Quality Assessment

    Authors: Ziheng Jia, Zicheng Zhang, Zeyu Zhang, Yingji Liang, Xiaorong Zhu, Chunyi Li, Jinliang Han, Haoning Wu, Bin Wang, Haoran Zhang, Guanyu Zhu, Qiyong Zhao, Xiaohong Liu, Guangtao Zhai, Xiongkuo Min

    Abstract: The data scaling law has been shown to significantly enhance the performance of large multi-modal models (LMMs) across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of scaling law remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose \textbf{OmniVQA}, an effic… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  20. arXiv:2505.20297  [pdf, other

    cs.CV cs.CL

    DiSA: Diffusion Step Annealing in Autoregressive Image Generation

    Authors: Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng

    Abstract: An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

    Comments: Our code is available at https://github.com/Qinyu-Allen-Zhao/DiSA

  21. arXiv:2505.17524  [pdf, other

    cs.CE

    Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

    Authors: Ye Du, Chen Yang, Nanxi Yu, Wanyu Lin, Qian Zhao, Shujun Wang

    Abstract: De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models usually encode the observed mass spectra into latent representations from which peptides are predicted autoregressively. However, the issue of missing fragmentation, attri… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML 2025

  22. MGPBD: A Multigrid Accelerated Global XPBD Solver

    Authors: Chunlei Li, Peng Yu, Tiantian Liu, Siyuan Yu, Yuting Xiao, Shuai Li, Aimin Hao, Yang Gao, Qinping Zhao

    Abstract: We introduce a novel Unsmoothed Aggregation (UA) Algebraic Multigrid (AMG) method combined with Preconditioned Conjugate Gradient (PCG) to overcome the limitations of Extended Position-Based Dynamics (XPBD) in high-resolution and high-stiffness simulations. While XPBD excels in simulating deformable objects due to its speed and simplicity, its nonlinear Gauss-Seidel (GS) solver often struggles wit… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: SIGGRAPH 2025

    ACM Class: I.3.6

  23. arXiv:2505.12748  [pdf, ps, other

    cs.RO cs.AI cs.CV

    TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation

    Authors: Hangyu Li, Qin Zhao, Haoran Xu, Xinyu Jiang, Qingwei Ben, Feiyu Jia, Haoyu Zhao, Liang Xu, Jia Zeng, Hanqing Wang, Bo Dai, Junting Dong, Jiangmiao Pang

    Abstract: Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines-ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces-there is still no unified benchmark that enables fa… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

    Comments: 13 pages

  24. arXiv:2505.12045  [pdf, ps, other

    cs.CV

    FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition

    Authors: Shuai Yuan, Guowen Xu, Hongwei Li, Rui Zhang, Xinyuan Qian, Wenbo Jiang, Hangcheng Cao, Qingchuan Zhao

    Abstract: Traffic sign recognition (TSR) systems are crucial for autonomous driving but are vulnerable to backdoor attacks. Existing physical backdoor attacks either lack stealth, provide inflexible attack control, or ignore emerging Vision-Large-Language-Models (VLMs). In this paper, we introduce FIGhost, the first physical-world backdoor attack leveraging fluorescent ink as triggers. Fluorescent triggers… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

  25. arXiv:2505.11154  [pdf, other

    cs.CR cs.CL

    MPMA: Preference Manipulation Attack Against Model Context Protocol

    Authors: Zihan Wang, Hongwei Li, Rui Zhang, Yu Liu, Wenbo Jiang, Wenshu Fan, Qingchuan Zhao, Guowen Xu

    Abstract: Model Context Protocol (MCP) standardizes interface mapping for large language models (LLMs) to access external data and tools, which revolutionizes the paradigm of tool selection and facilitates the rapid expansion of the LLM agent tool ecosystem. However, as the MCP is increasingly adopted, third-party customized versions of the MCP server expose potential security vulnerabilities. In this paper… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  26. arXiv:2505.07347  [pdf, other

    cs.CV

    AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography

    Authors: Jiewen Yang, Taoran Huang, Shangwei Ding, Xiaowei Xu, Qinhua Zhao, Yong Jiang, Jiarong Guo, Bin Pu, Jiexuan Zheng, Caojin Zhang, Hongwen Fei, Xiaomeng Li

    Abstract: Echocardiographers can detect pulmonary hypertension using Doppler echocardiography; however, accurately assessing its progression often proves challenging. Right heart catheterization (RHC), the gold standard for precise evaluation, is invasive and unsuitable for routine use, limiting its practicality for timely diagnosis and monitoring of pulmonary hypertension progression. Here, we propose MePH… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  27. arXiv:2505.05473  [pdf, ps, other

    cs.CV

    DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

    Authors: Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani

    Abstract: Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pi… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: CVPR 2025. Project website: https://qitaozhao.github.io/DiffusionSfM

  28. arXiv:2505.05151  [pdf, ps, other

    quant-ph cs.LG

    Overcoming Dimensional Factorization Limits in Discrete Diffusion Models through Quantum Joint Distribution Learning

    Authors: Chuangtao Chen, Qinglin Zhao, MengChu Zhou, Dusit Niyato, Zhimin He, Haozhen Situ

    Abstract: Discrete diffusion models represent a significant advance in generative modeling, demonstrating remarkable success in synthesizing complex, high-quality discrete data. However, to avoid exponential computational costs, they typically rely on calculating per-dimension transition probabilities when learning high-dimensional distributions. In this study, we rigorously prove that this approach leads t… ▽ More

    Submitted 29 June, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

    Comments: Comments are welcome

  29. arXiv:2505.05041  [pdf, other

    eess.IV cs.CV

    ADNP-15: An Open-Source Histopathological Dataset for Neuritic Plaque Segmentation in Human Brain Whole Slide Images with Frequency Domain Image Enhancement for Stain Normalization

    Authors: Chenxi Zhao, Jianqiang Li, Qing Zhao, Jing Bai, Susana Boluda, Benoit Delatour, Lev Stimmer, Daniel Racoceanu, Gabriel Jimenez, Guanghui Fu

    Abstract: Alzheimer's Disease (AD) is a neurodegenerative disorder characterized by amyloid-beta plaques and tau neurofibrillary tangles, which serve as key histopathological features. The identification and segmentation of these lesions are crucial for understanding AD progression but remain challenging due to the lack of large-scale annotated datasets and the impact of staining variations on automated ima… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  30. arXiv:2505.04869  [pdf, other

    cs.HC

    From First Draft to Final Insight: A Multi-Agent Approach for Feedback Generation

    Authors: Jie Cao, Chloe Qianhui Zhao, Xian Chen, Shuman Wang, Christian Schunn, Kenneth R. Koedinger, Jionghao Lin

    Abstract: Producing large volumes of high-quality, timely feedback poses significant challenges to instructors. To address this issue, automation technologies-particularly Large Language Models (LLMs)-show great potential. However, current LLM-based research still shows room for improvement in terms of feedback quality. Our study proposed a multi-agent approach performing "generation, evaluation, and regene… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 14 pages, to be published at the 26th International Conference on Artificial Intelligence in Education (AIED '25)

  31. arXiv:2505.04584  [pdf, other

    cs.HC

    SlideItRight: Using AI to Find Relevant Slides and Provide Feedback for Open-Ended Questions

    Authors: Chloe Qianhui Zhao, Jie Cao, Eason Chen, Kenneth R. Koedinger, Jionghao Lin

    Abstract: Feedback is important in supporting student learning. While various automated feedback systems have been implemented to make the feedback scalable, many existing solutions only focus on generating text-based feedback. As is indicated in the multimedia learning principle, learning with more modalities could help utilize more separate channels, reduce the cognitive load and facilitate students' lear… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 14 pages, to be published at the 26th International Conference on Artificial Intelligence in Education (AIED '25)

  32. arXiv:2505.03501  [pdf, other

    cs.CR cs.CL

    BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

    Authors: Zihan Wang, Hongwei Li, Rui Zhang, Wenbo Jiang, Kangjie Chen, Tianwei Zhang, Qingchuan Zhao, Guowen Xu

    Abstract: In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious enti… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  33. arXiv:2505.02094  [pdf, other

    cs.LG cs.CV

    SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations

    Authors: Runyi Yu, Yinhuai Wang, Qihan Zhao, Hok Wai Tsui, Jingbo Wang, Ping Tan, Qifeng Chen

    Abstract: We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID): demonstration noise and coverage limitations. While existing data collection approaches provide valuable interaction demonstrations, they often yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions. Our key insight is t… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

  34. arXiv:2504.21771  [pdf, other

    cs.CV

    Anatomical Similarity as a New Metric to Evaluate Brain Generative Models

    Authors: Bahram Jafrasteh, Wei Peng, Cheng Wan, Yimin Luo, Ehsan Adeli, Qingyu Zhao

    Abstract: Generative models enhance neuroimaging through data augmentation, quality improvement, and rare condition studies. Despite advances in realistic synthetic MRIs, evaluations focus on texture and perception, lacking sensitivity to crucial anatomical fidelity. This study proposes a new metric, called WASABI (Wasserstein-Based Anatomical Brain Index), to assess the anatomical realism of synthetic brai… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  35. arXiv:2504.17332  [pdf, other

    cs.CL

    Bridging Cognition and Emotion: Empathy-Driven Multimodal Misinformation Detection

    Authors: Zihan Wang, Lu Yuan, Zhengxuan Zhang, Qing Zhao

    Abstract: In the digital era, social media has become a major conduit for information dissemination, yet it also facilitates the rapid spread of misinformation. Traditional misinformation detection methods primarily focus on surface-level features, overlooking the crucial roles of human empathy in the propagation process. To address this gap, we propose the Dual-Aspect Empathy Framework (DAE), which integra… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  36. arXiv:2504.15667  [pdf, other

    eess.IV cs.CV

    Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg

    Authors: Jingchen Zou, Jianqiang Li, Gabriel Jimenez, Qing Zhao, Daniel Racoceanu, Matias Cosarinsky, Enzo Ferrante, Guanghui Fu

    Abstract: The performance of medical image segmentation models is usually evaluated using metrics like the Dice score and Hausdorff distance, which compare predicted masks to ground truth annotations. However, when applying the model to unseen data, such as in clinical settings, it is often impractical to annotate all the data, making the model's performance uncertain. To address this challenge, we propose… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  37. arXiv:2504.13882  [pdf, other

    cs.HC cs.CL

    Toward Automated Qualitative Analysis: Leveraging Large Language Models for Tutoring Dialogue Evaluation

    Authors: Megan Gu, Chloe Qianhui Zhao, Claire Liu, Nikhil Patel, Jahnvi Shah, Jionghao Lin, Kenneth R. Koedinger

    Abstract: Our study introduces an automated system leveraging large language models (LLMs) to assess the effectiveness of five key tutoring strategies: 1. giving effective praise, 2. reacting to errors, 3. determining what students know, 4. helping students manage inequity, and 5. responding to negative self-talk. Using a public dataset from the Teacher-Student Chatroom Corpus, our system classifies each tu… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: Manuscript accepted to the Workshop on "From Data to Discovery: LLMs for Qualitative Analysis in Education" at LAK25

  38. arXiv:2504.08685  [pdf, other

    cs.CV cs.AI

    Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

    Authors: Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo, Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Meng Wei, Zhiwu Qing, Fei Xiao, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi , et al. (30 additional authors not shown)

    Abstract: This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary… ▽ More

    Submitted 4 May, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

    Comments: Technical report (some typos fixed)

  39. arXiv:2504.06982  [pdf, other

    cs.CV

    SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets

    Authors: Yuhang Yang, Fengqi Liu, Yixing Lu, Qin Zhao, Pingyu Wu, Wei Zhai, Ran Yi, Yang Cao, Lizhuang Ma, Zheng-Jun Zha, Junting Dong

    Abstract: 3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-vie… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

    Comments: project page:https://yyvhang.github.io/SIGMAN_3D/

  40. arXiv:2504.05720  [pdf, other

    cs.CV

    QEMesh: Employing A Quadric Error Metrics-Based Representation for Mesh Generation

    Authors: Jiaqi Li, Ruowei Wang, Yu Liu, Qijun Zhao

    Abstract: Mesh generation plays a crucial role in 3D content creation, as mesh is widely used in various industrial applications. Recent works have achieved impressive results but still face several issues, such as unrealistic patterns or pits on surfaces, thin parts missing, and incomplete structures. Most of these problems stem from the choice of shape representation or the capabilities of the generative… ▽ More

    Submitted 8 April, 2025; originally announced April 2025.

    Comments: Accepted by International Conference on Multimedia and Expo

  41. arXiv:2504.02730  [pdf, other

    cs.CV cs.LG

    HQViT: Hybrid Quantum Vision Transformer for Image Classification

    Authors: Hui Zhang, Qinglin Zhao, Mengchu Zhou, Li Feng

    Abstract: Transformer-based architectures have revolutionized the landscape of deep learning. In computer vision domain, Vision Transformer demonstrates remarkable performance on par with or even surpassing that of convolutional neural networks. However, the quadratic computational complexity of its self-attention mechanism poses challenges for classical computing, making model training with high-dimensiona… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: 13 pages, 8 figures

  42. arXiv:2504.01038  [pdf, other

    eess.IV cs.CV cs.HC

    An Integrated AI-Enabled System Using One Class Twin Cross Learning (OCT-X) for Early Gastric Cancer Detection

    Authors: Xian-Xian Liu, Yuanyuan Wei, Mingkun Xu, Yongze Guo, Hongwei Zhang, Huicong Dong, Qun Song, Qi Zhao, Wei Luo, Feng Tien, Juntao Gao, Simon Fong

    Abstract: Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One C… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

    Comments: 26 pages, 4 figures, 6 tables

  43. arXiv:2504.00375  [pdf, other

    cs.CV

    CamoSAM2: Motion-Appearance Induced Auto-Refining Prompts for Video Camouflaged Object Detection

    Authors: Xin Zhang, Keren Fu, Qijun Zhao

    Abstract: The Segment Anything Model 2 (SAM2), a prompt-guided video foundation model, has remarkably performed in video object segmentation, drawing significant attention in the community. Due to the high similarity between camouflaged objects and their surroundings, which makes them difficult to distinguish even by the human eye, the application of SAM2 for automated segmentation in real-world scenarios f… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

    Comments: 10 pages, 5 figures,

  44. arXiv:2503.23748  [pdf, other

    cs.CR cs.LG cs.SE

    THEMIS: Towards Practical Intellectual Property Protection for Post-Deployment On-Device Deep Learning Models

    Authors: Yujin Huang, Zhi Zhang, Qingchuan Zhao, Xingliang Yuan, Chunyang Chen

    Abstract: On-device deep learning (DL) has rapidly gained adoption in mobile apps, offering the benefits of offline model inference and user privacy preservation over cloud-based approaches. However, it inevitably stores models on user devices, introducing new vulnerabilities, particularly model-stealing attacks and intellectual property infringement. While system-level protections like Trusted Execution En… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: To Appear in the 34th USENIX Security Symposium, August 13-15, 2025

  45. arXiv:2503.23327  [pdf

    cs.HC

    AI Delivers Creative Output but Struggles with Thinking Processes

    Authors: Man Zhang, Ying Li, Yang Peng, Yijia Sun, Wenxin Guo, Huiqing Hu, Shi Chen, Qingbai Zhao

    Abstract: A key objective in artificial intelligence (AI) development is to create systems that match or surpass human creativity. Although current AI models perform well across diverse creative tasks, it remains unclear whether these achievements reflect genuine creative thinking. This study examined whether AI models (GPT-3.5-turbo, GPT-4, and GPT-4o) engage in creative thinking by comparing their perform… ▽ More

    Submitted 30 March, 2025; originally announced March 2025.

  46. arXiv:2503.22020  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    Authors: Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, Tsung-Yi Lin

    Abstract: Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Project website: https://cot-vla.github.io/

    Journal ref: CVPR 2025

  47. arXiv:2503.20822  [pdf, other

    eess.IV cs.AI cs.GR

    Synthetic Video Enhances Physical Fidelity in Video Synthesis

    Authors: Qi Zhao, Xingyu Ni, Ziyu Wang, Feng Cheng, Ziyan Yang, Lu Jiang, Bohan Wang

    Abstract: We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. To harness this potential, we propose a solution that curates and integrate… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  48. arXiv:2503.19427  [pdf, other

    eess.IV cs.CV

    ASP-VMUNet: Atrous Shifted Parallel Vision Mamba U-Net for Skin Lesion Segmentation

    Authors: Muyi Bao, Shuchang Lyu, Zhaoyang Xu, Qi Zhao, Changyu Zeng, Wenpei Bai, Guangliang Cheng

    Abstract: Skin lesion segmentation is a critical challenge in computer vision, and it is essential to separate pathological features from healthy skin for diagnostics accurately. Traditional Convolutional Neural Networks (CNNs) are limited by narrow receptive fields, and Transformers face significant computational burdens. This paper presents a novel skin lesion segmentation framework, the Atrous Shifted Pa… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  49. arXiv:2503.19002  [pdf, other

    quant-ph cs.LG

    Quantum Complex-Valued Self-Attention Model

    Authors: Fu Chen, Qinglin Zhao, Li Feng, Longfei Tang, Yangbin Lin, Haitao Huang

    Abstract: Self-attention has revolutionized classical machine learning, yet existing quantum self-attention models underutilize quantum states' potential due to oversimplified or incomplete mechanisms. To address this limitation, we introduce the Quantum Complex-Valued Self-Attention Model (QCSAM), the first framework to leverage complex-valued similarities, which captures amplitude and phase relationships… ▽ More

    Submitted 7 April, 2025; v1 submitted 24 March, 2025; originally announced March 2025.

  50. arXiv:2503.14521  [pdf, other

    cs.CY cs.AI cs.CL

    Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models

    Authors: Yihang Chen, Haikang Deng, Kaiqiao Han, Qingyue Zhao

    Abstract: Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by decomposing complex problems into step-by-step solutions, improving performance on reasoning tasks. However, current CoT disclosure policies vary widely across different models in frontend visibility, API access, and pricing strategies, lacking a unified policy framework. This paper analyzes the dual-edged implications of fu… ▽ More

    Submitted 14 March, 2025; originally announced March 2025.