Skip to main content

Showing 1–50 of 535 results for author: Yuqing

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.15610  [pdf, ps, other

    cs.CV

    BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

    Authors: Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yihan Cao, Renjiao Yi, Yijie Wang, Kai Xu

    Abstract: Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: 11 pages, 6 figures

  2. arXiv:2506.14697  [pdf, ps, other

    cs.CR cs.RO

    AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions

    Authors: Aishan Liu, Zonghao Ying, Le Wang, Junjie Mu, Jinyang Guo, Jiakai Wang, Yuqing Ma, Siyuan Liang, Mingchuan Zhang, Xianglong Liu, Dacheng Tao

    Abstract: The rapid advancement of vision-language models (VLMs) and their integration into embodied agents have unlocked powerful capabilities for decision-making. However, as these systems are increasingly deployed in real-world environments, they face mounting safety concerns, particularly when responding to hazardous instructions. In this work, we propose AGENTSAFE, the first comprehensive benchmark for… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

    Comments: 11 pages

  3. arXiv:2506.13679  [pdf, ps, other

    cs.RO cs.AI cs.CV

    ROSA: Harnessing Robot States for Vision-Language and Action Alignment

    Authors: Yuqing Wen, Kefan Gu, Haoxuan Liu, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiaoyan Sun

    Abstract: Vision-Language-Action (VLA) models have recently made significant advance in multi-task, end-to-end robotic control, due to the strong generalization capabilities of Vision-Language Models (VLMs). A fundamental challenge in developing such models is effectively aligning the vision-language space with the robotic action space. Existing approaches typically rely on directly fine-tuning VLMs using e… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  4. arXiv:2506.13523  [pdf, ps, other

    cs.LG cs.AI

    The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor Products

    Authors: YuQing Xie, Ameya Daigavane, Mit Kotak, Tess Smidt

    Abstract: $E(3)$-equivariant neural networks have demonstrated success across a wide range of 3D modelling tasks. A fundamental operation in these networks is the tensor product, which interacts two geometric features in an equivariant manner to create new features. Due to the high computational complexity of the tensor product, significant effort has been invested to optimize the runtime of this operation.… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 27 pages, 10 Figures, ICML 2025

  5. arXiv:2506.13270  [pdf, ps, other

    cs.HC

    Screen Reader Users in the Vibe Coding Era: Adaptation, Empowerment, and New Accessibility Landscape

    Authors: Nan Chen, Luna K. Qiu, Arran Zeyu Wang, Zilong Wang, Yuqing Yang

    Abstract: The rise of generative AI agents has reshaped human-computer interaction and computer-supported cooperative work by shifting users' roles from direct task execution to supervising machine-driven actions, especially in programming (e.g., "vibe coding"). However, there is limited understanding of how screen reader users engage with these systems in practice. To address this gap, we conducted a longi… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  6. arXiv:2506.12707  [pdf, ps, other

    cs.CR cs.CL

    SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression

    Authors: Yucheng Li, Surin Ahn, Huiqiang Jiang, Amir H. Abdi, Yuqing Yang, Lili Qiu

    Abstract: Large language models (LLMs) have achieved widespread adoption across numerous applications. However, many LLMs are vulnerable to malicious attacks even after safety alignment. These attacks typically bypass LLMs' safety guardrails by wrapping the original malicious instructions inside adversarial jailbreaks prompts. Previous research has proposed methods such as adversarial training and prompt re… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  7. arXiv:2506.09989  [pdf, ps, other

    cs.CV

    Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes

    Authors: Yiming Dou, Wonseok Oh, Yuqing Luo, Antonio Loquercio, Andrew Owens

    Abstract: We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio.… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: CVPR 2025, Project page: https://www.yimingdou.com/hearing_hands/ , Code: https://github.com/Dou-Yiming/hearing_hands/

  8. arXiv:2506.09968  [pdf, ps, other

    cs.HC

    SRLAgent: Enhancing Self-Regulated Learning Skills through Gamification and LLM Assistance

    Authors: Wentao Ge, Yuqing Sun, Ziyan Wang, Haoyue Zheng, Weiyang He, Piaohong Wang, Qianyu Zhu, Benyou Wang

    Abstract: Self-regulated learning (SRL) is crucial for college students navigating increased academic demands and independence. Insufficient SRL skills can lead to disorganized study habits, low motivation, and poor time management, undermining learners ability to thrive in challenging environments. Through a formative study involving 59 college students, we identified key challenges students face in develo… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 14 pages

    ACM Class: I.2.1; I.2.6

  9. arXiv:2506.08889  [pdf, ps, other

    cs.LG cs.AI

    SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

    Authors: Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang

    Abstract: We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and c… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  10. arXiv:2506.06913  [pdf, ps, other

    cs.IR

    OneSug: The Unified End-to-End Generative Framework for E-commerce Query Suggestion

    Authors: Xian Guo, Ben Chen, Siyuan Wang, Ying Yang, Chenyi Lei, Yuqing Ding, Han Li

    Abstract: Query suggestion plays a crucial role in enhancing user experience in e-commerce search systems by providing relevant query recommendations that align with users' initial input. This module helps users navigate towards personalized preference needs and reduces typing effort, thereby improving search experience. Traditional query suggestion modules usually adopt multi-stage cascading architectures,… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

    Comments: 11 pages, 8 figures, and 6 tables

  11. arXiv:2506.04108  [pdf, ps, other

    cs.CL

    Rectified Sparse Attention

    Authors: Yutao Sun, Tianzhu Ye, Li Dong, Yuqing Xia, Jian Chen, Yizhao Gao, Shijie Cao, Jianyong Wang, Furu Wei

    Abstract: Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense… ▽ More

    Submitted 5 June, 2025; v1 submitted 4 June, 2025; originally announced June 2025.

  12. arXiv:2506.02269  [pdf, ps, other

    cs.LG cs.AI

    A Tale of Two Symmetries: Exploring the Loss Landscape of Equivariant Models

    Authors: YuQing Xie, Tess Smidt

    Abstract: Equivariant neural networks have proven to be effective for tasks with known underlying symmetries. However, optimizing equivariant networks can be tricky and best training practices are less established than for standard networks. In particular, recent works have found small training benefits from relaxing equivariance constraints. This raises the question: do equivariance constraints introduce f… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: 23 pages, 13 figures

  13. arXiv:2506.01285  [pdf, ps, other

    cs.GT

    A Reliable Vertical Federated Learning Framework for Traffic State Estimation with Data Selection and Incentive Mechanisms

    Authors: Zijun Zhan, Yaxian Dong, Daniel Mawunyo Doe, Yuqing Hu, Shuai Li, Shaohua Cao, Zhu Han

    Abstract: Vertical Federated Learning (VFL)-based Traffic State Estimation (TSE) offers a promising approach for integrating vertically distributed traffic data from municipal authorities (MA) and mobility providers (MP) while safeguarding privacy. However, given the variations in MPs' data collection capabilities and the potential for MPs to underperform in data provision, we propose a reliable VFL-based T… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: Submitted to the IEEE Transactions on Intelligent Transportation Systems

  14. arXiv:2506.00955  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection

    Authors: Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

    Abstract: Sarcasm fundamentally alters meaning through tone and context, yet detecting it in speech remains a challenge due to data scarcity. In addition, existing detection systems often rely on multimodal data, limiting their applicability in contexts where only speech is available. To address this, we propose an annotation pipeline that leverages large language models (LLMs) to generate a sarcasm dataset… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  15. arXiv:2505.24875  [pdf, ps, other

    cs.CV cs.CL

    ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

    Authors: Yu Zhang, Yunqi Li, Yifan Yang, Rui Wang, Yuqing Yang, Dai Qi, Jianmin Bao, Dongdong Chen, Chong Luo, Lili Qiu

    Abstract: Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written ration… ▽ More

    Submitted 5 June, 2025; v1 submitted 30 May, 2025; originally announced May 2025.

  16. arXiv:2505.24737  [pdf, ps, other

    cs.LG stat.ML

    Adapting to Linear Separable Subsets with Large-Margin in Differentially Private Learning

    Authors: Erchi Wang, Yuqing Zhu, Yu-Xiang Wang

    Abstract: This paper studies the problem of differentially private empirical risk minimization (DP-ERM) for binary linear classification. We obtain an efficient $(\varepsilon,δ)$-DP algorithm with an empirical zero-one risk bound of $\tilde{O}\left(\frac{1}{γ^2\varepsilon n} + \frac{|S_{\mathrm{out}}|}{γn}\right)$ where $n$ is the number of data points, $S_{\mathrm{out}}$ is an arbitrary subset of data one… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  17. arXiv:2505.20773  [pdf, ps, other

    cs.IR

    Cold-Start Recommendation with Knowledge-Guided Retrieval-Augmented Generation

    Authors: Wooseong Yang, Weizhi Zhang, Yuqing Liu, Yuwei Han, Yu Wang, Junhyun Lee, Philip S. Yu

    Abstract: Cold-start items remain a persistent challenge in recommender systems due to their lack of historical user interactions, which collaborative models rely on. While recent zero-shot methods leverage large language models (LLMs) to address this, they often struggle with sparse metadata and hallucinated or incomplete knowledge. We propose ColdRAG, a retrieval-augmented generation approach that builds… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 10 pages

    MSC Class: 68T05 68T05

  18. arXiv:2505.17665  [pdf

    cs.CV cs.AI

    EMRA-proxy: Enhancing Multi-Class Region Semantic Segmentation in Remote Sensing Images with Attention Proxy

    Authors: Yichun Yu, Yuqing Lan, Zhihuan Xing, Xiaoyi Yang, Tingyue Tang, Dan Yu

    Abstract: High-resolution remote sensing (HRRS) image segmentation is challenging due to complex spatial layouts and diverse object appearances. While CNNs excel at capturing local features, they struggle with long-range dependencies, whereas Transformers can model global context but often neglect local details and are computationally expensive.We propose a novel approach, Region-Aware Proxy Network (RAPNet… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

    Comments: Proceedings of the 20th International Conference on Intelligent Computing (ICIC 2024): Poster Volume I. Tianjin, China, 2024: 538-562

  19. arXiv:2505.17022  [pdf, ps, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

    Authors: Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu

    Abstract: Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to en… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Github page refer to: https://github.com/gogoduan/GoT-R1

  20. arXiv:2505.16170  [pdf, ps, other

    cs.CL

    When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

    Authors: Yuqing Yang, Robin Jia

    Abstract: Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledg… ▽ More

    Submitted 27 May, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

    Comments: Fixed typos

  21. arXiv:2505.14351  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

    Authors: Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi

    Abstract: Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-Ü-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: 13 pages

  22. arXiv:2505.13950  [pdf, other

    cs.IR

    Benchmarking the Myopic Trap: Positional Bias in Information Retrieval

    Authors: Ziyang Zeng, Dun Zhang, Jiacheng Li, Panxiang Zou, Yuqing Yang

    Abstract: This study investigates a specific form of positional bias, termed the Myopic Trap, where retrieval models disproportionately attend to the early parts of documents while overlooking relevant information that appears later. To systematically quantify this phenomenon, we propose a semantics-preserving evaluation framework that repurposes the existing NLP datasets into position-aware retrieval bench… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: 10 pages, 3 figures, 4 tables. Under review

  23. arXiv:2505.13827  [pdf, ps, other

    cs.GT

    A Sequence-Form Characterization and Differentiable Path-Following Computation of Normal-Form Perfect Equilibria in Extensive-Form Games

    Authors: Yuqing Hou, Yiyin Cao, Chuangyin Dang

    Abstract: The sequence form, owing to its compact and holistic strategy representation, has demonstrated significant efficiency in computing normal-form perfect equilibria for two-player extensive-form games with perfect recall. Nevertheless, the examination of $n$-player games remains underexplored. To tackle this challenge, we present a sequence-form characterization of normal-form perfect equilibria for… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  24. arXiv:2505.11820  [pdf, other

    cs.CL

    Chain-of-Model Learning for Language Model

    Authors: Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu

    Abstract: In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a com… ▽ More

    Submitted 23 May, 2025; v1 submitted 17 May, 2025; originally announced May 2025.

  25. arXiv:2505.10151  [pdf, other

    cs.RO

    Training People to Reward Robots

    Authors: Endong Sun, Yuqing Zhu, Matthew Howard

    Abstract: Learning from demonstration (LfD) is a technique that allows expert teachers to teach task-oriented skills to robotic systems. However, the most effective way of guiding novice teachers to approach expert-level demonstrations quantitatively for specific teaching tasks remains an open question. To this end, this paper investigates the use of machine teaching (MT) to guide novice teachers to improve… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: 6 pages

  26. arXiv:2505.09343  [pdf, ps, other

    cs.DC cs.AI cs.AR

    Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

    Authors: Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei

    Abstract: The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inferen… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version will appear as part of the Industry Track in Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA '25)

  27. arXiv:2505.07090  [pdf, other

    cs.LG

    Physics-informed Multiple-Input Operators for efficient dynamic response prediction of structures

    Authors: Bilal Ahmed, Yuqing Qiu, Diab W. Abueidda, Waleed El-Sekelly, Tarek Abdoun, Mostafa E. Mobasher

    Abstract: Finite element (FE) modeling is essential for structural analysis but remains computationally intensive, especially under dynamic loading. While operator learning models have shown promise in replicating static structural responses at FEM level accuracy, modeling dynamic behavior remains more challenging. This work presents a Multiple Input Operator Network (MIONet) that incorporates a second trun… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  28. arXiv:2505.06678  [pdf, other

    cs.NI eess.SP

    Distributionally Robust Contract Theory for Edge AIGC Services in Teleoperation

    Authors: Zijun Zhan, Yaxian Dong, Daniel Mawunyo Doe, Yuqing Hu, Shuai Li, Shaohua Cao, Lei Fan, Zhu Han

    Abstract: Advanced AI-Generated Content (AIGC) technologies have injected new impetus into teleoperation, further enhancing its security and efficiency. Edge AIGC networks have been introduced to meet the stringent low-latency requirements of teleoperation. However, the inherent uncertainty of AIGC service quality and the need to incentivize AIGC service providers (ASPs) make the design of a robust incentiv… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  29. arXiv:2505.04141  [pdf, other

    cs.RO

    NAMO-LLM: Efficient Navigation Among Movable Obstacles with Large Language Model Guidance

    Authors: Yuqing Zhang, Yiannis Kantaros

    Abstract: Several planners have been proposed to compute robot paths that reach desired goal regions while avoiding obstacles. However, these methods fail when all pathways to the goal are blocked. In such cases, the robot must reason about how to reconfigure the environment to access task-relevant regions - a problem known as Navigation Among Movable Objects (NAMO). While various solutions to this problem… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 10 pages, 6 figures

  30. arXiv:2505.02922  [pdf, ps, other

    cs.LG

    RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

    Authors: Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Yuqing Yang, Fan Yang, Mao Yang

    Abstract: The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index,… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

    Comments: 16 pages

  31. arXiv:2505.01950  [pdf, other

    cs.CV cs.AI

    Segment Any RGB-Thermal Model with Language-aided Distillation

    Authors: Dong Xing, Xianxun Zhu, Wei Zhou, Qika Lin, Hang Yang, Yuqing Wang

    Abstract: The recent Segment Anything Model (SAM) demonstrates strong instance segmentation performance across various downstream tasks. However, SAM is trained solely on RGB data, limiting its direct applicability to RGB-thermal (RGB-T) semantic segmentation. Given that RGB-T provides a robust solution for scene understanding in adverse weather and lighting conditions, such as low light and overexposure, w… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: arXiv admin note: text overlap with arXiv:2412.04220 by other authors

  32. arXiv:2505.01831  [pdf, other

    eess.IV cs.CV

    Multi-Scale Target-Aware Representation Learning for Fundus Image Enhancement

    Authors: Haofan Wu, Yin Huang, Yuqing Wu, Qiuyu Yang, Bingfang Wang, Li Zhang, Muhammad Fahadullah Khan, Ali Zia, M. Saleh Memon, Syed Sohail Bukhari, Abdul Fattah Memon, Daizong Ji, Ya Zhang, Ghulam Mustafa, Yin Fang

    Abstract: High-quality fundus images provide essential anatomical information for clinical screening and ophthalmic disease diagnosis. Yet, due to hardware limitations, operational variability, and patient compliance, fundus images often suffer from low resolution and signal-to-noise ratio. Recent years have witnessed promising progress in fundus image enhancement. However, existing works usually focus on r… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: Under review at Neural Networks

  33. arXiv:2505.00742  [pdf, other

    cs.CV cs.AI eess.IV

    Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

    Authors: Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, Qingwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu

    Abstract: Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omi… ▽ More

    Submitted 29 April, 2025; originally announced May 2025.

  34. arXiv:2505.00254  [pdf, other

    cs.CV cs.AI

    Empowering Agentic Video Analytics Systems with Video Language Models

    Authors: Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu

    Abstract: AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and an… ▽ More

    Submitted 16 May, 2025; v1 submitted 30 April, 2025; originally announced May 2025.

    Comments: 15 pages, AVAS, add latency breakdown

  35. arXiv:2504.17577  [pdf, other

    cs.LG

    TileLang: A Composable Tiled Programming Model for AI Systems

    Authors: Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, Zhi Yang

    Abstract: Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, har… ▽ More

    Submitted 27 April, 2025; v1 submitted 24 April, 2025; originally announced April 2025.

  36. arXiv:2504.16083  [pdf, other

    cs.CV cs.LG

    MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

    Authors: Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu

    Abstract: The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method th… ▽ More

    Submitted 23 May, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

    Comments: Accepted at ICML 2025

  37. arXiv:2504.14582  [pdf, other

    cs.CV

    NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results

    Authors: Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu , et al. (86 additional authors not shown)

    Abstract: This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that ach… ▽ More

    Submitted 28 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

    Comments: NTIRE 2025 webpage: https://www.cvlai.net/ntire/2025. Code: https://github.com/zhengchen1999/NTIRE2025_ImageSR_x4

  38. arXiv:2504.13978  [pdf

    q-bio.QM cs.LG

    Association between nutritional factors, inflammatory biomarkers and cancer types: an analysis of NHANES data using machine learning

    Authors: Yuqing Liu, Meng Zhao, Guanlan Hu, Yuchen Zhang

    Abstract: Background. Diet and inflammation are critical factors influencing cancer risk. However, the combined impact of nutritional status and inflammatory biomarkers on cancer status and type, using machine learning (ML), remains underexplored. Objectives. This study investigates the association between nutritional factors, inflammatory biomarkers, and cancer status, and whether these relationships dif… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  39. arXiv:2504.12067  [pdf, other

    cs.SE cs.DC cs.NI

    LO2: Microservice API Anomaly Dataset of Logs and Metrics

    Authors: Alexander Bakhtin, Jesse Nyyssölä, Yuqing Wang, Noman Ahmad, Ke Ping, Matteo Esposito, Mika Mäntylä, Davide Taibi

    Abstract: Context. Microservice-based systems have gained significant attention over the past years. A critical factor for understanding and analyzing the behavior of these systems is the collection of monitoring data such as logs, metrics, and traces. These data modalities can be used for anomaly detection and root cause analysis of failures. In particular, multi-modal methods utilizing several types of th… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

  40. arXiv:2504.10686  [pdf, other

    cs.CV eess.IV

    The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

    Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

  41. arXiv:2504.10685  [pdf, other

    cs.CV cs.AI

    NTIRE 2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results

    Authors: Yuqian Fu, Xingyu Qiu, Bin Ren, Yanwei Fu, Radu Timofte, Nicu Sebe, Ming-Hsuan Yang, Luc Van Gool, Kaijin Zhang, Qingpeng Nong, Xiugang Dong, Hong Gao, Xiangsheng Zhou, Jiancheng Pan, Yanxing Liu, Xiao He, Jiahao Li, Yuze Sun, Xiaomeng Huang, Zhenyu Zhang, Ran Ma, Yuhan Liu, Zijian Zhuang, Shuai Yi, Yixiong Zou , et al. (37 additional authors not shown)

    Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) poses significant challenges to existing object detection and few-shot detection models when applied across domains. In conjunction with NTIRE 2025, we organized the 1st CD-FSOD Challenge, aiming to advance the performance of current object detectors on entirely novel target domains with only limited labeled data. The challenge attracted 152 registe… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: accepted by CVPRW 25 @ NTIRE

  42. arXiv:2504.08378  [pdf, other

    cs.LG

    Scaling Up On-Device LLMs via Active-Weight Swapping Between DRAM and Flash

    Authors: Fucheng Jia, Zewen Wu, Shiqi Jiang, Huiqiang Jiang, Qianxi Zhang, Yuqing Yang, Yunxin Liu, Ju Ren, Deyu Zhang, Ting Cao

    Abstract: Large language models (LLMs) are increasingly being deployed on mobile devices, but the limited DRAM capacity constrains the deployable model size. This paper introduces ActiveFlow, the first LLM inference framework that can achieve adaptive DRAM usage for modern LLMs (not ReLU-based), enabling the scaling up of deployable model sizes. The framework is based on the novel concept of active weight D… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  43. arXiv:2504.04586  [pdf, other

    cs.NI

    Joint Optimization of Handoff and Video Rate in LEO Satellite Networks

    Authors: Kyoungjun Park, Zhiyuan He, Cheng Luo, Yi Xu, Lili Qiu, Changhan Ge, Muhammad Muaz, Yuqing Yang

    Abstract: Low Earth Orbit (LEO) satellite communication presents a promising solution for delivering Internet access to users in remote regions. Given that video content is expected to dominate network traffic in LEO satellite systems, this study presents a new video-aware mobility management framework specifically designed for such networks. By combining simulation models with real-world datasets, we highl… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

  44. arXiv:2504.04497  [pdf, other

    cs.RO

    SELC: Self-Supervised Efficient Local Correspondence Learning for Low Quality Images

    Authors: Yuqing Wang, Yan Wang, Hailiang Tang, Xiaoji Niu

    Abstract: Accurate and stable feature matching is critical for computer vision tasks, particularly in applications such as Simultaneous Localization and Mapping (SLAM). While recent learning-based feature matching methods have demonstrated promising performance in challenging spatiotemporal scenarios, they still face inherent trade-offs between accuracy and computational efficiency in specific settings. In… ▽ More

    Submitted 6 April, 2025; originally announced April 2025.

    Comments: 8 pages, 4 figures

  45. arXiv:2504.03682  [pdf

    cs.DC cs.AI cs.LG

    Intelligent Resource Allocation Optimization for Cloud Computing via Machine Learning

    Authors: Yuqing Wang, Xiao Yang

    Abstract: With the rapid expansion of cloud computing applications, optimizing resource allocation has become crucial for improving system performance and cost efficiency. This paper proposes an intelligent resource allocation algorithm that leverages deep learning (LSTM) for demand prediction and reinforcement learning (DQN) for dynamic scheduling. By accurately forecasting computing resource demands and e… ▽ More

    Submitted 21 March, 2025; originally announced April 2025.

  46. arXiv:2503.23631  [pdf, other

    cs.AI

    Intrinsically-Motivated Humans and Agents in Open-World Exploration

    Authors: Aly Lidayan, Yuqing Du, Eliza Kosoy, Maria Rufova, Pieter Abbeel, Alison Gopnik

    Abstract: What drives exploration? Understanding intrinsic motivation is a long-standing challenge in both cognitive science and artificial intelligence; numerous objectives have been proposed and used to train agents, yet there remains a gap between human and agent exploration. We directly compare adults, children, and AI agents in a complex open-ended environment, Crafter, and study how common intrinsic o… ▽ More

    Submitted 27 May, 2025; v1 submitted 30 March, 2025; originally announced March 2025.

  47. arXiv:2503.19482  [pdf, other

    cs.CL

    KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models

    Authors: Zhiwei Wang, Zhongxin Liu, Ying Li, Hongyu Sun, Meng Xu, Yuqing Zhang

    Abstract: The emergence of large language models (LLMs) has significantly advanced the development of natural language processing (NLP), especially in text generation tasks like question answering. However, model hallucinations remain a major challenge in natural language generation (NLG) tasks due to their complex causes. We systematically expand on the causes of factual hallucinations from the perspective… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: 16 pages, 34 figures

    ACM Class: I.2.7; I.2.6

  48. arXiv:2503.16430  [pdf, other

    cs.CV

    Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

    Authors: Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, Xihui Liu

    Abstract: Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but requi… ▽ More

    Submitted 21 March, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: Project page: https://yuqingwang1029.github.io/TokenBridge

  49. arXiv:2503.16129  [pdf, ps, other

    cs.GR

    Controllable Segmentation-Based Text-Guided Style Editing

    Authors: Jingwen Li, Aravind Chandrasekar, Mariana Rocha, Chao Li, Yuqing Chen

    Abstract: We present a novel approach for controllable, region-specific style editing driven by textual prompts. Building upon the state-space style alignment framework introduced by \emph{StyleMamba}, our method integrates a semantic segmentation model into the style transfer pipeline. This allows users to selectively apply text-driven style changes to specific segments (e.g., ``turn the building into a cy… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  50. arXiv:2503.15937  [pdf, other

    cs.AI

    Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

    Authors: Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu

    Abstract: We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: t… ▽ More

    Submitted 20 March, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

    Comments: 14 pages, 4 iterations, refine figs