Skip to main content

Showing 1–50 of 4,366 results for author: WU, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04947  [pdf, ps, other

    cs.CV cs.AI

    DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

    Authors: Yecheng Wu, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, Song Han, Han Cai

    Abstract: We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer fo… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: ICCV 2025

  2. arXiv:2507.04756  [pdf, ps, other

    cs.CL cs.AI

    CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

    Authors: Hang Lv, Sheng Liang, Hao Wang, Hongchao Gu, Yaxiong Wu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen

    Abstract: Personalized text generation has become crucial for adapting language models to diverse and evolving users' personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  3. arXiv:2507.04511  [pdf, ps, other

    cs.CV

    FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection

    Authors: Xinhua Lu, Runhe Lai, Yanqi Wu, Kanghao Chen, Wei-Shi Zheng, Ruixuan Wang

    Abstract: Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative C… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  4. arXiv:2507.04452  [pdf, ps, other

    cs.RO

    SimLauncher: Launching Sample-Efficient Real-world Robotic Reinforcement Learning via Simulation Pre-training

    Authors: Mingdong Wu, Lehong Wu, Yizhuo Wu, Weiyao Huang, Hongwei Fan, Zheyuan Hu, Haoran Geng, Jinzhou Li, Jiahe Ying, Long Yang, Yuanpei Chen, Hao Dong

    Abstract: Autonomous learning of dexterous, long-horizon robotic skills has been a longstanding pursuit of embodied AI. Recent advances in robotic reinforcement learning (RL) have demonstrated remarkable performance and robustness in real-world visuomotor control tasks. However, applying RL in the real world faces challenges such as low sample efficiency, slow exploration, and significant reliance on human… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  5. arXiv:2507.04404  [pdf, ps, other

    cs.AI

    LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers

    Authors: Jingze Zhu, Yongliang Wu, Wenbo Zhu, Jiawang Cao, Yanqiang Zheng, Jiawei Chen, Xu Yang, Bernt Schiele, Jonas Fischer, Xinting Hu

    Abstract: Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors, limiting their reliability in knowledge-intensive tasks. While decoding-time strategies provide a promising efficient solution without training, existing methods typically treat token-level and layer-level signals in isolation, overlooking the joint dynamics between them. In… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  6. arXiv:2507.04072  [pdf, ps, other

    cs.IR

    CTR-Guided Generative Query Suggestion in Conversational Search

    Authors: Erxue Min, Hsiu-Yuan Huang, Xihong Yang, Min Yang, Xin Jia, Yunfang Wu, Hengyi Cai, Junfeng Wang, Shuaiqiang Wang, Dawei Yin

    Abstract: Generating effective query suggestions in conversational search requires aligning model outputs with user preferences, which is challenging due to sparse and noisy click signals. We propose GQS, a generative framework that integrates click modeling and preference optimization to enhance real-world user engagement. GQS consists of three key components: (1) a Multi-Source CTR Modeling module that ca… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

  7. arXiv:2507.03585  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

    Authors: Tao Tang, Shijie Xu, Yiting Wu, Zhixiang Lu

    Abstract: The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Lang… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  8. arXiv:2507.03543  [pdf, ps, other

    cs.CL cs.AI

    H2HTalk: Evaluating Large Language Models as Emotional Companion

    Authors: Boyang Wang, Yalun Wu, Hongcheng Guo, Zhoujun Li

    Abstract: As digital emotional support needs grow, Large Language Model companions offer promising authentic, always-available empathy, though rigorous evaluation lags behind model advancement. We present Heart-to-Heart Talk (H2HTalk), a benchmark assessing companions across personality development and empathetic interaction, balancing emotional intelligence with linguistic fluency. H2HTalk features 4,650 c… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  9. arXiv:2507.03487  [pdf, ps, other

    cs.LG

    ObjectRL: An Object-Oriented Reinforcement Learning Codebase

    Authors: Gulcin Baykal, Abdullah Akgül, Manuel Haussmann, Bahareh Tasdighi, Nicklas Werge, Yi-Shan Wu, Melih Kandemir

    Abstract: ObjectRL is an open-source Python codebase for deep reinforcement learning (RL), designed for research-oriented prototyping with minimal programming effort. Unlike existing codebases, ObjectRL is built on Object-Oriented Programming (OOP) principles, providing a clear structure that simplifies the implementation, modification, and evaluation of new algorithms. ObjectRL lowers the entry barrier for… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  10. arXiv:2507.03268  [pdf, ps, other

    cs.CV

    Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification

    Authors: Xinyue Xin, Ming Li, Yan Wu, Xiang Li, Peng Zhang, Dazhi Xu

    Abstract: The collaborative classification of dual-frequency PolSAR images is a meaningful but also challenging research. The effect of regional consistency on classification information learning and the rational use of dual-frequency data are two main difficulties for dual-frequency collaborative classification. To tackle these problems, a selected knowledge distillation network with statistical-based samp… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  11. arXiv:2507.03175  [pdf, ps, other

    cs.LG cs.AI

    Understanding Knowledge Transferability for Transfer Learning: A Survey

    Authors: Haohua Wang, Jingge Wang, Zijie Zhao, Yang Tan, Yanru Wu, Hanbing Liu, Jingyun Yang, Enming Zhang, Xiangyu Chen, Zhengze Rong, Shanxin Guo, Yang Li

    Abstract: Transfer learning has become an essential paradigm in artificial intelligence, enabling the transfer of knowledge from a source task to improve performance on a target task. This approach, particularly through techniques such as pretraining and fine-tuning, has seen significant success in fields like computer vision and natural language processing. However, despite its widespread use, how to relia… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 35 pages, 15 figures, submitted to ACM Computing Surveys

    MSC Class: 68U01

  12. arXiv:2507.02863  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

    Authors: Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu

    Abstract: Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Code is available at: https://github.com/YkiWu/Point3R

  13. arXiv:2507.02768  [pdf, ps, other

    eess.AS cs.CL cs.SD

    DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang , et al. (3 additional authors not shown)

    Abstract: We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio

  14. arXiv:2507.02664  [pdf, ps, other

    cs.CV

    AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

    Authors: Ziyin Zhou, Yunpeng Luo, Yuanchen Wu, Ke Sun, Jiayi Ji, Ke Yan, Shouhong Ding, Xiaoshuai Sun, Yunsheng Wu, Rongrong Ji

    Abstract: The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generatio… ▽ More

    Submitted 7 July, 2025; v1 submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted to ICCV 2025

  15. arXiv:2507.02644  [pdf, ps, other

    cond-mat.str-el cs.AI quant-ph

    Solving the Hubbard model with Neural Quantum States

    Authors: Yuntian Gu, Wenrui Li, Heng Lin, Bo Zhan, Ruichen Li, Yifei Huang, Di He, Yantao Wu, Tao Xiang, Mingpu Qin, Liwei Wang, Dingshun Lv

    Abstract: The rapid development of neural quantum states (NQS) has established it as a promising framework for studying quantum many-body systems. In this work, by leveraging the cutting-edge transformer-based architectures and developing highly efficient optimization algorithms, we achieve the state-of-the-art results for the doped two-dimensional (2D) Hubbard model, arguably the minimum model for high-Tc… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  16. arXiv:2507.02271  [pdf, ps, other

    cs.CV cs.AI cs.MM

    Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

    Authors: Feizhen Huang, Yu Wu, Yutian Lin, Bo Du

    Abstract: Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-dist… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by IJCAI 2025

  17. arXiv:2507.01437  [pdf

    cs.CL

    Clinical NLP with Attention-Based Deep Learning for Multi-Disease Prediction

    Authors: Ting Xu, Xiaoxiao Deng, Xiandong Meng, Haifeng Yang, Yan Wu

    Abstract: This paper addresses the challenges posed by the unstructured nature and high-dimensional semantic complexity of electronic health record texts. A deep learning method based on attention mechanisms is proposed to achieve unified modeling for information extraction and multi-label disease prediction. The study is conducted on the MIMIC-IV dataset. A Transformer-based architecture is used to perform… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  18. arXiv:2507.01401  [pdf, ps, other

    cs.CV cs.AI

    Medical-Knowledge Driven Multiple Instance Learning for Classifying Severe Abdominal Anomalies on Prenatal Ultrasound

    Authors: Huanwen Liang, Jingxian Xu, Yuanji Zhang, Yuhao Huang, Yuhan Zhang, Xin Yang, Ran Li, Xuedong Deng, Yanjun Liu, Guowei Tao, Yun Wu, Sheng Zhao, Xinru Gao, Dong Ni

    Abstract: Fetal abdominal malformations are serious congenital anomalies that require accurate diagnosis to guide pregnancy management and reduce mortality. Although AI has demonstrated significant potential in medical diagnosis, its application to prenatal abdominal anomalies remains limited. Most existing studies focus on image-level classification and rely on standard plane localization, placing less emp… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by MICCAI 2025

  19. Classification based deep learning models for lung cancer and disease using medical images

    Authors: Ahmad Chaddad, Jihao Peng, Yihang Wu

    Abstract: The use of deep learning (DL) in medical image analysis has significantly improved the ability to predict lung cancer. In this study, we introduce a novel deep convolutional neural network (CNN) model, named ResNet+, which is based on the established ResNet framework. This model is specifically designed to improve the prediction of lung cancer and diseases using the images. To address the challeng… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted in IEEE Transactions on Radiation and Plasma Medical Sciences

  20. arXiv:2507.01066  [pdf

    cs.IR cs.CV cs.LG

    Embedding-based Retrieval in Multimodal Content Moderation

    Authors: Hanzhong Liang, Jinghao Shi, Xiang Shen, Zixuan Wang, Vera Wen, Ardalan Mehrani, Zhiqian Chen, Yifan Wu, Zhixin Zhang

    Abstract: Video understanding plays a fundamental role for content moderation on short video platforms, enabling the detection of inappropriate content. While classification remains the dominant approach for content moderation, it often struggles in scenarios requiring rapid and cost-efficient responses, such as trend adaptation and urgent escalations. To address this issue, we introduce an Embedding-Based… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: Camera ready for SIGIR 2025

  21. arXiv:2507.00950  [pdf, ps, other

    cs.CV cs.LG cs.MM

    MVP: Winning Solution to SMP Challenge 2025 Video Track

    Authors: Liliang Ye, Yunyao Zhang, Yafeng Wu, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, Zikai Song

    Abstract: Social media platforms serve as central hubs for content dissemination, opinion expression, and public engagement across diverse modalities. Accurately predicting the popularity of social media videos enables valuable applications in content recommendation, trend detection, and audience engagement. In this paper, we present Multimodal Video Predictor (MVP), our winning solution to the Video Track… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  22. arXiv:2507.00926  [pdf, ps, other

    cs.MM cs.LG

    HyperFusion: Hierarchical Multimodal Ensemble Learning for Social Media Popularity Prediction

    Authors: Liliang Ye, Yunyao Zhang, Yafeng Wu, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, Zikai Song

    Abstract: Social media popularity prediction plays a crucial role in content optimization, marketing strategies, and user engagement enhancement across digital platforms. However, predicting post popularity remains challenging due to the complex interplay between visual, textual, temporal, and user behavioral factors. This paper presents HyperFusion, a hierarchical multimodal ensemble learning framework for… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  23. arXiv:2507.00752  [pdf, ps, other

    cs.CV cs.RO

    Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

    Authors: Hao Xing, Kai Zhe Boey, Yuankai Wu, Darius Burschka, Gordon Cheng

    Abstract: Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 7 pages, 4 figures, accepted in IROS25, Hangzhou, China

  24. arXiv:2507.00577  [pdf, ps, other

    cs.CR cs.AI cs.CV

    BadViM: Backdoor Attack against Vision Mamba

    Authors: Yinghao Wu, Liyan Zhang

    Abstract: Vision State Space Models (SSMs), particularly architectures like Vision Mamba (ViM), have emerged as promising alternatives to Vision Transformers (ViTs). However, the security implications of this novel architecture, especially their vulnerability to backdoor attacks, remain critically underexplored. Backdoor attacks aim to embed hidden triggers into victim models, causing the model to misclassi… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  25. arXiv:2507.00505  [pdf, ps, other

    cs.CV

    LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

    Authors: Haoran Lou, Chunxiao Fan, Ziyan Liu, Yuexin Wu, Xinliang Wang

    Abstract: The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve thi… ▽ More

    Submitted 4 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to ICCV 2025

  26. arXiv:2507.00091  [pdf, ps, other

    cs.IT

    On the Optimality of Coded Distributed Computing for Ring Networks

    Authors: Zhenhao Huang, Minquan Cheng, Kai Wan, Qifu Tyler Sun, Youlong Wu

    Abstract: We consider a coded distributed computing problem in a ring-based communication network, where $N$ computing nodes are arranged in a ring topology and each node can only communicate with its neighbors within a constant distance $d$. To mitigate the communication bottleneck in exchanging intermediate values, we propose new coded distributed computing schemes for the ring-based network that exploit… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: Part of the work has been presented at ISIT 2025

  27. arXiv:2506.23924  [pdf, ps, other

    cs.AI

    Performance of LLMs on Stochastic Modeling Operations Research Problems: From Theory to Practice

    Authors: Akshit Kumar, Tianyi Peng, Yuhang Wu, Assaf Zeevi

    Abstract: Large language models (LLMs) have exhibited expert-level capabilities across various domains. However, their abilities to solve problems in Operations Research (OR) -- the analysis and optimization of mathematical models derived from real-world problems or their verbal descriptions -- remain underexplored. In this work, we take a first step toward evaluating LLMs' abilities to solve stochastic mod… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  28. arXiv:2506.23827  [pdf, ps, other

    cs.CV

    Spatially Gene Expression Prediction using Dual-Scale Contrastive Learning

    Authors: Mingcheng Qu, Yuncong Wu, Donglin Di, Yue Gao, Tonghua Su, Yang Song, Lei Fan

    Abstract: Spatial transcriptomics (ST) provides crucial insights into tissue micro-environments, but is limited to its high cost and complexity. As an alternative, predicting gene expression from pathology whole slide images (WSI) is gaining increasing attention. However, existing methods typically rely on single patches or a single pathology modality, neglecting the complex spatial and molecular interactio… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Our paper has been accepted by MICCAI 2025

  29. arXiv:2506.23785  [pdf, ps, other

    cs.CV

    Visual Textualization for Image Prompted Object Detection

    Authors: Yongjian Wu, Yang Zhou, Jiya Saiyin, Bingzheng Wei, Yan Xu

    Abstract: We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization -- a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models' (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pre-training data, while preserving their pre-t… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Accepted by ICCV 2025

  30. arXiv:2506.23680  [pdf, ps, other

    cs.IT

    Asymptotically Optimal Secure Aggregation for Wireless Federated Learning with Multiple Servers

    Authors: Zhenhao Huang, Kai Liang, Yuanming Shi, Songze Li, Youlong Wu

    Abstract: In this paper, we investigate the transmission latency of the secure aggregation problem in a \emph{wireless} federated learning system with multiple curious servers. We propose a privacy-preserving coded aggregation scheme where the servers can not infer any information about the distributed users' local gradients, nor the aggregation value. In our scheme, each user encodes its local gradient int… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: This work was in part presented at the IEEE International Symposium on Information Theory (ISIT), 2023

  31. arXiv:2506.23601  [pdf, ps, other

    cs.CL cs.AI

    Semantic-guided Diverse Decoding for Large Language Model

    Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou

    Abstract: Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  32. arXiv:2506.23485  [pdf, ps, other

    cs.CL cs.AI cs.IR

    Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

    Authors: Haocheng Yu, Yaxiong Wu, Hao Wang, Wei Guo, Yong Liu, Yawen Li, Yuyang Ye, Junping Du, Enhong Chen

    Abstract: Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effectively capturing users' real-time needs and enhancing personalized experiences. However, due to limi… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  33. arXiv:2506.23482  [pdf, ps, other

    cs.CV

    MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

    Authors: Jun Huang, Ting Liu, Yihang Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu

    Abstract: Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for obj… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: CVPR 2025

  34. arXiv:2506.23460  [pdf, ps, other

    cs.CV

    Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation

    Authors: Dewen Zeng, Xinrong Hu, Yu-Jen Chen, Yawen Wu, Xiaowei Xu, Yiyu Shi

    Abstract: Weakly supervised semantic segmentation (WSSS) methods using class labels often rely on class activation maps (CAMs) to localize objects. However, traditional CAM-based methods struggle with partial activations and imprecise object boundaries due to optimization discrepancies between classification and segmentation. Recently, the conditional diffusion model (CDM) has been used as an alternative fo… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  35. arXiv:2506.23329  [pdf, ps, other

    cs.CV

    IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

    Authors: Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyuan Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng

    Abstract: Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using pr… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Project Page: https://ir3d-bench.github.io/

  36. arXiv:2506.22852  [pdf, ps, other

    cs.CL

    Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems

    Authors: Yucheng Cai, Yuxuan Wu, Yi Huang, Junlan Feng, Zhijian Ou

    Abstract: Large language models (LLMs) have recently been applied to dialog systems. Despite making progress, LLMs are prone to errors in knowledge-intensive scenarios. Recently, approaches based on retrieval augmented generation (RAG) and agent have emerged to improve the factual accuracy by enhancing the LLMs with knowledge retrieved from external knowledge bases (KBs). This is mostly implemented by promp… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  37. arXiv:2506.22788  [pdf, ps, other

    cs.RO

    SPI-BoTER: Error Compensation for Industrial Robots via Sparse Attention Masking and Hybrid Loss with Spatial-Physical Information

    Authors: Xuao Hou, Yongquan Jia, Shijin Zhang, Yuqiang Wu

    Abstract: The widespread application of industrial robots in fields such as cutting and welding has imposed increasingly stringent requirements on the trajectory accuracy of end-effectors. However, current error compensation methods face several critical challenges, including overly simplified mechanism modeling, a lack of physical consistency in data-driven approaches, and substantial data requirements. Th… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  38. arXiv:2506.22773  [pdf, ps, other

    cs.DC cs.AR cs.CY cs.LG

    Not All Water Consumption Is Equal: A Water Stress Weighted Metric for Sustainable Computing

    Authors: Yanran Wu, Inez Hua, Yi Ding

    Abstract: Water consumption is an increasingly critical dimension of computing sustainability, especially as AI workloads rapidly scale. However, current water impact assessment often overlooks where and when water stress is more severe. To fill in this gap, we present SCARF, the first general framework that evaluates water impact of computing by factoring in both spatial and temporal variations in water st… ▽ More

    Submitted 1 July, 2025; v1 submitted 28 June, 2025; originally announced June 2025.

    Comments: 7 pages, 9 figures, The 4th Workshop on Sustainable Computer Systems (HotCarbon'25), Cambridge, MA, July 10-11th, 2025

    Journal ref: ACM SIGEnergy Energy Informatics Review (EIR), Volume 5 Issue 2, July 2025

  39. arXiv:2506.21896  [pdf, ps, other

    cs.HC

    Focus on the Experts: Co-designing an Augmented Reality Eye-Gaze Tracking System with Surgical Trainees to Improve Endoscopic Instruction

    Authors: Jumanh Atoum, Jinkyung Park, Mamtaj Akter, Nicholas Kavoussi, Pamela Wisniewski, Jie Ying Wu

    Abstract: The current apprenticeship model for surgical training requires a high level of supervision, which does not scale well to meet the growing need for more surgeons. Many endoscopic procedures are directly taught in the operating room (OR) while the attending surgeon and trainee operate on patients. The need to prioritize patient care limits the trainees' opportunities to experiment and receive feedb… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  40. arXiv:2506.21734  [pdf, ps, other

    cs.AI cs.LG

    Hierarchical Reasoning Model

    Authors: Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, Yasin Abbasi Yadkori

    Abstract: Reasoning, the process of devising and executing complex goal-oriented action sequences, remains a critical challenge in AI. Current large language models (LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from brittle task decomposition, extensive data requirements, and high latency. Inspired by the hierarchical and multi-timescale processing in the human brain, we propose th… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  41. arXiv:2506.21580  [pdf

    cs.CL cs.AI cs.CY

    From General Reasoning to Domain Expertise: Uncovering the Limits of Generalization in Large Language Models

    Authors: Dana Alsagheer, Yang Lu, Abdulrahman Kamal, Omar Kamal, Mohammad Kamal, Nada Mansour, Cosmo Yang Wu, Rambiba Karanjai, Sen Li, Weidong Shi

    Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains. However, effective decision-making relies heavily on strong reasoning abilities. Reasoning is the foundation for decision-making, providing the analytical and logical framework to make sound choices. Reasoning involves analyzing information, drawing inferences, and reaching conclusions… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  42. arXiv:2506.21263  [pdf, ps, other

    cs.LG cs.AI cs.CL

    DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster

    Authors: Ji Qi, WenPeng Zhu, Li Li, Ming Wu, YingJun Wu, Wu He, Xun Gao, Jason Zeng, Michael Heinrich

    Abstract: The distributed training of foundation models, particularly large language models (LLMs), demands a high level of communication. Consequently, it is highly dependent on a centralized cluster with fast and reliable interconnects. Can we conduct training on slow networks and thereby unleash the power of decentralized clusters when dealing with models exceeding 100 billion parameters? In this paper,… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  43. arXiv:2506.21101  [pdf, ps, other

    cs.CV

    OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

    Authors: Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, AndyPian Wu, Chaoyang Wang, Chengjie Wang, Taisong Jin, SevenShu, Yunsheng Wu, Yongge Liu, Rongrong Ji

    Abstract: As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address t… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Accepted to ICCV 2025

  44. arXiv:2506.20991  [pdf, ps, other

    cs.CV

    TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation

    Authors: Chade Li, Pengju Zhang, Yihong Wu

    Abstract: The rapid advancement of 3D vision-language models (VLMs) has spurred significant interest in interactive point cloud processing tasks, particularly for real-world applications. However, existing methods often underperform in point-level tasks, such as segmentation, due to missing direct 3D-text alignment, limiting their ability to link local 3D features with textual context. To solve this problem… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  45. arXiv:2506.20406  [pdf, ps, other

    stat.ML cs.IT cs.LG stat.ME

    POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes

    Authors: Ruijia Zhang, Zhengling Qi, Yue Wu, Xiangyu Zhang, Yanxun Xu

    Abstract: Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline rei… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  46. arXiv:2506.19937  [pdf, ps, other

    cs.LG

    The Most Important Features in Generalized Additive Models Might Be Groups of Features

    Authors: Tomas M. Bosschieter, Luis Franca, Jessica Wolk, Yiyuan Wu, Bella Mehta, Joseph Dehoney, Orsolya Kiss, Fiona C. Baker, Qingyu Zhao, Rich Caruana, Kilian M. Pohl

    Abstract: While analyzing the importance of features has become ubiquitous in interpretable machine learning, the joint signal from a group of related features is sometimes overlooked or inadvertently excluded. Neglecting the joint signal could bypass a critical insight: in many instances, the most significant predictors are not isolated features, but rather the combined effect of groups of features. This c… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  47. arXiv:2506.19651  [pdf, ps, other

    cs.CV cs.LG cs.PF

    PEVLM: Parallel Encoding for Vision-Language Models

    Authors: Letian Kang, Shixian Luo, Yiqiang Li, Xiaoyang Yu, Shenxuan Zhou, Yong Wu

    Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in multimodal understanding and generation tasks. However, their application to long video understanding remains hindered by the quadratic complexity of standard attention mechanisms. In this work, we introduce \textbf{PEVLM}, a fine-tuning-free parallel encoding method designed to enhance the prefilling efficiency of VLMs in long… ▽ More

    Submitted 7 July, 2025; v1 submitted 24 June, 2025; originally announced June 2025.

  48. arXiv:2506.19424  [pdf, ps, other

    cs.RO

    Ground-Effect-Aware Modeling and Control for Multicopters

    Authors: Tiankai Yang, Kaixin Chai, Jialin Ji, Yuze Wu, Chao Xu, Fei Gao

    Abstract: The ground effect on multicopters introduces several challenges, such as control errors caused by additional lift, oscillations that may occur during near-ground flight due to external torques, and the influence of ground airflow on models such as the rotor drag and the mixing matrix. This article collects and analyzes the dynamics data of near-ground multicopter flight through various methods, in… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  49. arXiv:2506.19287  [pdf, ps, other

    cs.SE

    Generating and Understanding Tests via Path-Aware Symbolic Execution with LLMs

    Authors: Yaoxuan Wu, Xiaojie Zhou, Ahmad Humayun, Muhammad Ali Gulzar, Miryung Kim

    Abstract: Symbolic execution is a widely used technique for test generation, offering systematic exploration of program paths through constraint solving. However, it is fundamentally constrained by the capability to model the target code including library functions in terms of symbolic constraint and the capability of underlying constraint solvers. As a result, many paths involving complex features remain u… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  50. arXiv:2506.19283  [pdf, ps, other

    cs.CV cs.AI cs.RO

    AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration

    Authors: Xiangbo Gao, Yuheng Wu, Fengze Yang, Xuewen Luo, Keshu Wu, Xinghao Chen, Yuping Wang, Chenxi Liu, Yang Zhou, Zhengzhong Tu

    Abstract: While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative o… ▽ More

    Submitted 2 July, 2025; v1 submitted 23 June, 2025; originally announced June 2025.