Skip to main content

Showing 1–50 of 38,288 results for author: Yang

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.02813  [pdf, ps, other

    cs.CV

    LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

    Authors: Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan

    Abstract: Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Project page: https://liuff19.github.io/LangScene-X

  2. arXiv:2507.02804  [pdf, ps, other

    cs.CL

    Multimodal Mathematical Reasoning with Diverse Solving Perspective

    Authors: Wenhao Shi, Zhiqiang Hu, Yi Bin, Yang Yang, See-Kiong Ng, Heng Tao Shen

    Abstract: Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflection… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 8 pages

  3. arXiv:2507.02798  [pdf, ps, other

    cs.CV

    No time to train! Training-Free Reference-Based Instance Segmentation

    Authors: Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley

    Abstract: The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards red… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Preprint

  4. arXiv:2507.02790  [pdf, ps, other

    cs.CV cs.CL

    From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

    Authors: Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu

    Abstract: The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to inco… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  5. arXiv:2507.02773  [pdf, ps, other

    cs.AI cs.LG cs.MA

    KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-shot Diagnosis Prediction Using Multi-agent LLMs

    Authors: Yuzhang Xie, Hejie Cui, Ziyang Zhang, Jiaying Lu, Kai Shu, Fadi Nahab, Xiao Hu, Carl Yang

    Abstract: Medical diagnosis prediction plays a critical role in disease detection and personalized healthcare. While machine learning (ML) models have been widely adopted for this task, their reliance on supervised training limits their ability to generalize to unseen cases, particularly given the high cost of acquiring large, labeled datasets. Large language models (LLMs) have shown promise in leveraging l… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Journal ref: American Medical Informatics Association (AMIA) 2025 Annual Symposium, Oral

  6. arXiv:2507.02768  [pdf, ps, other

    eess.AS cs.CL cs.SD

    DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang , et al. (3 additional authors not shown)

    Abstract: We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio

  7. arXiv:2507.02751  [pdf, ps, other

    cs.CV

    Partial Weakly-Supervised Oriented Object Detection

    Authors: Mingxin Liu, Peiyuan Zhang, Yuan Liu, Wei Zhang, Yue Zhou, Ning Liao, Ziyang Gong, Junwei Luo, Zhirui Wang, Yi Yu, Xue Yang

    Abstract: The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial O… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 10 pages, 5 figures, 4 tables, source code: https://github.com/VisionXLab/PWOOD

  8. arXiv:2507.02731  [pdf, ps, other

    cs.IT eess.SP

    RIS-Aided Cooperative ISAC Networks for Structural Health Monitoring

    Authors: Jie Yang, Chao-Kai Wen, Xiao Li, Shi Jin

    Abstract: Integrated sensing and communication (ISAC) is a key feature of future cellular systems, enabling applications such as intruder detection, monitoring, and tracking using the same infrastructure. However, its potential for structural health monitoring (SHM), which requires the detection of slow and subtle structural changes, remains largely unexplored due to challenges such as multipath interferenc… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: This work has been submitted to the IEEE for possible publication

  9. arXiv:2507.02713  [pdf, ps, other

    cs.CV

    UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation

    Authors: Qin Guo, Ailing Zeng, Dongxu Yue, Ceyuan Yang, Yang Cao, Hanzhong Guo, Fei Shen, Wei Liu, Xihui Liu, Dan Xu

    Abstract: Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These ch… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  10. arXiv:2507.02675  [pdf, ps, other

    cs.GT

    TUC-PPO: Team Utility-Constrained Proximal Policy Optimization for Spatial Public Goods Games

    Authors: Zhaoqilin Yang, Xin Wang, Ruichen Zhang, Chanchan Li, Youliang Tian

    Abstract: We introduce Team Utility-Constrained Proximal Policy Optimization (TUC-PPO), a new deep reinforcement learning framework. It extends Proximal Policy Optimization (PPO) by integrating team welfare objectives specifically for spatial public goods games. Unlike conventional approaches where cooperation emerges indirectly from individual rewards, TUC-PPO instead optimizes a bi-level objective integra… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  11. arXiv:2507.02652  [pdf, ps, other

    cs.AI cs.CL cs.IR

    Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search

    Authors: Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yang Zhao, Hongjin Qian, Zhicheng Dou

    Abstract: Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to ineffici… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 9 pages

  12. arXiv:2507.02600  [pdf, ps, other

    cs.RO

    ArtGS:3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects

    Authors: Qiaojun Yu, Xibin Yuan, Yu jiang, Junting Chen, Dongzhe Zheng, Ce Hao, Yang You, Yixing Chen, Yao Mu, Liu Liu, Cewu Lu

    Abstract: Articulated object manipulation remains a critical challenge in robotics due to the complex kinematic constraints and the limited physical reasoning of existing methods. In this work, we introduce ArtGS, a novel framework that extends 3D Gaussian Splatting (3DGS) by integrating visual-physical modeling for articulated object understanding and interaction. ArtGS begins with multi-view RGB-D reconst… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted by IROS 2025

  13. arXiv:2507.02576  [pdf, ps, other

    cs.CV

    Parametric shape models for vessels learned from segmentations via differentiable voxelization

    Authors: Alina F. Dima, Suprosanna Shit, Huaqi Qiu, Robbie Holland, Tamara T. Mueller, Fabio Antonio Musio, Kaiyuan Yang, Bjoern Menze, Rickmer Braren, Marcus Makowski, Daniel Rueckert

    Abstract: Vessels are complex structures in the body that have been studied extensively in multiple representations. While voxelization is the most common of them, meshes and parametric models are critical in various applications due to their desirable properties. However, these representations are typically extracted through segmentations and used disjointly from each other. We propose a framework that joi… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 15 pages, 6 figures

  14. arXiv:2507.02546  [pdf, ps, other

    cs.CV

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Authors: Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, Jiaolong Yang

    Abstract: We propose MoGe-2, an advanced open-domain geometry estimation model that recovers a metric scale 3D point map of a scene from a single image. Our method builds upon the recent monocular geometry estimation approach, MoGe, which predicts affine-invariant point maps with unknown scales. We explore effective strategies to extend MoGe for metric geometry prediction without compromising the relative g… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Project page: https://wangrc.site/MoGe2Page/

  15. arXiv:2507.02541  [pdf, ps, other

    cs.AI

    Clarifying Before Reasoning: A Coq Prover with Structural Context

    Authors: Yanzhen Lu, Hanbin Yang, Xiaodie Wang, Ge Zhang, Biao Li, Chenxu Fu, Chao Li, Yang Yuan, Andrew Chi-Chih Yao

    Abstract: In this work, we investigate whether improving task clarity can enhance reasoning ability of large language models, focusing on theorem proving in Coq. We introduce a concept-level metric to evaluate task clarity and show that adding structured semantic context to the standard input used by modern LLMs, leads to a 1.85$\times$ improvement in clarity score (44.5\%~$\rightarrow$~82.3\%). Using the g… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  16. arXiv:2507.02437  [pdf, ps, other

    cs.CV eess.IV

    F^2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning

    Authors: Wei Li, Jingyang Zhang, Lihao Liu, Guoan Wang, Junjun He, Yang Chen, Lixu Gu

    Abstract: Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in ran… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: This paper has been submitted to relevant journals

  17. arXiv:2507.02433  [pdf, ps, other

    cs.DS

    Numerical Linear Algebra in Linear Space

    Authors: Yiping Liu, Hoai-An Nguyen, Junzhao Yang

    Abstract: We present a randomized linear-space solver for general linear systems $\mathbf{A} \mathbf{x} = \mathbf{b}$ with $\mathbf{A} \in \mathbb{Z}^{n \times n}$ and $\mathbf{b} \in \mathbb{Z}^n$, without any assumption on the condition number of $\mathbf{A}$. For matrices whose entries are bounded by $\mathrm{poly}(n)$, the solver returns a $(1+ε)$-multiplicative entry-wise approximation to vector… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 52 pages, 0 figures

  18. arXiv:2507.02379  [pdf

    cs.AI q-bio.BM

    An AI-native experimental laboratory for autonomous biomolecular engineering

    Authors: Mingyu Wu, Zhaoguo Wang, Jiabin Wang, Zhiyuan Dong, Jingkai Yang, Qingting Li, Tianyu Huang, Lei Zhao, Mingqiang Li, Fei Wang, Chunhai Fan, Haibo Chen

    Abstract: Autonomous scientific research, capable of independently conducting complex experiments and serving non-specialists, represents a long-held aspiration. Achieving it requires a fundamental paradigm shift driven by artificial intelligence (AI). While autonomous experimental systems are emerging, they remain confined to areas featuring singular objectives and well-defined, simple experimental workflo… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  19. arXiv:2507.02373  [pdf, ps, other

    cs.CV

    UVLM: Benchmarking Video Language Model for Underwater World Understanding

    Authors: Xizhe Xue, Yang Zhou, Dawei Yan, Ying Li, Haokui Zhang, Rong Xiao

    Abstract: Recently, the remarkable success of large language models (LLMs) has achieved a profound impact on the field of artificial intelligence. Numerous advanced works based on LLMs have been proposed and applied in various scenarios. Among them, video language models (VidLMs) are particularly widely used. However, existing works primarily focus on terrestrial scenarios, overlooking the highly demanding… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: 13 pages, 4 figures, 3 tables

  20. arXiv:2507.02363  [pdf, ps, other

    cs.CV

    LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling

    Authors: Jiahao Wu, Rui Peng, Jianbo Jiao, Jiayu Yang, Luyang Tang, Kaiqiang Xiong, Jie Liang, Jinbo Yan, Runling Liu, Ronggang Wang

    Abstract: Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce LocalDyGS, which consists of two parts to adapt our method to… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  21. arXiv:2507.02342  [pdf, ps, other

    cs.LG cs.AI

    DeltaSHAP: Explaining Prediction Evolutions in Online Patient Monitoring with Shapley Values

    Authors: Changhun Kim, Yechan Mun, Sangchul Hahn, Eunho Yang

    Abstract: This study proposes DeltaSHAP, a novel explainable artificial intelligence (XAI) algorithm specifically designed for online patient monitoring systems. In clinical environments, discovering the causes driving patient risk evolution is critical for timely intervention, yet existing XAI methods fail to address the unique requirements of clinical time series explanation tasks. To this end, DeltaSHAP… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted to ICML 2025 Workshop on Actionable Interpretability. Code is available at https://github.com/AITRICS/DeltaSHAP

  22. arXiv:2507.02318  [pdf, ps, other

    cs.SE

    Precisely Detecting Python Type Errors via LLM-based Unit Test Generation

    Authors: Chen Yang, Ziqi Wang, Yanjie Jiang, Lin Yang, Yuteng Zheng, Jianyi Zhou, Junjie Chen

    Abstract: Type errors in Python often lead to runtime failures, posing significant challenges to software reliability and developer productivity. Existing static analysis tools aim to detect such errors without execution but frequently suffer from high false positive rates. Recently, unit test generation techniques offer great promise in achieving high test coverage, but they often struggle to produce bug-r… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  23. arXiv:2507.02299  [pdf, ps, other

    cs.CV

    DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation

    Authors: Yunhan Yang, Shuo Chen, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Edmund Y. Lam, Hengshuang Zhao, Tong He, Xihui Liu

    Abstract: Recent advancements in leveraging pre-trained 2D diffusion models achieve the generation of high-quality novel views from a single in-the-wild image. However, existing works face challenges in producing controllable novel views due to the lack of information from multiple views. In this paper, we present DreamComposer++, a flexible and scalable framework designed to improve current view-aware diff… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted by TPAMI, extension of CVPR 2024 paper DreamComposer

  24. arXiv:2507.02289  [pdf, ps, other

    eess.IV cs.CV

    CineMyoPS: Segmenting Myocardial Pathologies from Cine Cardiac MR

    Authors: Wangbin Ding, Lei Li, Junyi Qiu, Bogen Lin, Mingjing Yang, Liqin Huang, Lianming Wu, Sihan Wang, Xiahai Zhuang

    Abstract: Myocardial infarction (MI) is a leading cause of death worldwide. Late gadolinium enhancement (LGE) and T2-weighted cardiac magnetic resonance (CMR) imaging can respectively identify scarring and edema areas, both of which are essential for MI risk stratification and prognosis assessment. Although combining complementary information from multi-sequence CMR is useful, acquiring these sequences can… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  25. arXiv:2507.02273  [pdf, ps, other

    cs.SD eess.AS

    Fx-Encoder++: Extracting Instrument-Wise Audio Effects Representations from Mixtures

    Authors: Yen-Tung Yeh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yi-Hsuan Yang, Yuki Mitsufuji

    Abstract: General-purpose audio representations have proven effective across diverse music information retrieval applications, yet their utility in intelligent music production remains limited by insufficient understanding of audio effects (Fx). Although previous approaches have emphasized audio effects analysis at the mixture level, this focus falls short for tasks demanding instrument-wise audio effects u… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: ISMIR 2025

  26. arXiv:2507.02256  [pdf, ps, other

    cs.LG cs.RO

    Uncertainty-aware Reward Design Process

    Authors: Yang Yang, Xiaolu Zhou, Bosong Ding, Miao Xin

    Abstract: Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging process due to the inefficiencies and inconsistencies inherent in conventional reward engineering methodologies. Recent advances have explored leveraging large language models (LLMs) to automate reward function design. However, their suboptimal performance in numerical optimization of… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 34 pages, 9 figures

  27. arXiv:2507.02255  [pdf, ps, other

    cs.IR cs.LG

    Listwise Preference Alignment Optimization for Tail Item Recommendation

    Authors: Zihao Li, Chao Yang, Tong Zhang, Yakun Chen, Xianzhi Wang, Guandong Xu, Daoyi Dong

    Abstract: Preference alignment has achieved greater success on Large Language Models (LLMs) and drawn broad interest in recommendation research. Existing preference alignment methods for recommendation either require explicit reward modeling or only support pairwise preference comparison. The former directly increases substantial computational costs, while the latter hinders training efficiency on negative… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  28. arXiv:2507.02245  [pdf, ps, other

    cs.RO

    CoInfra: A Large-Scale Cooperative Infrastructure Perception System and Dataset in Adverse Weather

    Authors: Minghao Ning, Yufeng Yang, Keqi Shu, Shucheng Huang, Jiaming Zhong, Maryam Salehi, Mahdi Rahmani, Yukun Lu, Chen Sun, Aladdin Saleh, Ehsan Hashemi, Amir Khajepour

    Abstract: We present CoInfra, a large-scale cooperative infrastructure perception system and dataset designed to advance robust multi-agent perception under real-world and adverse weather conditions. The CoInfra system includes 14 fully synchronized sensor nodes, each equipped with dual RGB cameras and a LiDAR, deployed across a shared region and operating continuously to capture all traffic participants in… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: This paper has been submitted to the IEEE Transactions on Robotics for review

  29. arXiv:2507.02199  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer

    Authors: Wenquan Lu, Yuechuan Yang, Kyle Lee, Yanshu Li, Enqi Liu

    Abstract: Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent arch… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  30. ARMOUR US: Android Runtime Zero-permission Sensor Usage Monitoring from User Space

    Authors: Yan Long, Jiancong Cui, Yuqing Yang, Tobias Alam, Zhiqiang Lin, Kevin Fu

    Abstract: This work investigates how to monitor access to Android zero-permission sensors which could cause privacy leakage to users. Moreover, monitoring such sensitive access allows security researchers to characterize potential sensor abuse patterns. Zero-permission sensors such as accelerometers have become an indispensable part of Android devices. The critical information they provide has attracted ext… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    ACM Class: K.6.5; D.4.6

    Journal ref: WiSec 2025: 18th ACM Conference on Security and Privacy in Wireless and Mobile Networks

  31. arXiv:2507.02145  [pdf, ps, other

    cs.CL cs.AI

    Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization

    Authors: Keyan Jin, Yapeng Wang, Leonel Santos, Tao Fang, Xu Yang, Sio Kei Im, Hugo Gonçalo Oliveira

    Abstract: Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplor… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  32. arXiv:2507.02089  [pdf, ps, other

    cs.LG stat.ML

    Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model

    Authors: Xingtu Liu, Lin F. Yang, Sharan Vaswani

    Abstract: We consider infinite-horizon $γ$-discounted (linear) constrained Markov decision processes (CMDPs) where the objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Given access to a generative model, we propose to solve CMDPs with a primal-dual framework that can leverage any black-box unconstrained MDP solver. For linear CMDPs with… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  33. arXiv:2507.02029  [pdf, ps, other

    cs.RO

    RoboBrain 2.0 Technical Report

    Authors: BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xiangqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Shanyu Rong, Zhengliang Cai, Bolun Zhang, Shuyi Zhang, Huaihai Lyu, Mengfei Du , et al. (21 additional authors not shown)

    Abstract: We introduce RoboBrain 2.0, our latest generation of embodied vision-language foundation models, designed to unify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes in two variants: a lightweight 7B model and a full-scale 32B model, featuring a heterogeneous architecture with a vision encoder and a language model. Despite its compact size, RoboBrain… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  34. arXiv:2507.02013  [pdf, ps, other

    cs.NI eess.SP

    AI-Empowered Channel Generation for IoV Semantic Communications in Dynamic Conditions

    Authors: Hao Liu, Bo Yang, Zhiwen Yu, Xuelin Cao, George C. Alexandropoulos, Yan Zhang, Chau Yuen

    Abstract: The Internet of Vehicles (IoV) transforms the transportation ecosystem promising pervasive connectivity and data-driven approaches. Deep learning and generative Artificial Intelligence (AI) have the potential to significantly enhance the operation of applications within IoV by facilitating efficient decision-making and predictive capabilities, including intelligent navigation, vehicle safety monit… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  35. arXiv:2507.02008  [pdf, ps, other

    cs.LO

    SMT-Sweep: Word-Level Representation Unification for Hardware Verification

    Authors: Ziyi Yang, Guangyu Hu, Mingkai Miao, Changyuan Yu, Hongce Zhang

    Abstract: SAT sweeping has long been a cornerstone technique in logic simplification and equivalence checking at the bit level, leveraging structural hashing, simulation and SAT solving to prune redundant logic. However, with the growing adoption of word-level constructs in hardware verification, such as bit-vector operations, arithmetics and arrays, there lacks a counterpart of SAT sweeping at the word lev… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  36. arXiv:2507.01957  [pdf, ps, other

    cs.CV cs.AI

    Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

    Authors: Zhuoyang Zhang, Luke J. Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, Song Han

    Abstract: We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To a… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: The first two authors contributed equally to this work

  37. arXiv:2507.01949  [pdf, ps, other

    cs.CV

    Kwai Keye-VL Technical Report

    Authors: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao , et al. (35 additional authors not shown)

    Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video unde… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Technical Report: https://github.com/Kwai-Keye/Keye

  38. arXiv:2507.01939  [pdf, ps, other

    astro-ph.IM astro-ph.SR cs.AI cs.LG

    SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars

    Authors: Xiaosheng Zhao, Yang Huang, Guirong Xue, Xiao Kong, Jifeng Liu, Xiaoyu Tang, Timothy C. Beers, Yuan-Sen Ting, A-Li Luo

    Abstract: In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars.… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 26 pages, 6 figures, 5 tables. To be submitted to AAS Journals. Comments welcome

  39. arXiv:2507.01925  [pdf, ps, other

    cs.RO

    A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    Authors: Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang

    Abstract: The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 70 pages, 5 figures

  40. arXiv:2507.01921  [pdf, ps, other

    cs.CL

    NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks

    Authors: Yang Li, Youssef Emad, Karthik Padthe, Jack Lanchantin, Weizhe Yuan, Thao Nguyen, Jason Weston, Shang-Wen Li, Dong Wang, Ilia Kulikov, Xian Li

    Abstract: Recent work has shown that distilling reasoning traces from a larger teacher model via supervised finetuning outperforms reinforcement learning with the smaller student model alone (Guo et al. 2025). However, there has not been a systematic study of what kind of reasoning demonstrations from the teacher are most effective in improving the student model's reasoning capabilities. In this work we cur… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  41. arXiv:2507.01903  [pdf, ps, other

    cs.CL cs.AI

    AI4Research: A Survey of Artificial Intelligence for Scientific Research

    Authors: Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, Yimeng Zhang, Yihao Liang, Yuhang Zhou, Jiaqi Wang, Zhi Chen, Wanxiang Che

    Abstract: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Preprint

  42. arXiv:2507.01776  [pdf, ps, other

    cs.HC cs.MM

    Human-Machine Collaboration-Guided Space Design: Combination of Machine Learning Models and Humanistic Design Concepts

    Authors: Yuxuan Yang

    Abstract: The integration of machine learning (ML) into spatial design holds immense potential for optimizing space utilization, enhancing functionality, and streamlining design processes. ML can automate tasks, predict performance outcomes, and tailor spaces to user preferences. However, the emotional, cultural, and aesthetic dimensions of design remain crucial for creating spaces that truly resonate with… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  43. arXiv:2507.01773  [pdf, ps, other

    cs.NI

    Frontiers of Generative AI for Network Optimization: Theories, Limits, and Visions

    Authors: Bo Yang, Ruihuai Liang, Weixin Li, Han Wang, Xuelin Cao, Zhiwen Yu, Samson Lasaulce, Mérouane Debbah, Mohamed-Slim Alouini, H. Vincent Poor, Chau Yuen

    Abstract: While interest in the application of generative AI (GenAI) in network optimization has surged in recent years, its rapid progress has often overshadowed critical limitations intrinsic to generative models that remain insufficiently examined in existing literature. This survey provides a comprehensive review and critical analysis of GenAI in network optimization. We focus on the two dominant paradi… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  44. arXiv:2507.01738  [pdf, ps, other

    cs.CV

    DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

    Authors: Ming Dai, Wenxuan Cheng, Jiang-jiang Liu, Sen Yang, Wenxiao Cai, Yanpeng Sun, Wankou Yang

    Abstract: Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we pr… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: ICCV 2025

  45. arXiv:2507.01697  [pdf, ps, other

    cs.RO math.OC

    An RRT* algorithm based on Riemannian metric model for optimal path planning

    Authors: Yu Zhang, Qi Zhou, Xiao-Song Yang

    Abstract: This paper presents a Riemannian metric-based model to solve the optimal path planning problem on two-dimensional smooth submanifolds in high-dimensional space. Our model is based on constructing a new Riemannian metric on a two-dimensional projection plane, which is induced by the high-dimensional Euclidean metric on two-dimensional smooth submanifold and reflects the environmental information of… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 27 pages

    MSC Class: 00A69; 93C85; 14H55 ACM Class: I.2.9

  46. arXiv:2507.01685  [pdf, ps, other

    cs.IT

    Half Spatially Coupled Turbo-Like Codes

    Authors: Xiaowei Wu, Lei Yang, Min Qiu, Chong Han, Jinhong Yuan

    Abstract: This paper presents a new class of spatially coupled turbo-like codes (SC-TCs), namely half spatially coupled braided convolutional codes (HSC-BCCs) and half spatially coupled parallel concatenated codes (HSC-PCCs). Different from the conventional SC-TCs, the proposed codes have simpler and deterministic coupling structures. Most notably, the coupling of HSC-BCCs is performed by re-encoding the wh… ▽ More

    Submitted 2 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

    Comments: This is an extended version of conference paper "Half Spatially Coupled Turbo-Like Codes" accepted by 2025 IEEE Information Theory Workshop

  47. arXiv:2507.01663  [pdf, ps, other

    cs.LG cs.AI

    AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training

    Authors: Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, Jianping Wu

    Abstract: Reinforcement learning (RL) has become a pivotal technology in the post-training phase of large language models (LLMs). Traditional task-colocated RL frameworks suffer from significant scalability bottlenecks, while task-separated RL frameworks face challenges in complex dataflows and the corresponding resource idling and workload imbalance. Moreover, most existing frameworks are tightly coupled w… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  48. arXiv:2507.01643  [pdf, ps, other

    cs.CV

    SAILViT: Towards Robust and Generalizable Visual Backbones for MLLMs via Gradual Feature Refinement

    Authors: Weijie Yin, Dingkang Yang, Hongyuan Dong, Zijian Kang, Jiacong Wang, Xiao Liang, Chao Feng, Jiao Ran

    Abstract: Vision Transformers (ViTs) are essential as foundation backbones in establishing the visual comprehension capabilities of Multimodal Large Language Models (MLLMs). Although most ViTs achieve impressive performance through image-text pair-based contrastive learning or self-supervised mechanisms, they struggle to engage in connector-based co-training directly with LLMs due to potential parameter ini… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: We release SAILViT, a series of versatile vision foundation models

  49. arXiv:2507.01582  [pdf, ps, other

    cs.SD cs.AI cs.MM eess.AS

    Exploring Classical Piano Performance Generation with Expressive Music Variational AutoEncoder

    Authors: Jing Luo, Xinyu Yang, Jie Wei

    Abstract: The creativity of classical music arises not only from composers who craft the musical sheets but also from performers who interpret the static notations with expressive nuances. This paper addresses the challenge of generating classical piano performances from scratch, aiming to emulate the dual roles of composer and pianist in the creative process. We introduce the Expressive Compound Word (ECP)… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted by IEEE SMC 2025

  50. arXiv:2507.01551  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning

    Authors: Wu Fei, Hao Kong, Shuxian Liang, Yang Lin, Yibo Yang, Jing Tang, Lei Chen, Xiansheng Hua

    Abstract: Process Reinforcement Learning~(PRL) has demonstrated considerable potential in enhancing the reasoning capabilities of Large Language Models~(LLMs). However, introducing additional process reward models incurs substantial computational overhead, and there is no unified theoretical framework for process-level advantage estimation. To bridge this gap, we propose \textbf{S}elf-Guided \textbf{P}roces… ▽ More

    Submitted 3 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.