Skip to main content

Showing 1–50 of 8,216 results for author: Wang, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.01951  [pdf, ps, other

    cs.LG cs.CL

    Test-Time Scaling with Reflective Generative Model

    Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie

    Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra pr… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  2. arXiv:2507.01938  [pdf, ps, other

    cs.CV

    CI-VID: A Coherent Interleaved Text-Video Dataset

    Authors: Yiming Ju, Jijin Hu, Zhengxiong Luo, Haoge Deng, hanyu Zhao, Li Du, Chengwei Wu, Donglin Hao, Xinlong Wang, Tengfei Pan

    Abstract: Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to support the modeling of coherent multi-clip video sequences. To address this limitation, we introduce CI-VI… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  3. arXiv:2507.01932  [pdf, ps, other

    math.OC cs.LG math.NA stat.ML

    A first-order method for nonconvex-nonconcave minimax problems under a local Kurdyka-Łojasiewicz condition

    Authors: Zhaosong Lu, Xiangyuan Wang

    Abstract: We study a class of nonconvex-nonconcave minimax problems in which the inner maximization problem satisfies a local Kurdyka-Łojasiewicz (KL) condition that may vary with the outer minimization variable. In contrast to the global KL or Polyak-Łojasiewicz (PL) conditions commonly assumed in the literature -- which are significantly stronger and often too restrictive in practice -- this local KL cond… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: 26 pages

    MSC Class: 90C26; 90C30; 90C47; 90C99; 65K05

  4. arXiv:2507.01335  [pdf, ps, other

    cs.CL cs.AI

    LEDOM: An Open and Fundamental Reverse Language Model

    Authors: Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan

    Abstract: We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Work in progress

  5. arXiv:2507.01061  [pdf, ps, other

    cs.CY cs.AI cs.HC

    Epitome: Pioneering an Experimental Platform for AI-Social Science Integration

    Authors: Jingjing Qu, Kejia Hu, Jun Zhu, Wenhao Li, Teng Wang, Zhiyun Chen, Yulei Ye, Chaochao Lu, Aimin Zhou, Xiangfeng Wang, James Evan

    Abstract: The integration of Large Language Models (LLMs) into social science experiments represents a transformative approach to understanding human-AI interactions and their societal impacts. We introduce Epitome, the world's first open experimental platform dedicated to the deep integration of artificial intelligence and social science. Rooted in theoretical foundations from management, communication stu… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

    Comments: 18 pages, 5figures

  6. arXiv:2507.00886  [pdf, ps, other

    cs.CV cs.RO

    GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

    Authors: Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, Luc Van Gool

    Abstract: As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scene… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  7. arXiv:2507.00690  [pdf, ps, other

    cs.CV cs.CR

    Cage-Based Deformation for Transferable and Undefendable Point Cloud Attack

    Authors: Keke Tang, Ziyong Du, Weilong Peng, Xiaofei Wang, Peican Zhu, Ligang Liu, Zhihong Tian

    Abstract: Adversarial attacks on point clouds often impose strict geometric constraints to preserve plausibility; however, such constraints inherently limit transferability and undefendability. While deformation offers an alternative, existing unstructured approaches may introduce unnatural distortions, making adversarial point clouds conspicuous and undermining their plausibility. In this paper, we propose… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  8. arXiv:2507.00672  [pdf, ps, other

    cs.NI cs.DC

    Toward Edge General Intelligence with Multiple-Large Language Model (Multi-LLM): Architecture, Trust, and Orchestration

    Authors: Haoxiang Luo, Yinqiu Liu, Ruichen Zhang, Jiacheng Wang, Gang Sun, Dusit Niyato, Hongfang Yu, Zehui Xiong, Xianbin Wang, Xuemin Shen

    Abstract: Edge computing enables real-time data processing closer to its source, thus improving the latency and performance of edge-enabled AI applications. However, traditional AI models often fall short when dealing with complex, dynamic tasks that require advanced reasoning and multimodal data processing. This survey explores the integration of multi-LLMs (Large Language Models) to address this in edge c… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  9. arXiv:2507.00665  [pdf, ps, other

    cs.CL cs.AI

    SAFER: Probing Safety in Reward Models with Sparse Autoencoder

    Authors: Sihang Li, Wei Shi, Ziyuan Xie, Tao Liang, Guojun Ma, Xiang Wang

    Abstract: Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (S… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  10. arXiv:2507.00642  [pdf, ps, other

    cs.AR

    ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis

    Authors: Runkai Li, Jia Xiong, Xiuyuan He, Jieru Zhao, Qiang Xu, Xi Wang

    Abstract: The increasing complexity of computational demands has accelerated the adoption of domain-specific accelerators, yet traditional hardware design methodologies remain constrained by prolonged development and verification cycles. High-Level Synthesis (HLS) bridges the gap between software and hardware by enabling hardware design from high-level programming languages. However, its widespread adoption… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  11. arXiv:2507.00611  [pdf, ps, other

    cs.LG cs.AI cs.RO

    Residual Reward Models for Preference-based Reinforcement Learning

    Authors: Chenyang Cao, Miguel Rogel-García, Mohamed Nabail, Xueqian Wang, Nicholas Rhinehart

    Abstract: Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify, avoiding heuristic and time-consuming reward design. However, PbRL can suffer from slow convergence speed since it requires training in a reward model. Prior work has proposed learning a reward model from demonstrations and fine-tuning it usin… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: 26 pages, 22 figures

  12. arXiv:2507.00505  [pdf, ps, other

    cs.CV

    LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

    Authors: Haoran Lou, Chunxiao Fan, Ziyan Liu, Yuexin Wu, Xinxiang Wang

    Abstract: The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve thi… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: ICCV

  13. arXiv:2507.00487  [pdf, ps, other

    cs.IR cs.CL

    MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models

    Authors: Jianghao Lin, Xinyuan Wang, Xinyi Dai, Menghui Zhu, Bo Chen, Ruiming Tang, Yong Yu, Weinan Zhang

    Abstract: Tool retrieval is a critical component in enabling large language models (LLMs) to interact effectively with external tools. It aims to precisely filter the massive tools into a small set of candidates for the downstream tool-augmented LLMs. However, most existing approaches primarily focus on optimizing tool representations, often neglecting the importance of precise query comprehension. To addre… ▽ More

    Submitted 2 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

  14. arXiv:2507.00458  [pdf, ps, other

    eess.AS cs.SD

    Mitigating Language Mismatch in SSL-Based Speaker Anonymization

    Authors: Zhe Zhang, Wen-Chin Huang, Xin Wang, Xiaoxiao Miao, Junichi Yamagishi

    Abstract: Speaker anonymization aims to protect speaker identity while preserving content information and the intelligibility of speech. However, most speaker anonymization systems (SASs) are developed and evaluated using only English, resulting in degraded utility for other languages. This paper investigates language mismatch in SASs for Japanese and Mandarin speech. First, we fine-tune a self-supervised l… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to Interspeech 2025

  15. arXiv:2507.00363  [pdf, ps, other

    cs.CV

    GDGS: 3D Gaussian Splatting Via Geometry-Guided Initialization And Dynamic Density Control

    Authors: Xingjun Wang, Lianlei Shan

    Abstract: We propose a method to enhance 3D Gaussian Splatting (3DGS)~\cite{Kerbl2023}, addressing challenges in initialization, optimization, and density control. Gaussian Splatting is an alternative for rendering realistic images while supporting real-time performance, and it has gained popularity due to its explicit 3D Gaussian representation. However, 3DGS heavily depends on accurate initialization and… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

  16. arXiv:2507.00045  [pdf, ps, other

    cs.CV cs.AI cs.CL

    CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning

    Authors: Ming Li, Chenguang Wang, Yijun Liang, Xiyao Wang, Yuhang Zhou, Xiyang Wu, Yuqing Zhang, Ruiyi Zhang, Tianyi Zhou

    Abstract: Recent agentic Multi-Modal Large Language Models (MLLMs) such as GPT-o3 have achieved near-ceiling scores on various existing benchmarks, motivating a demand for more challenging test tasks. These MLLMs have been reported to excel in a few expert-level tasks for humans, e.g., GeoGuesser, reflecting their potential as a detective who can notice minuscule cues in an image and weave them into coheren… ▽ More

    Submitted 23 June, 2025; originally announced July 2025.

  17. arXiv:2506.24009  [pdf, ps, other

    cs.IT cs.AI

    Bridging Physical and Digital Worlds: Embodied Large AI for Future Wireless Systems

    Authors: Xinquan Wang, Fenghao Zhu, Zhaohui Yang, Chongwen Huang, Xiaoming Chen, Zhaoyang Zhang, Sami Muhaidat, Mérouane Debbah

    Abstract: Large artificial intelligence (AI) models offer revolutionary potential for future wireless systems, promising unprecedented capabilities in network optimization and performance. However, current paradigms largely overlook crucial physical interactions. This oversight means they primarily rely on offline datasets, leading to difficulties in handling real-time wireless dynamics and non-stationary e… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 7 pages, 4 figures

  18. arXiv:2506.23783  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking

    Authors: Shiao Wang, Ju Huang, Qingchuan Ma, Jinfeng Gao, Chunyi Xu, Xiao Wang, Lan Chen, Bo Jiang

    Abstract: Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effe… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Journal extension of Mamba-FETrack which was published on Pattern Recognition and Computer Vision (PRCV) 2024

  19. arXiv:2506.23640  [pdf, ps, other

    cs.NI cs.LG

    Geminet: Learning the Duality-based Iterative Process for Lightweight Traffic Engineering in Changing Topologies

    Authors: Ximeng Liu, Shizhen Zhao, Xinbing Wang

    Abstract: Recently, researchers have explored ML-based Traffic Engineering (TE), leveraging neural networks to solve TE problems traditionally addressed by optimization. However, existing ML-based TE schemes remain impractical: they either fail to handle topology changes or suffer from poor scalability due to excessive computational and memory overhead. To overcome these limitations, we propose Geminet, a l… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  20. arXiv:2506.23603  [pdf

    cs.CR cs.AI

    SoK: Semantic Privacy in Large Language Models

    Authors: Baihe Ma, Yanna Jiang, Xu Wang, Guangshen Yu, Qin Wang, Caijun Sun, Chen Li, Xuelei Qi, Ying He, Wei Ni, Ren Ping Liu

    Abstract: As Large Language Models (LLMs) are increasingly deployed in sensitive domains, traditional data privacy measures prove inadequate for protecting information that is implicit, contextual, or inferable - what we define as semantic privacy. This Systematization of Knowledge (SoK) introduces a lifecycle-centric framework to analyze how semantic privacy risks emerge across input processing, pretrainin… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  21. arXiv:2506.23351  [pdf, ps, other

    cs.RO cs.AI cs.LG cs.MA

    Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop

    Authors: Tianxing Chen, Kaixuan Wang, Zhaohui Yang, Yuhao Zhang, Zanxin Chen, Baijun Chen, Wanxi Dong, Ziyuan Liu, Dong Chen, Tianshuo Yang, Haibao Yu, Xiaokang Yang, Yusen Qin, Zhiqiang Xie, Yao Mu, Ping Luo, Tian Nian, Weiliang Deng, Yiheng Ge, Yibin Liu, Zixuan Li, Dehui Wang, Zhixuan Liang, Haohui Xie, Rijie Zeng , et al. (74 additional authors not shown)

    Abstract: Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To ad… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Challenge Webpage: https://robotwin-benchmark.github.io/cvpr-2025-challenge/

  22. arXiv:2506.23184  [pdf, ps, other

    eess.IV cs.AI cs.CV

    Score-based Diffusion Model for Unpaired Virtual Histology Staining

    Authors: Anran Liu, Xiaofei Wang, Jing Cai, Chao Li

    Abstract: Hematoxylin and eosin (H&E) staining visualizes histology but lacks specificity for diagnostic markers. Immunohistochemistry (IHC) staining provides protein-targeted staining but is restricted by tissue availability and antibody specificity. Virtual staining, i.e., computationally translating the H&E image to its IHC counterpart while preserving the tissue structure, is promising for efficient IHC… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: 11 pages, 3 figures

  23. arXiv:2506.23078  [pdf, ps, other

    cs.RO

    Event-based Stereo Visual-Inertial Odometry with Voxel Map

    Authors: Zhaoxing Zhang, Xiaoxiang Wang, Chengliang Zhang, Yangyang Guo, Zikang Yuan, Xin Yang

    Abstract: The event camera, renowned for its high dynamic range and exceptional temporal resolution, is recognized as an important sensor for visual odometry. However, the inherent noise in event streams complicates the selection of high-quality map points, which critically determine the precision of state estimation. To address this challenge, we propose Voxel-ESVIO, an event-based stereo visual-inertial o… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  24. arXiv:2506.23077  [pdf, ps, other

    cs.CV

    Dynamic Contrastive Learning for Hierarchical Retrieval: A Case Study of Distance-Aware Cross-View Geo-Localization

    Authors: Suofei Zhang, Xinxin Wang, Xiaofu Wu, Quan Zhou, Haifeng Hu

    Abstract: Existing deep learning-based cross-view geo-localization methods primarily focus on improving the accuracy of cross-domain image matching, rather than enabling models to comprehensively capture contextual information around the target and minimize the cost of localization errors. To support systematic research into this Distance-Aware Cross-View Geo-Localization (DACVGL) problem, we construct Dist… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  25. arXiv:2506.22950  [pdf, ps, other

    cs.LG

    Infinite Sampling: Efficient and Stable Grouped RL Training for Large Language Models

    Authors: Liangyu Wang, Huanyi Xie, Xinhai Wang, Tianjin Huang, Mengdi Li, Di Wang

    Abstract: Group-based reinforcement learning algorithms such as Group Reward Policy Optimization (GRPO) have proven effective for fine-tuning large language models (LLMs) with human feedback. However, generating and storing multiple responses per prompt incurs substantial memory overhead, especially as the sample group size increases, limiting scalability under constrained hardware. We propose Infinite Sa… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  26. arXiv:2506.22749  [pdf, ps, other

    cs.CV

    Deep Learning based Joint Geometry and Attribute Up-sampling for Large-Scale Colored Point Clouds

    Authors: Yun Zhang, Feifan Chen, Na Li, Zhiwei Guo, Xu Wang, Fen Miao, Sam Kwong

    Abstract: Colored point cloud, which includes geometry and attribute components, is a mainstream representation enabling realistic and immersive 3D applications. To generate large-scale and denser colored point clouds, we propose a deep learning-based Joint Geometry and Attribute Up-sampling (JGAU) method that learns to model both geometry and attribute patterns while leveraging spatial attribute correlatio… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  27. arXiv:2506.22714  [pdf, ps, other

    cs.DC cs.LG cs.PF

    Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

    Authors: Jinliang Shi, Shigang Li, Youxuan Xu, Xueying Wang, Rongtian Fu, Zhi Ma, Tong Wu

    Abstract: Sparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor cores and CUDA cores to accelerate sparse operators. The former brings superior computing power but only for structured matrix multiplication, while the latter has relatively lower performance but with higher programming flex… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    ACM Class: C.1.4; I.2.11

  28. arXiv:2506.22291  [pdf, ps, other

    cs.CV cs.AI

    RoomCraft: Controllable and Complete 3D Indoor Scene Generation

    Authors: Mengqi Zhou, Xipeng Wang, Yuxi Wang, Zhaoxiang Zhang

    Abstract: Generating realistic 3D indoor scenes from user inputs remains a challenging problem in computer vision and graphics, requiring careful balance of geometric consistency, spatial relationships, and visual realism. While neural generation methods often produce repetitive elements due to limited global spatial reasoning, procedural approaches can leverage constraints for controllable generation but s… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  29. arXiv:2506.22103  [pdf, ps, other

    cs.SI

    Quantifying Institutional Gender Inequality in Contemporary Visual Art

    Authors: Xindi Wang, Alexander J. Gates, Magnus Resch, Albert-Laszlo Barabasi

    Abstract: From disparities in the number of exhibiting artists to auction opportunities, there is evidence of women's under-representation in visual art. Here we explore the exhibition history and auction sales of 65,768 contemporary artists in 20,389 institutions, revealing gender differences in the artist population, exhibitions and auctions. We distinguish between two criteria for gender equity: gender-n… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: 35 pages, 6 figures

  30. arXiv:2506.21932  [pdf, ps, other

    math.NA cs.CE cs.PF

    StructMG: A Fast and Scalable Structured Algebraic Multigrid

    Authors: Yi Zong, Peinan Yu, Haopeng Huang, Zhengding Hu, Xinliang Wang, Qin Wang, Chensong Zhang, Xiaowen Xu, Jian Sun, Yongxiao Zhou, Wei Xue

    Abstract: Parallel multigrid is widely used as preconditioners in solving large-scale sparse linear systems. However, the current multigrid library still needs more satisfactory performance for structured grid problems regarding speed and scalability. Based on the classical 'multigrid seesaw', we derive three necessary principles for an efficient structured multigrid, which instructs our design and implemen… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  31. arXiv:2506.21912  [pdf, ps, other

    cs.CV cs.MM

    Generating Attribute-Aware Human Motions from Textual Prompt

    Authors: Xinghan Wang, Kun Xu, Fei Li, Cao Sheng, Jiazhong Yu, Yadong Mu

    Abstract: Text-driven human motion generation has recently attracted considerable attention, allowing models to generate human motions based on textual descriptions. However, current methods neglect the influence of human attributes (such as age, gender, weight, and height) which are key factors shaping human motion patterns. This work represents a pilot exploration for bridging this gap. We conceptualize e… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  32. arXiv:2506.21894  [pdf, ps, other

    stat.ML cs.LG

    Thompson Sampling in Function Spaces via Neural Operators

    Authors: Rafael Oliveira, Xuesong Wang, Kian Ming A. Chai, Edwin V. Bonilla

    Abstract: We propose an extension of Thompson sampling to optimization problems over function spaces where the objective is a known functional of an unknown operator's output. We assume that functional evaluations are inexpensive, while queries to the operator (such as running a high-fidelity simulator) are costly. Our algorithm employs a sample-then-optimize approach using neural operator surrogates. This… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

    Comments: Under review

  33. arXiv:2506.21835  [pdf, ps, other

    cs.CV

    ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts

    Authors: Xiaoqi Wang, Clint Sebastian, Wenbin He, Liu Ren

    Abstract: The recent advancements in large foundation models have driven the success of open-set image segmentation, a task focused on segmenting objects beyond predefined categories. Among various prompt types (such as points, boxes, texts, and visual references), visual reference segmentation stands out for its unique flexibility and strong zero-shot capabilities. Recently, several SAM-based methods have… ▽ More

    Submitted 30 June, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

  34. arXiv:2506.21763  [pdf, ps, other

    cs.AI

    THE-Tree: Can Tracing Historical Evolution Enhance Scientific Verification and Reasoning?

    Authors: Xin Wang, Jiyao Liu, Yulong Xiao, Junzhi Ning, Lihao Liu, Junjun He, Botian Shi, Kaicheng Yu

    Abstract: Large Language Models (LLMs) are accelerating scientific idea generation, but rigorously evaluating these numerous, often superficial, AI-generated propositions for novelty and factual accuracy is a critical bottleneck; manual verification is too slow.Existing validation methods are inadequate: LLMs as standalone verifiers may hallucinate and lack domain knowledge (our findings show ~60\% unawaren… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  35. arXiv:2506.21416  [pdf, ps, other

    cs.CV

    XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

    Authors: Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu

    Abstract: Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled gene… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Project Page: https://bytedance.github.io/XVerse Github Link: https://github.com/bytedance/XVerse

  36. arXiv:2506.21411  [pdf, ps, other

    cs.LG

    Distributed Cross-Channel Hierarchical Aggregation for Foundation Models

    Authors: Aristeidis Tsaris, Isaac Lyngaas, John Lagregren, Mohamed Wahib, Larry York, Prasanna Balaprakash, Dan Lu, Feiyi Wang, Xiao Wang

    Abstract: Vision-based scientific foundation models hold significant promise for advancing scientific discovery and innovation. This potential stems from their ability to aggregate images from diverse sources such as varying physical groundings or data acquisition systems and to learn spatio-temporal correlations using transformer architectures. However, tokenizing and aggregating images can be compute-inte… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  37. arXiv:2506.21269  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Integrating Vehicle Acoustic Data for Enhanced Urban Traffic Management: A Study on Speed Classification in Suzhou

    Authors: Pengfei Fan, Yuli Zhang, Xinheng Wang, Ruiyuan Jiang, Hankang Gu, Dongyao Jia, Shangbo Wang

    Abstract: This study presents and publicly releases the Suzhou Urban Road Acoustic Dataset (SZUR-Acoustic Dataset), which is accompanied by comprehensive data-acquisition protocols and annotation guidelines to ensure transparency and reproducibility of the experimental workflow. To model the coupling between vehicular noise and driving speed, we propose a bimodal-feature-fusion deep convolutional neural net… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  38. arXiv:2506.21205  [pdf, ps, other

    cs.RO

    Dynamic Risk-Aware MPPI for Mobile Robots in Crowds via Efficient Monte Carlo Approximations

    Authors: Elia Trevisan, Khaled A. Mustafa, Godert Notten, Xinwei Wang, Javier Alonso-Mora

    Abstract: Deploying mobile robots safely among humans requires the motion planner to account for the uncertainty in the other agents' predicted trajectories. This remains challenging in traditional approaches, especially with arbitrarily shaped predictions and real-time constraints. To address these challenges, we propose a Dynamic Risk-Aware Model Predictive Path Integral control (DRA-MPPI), a motion plann… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Accepted for presentation at IROS 2025. Submitted Version

  39. arXiv:2506.21006  [pdf, ps, other

    cs.CV

    Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning

    Authors: Tyler Ward, Xiaoqin Wang, Braxton McFarland, Md Atik Ahamed, Sahar Nozad, Talal Arshad, Hafsa Nebbache, Jin Chen, Abdullah Imran

    Abstract: Complete removal of cancer tumors with a negative specimen margin during lumpectomy is essential in reducing breast cancer recurrence. However, 2D specimen radiography (SR), the current method used to assess intraoperative specimen margin status, has limited accuracy, resulting in nearly a quarter of patients requiring additional surgery. To address this, we propose a novel deep learning framework… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: 19 pages, 7 figures, 3 tables

  40. arXiv:2506.20988  [pdf, ps, other

    cs.CV cs.AI

    Segment Anything in Pathology Images with Natural Language

    Authors: Zhixuan Chen, Junlin Hou, Liqi Lin, Yihui Wang, Yequan Bie, Xi Wang, Yanning Zhou, Ronald Cheong Kin Chan, Hao Chen

    Abstract: Pathology image segmentation is crucial in computational pathology for analyzing histological features relevant to cancer diagnosis and prognosis. However, current methods face major challenges in clinical applications due to limited annotated data and restricted category definitions. To address these limitations, we propose PathSegmentor, the first text-prompted segmentation foundation model desi… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

  41. arXiv:2506.20967  [pdf, ps, other

    cs.CV cs.AI

    DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing

    Authors: Lingling Cai, Kang Zhao, Hangjie Yuan, Xiang Wang, Yingya Zhang, Kejie Huang

    Abstract: The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVE… ▽ More

    Submitted 27 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

    Comments: Zero-shot video editing

  42. The Next Phase of Scientific Fact-Checking: Advanced Evidence Retrieval from Complex Structured Academic Papers

    Authors: Xingyu Deng, Xi Wang, Mark Stevenson

    Abstract: Scientific fact-checking aims to determine the veracity of scientific claims by retrieving and analysing evidence from research literature. The problem is inherently more complex than general fact-checking since it must accommodate the evolving nature of scientific knowledge, the structural complexity of academic literature and the challenges posed by long-form, multimodal scientific expression. H… ▽ More

    Submitted 28 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

    Comments: Accepted for ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR'25)

  43. arXiv:2506.20590  [pdf, ps, other

    cs.CV

    WonderFree: Enhancing Novel View Quality and Cross-View Consistency for 3D Scene Exploration

    Authors: Chaojun Ni, Jie Li, Haoyun Li, Hengyu Liu, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Boyuan Wang, Chenxin Li, Guan Huang, Wenjun Mei

    Abstract: Interactive 3D scene generation from a single image has gained significant attention due to its potential to create immersive virtual worlds. However, a key challenge in current 3D generation methods is the limited explorability, which cannot render high-quality images during larger maneuvers beyond the original viewpoint, particularly when attempting to move forward into unseen areas. To address… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  44. arXiv:2506.20493  [pdf

    eess.SY cs.GT

    Analyzing the Impact of Strategic Bidding on the Reserve Capacity via a Bi-Level Model

    Authors: Yun Xu, Yunxiao Bai, Yunyong Zhang, Peng Wang, Xuelin Wang, Jiqun Guo, Kaijun Xie, Rusheng Zhao

    Abstract: The growing integration of renewable energy sources necessitates adequate reserve capacity to maintain power balance. However, in market clearing, power companies with flexible resources may submit strategic bids to maximize profits, potentially compromising system reserves. This paper examines the effects of such strategic behavior by modeling the market as a bi-level problem. The upper level rep… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  45. arXiv:2506.20168  [pdf, ps, other

    cs.CV

    Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

    Authors: Zhentao He, Can Zhang, Ziheng Wu, Zhenghao Chen, Yufei Zhan, Yifan Li, Zhao Zhang, Xian Wang, Minghui Qiu

    Abstract: Recent advancements in multimodal large language models have enhanced document understanding by integrating textual and visual information. However, existing models exhibit incompleteness within their paradigm in real-world scenarios, particularly under visual degradation. In such conditions, the current response paradigm often fails to adequately perceive visual degradation and ambiguity, leading… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  46. arXiv:2506.19991  [pdf, ps, other

    cs.CG

    On the Stability of the Euler Characteristic Transform for a Perturbed Embedding

    Authors: Jasmine George, Oscar Lledo Osborn, Elizabeth Munch, Messiah Ridgley II, Elena Xinyi Wang

    Abstract: The Euler Characteristic Transform (ECT) is a robust method for shape classification. It takes an embedded shape and, for each direction, computes a piecewise constant function representing the Euler Characteristic of the shape's sublevel sets, which are defined by the height function in that direction. It has applications in TDA inverse problems, such as shape reconstruction, and is also employed… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: REU Project, Summer 2024

    MSC Class: 55N31

  47. arXiv:2506.19886  [pdf, ps, other

    cs.CR cs.IT cs.LG

    Diffusion-based Task-oriented Semantic Communications with Model Inversion Attack

    Authors: Xuesong Wang, Mo Li, Xingyan Shi, Zhaoqian Liu, Shenghao Yang

    Abstract: Semantic communication has emerged as a promising neural network-based system design for 6G networks. Task-oriented semantic communication is a novel paradigm whose core goal is to efficiently complete specific tasks by transmitting semantic information, optimizing communication efficiency and task performance. The key challenge lies in preserving privacy while maintaining task accuracy, as this s… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  48. arXiv:2506.19850  [pdf, ps, other

    cs.CV cs.RO

    Unified Vision-Language-Action Model

    Authors: Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, Zhaoxiang Zhang

    Abstract: Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVL… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: technical report

  49. arXiv:2506.19838  [pdf, ps, other

    cs.CV

    SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution

    Authors: Liangbin Xie, Yu Li, Shian Du, Menghan Xia, Xintao Wang, Fanghua Yu, Ziyan Chen, Pengfei Wan, Jiantao Zhou, Chao Dong

    Abstract: Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at l… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: Project webpage available at https://simplegvr.github.io/

  50. arXiv:2506.19780  [pdf, ps, other

    cs.LG

    Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment

    Authors: Yuhui Sun, Xiyao Wang, Zixi Li, Jinman Zhao

    Abstract: While large-scale unsupervised language models (LMs) capture broad world knowledge and reasoning capabilities, steering their behavior toward desired objectives remains challenging due to the lack of explicit supervision. Existing alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on training a reward model and performing reinforcement learning to align with huma… ▽ More

    Submitted 26 June, 2025; v1 submitted 24 June, 2025; originally announced June 2025.

    Comments: 10 pages, 4 figures, appendix included. To appear in Proceedings of AAAI 2026. Code: https://github.com/yuhui15/Multi-Preference-Lambda-weighted-DPO

    ACM Class: I.2.6; I.2.7; I.5.1