Skip to main content

Showing 1–50 of 4,709 results for author: Wang, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.06224  [pdf, ps, other

    cs.RO cs.AI

    EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow

    Authors: Yixiang Chen, Peiyan Li, Yan Huang, Jiabing Yang, Kehan Chen, Liang Wang

    Abstract: Current language-guided robotic manipulation systems often require low-level action-labeled datasets for imitation learning. While object-centric flow prediction methods mitigate this issue, they remain limited to scenarios involving rigid objects with clear displacement and minimal occlusion. In this work, we present Embodiment-Centric Flow (EC-Flow), a framework that directly learns manipulation… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: Accepted at ICCV 2025

  2. arXiv:2507.06057  [pdf, ps, other

    cs.AI cs.LG

    FEVO: Financial Knowledge Expansion and Reasoning Evolution for Large Language Models

    Authors: Bo Pang, Yalu Ouyang, Hangfei Xu, Ziqi Jia, Panpan Li, Shengzhao Wen, Lu Wang, Shiyong Li, Yanpeng Wang

    Abstract: Advancements in reasoning for large language models (LLMs) have lead to significant performance improvements for LLMs in various fields such as mathematics and programming. However, research applying these advances to the financial domain, where considerable domain-specific knowledge is necessary to complete tasks, remains limited. To address this gap, we introduce FEVO (Financial Evolution), a mu… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  3. arXiv:2507.05900  [pdf, ps, other

    cs.SD cs.LG eess.AS math.OC

    Stable Acoustic Relay Assignment with High Throughput via Lase Chaos-based Reinforcement Learning

    Authors: Zengjing Chen, Lu Wang, Chengzhi Xing

    Abstract: This study addresses the problem of stable acoustic relay assignment in an underwater acoustic network. Unlike the objectives of most existing literature, two distinct objectives, namely classical stable arrangement and ambiguous stable arrangement, are considered. To achieve these stable arrangements, a laser chaos-based multi-processing learning (LC-ML) method is introduced to efficiently obtain… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  4. arXiv:2507.05822  [pdf, ps, other

    cs.CV

    Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models

    Authors: L'ea Dubois, Klaus Schmidt, Chengyu Wang, Ji-Hoon Park, Lin Wang, Santiago Munoz

    Abstract: Current video understanding models excel at recognizing "what" is happening but fall short in high-level cognitive tasks like causal reasoning and future prediction, a limitation rooted in their lack of commonsense world knowledge. To bridge this cognitive gap, we propose a novel framework that synergistically fuses a powerful Vision Foundation Model (VFM) for deep visual perception with a Large L… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

    Comments: 22 pages, 4 figures

    MSC Class: CS ACM Class: I.2.10

  5. arXiv:2507.05542  [pdf

    cs.DB

    GTRSS: Graph-based Top-$k$ Representative Similar Subtrajectory Query

    Authors: Mingchang Ge, Liping Wang, Xuemin Lin, Yuang Zhang, Kunming Wang

    Abstract: Trajectory mining has attracted significant attention. This paper addresses the Top-k Representative Similar Subtrajectory Query (TRSSQ) problem, which aims to find the k most representative subtrajectories similar to a query. Existing methods rely on costly filtering-validation frameworks, resulting in slow response times. Addressing this, we propose GTRSS, a novel Graph-based Top-k Representativ… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  6. arXiv:2507.05317  [pdf, ps, other

    eess.IV cs.AI cs.CV

    PWD: Prior-Guided and Wavelet-Enhanced Diffusion Model for Limited-Angle CT

    Authors: Yi Liu, Yiyang Wen, Zekun Zhou, Junqi Ma, Linghang Wang, Yucheng Yao, Liu Shi, Qiegen Liu

    Abstract: Generative diffusion models have received increasing attention in medical imaging, particularly in limited-angle computed tomography (LACT). Standard diffusion models achieve high-quality image reconstruction but require a large number of sampling steps during inference, resulting in substantial computational overhead. Although skip-sampling strategies have been proposed to improve efficiency, the… ▽ More

    Submitted 30 June, 2025; originally announced July 2025.

  7. arXiv:2507.04631  [pdf, ps, other

    cs.CV cs.AI cs.RO

    Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

    Authors: Yun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang, Zhanjie Zhang, Ao Ma, Chenyou Fan, Tin Lun Lam, Junjie Hu

    Abstract: Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Journal ref: ICCV 2025

  8. arXiv:2507.04607  [pdf, ps, other

    cs.CL cs.AI

    PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes

    Authors: Xinliang Frederick Zhang, Nick Beauchamp, Lu Wang

    Abstract: Large language model (LLM) personalization aims to align model outputs with individuals' unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  9. arXiv:2507.04351  [pdf, ps, other

    cs.RO cs.AI

    MLLM-Fabric: Multimodal Large Language Model-Driven Robotic Framework for Fabric Sorting and Selection

    Authors: Liman Wang, Hanyang Zhong, Tianyuan Wang, Shan Luo, Jihong Zhu

    Abstract: Choosing the right fabric is crucial to meet functional and quality requirements in robotic applications for textile manufacturing, apparel production, and smart retail. We present MLLM-Fabric, a robotic framework powered by multimodal large language models (MLLMs) for fabric sorting and selection. The system includes a robotic arm, a camera, a visuotactile sensor, and a pressure sensor. It employ… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  10. arXiv:2507.04059  [pdf, ps, other

    cs.LG cs.AI cs.CV stat.ML

    Attributing Data for Sharpness-Aware Minimization

    Authors: Chenyang Ren, Yifan Jia, Huanyi Xie, Zhaobin Xu, Tianxing Wei, Liangyu Wang, Lijie Hu, Di Wang

    Abstract: Sharpness-aware Minimization (SAM) improves generalization in large-scale model training by linking loss landscape geometry to generalization. However, challenges such as mislabeled noisy data and privacy concerns have emerged as significant issues. Data attribution, which identifies the contributions of specific training samples, offers a promising solution. However, directly rendering existing d… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: 25 pages

  11. arXiv:2507.03560  [pdf, ps, other

    cs.LG

    Simplifying Graph Neural Kernels: from Stacking Layers to Collapsed Structure

    Authors: Lin Wang, Shijie Wang, Sirui Huang, Qing Li

    Abstract: The Graph Neural Tangent Kernel (GNTK) successfully bridges the gap between kernel methods and Graph Neural Networks (GNNs), addressing key challenges such as the difficulty of training deep networks and the limitations of traditional kernel methods. However, the existing layer-stacking strategy in GNTK introduces redundant computations, significantly increasing computational complexity and limiti… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  12. arXiv:2507.03211  [pdf, ps, other

    cs.LG cs.PF

    DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing

    Authors: Liangyu Wang, Huanyi Xie, Di Wang

    Abstract: Fine-tuning large language models (LLMs) remains resource-intensive due to their sheer scale. While zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating backward passes, its application to multi-hundred-billion-parameter models is constrained by GPU memory and compute throughput. The ZO2 framework addresses the memory bottleneck by offloading model parameters to CP… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  13. arXiv:2507.02666  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

    Authors: Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

    Abstract: In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively miti… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

    Comments: Accepted at Interspeech2025

  14. arXiv:2507.02644  [pdf, ps, other

    cond-mat.str-el cs.AI quant-ph

    Solving the Hubbard model with Neural Quantum States

    Authors: Yuntian Gu, Wenrui Li, Heng Lin, Bo Zhan, Ruichen Li, Yifei Huang, Di He, Yantao Wu, Tao Xiang, Mingpu Qin, Liwei Wang, Dingshun Lv

    Abstract: The rapid development of neural quantum states (NQS) has established it as a promising framework for studying quantum many-body systems. In this work, by leveraging the cutting-edge transformer-based architectures and developing highly efficient optimization algorithms, we achieve the state-of-the-art results for the doped two-dimensional (2D) Hubbard model, arguably the minimum model for high-Tc… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  15. arXiv:2507.02376  [pdf, ps, other

    cs.SE cs.AI cs.DC

    VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software

    Authors: Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei, Leye Wang

    Abstract: Vertical Federated Learning (VFL) is a distributed AI software deployment mechanism for cross-silo collaboration without accessing participants' data. However, existing VFL work lacks a mechanism to audit the execution correctness of the inference software of the data party. To address this problem, we design a Vertical Federated Inference Auditing (VeFIA) framework. VeFIA helps the task party to… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  16. arXiv:2507.02303  [pdf, ps, other

    cs.IT

    Measurements and Modeling of Air-Ground Integrated Channel in Forest Environment Based on OFDM Signals

    Authors: Zhe Xiao, Shu Sun, Na Liu, Lianming Xu, Li Wang

    Abstract: Forests are frequently impacted by climate conditions, vegetation density, and intricate terrain and geology, which contribute to natural disasters. Personnel engaged in or supporting rescue operations in such environments rely on robust communication systems to ensure their safety, highlighting the criticality of channel measurements in forest environments. However, according to current research,… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  17. arXiv:2507.02291  [pdf, ps, other

    cs.LG cs.AI cs.IT

    Knowledge Graph-Based Explainable and Generalized Zero-Shot Semantic Communications

    Authors: Zhaoyu Zhang, Lingyi Wang, Wei Wu, Fuhui Zhou, Qihui Wu

    Abstract: Data-driven semantic communication is based on superficial statistical patterns, thereby lacking interpretability and generalization, especially for applications with the presence of unseen data. To address these challenges, we propose a novel knowledge graph-enhanced zero-shot semantic communication (KGZS-SC) network. Guided by the structured semantic information from a knowledge graph-based sema… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  18. arXiv:2507.01467  [pdf, ps, other

    cs.CV

    Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think

    Authors: Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li

    Abstract: REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  19. arXiv:2507.01428  [pdf, ps, other

    cs.CV eess.IV

    DiffMark: Diffusion-based Robust Watermark Against Deepfakes

    Authors: Chen Sun, Haiyang Sun, Zhiqing Guo, Yunfeng Diao, Liejun Wang, Dan Ma, Gaobo Yang, Keqin Li

    Abstract: Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image du… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

  20. arXiv:2507.01384  [pdf, ps, other

    cs.CV

    MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing

    Authors: Langyu Wang, Bingke Zhu, Yingying Chen, Yiyuan Zhang, Ming Tang, Jinqiao Wang

    Abstract: The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accpted by ICCV 2025

  21. arXiv:2507.01381  [pdf, ps, other

    cs.LG cs.AI

    Distributional Soft Actor-Critic with Diffusion Policy

    Authors: Tong Liu, Yinuo Wang, Xujie Song, Wenjun Zou, Liangfa Chen, Likun Wang, Bin Shuai, Jingliang Duan, Shengbo Eben Li

    Abstract: Reinforcement learning has been proven to be highly effective in handling complex control tasks. Traditional methods typically use unimodal distributions, such as Gaussian distributions, to model the output of value distributions. However, unimodal distribution often and easily causes bias in value function estimation, leading to poor algorithm performance. This paper proposes a distributional rei… ▽ More

    Submitted 3 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted IEEE ITSC 2025

  22. arXiv:2507.01154  [pdf, ps, other

    cs.LG cs.CR

    FlashDP: Private Training Large Language Models with Efficient DP-SGD

    Authors: Liangyu Wang, Junxiao Wang, Jie Ren, Zihang Xiang, David E. Keyes, Di Wang

    Abstract: As large language models (LLMs) increasingly underpin technological advancements, the privacy of their training data emerges as a critical concern. Differential Privacy (DP) serves as a rigorous mechanism to protect this data, yet its integration via Differentially Private Stochastic Gradient Descent (DP-SGD) introduces substantial challenges, primarily due to the complexities of per-sample gradie… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  23. arXiv:2507.00980  [pdf, ps, other

    cs.CV

    RTMap: Real-Time Recursive Mapping with Change Detection and Localization

    Authors: Yuheng Du, Sheng Yang, Lingxuan Wang, Zhenghua Hou, Chengying Cai, Zhitao Tan, Mingxia Chen, Shi-Sheng Huang, Qiang Li

    Abstract: While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simul… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  24. arXiv:2507.00659  [pdf, ps, other

    cs.CV

    LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment

    Authors: Juelin Zhu, Shuaibang Peng, Long Wang, Hanlin Tan, Yu Liu, Maojun Zhang, Shen Yan

    Abstract: We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. Previous wireframe-alignment-based method LoD-Loc has shown promising localization results leveraging LoD models. However, LoD-Loc mainly relies on high-LoD (LoD3 or LoD2) city models, but the majority of available models and those many countries plan to construct nationwide are low-LoD (LoD1). Con… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCV 2025

  25. arXiv:2507.00485  [pdf, ps, other

    cs.LG cs.AI

    PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning

    Authors: Weiran Guo, Guanjun Liu, Ziyuan Zhou, Ling Wang

    Abstract: Reinforcement Learning (RL) is widely used in tasks where agents interact with an environment to maximize rewards. Building on this foundation, Safe Reinforcement Learning (Safe RL) incorporates a cost metric alongside the reward metric, ensuring that agents adhere to safety constraints during decision-making. In this paper, we identify that Safe RL is vulnerable to backdoor attacks, which can man… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

  26. arXiv:2506.23863  [pdf, ps, other

    cs.CV

    Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction

    Authors: Jiahao Ma, Lei Wang, Miaomiao liu, David Ahmedt-Aristizabal, Chuong Nguyen

    Abstract: Multi-view 3D reconstruction remains a core challenge in computer vision. Recent methods, such as DUST3R and its successors, directly regress pointmaps from image pairs without relying on known scene geometry or camera parameters. However, the performance of these models is constrained by the diversity and scale of available training data. In this work, we introduce Puzzles, a data augmentation st… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Feed-forward 3D reconstruction, Data Augmentation

  27. arXiv:2506.23420  [pdf

    cs.CE

    Data-Driven Multiscale Topology Optimization of Spinodoid Architected Materials with Controllable Anisotropy

    Authors: Shiguang Deng, Doksoo Lee, Aaditya Chandrasekhar, Stefan Knapik, Liwei Wang, Horacio D. Espinosa, Wei Chen

    Abstract: Spinodoid architected materials have drawn significant attention due to their unique nature in stochasticity, aperiodicity, and bi-continuity. Compared to classic periodic truss-, beam- and plate-based lattice architectures, spinodoids are insensitive to manufacturing defects, scalable for high throughput production, functionally graded by tunable local properties, and material failure resistant d… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

  28. arXiv:2506.23115  [pdf, ps, other

    cs.CV cs.AI cs.CL

    MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

    Authors: Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou

    Abstract: Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and dat… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Homepage: https://haon-chen.github.io/MoCa/

  29. arXiv:2506.22950  [pdf, ps, other

    cs.LG

    Infinite Sampling: Efficient and Stable Grouped RL Training for Large Language Models

    Authors: Liangyu Wang, Huanyi Xie, Xinhai Wang, Tianjin Huang, Mengdi Li, Di Wang

    Abstract: Group-based reinforcement learning algorithms such as Group Reward Policy Optimization (GRPO) have proven effective for fine-tuning large language models (LLMs) with human feedback. However, generating and storing multiple responses per prompt incurs substantial memory overhead, especially as the sample group size increases, limiting scalability under constrained hardware. We propose Infinite Sa… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  30. arXiv:2506.22186  [pdf, ps, other

    cs.LG

    Thompson Sampling-Based Learning and Control for Unknown Dynamic Systems

    Authors: Kaikai Zheng, Dawei Shi, Yang Shi, Long Wang

    Abstract: Thompson sampling (TS) is an effective method to explore parametric uncertainties and can therefore be used for active learning-based controller design. However, TS relies on finite parametric representations, which limits its applicability to more general spaces, which are more commonly encountered in control system design. To address this issue, this work pro poses a parameterization method for… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  31. arXiv:2506.22161  [pdf, ps, other

    cs.CV

    Attention-disentangled Uniform Orthogonal Feature Space Optimization for Few-shot Object Detection

    Authors: Taijin Zhao, Heqian Qiu, Yu Dai, Lanxiao Wang, Fanman Meng, Qingbo Wu, Hongliang Li

    Abstract: Few-shot object detection (FSOD) aims to detect objects with limited samples for novel classes, while relying on abundant data for base classes. Existing FSOD approaches, predominantly built on the Faster R-CNN detector, entangle objectness recognition and foreground classification within shared feature spaces. This paradigm inherently establishes class-specific objectness criteria and suffers fro… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  32. arXiv:2506.20342  [pdf, ps, other

    cs.CV cs.AI cs.LG

    Feature Hallucination for Self-supervised Action Recognition

    Authors: Lei Wang, Piotr Koniusz

    Abstract: Understanding human actions in videos requires more than raw pixel analysis; it relies on high-level semantic reasoning and effective integration of multimodal features. We propose a deep translational action recognition framework that enhances recognition accuracy by jointly predicting action concepts and auxiliary features from RGB video frames. At test time, hallucination streams infer missing… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Accepted for publication in International Journal of Computer Vision (IJCV)

  33. arXiv:2506.20274  [pdf

    cs.AI

    Enterprise Large Language Model Evaluation Benchmark

    Authors: Liya Wang, David Yi, Damien Jose, John Passarelli, James Gao, Jordan Leventis, Kang Li

    Abstract: Large Language Models (LLMs) ) have demonstrated promise in boosting productivity across AI-powered tools, yet existing benchmarks like Massive Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task complexities. We propose a 14-task framework grounded in Bloom's Taxonomy to holistically evaluate LLM capabilities in enterprise contexts. To address challenges of noisy… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

    Comments: Submitted to MLNLP 2025 at https://csity2025.org/mlnlp/index

  34. arXiv:2506.20059  [pdf, ps, other

    cs.AI

    DiaLLMs: EHR Enhanced Clinical Conversational System for Clinical Test Recommendation and Diagnosis Prediction

    Authors: Weijieying Ren, Tianxiang Zhao, Lei Wang, Tianchun Wang, Vasant Honavar

    Abstract: Recent advances in Large Language Models (LLMs) have led to remarkable progresses in medical consultation. However, existing medical LLMs overlook the essential role of Electronic Health Records (EHR) and focus primarily on diagnosis recommendation, limiting their clinical applicability. We propose DiaLLM, the first medical LLM that integrates heterogeneous EHR data into clinically grounded dialog… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Journal ref: published in ACL 2025

  35. arXiv:2506.20045  [pdf, ps, other

    cs.RO cs.CV

    Consensus-Driven Uncertainty for Robotic Grasping based on RGB Perception

    Authors: Eric C. Joyce, Qianwen Zhao, Nathaniel Burgdorfer, Long Wang, Philippos Mordohai

    Abstract: Deep object pose estimators are notoriously overconfident. A grasping agent that both estimates the 6-DoF pose of a target object and predicts the uncertainty of its own estimate could avoid task failure by choosing not to act under high uncertainty. Even though object pose estimation improves and uncertainty quantification research continues to make strides, few studies have connected them to the… ▽ More

    Submitted 26 June, 2025; v1 submitted 24 June, 2025; originally announced June 2025.

    Comments: Accepted to IROS 2025

  36. arXiv:2506.19774  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.SD

    Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

    Authors: Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai

    Abstract: We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alig… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  37. arXiv:2506.19518  [pdf, ps, other

    cs.IT

    Robust and Resilient Networks with Integrated Sensing, Communication and Computation

    Authors: Ming-Chun Lee, Christian Eckrich, Vahid Jamali, Yu-Chih Huang, Arash Asadi, Li-Chun Wang

    Abstract: Emerging applications such as networked robotics, intelligent transportation, smart factories, and virtual and augmented reality demand integrated perception and connectivity enabled by wireless communication. This has driven growing interests in integrated sensing, communication, and computation (ISCC) systems, with a primary focus on their efficient co-designs. However, as ISCC systems increasin… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: This work has been submitted to the IEEE Communications Magazine for possible publication

  38. arXiv:2506.18309  [pdf, ps, other

    cs.IR cs.AI

    LettinGo: Explore User Profile Generation for Recommendation System

    Authors: Lu Wang, Di Zhang, Fangkai Yang, Pu Zhao, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Qingwei Lin, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang

    Abstract: User profiling is pivotal for recommendation systems, as it transforms raw user interaction data into concise and structured representations that drive personalized recommendations. While traditional embedding-based profiles lack interpretability and adaptability, recent advances with large language models (LLMs) enable text-based profiles that are semantically richer and more transparent. However… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

    Comments: 11 pages, 3 figures

  39. arXiv:2506.18084  [pdf, ps, other

    cs.CV

    TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving

    Authors: Wenzhuo Liu, Yicheng Qiao, Zhen Wang, Qiannan Guo, Zilong Chen, Meihua Zhou, Xinran Li, Letian Wang, Zhiwei Li, Huaping Liu, Wenshuo Wang

    Abstract: Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM^3-Learning (Time-Efficient Multimodal Multi-task Learning), a… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  40. arXiv:2506.17692  [pdf, ps, other

    cs.CL

    Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering

    Authors: Binquan Ji, Haibo Luo, Yifei Lu, Lei Hei, Jiaqi Wang, Tingjing Liao, Lingyu Wang, Shichao Wang, Feiliang Ren

    Abstract: Knowledge-intensive multi-hop question answering (QA) tasks, which require integrating evidence from multiple sources to address complex queries, often necessitate multiple rounds of retrieval and iterative generation by large language models (LLMs). However, incorporating many documents and extended contexts poses challenges -such as hallucinations and semantic drift-for lightweight LLMs with few… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

  41. arXiv:2506.17667  [pdf, ps, other

    cs.AI

    PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models

    Authors: Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Peng Xia, Jiabei Xiao, Wenlong Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Mingyu Ding, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, Xinzhu Ma

    Abstract: Physics problem-solving is a challenging domain for large AI models, requiring integration of conceptual understanding, mathematical reasoning, and interpretation of physical diagrams. Current evaluation methodologies show notable limitations in capturing the breadth and complexity of undergraduate-level physics, underscoring the need for more rigorous assessments. To this end, we present PhysUniB… ▽ More

    Submitted 27 June, 2025; v1 submitted 21 June, 2025; originally announced June 2025.

  42. arXiv:2506.17625  [pdf, ps, other

    cs.CR

    List-Decodable Byzantine Robust PIR: Lower Communication Complexity, Higher Byzantine Tolerance, Smaller List Size

    Authors: Pengzhen Ke, Liang Feng Zhang, Huaxiong Wang, Li-Ping Wang

    Abstract: Private Information Retrieval (PIR) is a privacy-preserving primitive in cryptography. Significant endeavors have been made to address the variant of PIR concerning the malicious servers. Among those endeavors, list-decodable Byzantine robust PIR schemes may tolerate a majority of malicious responding servers that provide incorrect answers. In this paper, we propose two perfect list-decodable BRPI… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: Submitted to AsiaCrypt 2025

  43. arXiv:2506.17029  [pdf, ps, other

    cs.LG

    Scalable and Reliable Multi-agent Reinforcement Learning for Traffic Assignment

    Authors: Leizhen Wang, Peibo Duan, Cheng Lyu, Zewen Wang, Zhiqiang He, Nan Zheng, Zhenliang Ma

    Abstract: The evolution of metropolitan cities and the increase in travel demands impose stringent requirements on traffic assignment methods. Multi-agent reinforcement learning (MARL) approaches outperform traditional methods in modeling adaptive routing behavior without requiring explicit system dynamics, which is beneficial for real-world deployment. However, MARL frameworks face challenges in scalabilit… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  44. arXiv:2506.16962  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

    Authors: Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, Xiaosong Wang

    Abstract: Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

  45. arXiv:2506.16961  [pdf, ps, other

    cs.CV eess.IV

    Reversing Flow for Image Restoration

    Authors: Haina Qin, Wenyang Luo, Libin Wang, Dandan Zheng, Jingdong Chen, Ming Yang, Bing Li, Weiming Hu

    Abstract: Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restorat… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: CVPR2025 Final Version; Corresponding Author: Bing Li

    MSC Class: 68U10 ACM Class: I.4.4

  46. arXiv:2506.16960  [pdf, ps, other

    cs.CV

    Visual-Instructed Degradation Diffusion for All-in-One Image Restoration

    Authors: Wenyang Luo, Haina Qin, Zewen Chen, Libin Wang, Dandan Zheng, Yuming Li, Yufan Liu, Bing Li, Weiming Hu

    Abstract: Image restoration tasks like deblurring, denoising, and dehazing usually need distinct models for each degradation type, restricting their generalization in real-world scenarios with mixed or unknown degradations. In this work, we propose \textbf{Defusion}, a novel all-in-one image restoration framework that utilizes visual instruction-guided degradation diffusion. Unlike existing methods that rel… ▽ More

    Submitted 20 June, 2025; originally announced June 2025.

    Comments: CVPR2025 Final Version; Corresponding Author: Bing Li

    MSC Class: 68U10 ACM Class: I.4.4

  47. arXiv:2506.16504  [pdf, ps, other

    cs.CV cs.AI

    Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

    Authors: Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxiang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, Sheng Zhang, Xin Huang, Di Luo, Fan Yang, Fang Yang, Lifu Wang, Sicong Liu, Yixuan Tang, Yulin Cai, Zebin He, Tian Liu, Yuhong Liu, Jie Jiang, Linus, Jingwei Huang , et al. (1 additional authors not shown)

    Abstract: In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

    Comments: Technical report

  48. arXiv:2506.16201  [pdf, ps, other

    cs.RO cs.CV

    FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation

    Authors: Sen Wang, Le Wang, Sanping Zhou, Jingyi Tian, Jiayi Li, Haowen Sun, Wei Tang

    Abstract: Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhanci… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  49. arXiv:2506.16190  [pdf, other

    cs.CL

    Web(er) of Hate: A Survey on How Hate Speech Is Typed

    Authors: Luna Wang, Andrew Caines, Alice Hutchings

    Abstract: The curation of hate speech datasets involves complex design decisions that balance competing priorities. This paper critically examines these methodological choices in a diverse range of datasets, highlighting common themes and practices, and their implications for dataset reliability. Drawing on Max Weber's notion of ideal types, we argue for a reflexive approach in dataset creation, urging rese… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  50. arXiv:2506.16082  [pdf, ps, other

    cs.CV

    PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning

    Authors: Yizhe Li, Sanping Zhou, Zheng Qin, Le Wang

    Abstract: Dense video captioning is a challenging task that aims to localize and caption multiple events in an untrimmed video. Recent studies mainly follow the transformer-based architecture to jointly perform the two sub-tasks, i.e., event localization and caption generation, in an end-to-end manner. Based on the general philosophy of detection transformer, these methods implicitly learn the event locatio… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.