Skip to main content

Showing 1–50 of 884 results for author: Yang, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.09877  [pdf, ps, other

    cs.HC

    Post-Post-API Age: Studying Digital Platforms in Scant Data Access Times

    Authors: Kayo Mimizuka, Megan A Brown, Kai-Cheng Yang, Josephine Lukito

    Abstract: Over the past decade, data provided by digital platforms has informed substantial research in HCI to understand online human interaction and communication. Following the closure of major social media APIs that previously provided free access to large-scale data (the "post-API age"), emerging data access programs required by the European Union's Digital Services Act (DSA) have sparked optimism abou… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  2. arXiv:2505.09388  [pdf, other

    cs.CL

    Qwen3 Technical Report

    Authors: An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou , et al. (35 additional authors not shown)

    Abstract: In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  3. arXiv:2505.09106  [pdf, other

    cs.LG

    Argus: Federated Non-convex Bilevel Learning over 6G Space-Air-Ground Integrated Network

    Authors: Ya Liu, Kai Yang, Yu Zhu, Keying Yang, Haibo Zhao

    Abstract: The space-air-ground integrated network (SAGIN) has recently emerged as a core element in the 6G networks. However, traditional centralized and synchronous optimization algorithms are unsuitable for SAGIN due to infrastructureless and time-varying environments. This paper aims to develop a novel Asynchronous algorithm a.k.a. Argus for tackling non-convex and non-smooth decentralized federated bile… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: 17 pages, 11 figures

    MSC Class: 68T07 ACM Class: I.2

  4. arXiv:2505.07916  [pdf, ps, other

    eess.AS cs.SD

    MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

    Authors: Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He

    Abstract: We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, w… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  5. arXiv:2505.07219  [pdf, ps, other

    cs.CV cs.RO eess.IV

    Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection

    Authors: Hongda Qin, Xiao Lu, Zhiyong Wei, Yihong Cao, Kailun Yang, Ningjiang Chen

    Abstract: Generalizing an object detector trained on a single domain to multiple unseen domains is a challenging task. Existing methods typically introduce image or feature augmentation to diversify the source domain to raise the robustness of the detector. Vision-Language Model (VLM)-based augmentation techniques have been proven to be effective, but they require that the detector's backbone has the same s… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: The source code and pre-trained models will be publicly available at https://github.com/qinhongda8/LDDS

  6. arXiv:2505.06283  [pdf, other

    cs.LG q-bio.QM stat.ML

    Soft causal learning for generalized molecule property prediction: An environment perspective

    Authors: Limin Li, Kuo Yang, Wenjie Du, Pengkun Wang, Zhengyang Zhou, Yang Wang

    Abstract: Learning on molecule graphs has become an increasingly important topic in AI for science, which takes full advantage of AI to facilitate scientific discovery. Existing solutions on modeling molecules utilize Graph Neural Networks (GNNs) to achieve representations but they mostly fail to adapt models to out-of-distribution (OOD) samples. Although recent advances on OOD-oriented graph learning have… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 23 pages, 7 figures, 3 tables

    ACM Class: I.2.4

  7. arXiv:2505.03539  [pdf, other

    cs.CV cs.RO eess.IV

    Panoramic Out-of-Distribution Segmentation

    Authors: Mengfei Duan, Kailun Yang, Yuheng Zhang, Yihong Cao, Fei Teng, Kai Luo, Jiaming Zhang, Zhiyong Li, Shutao Li

    Abstract: Panoramic imaging enables capturing 360° images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introdu… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Code and datasets will be available at https://github.com/MengfeiD/PanOoS

  8. arXiv:2505.01821  [pdf, ps, other

    cs.DC cs.AI cs.LG

    Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey

    Authors: Jing Liu, Yao Du, Kun Yang, Yan Wang, Xiping Hu, Zehua Wang, Yang Liu, Peng Sun, Azzedine Boukerche, Victor C. M. Leung

    Abstract: Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed sys… ▽ More

    Submitted 3 May, 2025; originally announced May 2025.

    Comments: 30 pages, 10figures, 6 tables

  9. arXiv:2505.01557  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Contextures: Representations from Contexts

    Authors: Runtian Zhai, Kai Yang, Che-Ping Tsai, Burak Varici, Zico Kolter, Pradeep Ravikumar

    Abstract: Despite the empirical success of foundation models, we do not have a systematic characterization of the representations that these models learn. In this paper, we establish the contexture theory. It shows that a large class of representation learning methods can be characterized as learning from the association between the input and a context variable. Specifically, we show that many popular metho… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: ICML 2025, longer version. arXiv admin note: substantial text overlap with arXiv:2504.19792

  10. arXiv:2505.00991  [pdf, other

    cs.RO eess.SY

    DexCtrl: Towards Sim-to-Real Dexterity with Adaptive Controller Learning

    Authors: Shuqi Zhao, Ke Yang, Yuxin Chen, Chenran Li, Yichen Xie, Xiang Zhang, Changhao Wang, Masayoshi Tomizuka

    Abstract: Dexterous manipulation has seen remarkable progress in recent years, with policies capable of executing many complex and contact-rich tasks in simulation. However, transferring these policies from simulation to real world remains a significant challenge. One important issue is the mismatch in low-level controller dynamics, where identical trajectories can lead to vastly different contact forces an… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  11. arXiv:2504.21330  [pdf, ps, other

    cs.CL

    Does the Prompt-based Large Language Model Recognize Students' Demographics and Introduce Bias in Essay Scoring?

    Authors: Kaixun Yang, Mladen Raković, Dragan Gašević, Guanliang Chen

    Abstract: Large Language Models (LLMs) are widely used in Automated Essay Scoring (AES) due to their ability to capture semantic meaning. Traditional fine-tuning approaches required technical expertise, limiting accessibility for educators with limited technical backgrounds. However, prompt-based tools like ChatGPT have made AES more accessible, enabling educators to obtain machine-generated scores using na… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

  12. arXiv:2504.20674  [pdf, other

    cs.CE

    DiffLiB: High-fidelity differentiable modeling of lithium-ion batteries and efficient gradient-based parameter identification

    Authors: Weipeng Xu, Kaiqi Yang, Yuzhi Zhang, Shichao Sun, Sheng Mao, Tianju Xue

    Abstract: The physics-based Doyle-Fuller-Newman (DFN) model, widely adopted for its precise electrochemical modeling, stands out among various simulation models of lithium-ion batteries (LIBs). Although the DFN model is powerful in forward predictive analysis, the inverse identification of its model parameters has remained a long-standing challenge. The numerous unknown parameters associated with the nonlin… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  13. arXiv:2504.20369  [pdf, other

    cs.HC cs.DB

    Perception-aware Sampling for Scatterplot Visualizations

    Authors: Zafeiria Moumoulidou, Hamza Elhamdadi, Ke Yang, Subrata Mitra, Cindy Xiong Bearfield, Alexandra Meliou

    Abstract: Visualizing data is often a crucial first step in data analytics workflows, but growing data sizes pose challenges due to computational and visual perception limitations. As a result, data analysts commonly down-sample their data and work with subsets. Deriving representative samples, however, remains a challenge. This paper focuses on scatterplots, a widely-used visualization type, and introduces… ▽ More

    Submitted 12 May, 2025; v1 submitted 28 April, 2025; originally announced April 2025.

  14. arXiv:2504.20250  [pdf, other

    cs.LG q-fin.GN q-fin.ST stat.AP stat.ML

    Financial Data Analysis with Robust Federated Logistic Regression

    Authors: Kun Yang, Nikhil Krishnan, Sanjeev R. Kulkarni

    Abstract: In this study, we focus on the analysis of financial data in a federated setting, wherein data is distributed across multiple clients or locations, and the raw data never leaves the local devices. Our primary focus is not only on the development of efficient learning frameworks (for protecting user data privacy) in the field of federated learning but also on the importance of designing models that… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

  15. arXiv:2504.19724  [pdf, other

    cs.CV

    RepText: Rendering Visual Text via Replicating

    Authors: Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen

    Abstract: Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but n… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: Technical Report. https://reptext.github.io/

  16. arXiv:2504.19444  [pdf, other

    cs.SE cs.CL

    Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

    Authors: Kang Yang, Xinjun Mao, Shangwen Wang, Yanlin Wang, Tanghaoran Zhang, Bo Lin, Yihao Qin, Zhang Zhang, Yao Lu, Kamal Al-Sabahi

    Abstract: Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model performance. Large language models (LLMs) excel at generating high-quality code comments. We investigate whether replacing human-written comments with LLM-gener… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

    Comments: Awarded the ACM SIGSOFT Distinguished Paper Award in ICPC 2025

  17. arXiv:2504.18448  [pdf, other

    cs.CV

    NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration

    Authors: Haotian Dong, Xin Wang, Di Lin, Yipeng Wu, Qin Chen, Ruonan Liu, Kairui Yang, Ping Li, Qing Guo

    Abstract: High-quality video generation is crucial for many fields, including the film industry and autonomous driving. However, generating videos with spatiotemporal consistencies remains challenging. Current methods typically utilize attention mechanisms or modify noise to achieve consistent videos, neglecting global spatiotemporal information that could help ensure spatial and temporal consistency during… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

  18. arXiv:2504.17432  [pdf, other

    cs.CV

    Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

    Authors: Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng

    Abstract: The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: 13 pages, 8 figures, Project page: https://garygutc.github.io/UniME

  19. arXiv:2504.16801  [pdf, other

    cs.CV

    Decoupled Global-Local Alignment for Improving Compositional Understanding

    Authors: Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziyong Feng, Yupei Wang

    Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods signi… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

  20. arXiv:2504.14653  [pdf, other

    cs.IT eess.SP

    Wireless Large AI Model: Shaping the AI-Native Future of 6G and Beyond

    Authors: Fenghao Zhu, Xinquan Wang, Xinyi Li, Maojun Zhang, Yixuan Chen, Chongwen Huang, Zhaohui Yang, Xiaoming Chen, Zhaoyang Zhang, Richeng Jin, Yongming Huang, Wei Feng, Tingting Yang, Baoming Bai, Feifei Gao, Kun Yang, Yuanwei Liu, Sami Muhaidat, Chau Yuen, Kaibin Huang, Kai-Kit Wong, Dusit Niyato, Mérouane Debbah

    Abstract: The emergence of sixth-generation and beyond communication systems is expected to fundamentally transform digital experiences through introducing unparalleled levels of intelligence, efficiency, and connectivity. A promising technology poised to enable this revolutionary vision is the wireless large AI model (WLAM), characterized by its exceptional capabilities in data processing, inference, and d… ▽ More

    Submitted 28 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

  21. arXiv:2504.11966  [pdf, other

    cs.CV cs.LG cs.RO eess.IV

    Exploring Video-Based Driver Activity Recognition under Noisy Labels

    Authors: Linjuan Fan, Di Wen, Kunyu Peng, Kailun Yang, Jiaming Zhang, Ruiping Liu, Yufan Chen, Junwei Zheng, Jiamin Wu, Xudong Han, Rainer Stiefelhagen

    Abstract: As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the d… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: The source code is available at https://github.com/ilonafan/DAR-noisy-labels

  22. arXiv:2504.09757  [pdf, other

    cs.CR

    Alleviating the Fear of Losing Alignment in LLM Fine-tuning

    Authors: Kang Yang, Guanhong Tao, Xun Chen, Jun Xu

    Abstract: Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called \textit{alignment} can help. Yet, alignment can be unexpectedly co… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

  23. arXiv:2504.09197  [pdf, other

    cs.AI

    Graph Learning-Driven Multi-Vessel Association: Fusing Multimodal Data for Maritime Intelligence

    Authors: Yuxu Lu, Kaisen Yang, Dong Yang, Haifeng Ding, Jinxian Weng, Ryan Wen Liu

    Abstract: Ensuring maritime safety and optimizing traffic management in increasingly crowded and complex waterways require effective waterway monitoring. However, current methods struggle with challenges arising from multimodal data, such as dimensional disparities, mismatched target counts, vessel scale variations, occlusions, and asynchronous data streams from systems like the automatic identification sys… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  24. arXiv:2504.07717  [pdf, other

    cs.CR cs.AI

    PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization

    Authors: Yang Jiao, Xiaodong Wang, Kai Yang

    Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of applications, e.g., medical question-answering, mathematical sciences, and code generation. However, they also exhibit inherent limitations, such as outdated knowledge and susceptibility to hallucinations. Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to address these issues, but… ▽ More

    Submitted 16 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

    Comments: Accepted at SIGIR 2025

  25. arXiv:2504.05276  [pdf, other

    cs.CL

    Enhancing LLM-Based Short Answer Grading with Retrieval-Augmented Generation

    Authors: Yucheng Chu, Peng He, Hang Li, Haoyu Han, Kaiqi Yang, Yu Xue, Tingting Li, Joseph Krajcik, Jiliang Tang

    Abstract: Short answer assessment is a vital component of science education, allowing evaluation of students' complex three-dimensional understanding. Large language models (LLMs) that possess human-like ability in linguistic tasks are increasingly popular in assisting human graders to reduce their workload. However, LLMs' limitations in domain knowledge restrict their understanding in task-specific require… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  26. arXiv:2504.05239  [pdf, other

    cs.CL

    LLM-based Automated Grading with Human-in-the-Loop

    Authors: Hang Li, Yucheng Chu, Kaiqi Yang, Yasemin Copur-Gencturk, Jiliang Tang

    Abstract: The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance com… ▽ More

    Submitted 28 April, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

  27. arXiv:2504.03651  [pdf, other

    cs.DC cs.AI cs.LG

    Echo: Efficient Co-Scheduling of Hybrid Online-Offline Tasks for Large Language Model Serving

    Authors: Zhibin Wang, Shipeng Li, Xue Li, Yuhang Zhou, Zhonghui Zhang, Zibo Wang, Rong Gu, Chen Tian, Kun Yang, Sheng Zhong

    Abstract: Large language models have been widely deployed in various applications, encompassing both interactive online tasks and batched offline tasks. Given the burstiness and latency sensitivity of online tasks, over-provisioning resources is common practice. This allows for the integration of latency-insensitive offline tasks during periods of low online load, enhancing resource utilization. However, st… ▽ More

    Submitted 1 March, 2025; originally announced April 2025.

  28. arXiv:2504.03041  [pdf, other

    cs.CV

    VIP: Video Inpainting Pipeline for Real World Human Removal

    Authors: Huiming Sun, Yikang Li, Kangning Yang, Ruineng Li, Daitao Xing, Yangbo Xie, Lan Fu, Kaiyu Zhang, Ming Chen, Jiaming Ding, Jiang Geng, Jie Cai, Zibo Meng, Chiuman Ho

    Abstract: Inpainting for real-world human and pedestrian removal in high-resolution video clips presents significant challenges, particularly in achieving high-quality outcomes, ensuring temporal consistency, and managing complex object interactions that involve humans, their belongings, and their shadows. In this paper, we introduce VIP (Video Inpainting Pipeline), a novel promptless video inpainting frame… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

  29. arXiv:2504.01981  [pdf, other

    cs.AR cs.AI

    NLS: Natural-Level Synthesis for Hardware Implementation Through GenAI

    Authors: Kaiyuan Yang, Huang Ouyang, Xinyi Wang, Bingjie Lu, Yanbo Wang, Charith Abhayaratne, Sizhao Li, Long Jin, Tiantai Deng

    Abstract: This paper introduces Natural-Level Synthesis, an innovative approach for generating hardware using generative artificial intelligence on both the system level and component-level. NLS bridges a gap in current hardware development processes, where algorithm and application engineers' involvement typically ends at the requirements stage. With NLS, engineers can participate more deeply in the develo… ▽ More

    Submitted 28 March, 2025; originally announced April 2025.

    Comments: 9 pages, 4 figures, and 5 tables. Submitted for IEEE Transactions on CAD. The same content was accepted by Design Automation Conference 2025 as a WIP Poster (not count as publication, so it's ok to submit the content elsewhere). TCAD info: https://ieeexplore.ieee.org/document/10186100 Submitted for review on 26th of Feb. Reference - TCAD-2025-0203

  30. arXiv:2503.22079  [pdf, other

    cs.CV

    A Semantic-Enhanced Heterogeneous Graph Learning Method for Flexible Objects Recognition

    Authors: Kunshan Yang, Wenwei Luo, Yuguo Hu, Jiafu Yan, Mengmeng Jing, Lin Zuo

    Abstract: Flexible objects recognition remains a significant challenge due to its inherently diverse shapes and sizes, translucent attributes, and subtle inter-class differences. Graph-based models, such as graph convolution networks and graph vision models, are promising in flexible objects recognition due to their ability of capturing variable relations within the flexible objects. These methods, however,… ▽ More

    Submitted 27 March, 2025; originally announced March 2025.

    Comments: Accepted by ICME 2025

  31. arXiv:2503.19543  [pdf, other

    cs.CV

    Scene-agnostic Pose Regression for Visual Localization

    Authors: Junwei Zheng, Ruiping Liu, Yufan Chen, Zhenfang Chen, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

    Abstract: Absolute Pose Regression (APR) predicts 6D camera poses but lacks the adaptability to unknown environments without retraining, while Relative Pose Regression (RPR) generalizes better yet requires a large image retrieval database. Visual Odometry (VO) generalizes well in unseen environments but suffers from accumulated error in open trajectories. To address this dilemma, we introduce a new task, Sc… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Accepted by CVPR 2025. Project page: https://junweizheng93.github.io/publications/SPR/SPR.html

  32. arXiv:2503.15952  [pdf, other

    cs.CL

    Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

    Authors: Chen Li, Nazhou Liu, Kai Yang

    Abstract: Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of Reasoning LLMs training. However, we find some deficiency that influences RL stability and inference efficiency. Thus, we propose Adaptive Group Policy Optimization (AGPO) which contains two simple but effective modifications: a revised advantage estimation method to mitigate zero-variance situatio… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

    Comments: This is an unfinished version and will be updated. We aim to share some findings

  33. arXiv:2503.13946  [pdf, other

    cs.CV

    Is Discretization Fusion All You Need for Collaborative Perception?

    Authors: Kang Yang, Tianci Bu, Lantao Li, Chunxu Li, Yongcai Wang, Deying Li

    Abstract: Collaborative perception in multi-agent system enhances overall perceptual capabilities by facilitating the exchange of complementary information among agents. Current mainstream collaborative perception methods rely on discretized feature maps to conduct fusion, which however, lacks flexibility in extracting and transmitting the informative features and can hardly focus on the informative feature… ▽ More

    Submitted 18 March, 2025; originally announced March 2025.

  34. arXiv:2503.12419  [pdf, other

    cs.CV cs.RO eess.IV physics.optics

    EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera

    Authors: Luming Wang, Hao Shi, Xiaoting Yin, Kailun Yang, Kaiwei Wang, Jian Bai

    Abstract: Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing as… ▽ More

    Submitted 13 April, 2025; v1 submitted 16 March, 2025; originally announced March 2025.

    Comments: The dataset and models are made available at https://github.com/3190105222/EgoEv_Gesture

  35. arXiv:2503.12233  [pdf, ps, other

    cs.IT eess.SP

    Robust Full-Space Physical Layer Security for STAR-RIS-Aided Wireless Networks: Eavesdropper with Uncertain Location and Channel

    Authors: Han Xiao, Xiaoyan Hu, Ang Li, Wenjie Wang, Kun Yang

    Abstract: A robust full-space physical layer security (PLS) transmission scheme is proposed in this paper considering the full-space wiretapping challenge of wireless networks supported by simultaneous transmitting and reflecting reconfigurable intelligent surface (STAR-RIS). Different from the existing schemes, the proposed PLS scheme takes account of the uncertainty on the eavesdropper's position within t… ▽ More

    Submitted 15 March, 2025; originally announced March 2025.

  36. arXiv:2503.11068  [pdf

    cs.ET

    DeepSeek Powered Solid Dosage Formulation Design and Development

    Authors: Leqi Lin, Xingyu Zhou, Kaiyuan Yang, Xizhong Chen

    Abstract: Pharmaceutical process design and development for generic, innovative, or personalized drugs have always been a time-consuming, costly, rigorous process, that involves multi-stage evaluation for better quality control and assurance. Large language models (LLMs), a type of generative artificial intelligence system, can augment laboratory research in the pharmaceutical engineering process by helping… ▽ More

    Submitted 21 March, 2025; v1 submitted 14 March, 2025; originally announced March 2025.

  37. arXiv:2503.10216  [pdf, other

    cs.CV

    CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition

    Authors: Kaixiang Yang, Xin Li, Qiang Li, Zhiwei Wang

    Abstract: Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world surgeries.In this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusi… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  38. arXiv:2503.08726  [pdf, other

    cs.LG cs.AI eess.SP

    SIMAC: A Semantic-Driven Integrated Multimodal Sensing And Communication Framework

    Authors: Yubo Peng, Luping Xiang, Kun Yang, Feibo Jiang, Kezhi Wang, Dapeng Oliver Wu

    Abstract: Traditional single-modality sensing faces limitations in accuracy and capability, and its decoupled implementation with communication systems increases latency in bandwidth-constrained environments. Additionally, single-task-oriented sensing systems fail to address users' diverse demands. To overcome these challenges, we propose a semantic-driven integrated multimodal sensing and communication (SI… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  39. arXiv:2503.08531  [pdf, other

    cs.CV q-bio.QM

    Visual Attention Graph

    Authors: Kai-Fu Yang, Yong-Jie Li

    Abstract: Visual attention plays a critical role when our visual system executes active visual tasks by interacting with the physical scene. However, how to encode the visual object relationship in the psychological world of our brain deserves to be explored. In the field of computer vision, predicting visual fixations or scanpaths is a usual way to explore the visual attention and behaviors of human observ… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: 20 pages, 14 figures

  40. arXiv:2503.07536  [pdf, other

    cs.CL cs.AI

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Authors: Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, Xu Yang

    Abstract: Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two… ▽ More

    Submitted 10 March, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

  41. arXiv:2503.07252  [pdf, other

    cs.CV eess.IV eess.SP

    Semantic Communications with Computer Vision Sensing for Edge Video Transmission

    Authors: Yubo Peng, Luping Xiang, Kun Yang, Kezhi Wang, Merouane Debbah

    Abstract: Despite the widespread adoption of vision sensors in edge applications, such as surveillance, the transmission of video data consumes substantial spectrum resources. Semantic communication (SC) offers a solution by extracting and compressing information at the semantic level, preserving the accuracy and relevance of transmitted data while significantly reducing the volume of transmitted informatio… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

  42. arXiv:2503.06821  [pdf, other

    cs.CV cs.RO eess.IV

    HierDAMap: Towards Universal Domain Adaptive BEV Mapping via Hierarchical Perspective Priors

    Authors: Siyu Li, Yihong Cao, Hao Shi, Yongsheng Zang, Xuan He, Kailun Yang, Zhiyong Li

    Abstract: The exploration of Bird's-Eye View (BEV) mapping technology has driven significant innovation in visual perception technology for autonomous driving. BEV mapping models need to be applied to the unlabeled real world, making the study of unsupervised domain adaptation models an essential path. However, research on unsupervised domain adaptation for BEV mapping remains limited and cannot perfectly a… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

    Comments: The source code will be made publicly available at https://github.com/lynn-yu/HierDAMap

  43. arXiv:2503.06700  [pdf, other

    cs.CV

    MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation

    Authors: Chenfei Liao, Xu Zheng, Yuanhuiyi Lyu, Haiwei Xue, Yihong Cao, Jiawen Wang, Kailun Yang, Xuming Hu

    Abstract: Research has focused on Multi-Modal Semantic Segmentation (MMSS), where pixel-wise predictions are derived from multiple visual modalities captured by diverse sensors. Recently, the large vision model, Segment Anything Model 2 (SAM2), has shown strong zero-shot segmentation performance on both images and videos. When extending SAM2 to MMSS, two issues arise: 1. How can SAM2 be adapted to multi-mod… ▽ More

    Submitted 20 March, 2025; v1 submitted 9 March, 2025; originally announced March 2025.

  44. arXiv:2503.06125  [pdf, other

    eess.IV cs.CV

    RGB-Phase Speckle: Cross-Scene Stereo 3D Reconstruction via Wrapped Pre-Normalization

    Authors: Kai Yang, Zijian Bai, Yang Xiao, Xinyu Li, Xiaohan Shi

    Abstract: 3D reconstruction garners increasing attention alongside the advancement of high-level image applications, where dense stereo matching (DSM) serves as a pivotal technique. Previous studies often rely on publicly available datasets for training, focusing on modifying network architectures or incorporating specialized modules to extract domain-invariant features and thus improve model robustness. In… ▽ More

    Submitted 17 April, 2025; v1 submitted 8 March, 2025; originally announced March 2025.

    Comments: Submitted to ICCV 2025

  45. arXiv:2503.05584  [pdf, other

    cs.CV

    QArtSR: Quantization via Reverse-Module and Timestep-Retraining in One-Step Diffusion based Image Super-Resolution

    Authors: Libo Zhu, Haotong Qin, Kaicheng Yang, Wenbo Li, Yong Guo, Yulun Zhang, Susanto Rahardja, Xiaokang Yang

    Abstract: One-step diffusion-based image super-resolution (OSDSR) models are showing increasingly superior performance nowadays. However, although their denoising steps are reduced to one and they can be quantized to 8-bit to reduce the costs further, there is still significant potential for OSDSR to quantize to lower bits. To explore more possibilities of quantized OSDSR, we propose an efficient method, Qu… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

  46. arXiv:2503.05569  [pdf

    cs.RO physics.med-ph

    A-SEE2.0: Active-Sensing End-Effector for Robotic Ultrasound Systems with Dense Contact Surface Perception Enabled Probe Orientation Adjustment

    Authors: Yernar Zhetpissov, Xihan Ma, Kehan Yang, Haichong K. Zhang

    Abstract: Conventional freehand ultrasound (US) imaging is highly dependent on the skill of the operator, often leading to inconsistent results and increased physical demand on sonographers. Robotic Ultrasound Systems (RUSS) aim to address these limitations by providing standardized and automated imaging solutions, especially in environments with limited access to skilled operators. This paper presents the… ▽ More

    Submitted 7 March, 2025; originally announced March 2025.

    Comments: 8 pages, submitted for review

  47. arXiv:2503.04565  [pdf, other

    cs.CV cs.RO eess.IV

    Omnidirectional Multi-Object Tracking

    Authors: Kai Luo, Hao Shi, Sheng Wu, Fei Teng, Mengfei Duan, Chang Huang, Yuhang Wang, Kaiwei Wang, Kailun Yang

    Abstract: Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geomet… ▽ More

    Submitted 23 March, 2025; v1 submitted 6 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025. The established dataset and source code are available at https://github.com/xifen523/OmniTrack

  48. arXiv:2503.03115  [pdf, other

    cs.CV

    NTR-Gaussian: Nighttime Dynamic Thermal Reconstruction with 4D Gaussian Splatting Based on Thermodynamics

    Authors: Kun Yang, Yuxiang Liu, Zeyu Cui, Yu Liu, Maojun Zhang, Shen Yan, Qing Wang

    Abstract: Thermal infrared imaging offers the advantage of all-weather capability, enabling non-intrusive measurement of an object's surface temperature. Consequently, thermal infrared images are employed to reconstruct 3D models that accurately reflect the temperature distribution of a scene, aiding in applications such as building monitoring and energy management. However, existing approaches predominantl… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: IEEE Conference on Computer Vision and Pattern Recognition 2025

  49. arXiv:2503.02600  [pdf, other

    cs.CV cs.RO eess.IV

    Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts

    Authors: Yizhou Huang, Fan Yang, Guoliang Zhu, Gen Li, Hao Shi, Yukun Zuo, Wenrui Chen, Zhiyong Li, Kailun Yang

    Abstract: Affordance refers to the functional properties that an agent perceives and utilizes from its environment, and is key perceptual information required for robots to perform actions. This information is rich and multimodal in nature. Existing multimodal affordance methods face limitations in extracting useful information, mainly due to simple structural designs, basic fusion methods, and large model… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: The source code will be made publicly available at https://github.com/DAWDSE/BiT-Align

  50. arXiv:2503.02581  [pdf, other

    cs.CV cs.RO eess.IV

    Unveiling the Potential of Segment Anything Model 2 for RGB-Thermal Semantic Segmentation with Language Guidance

    Authors: Jiayi Zhao, Fei Teng, Kai Luo, Guoqiang Zhao, Zhiyong Li, Xu Zheng, Kailun Yang

    Abstract: The perception capability of robotic systems relies on the richness of the dataset. Although Segment Anything Model 2 (SAM2), trained on large datasets, demonstrates strong perception potential in perception tasks, its inherent training paradigm prevents it from being suitable for RGB-T tasks. To address these challenges, we propose SHIFNet, a novel SAM2-driven Hybrid Interaction Paradigm that unl… ▽ More

    Submitted 4 March, 2025; originally announced March 2025.

    Comments: The source code will be made publicly available at https://github.com/iAsakiT3T/SHIFNet