Skip to main content

Showing 1–50 of 863 results for author: Sun, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.09315  [pdf, other

    cs.RO cs.CV cs.LG

    TransDiffuser: End-to-end Trajectory Generation with Decorrelated Multi-modal Representation for Autonomous Driving

    Authors: Xuefeng Jiang, Yuan Ma, Pengxiang Li, Leimeng Xu, Xin Wen, Kun Zhan, Zhongpu Xia, Peng Jia, XianPeng Lang, Sheng Sun

    Abstract: In recent years, diffusion model has shown its potential across diverse domains from vision generation to language modeling. Transferring its capabilities to modern autonomous driving systems has also emerged as a promising direction.In this work, we propose TransDiffuser, an encoder-decoder based generative trajectory planning model for end-to-end autonomous driving. The encoded scene information… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

    Comments: Under review

  2. arXiv:2505.08392  [pdf, other

    cs.CL cs.AI

    Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping

    Authors: Ren Zhuang, Ben Wang, Shuifa Sun

    Abstract: Large Language Models leverage Chain-of-Thought (CoT) prompting for complex tasks, but their reasoning traces are often excessively verbose and inefficient, leading to significant computational costs and latency. Current CoT compression techniques typically rely on generic importance metrics and static compression rates, which may inadvertently remove functionally critical tokens or fail to adapt… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  3. arXiv:2505.08366  [pdf

    eess.SP cs.AI

    Non-contact Vital Signs Detection in Dynamic Environments

    Authors: Shuai Sun, Chong-Xi Liang, Chengwei Ye, Huanzhen Zhang, Kangsheng Wang

    Abstract: Accurate phase demodulation is critical for vital sign detection using millimeter-wave radar. However, in complex environments, time-varying DC offsets and phase imbalances can severely degrade demodulation performance. To address this, we propose a novel DC offset calibration method alongside a Hilbert and Differential Cross-Multiply (HADCM) demodulation algorithm. The approach estimates time-var… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  4. arXiv:2505.08157  [pdf, other

    cs.IR cs.AI

    Hyperbolic Contrastive Learning with Model-augmentation for Knowledge-aware Recommendation

    Authors: Shengyin Sun, Chen Ma

    Abstract: Benefiting from the effectiveness of graph neural networks (GNNs) and contrastive learning, GNN-based contrastive learning has become mainstream for knowledge-aware recommendation. However, most existing contrastive learning-based methods have difficulties in effectively capturing the underlying hierarchical structure within user-item bipartite graphs and knowledge graphs. Moreover, they commonly… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: 18 pages

  5. arXiv:2505.06919  [pdf, ps, other

    cs.RO

    The First WARA Robotics Mobile Manipulation Challenge -- Lessons Learned

    Authors: David Cáceres Domínguez, Marco Iannotta, Abhishek Kashyap, Shuo Sun, Yuxuan Yang, Christian Cella, Matteo Colombo, Martina Pelosi, Giuseppe F. Preziosa, Alessandra Tafuro, Isacco Zappa, Finn Busch, Yifei Dong, Alberta Longhini, Haofei Lu, Rafael I. Cabral Muchacho, Jonathan Styrud, Sebastiano Fregnan, Marko Guberina, Zheng Jia, Graziano Carriero, Sofia Lindqvist, Silvio Di Castro, Matteo Iovino

    Abstract: The first WARA Robotics Mobile Manipulation Challenge, held in December 2024 at ABB Corporate Research in Västerås, Sweden, addressed the automation of task-intensive and repetitive manual labor in laboratory environments - specifically the transport and cleaning of glassware. Designed in collaboration with AstraZeneca, the challenge invited academic teams to develop autonomous robotic systems cap… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  6. arXiv:2505.06684  [pdf, other

    cs.CV cs.AI

    FNBench: Benchmarking Robust Federated Learning against Noisy Labels

    Authors: Xuefeng Jiang, Jia Li, Nannan Wu, Zhiyuan Wu, Xujing Li, Sheng Sun, Gang Xu, Yuwei Wang, Qi Li, Min Liu

    Abstract: Robustness to label noise within data is a significant challenge in federated learning (FL). From the data-centric perspective, the data quality of distributed datasets can not be guaranteed since annotations of different clients contain complicated label noise of varying degrees, which causes the performance degradation. There have been some early attempts to tackle noisy labels in FL. However, t… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: Submitted to IEEE TDSC, currently under major revision

  7. arXiv:2505.04339  [pdf, other

    cs.LG cs.AI

    Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning

    Authors: Hao Peng, Xiang Huang, Shuo Sun, Ruitong Zhang, Philip S. Yu

    Abstract: DBSCAN, a well-known density-based clustering algorithm, has gained widespread popularity and usage due to its effectiveness in identifying clusters of arbitrary shapes and handling noisy data. However, it encounters challenges in producing satisfactory cluster results when confronted with datasets of varying density scales, a common scenario in real-world applications. In this paper, we propose a… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  8. arXiv:2505.03039  [pdf

    cs.CV stat.AP

    An Explainable Anomaly Detection Framework for Monitoring Depression and Anxiety Using Consumer Wearable Devices

    Authors: Yuezhou Zhang, Amos A. Folarin, Callum Stewart, Heet Sankesara, Yatharth Ranjan, Pauline Conde, Akash Roy Choudhury, Shaoxiong Sun, Zulqarnain Rashid, Richard J. B. Dobson

    Abstract: Continuous monitoring of behavior and physiology via wearable devices offers a novel, objective method for the early detection of worsening depression and anxiety. In this study, we present an explainable anomaly detection framework that identifies clinically meaningful increases in symptom severity using consumer-grade wearable data. Leveraging data from 2,023 participants with defined healthy ba… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  9. arXiv:2505.00989  [pdf, other

    cs.CL

    VTS-LLM: Domain-Adaptive LLM Agent for Enhancing Awareness in Vessel Traffic Services through Natural Language

    Authors: Sijin Sun, Liangbin Zhao, Ming Deng, Xiuju Fu

    Abstract: Vessel Traffic Services (VTS) are essential for maritime safety and regulatory compliance through real-time traffic management. However, with increasing traffic complexity and the prevalence of heterogeneous, multimodal data, existing VTS systems face limitations in spatiotemporal reasoning and intuitive human interaction. In this work, we propose VTS-LLM Agent, the first domain-adaptive large LLM… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: 8 pages, 5 figures, 7 tablels, submitted to ITSC2025

  10. arXiv:2505.00473  [pdf, other

    cs.LG math.NA

    Interpretable Spatial-Temporal Fusion Transformers: Multi-Output Prediction for Parametric Dynamical Systems with Time-Varying Inputs

    Authors: Shuwen Sun, Lihong Feng, Peter Benner

    Abstract: We explore the promising performance of a transformer model in predicting outputs of parametric dynamical systems with external time-varying input signals. The outputs of such systems vary not only with physical parameters but also with external time-varying input signals. Accurately catching the dynamics of such systems is challenging. We have adapted and extended an existing transformer model fo… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  11. arXiv:2505.00340  [pdf, ps, other

    cs.CR

    Vehicular Communication Security: Multi-Channel and Multi-Factor Authentication

    Authors: Marco De Vincenzi, Shuyang Sun, Chen Bo Calvin Zhang, Manuel Garcia, Shaozu Ding, Chiara Bodei, Ilaria Matteucci, Sanjay E. Sarma, Dajiang Suo

    Abstract: Secure and reliable communications are crucial for Intelligent Transportation Systems (ITSs), where Vehicle-to-Infrastructure (V2I) communication plays a key role in enabling mobility-enhancing and safety-critical services. Current V2I authentication relies on credential-based methods over wireless Non-Line-of-Sight (NLOS) channels, leaving them exposed to remote impersonation and proximity attack… ▽ More

    Submitted 8 May, 2025; v1 submitted 1 May, 2025; originally announced May 2025.

  12. arXiv:2504.21583  [pdf, other

    cs.NI

    Toward Realization of Low-Altitude Economy Networks: Core Architecture, Integrated Technologies, and Future Directions

    Authors: Yixian Wang, Geng Sun, Zemin Sun, Jiacheng Wang, Jiahui Li, Changyuan Zhao, Jing Wu, Shuang Liang, Minghao Yin, Pengfei Wang, Dusit Niyato, Sumei Sun, Dong In Kim

    Abstract: The rise of the low-altitude economy (LAE) is propelling urban development and emerging industries by integrating advanced technologies to enhance efficiency, safety, and sustainability in low-altitude operations. The widespread adoption of unmanned aerial vehicles (UAVs) and electric vertical takeoff and landing (eVTOL) aircraft plays a crucial role in enabling key applications within LAE, such a… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

    Comments: 25 pages, 12 figures, published to TCCN

  13. arXiv:2504.21366  [pdf, other

    cs.SD cs.AI

    DGFNet: End-to-End Audio-Visual Source Separation Based on Dynamic Gating Fusion

    Authors: Yinfeng Yu, Shiyu Sun

    Abstract: Current Audio-Visual Source Separation methods primarily adopt two design strategies. The first strategy involves fusing audio and visual features at the bottleneck layer of the encoder, followed by processing the fused features through the decoder. However, when there is a significant disparity between the two modalities, this approach may lead to the loss of critical information. The second stra… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

    Comments: Main paper (9 pages). Accepted for publication by ICMR(International Conference on Multimedia Retrieval) 2025

  14. arXiv:2504.21311  [pdf, ps, other

    cs.NI

    Covert Prompt Transmission for Secure Large Language Model Services

    Authors: Ruichen Zhang, Yinqiu Liu, Shunpu Tang, Jiacheng Wang, Dusit Niyato, Geng Sun, Yonghui Li, Sumei Sun

    Abstract: This paper investigates covert prompt transmission for secure and efficient large language model (LLM) services over wireless networks. We formulate a latency minimization problem under fidelity and detectability constraints to ensure confidential and covert communication by jointly optimizing the transmit power and prompt compression ratio. To solve this problem, we first propose a prompt compres… ▽ More

    Submitted 30 April, 2025; originally announced April 2025.

    Comments: 13 pages, 9 figures

  15. arXiv:2504.20829  [pdf, other

    cs.CV cs.AI

    GaussTrap: Stealthy Poisoning Attacks on 3D Gaussian Splatting for Targeted Scene Confusion

    Authors: Jiaxin Hong, Sixu Chen, Shuoyang Sun, Hongyao Yu, Hao Fang, Yuqi Tan, Bin Chen, Shuhan Qi, Jiawei Li

    Abstract: As 3D Gaussian Splatting (3DGS) emerges as a breakthrough in scene representation and novel view synthesis, its rapid adoption in safety-critical domains (e.g., autonomous systems, AR/VR) urgently demands scrutiny of potential security vulnerabilities. This paper presents the first systematic study of backdoor threats in 3DGS pipelines. We identify that adversaries may implant backdoor views to in… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  16. arXiv:2504.20795  [pdf, other

    cs.DS

    Effective Index Construction Algorithm for Optimal $(k,η)$-cores Computation

    Authors: Shengli Sun, Peng Xu, Guanming Jiang, Philip S. Yu, Yi Li

    Abstract: Computing $(k,η)$-cores from uncertain graphs is a fundamental problem in uncertain graph analysis. UCF-Index is the state-of-the-art resolution to support $(k,η)$-core queries, allowing the $(k,η)$-core for any combination of $k$ and $η$ to be computed in an optimal time. However, this index constructed by current algorithm is usually incorrect. During decomposition, the key is to obtain the $k$-… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  17. arXiv:2504.20674  [pdf, other

    cs.CE

    DiffLiB: High-fidelity differentiable modeling of lithium-ion batteries and efficient gradient-based parameter identification

    Authors: Weipeng Xu, Kaiqi Yang, Yuzhi Zhang, Shichao Sun, Sheng Mao, Tianju Xue

    Abstract: The physics-based Doyle-Fuller-Newman (DFN) model, widely adopted for its precise electrochemical modeling, stands out among various simulation models of lithium-ion batteries (LIBs). Although the DFN model is powerful in forward predictive analysis, the inverse identification of its model parameters has remained a long-standing challenge. The numerous unknown parameters associated with the nonlin… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  18. arXiv:2504.20496  [pdf, other

    cs.CV

    Large-scale visual SLAM for in-the-wild videos

    Authors: Shuo Sun, Torsten Sattler, Malcolm Mielle, Achim J. Lilienthal, Martin Magnusson

    Abstract: Accurate and robust 3D scene reconstruction from casual, in-the-wild videos can significantly simplify robot deployment to new environments. However, reliable camera pose estimation and scene reconstruction from such unconstrained videos remains an open challenge. Existing visual-only SLAM methods perform well on benchmark datasets but struggle with real-world footage which often exhibits uncontro… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: fix the overview figure

  19. arXiv:2504.20106  [pdf, other

    cs.LG cs.AI

    Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors

    Authors: Ren-Wei Liang, Chin-Ting Hsu, Chan-Hung Yu, Saransh Agrawal, Shih-Cheng Huang, Shang-Tse Chen, Kuan-Hao Huang, Shao-Hua Sun

    Abstract: Ensuring that large language models (LLMs) are both helpful and harmless is a critical challenge, as overly strict constraints can lead to excessive refusals, while permissive models risk generating harmful content. Existing approaches, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), attempt to balance these trade-offs but suffer from performance… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

    Comments: 22 pages, 5 figures, 9 tables

  20. arXiv:2504.20097  [pdf, other

    cs.CV quant-ph

    Long-Distance Field Demonstration of Imaging-Free Drone Identification in Intracity Environments

    Authors: Junran Guo, Tonglin Mu, Keyuan Li, Jianing Li, Ziyang Luo, Ye Chen, Xiaodong Fan, Jinquan Huang, Minjie Liu, Jinbei Zhang, Ruoyang Qi, Naiting Gu, Shihai Sun

    Abstract: Detecting small objects, such as drones, over long distances presents a significant challenge with broad implications for security, surveillance, environmental monitoring, and autonomous systems. Traditional imaging-based methods rely on high-resolution image acquisition, but are often constrained by range, power consumption, and cost. In contrast, data-driven single-photon-single-pixel light dete… ▽ More

    Submitted 26 April, 2025; originally announced April 2025.

    Comments: 15 pages, 9 figures

  21. arXiv:2504.19507  [pdf, other

    cs.IT

    From Freshness to Effectiveness: Goal-Oriented Sampling for Remote Decision Making

    Authors: Aimin Li, Shaohua Wu, Gary C. F. Lee, Sumei Sun

    Abstract: Data freshness, measured by Age of Information (AoI), is highly relevant in networked applications such as Vehicle to Everything (V2X), smart health systems, and Industrial Internet of Things (IIoT). Yet, freshness alone does not equate to informativeness. In decision-critical settings, some stale data may prove more valuable than fresh updates. To explore this nuance, we move beyond AoI-centric p… ▽ More

    Submitted 5 May, 2025; v1 submitted 28 April, 2025; originally announced April 2025.

    Comments: 35 pages. Submitted to the IEEE Transactions on Information Theory

  22. arXiv:2504.19086  [pdf, other

    cs.CV

    Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction

    Authors: Xiaoran Xu, Jiangang Yang, Wenyue Chong, Wenhui Shi, Shichu Sun, Jing Xing, Jian Liu

    Abstract: Single-Domain Generalized Object Detection~(S-DGOD) aims to train an object detector on a single source domain while generalizing well to diverse unseen target domains, making it suitable for multimedia applications that involve various domain shifts, such as intelligent video surveillance and VR/AR technologies. With the success of large-scale Vision-Language Models, recent S-DGOD approaches expl… ▽ More

    Submitted 26 April, 2025; originally announced April 2025.

  23. arXiv:2504.18012  [pdf, other

    cs.CL cs.AI

    Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation

    Authors: Zhuang Yu, Shiliang Sun, Jing Zhao, Tengfei Song, Hao Yang

    Abstract: Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study o… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  24. arXiv:2504.14858  [pdf, other

    cs.AI cs.CL

    AlignRAG: An Adaptable Framework for Resolving Misalignments in Retrieval-Aware Reasoning of RAG

    Authors: Jiaqi Wei, Hao Zhou, Xiang Zhang, Di Zhang, Zijie Qiu, Wei Wei, Jinzhe Li, Wanli Ouyang, Siqi Sun

    Abstract: Retrieval-augmented generation (RAG) has emerged as a foundational paradigm for knowledge-grounded text generation. However, existing RAG pipelines often fail to ensure that the reasoning trajectories align with the evidential constraints imposed by retrieved content. In this paper, we reframe RAG as a problem of retrieval-aware reasoning and identify a core challenge: reasoning misalignment-the m… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  25. arXiv:2504.14489  [pdf, other

    cs.OS

    Optimizing SLO-oriented LLM Serving with PD-Multiplexing

    Authors: Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Yangjie Zhou, Shixuan Sun, Minyi Guo

    Abstract: Modern LLM services demand high throughput and stringent SLO guarantees across two distinct inference phases-prefill and decode-and complex multi-turn workflows. However, current systems face a fundamental tradeoff: out-of-place compute partition enables per-phase SLO attainment, while in-place memory sharing maximizes throughput via KV cache reuse. Moreover, existing in-place compute partition al… ▽ More

    Submitted 22 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

  26. arXiv:2504.13945  [pdf, other

    cs.LG cs.AI

    Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models

    Authors: Zhanglin Wu, Tengfei Song, Ning Xie, Mengli Zhu, Weidong Zhang, Shuang Wu, Pengfei Li, Chong Li, Junhao Zhu, Hao Yang, Shiliang Sun

    Abstract: The rapid advancement of large vision-language models (LVLMs) has significantly propelled applications in document understanding, particularly in optical character recognition (OCR) and multilingual translation. However, current evaluations of LVLMs, like the widely used OCRBench, mainly focus on verifying the correctness of their short-text responses and long-text responses with simple layout, wh… ▽ More

    Submitted 23 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

    Comments: 12 pages, 5 figures, 5 Tables

  27. arXiv:2504.13102  [pdf, ps, other

    cs.SD cs.AI eess.AS

    A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition

    Authors: Wei Huang, Shumeng Sun, Junpeng Lu, Zhenpeng Xu, Zhengyang Xiu, Hao Zhang

    Abstract: Underwater acoustic target recognition (UATR) is of great significance for the protection of marine diversity and national defense security. The development of deep learning provides new opportunities for UATR, but faces challenges brought by the scarcity of reference samples and complex environmental interference. To address these issues, we proposes a multi-task balanced channel attention convol… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  28. arXiv:2504.11839  [pdf, other

    cs.SE

    "Good" and "Bad" Failures in Industrial CI/CD -- Balancing Cost and Quality Assurance

    Authors: Simin Sun, David Friberg, Miroslaw Staron

    Abstract: Continuous Integration and Continuous Deployment (CI/CD) pipeline automates software development to speed up and enhance the efficiency of engineering software. These workflows consist of various jobs, such as code validation and testing, which developers must wait to complete before receiving feedback. The jobs can fail, which leads to unnecessary delays in build times, decreasing productivity fo… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: 5 pages, 2 figures

  29. arXiv:2504.11346  [pdf, other

    cs.CV

    Seedream 3.0 Technical Report

    Authors: Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai , et al. (6 additional authors not shown)

    Abstract: We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 st… ▽ More

    Submitted 16 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

    Comments: Seedream 3.0 Technical Report

  30. arXiv:2504.09103  [pdf, other

    cs.RO

    IMPACT: Behavioral Intention-aware Multimodal Trajectory Prediction with Adaptive Context Trimming

    Authors: Jiawei Sun, Xibin Yue, Jiahui Li, Tianle Shen, Chengran Yuan, Shuo Sun, Sheng Guo, Quanyun Zhou, Marcelo H Ang Jr

    Abstract: While most prior research has focused on improving the precision of multimodal trajectory predictions, the explicit modeling of multimodal behavioral intentions (e.g., yielding, overtaking) remains relatively underexplored. This paper proposes a unified framework that jointly predicts both behavioral intentions and trajectories to enhance prediction accuracy, interpretability, and efficiency. Spec… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

    Comments: under review

  31. arXiv:2504.09069  [pdf, other

    cs.CV

    UniFlowRestore: A General Video Restoration Framework via Flow Matching and Prompt Guidance

    Authors: Shuning Sun, Yu Zhang, Chen Wu, Dianjie Lu, Dianjie Lu, Guijuan Zhan, Yang Weng, Zhuoran Zheng

    Abstract: Video imaging is often affected by complex degradations such as blur, noise, and compression artifacts. Traditional restoration methods follow a "single-task single-model" paradigm, resulting in poor generalization and high computational cost, limiting their applicability in real-world scenarios with diverse degradation types. We propose UniFlowRestore, a general video restoration framework that m… ▽ More

    Submitted 12 April, 2025; originally announced April 2025.

  32. arXiv:2504.08719  [pdf, other

    cs.CL

    SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling

    Authors: Krishna C. Puvvada, Faisal Ladhak, Santiago Akle Serrano, Cheng-Ping Hsieh, Shantanu Acharya, Somshubra Majumdar, Fei Jia, Samuel Kriman, Simeng Sun, Dima Rekesh, Boris Ginsburg

    Abstract: We present a decoder-only Transformer architecture that robustly generalizes to sequence lengths substantially longer than those seen during training. Our model, SWAN-GPT, interleaves layers without positional encodings (NoPE) and sliding-window attention layers equipped with rotary positional encodings (SWA-RoPE). Experiments demonstrate strong performance on sequence lengths significantly longer… ▽ More

    Submitted 11 April, 2025; originally announced April 2025.

  33. arXiv:2504.06141  [pdf, other

    cs.LG

    Adversarial Training of Reward Models

    Authors: Alexander Bukharin, Haifeng Qian, Shengyang Sun, Adithya Renduchintala, Soumye Singhal, Zhilin Wang, Oleksii Kuchaiev, Olivier Delalleau, Tuo Zhao

    Abstract: Reward modeling has emerged as a promising approach for the scalable alignment of language models. However, contemporary reward models (RMs) often lack robustness, awarding high rewards to low-quality, out-of-distribution (OOD) samples. This can lead to reward hacking, where policies exploit unintended shortcuts to maximize rewards, undermining alignment. To address this challenge, we introduce Ad… ▽ More

    Submitted 11 April, 2025; v1 submitted 8 April, 2025; originally announced April 2025.

    Comments: 16 pages, 7 figures

  34. arXiv:2504.05629  [pdf

    cs.RO

    PTRL: Prior Transfer Deep Reinforcement Learning for Legged Robots Locomotion

    Authors: Haodong Huang, Shilong Sun, Zida Zhao, Hailin Huang, Changqing Shen, Wenfu Xu

    Abstract: In the field of legged robot motion control, reinforcement learning (RL) holds great promise but faces two major challenges: high computational cost for training individual robots and poor generalization of trained models. To address these problems, this paper proposes a novel framework called Prior Transfer Reinforcement Learning (PTRL), which improves both training efficiency and model transfera… ▽ More

    Submitted 7 April, 2025; originally announced April 2025.

  35. arXiv:2504.04164  [pdf, other

    cs.LG

    MInCo: Mitigating Information Conflicts in Distracted Visual Model-based Reinforcement Learning

    Authors: Shiguang Sun, Hanbo Zhang, Zeyang Liu, Xinrui Yang, Lipeng Wan, Bing Yan, Xingyu Chen, Xuguang Lan

    Abstract: Existing visual model-based reinforcement learning (MBRL) algorithms with observation reconstruction often suffer from information conflicts, making it difficult to learn compact representations and hence result in less robust policies, especially in the presence of task-irrelevant visual distractions. In this paper, we first reveal that the information conflicts in current visual MBRL algorithms… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

  36. arXiv:2504.02061  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Aligned Better, Listen Better for Audio-Visual Large Language Models

    Authors: Yuxin Guo, Shuailei Ma, Shijie Ma, Xiaoyi Bao, Chen-Wei Xie, Kecheng Zheng, Tingyu Weng, Siyang Sun, Yun Zheng, Wei Zou

    Abstract: Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak un… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: Accepted to ICLR 2025

  37. arXiv:2504.00264  [pdf, other

    eess.IV cs.CV stat.ML

    DiffDenoise: Self-Supervised Medical Image Denoising with Conditional Diffusion Models

    Authors: Basar Demir, Yikang Liu, Xiao Chen, Eric Z. Chen, Lin Zhao, Boris Mailhe, Terrence Chen, Shanhui Sun

    Abstract: Many self-supervised denoising approaches have been proposed in recent years. However, these methods tend to overly smooth images, resulting in the loss of fine structures that are essential for medical applications. In this paper, we propose DiffDenoise, a powerful self-supervised denoising approach tailored for medical images, designed to preserve high-frequency details. Our approach comprises t… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

  38. arXiv:2504.00191  [pdf, other

    cs.CV

    Leveraging Diffusion Model and Image Foundation Model for Improved Correspondence Matching in Coronary Angiography

    Authors: Lin Zhao, Xin Yu, Yikang Liu, Xiao Chen, Eric Z. Chen, Terrence Chen, Shanhui Sun

    Abstract: Accurate correspondence matching in coronary angiography images is crucial for reconstructing 3D coronary artery structures, which is essential for precise diagnosis and treatment planning of coronary artery disease (CAD). Traditional matching methods for natural images often fail to generalize to X-ray images due to inherent differences such as lack of texture, lower contrast, and overlapping str… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

  39. arXiv:2503.24368  [pdf, other

    cs.CV

    Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

    Authors: Xiaoran Zhang, Eric Z. Chen, Lin Zhao, Xiao Chen, Yikang Liu, Boris Maihe, James S. Duncan, Terrence Chen, Shanhui Sun

    Abstract: We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation. Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance. To overcome these limitations, we introduce an adaptive framework that… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  40. arXiv:2503.23975  [pdf, other

    cs.RO

    A Reactive Framework for Whole-Body Motion Planning of Mobile Manipulators Combining Reinforcement Learning and SDF-Constrained Quadratic Programmi

    Authors: Chenyu Zhang, Shiying Sun, Kuan Liu, Chuanbao Zhou, Xiaoguang Zhao, Min Tan, Yanlong Huang

    Abstract: As an important branch of embodied artificial intelligence, mobile manipulators are increasingly applied in intelligent services, but their redundant degrees of freedom also limit efficient motion planning in cluttered environments. To address this issue, this paper proposes a hybrid learning and optimization framework for reactive whole-body motion planning of mobile manipulators. We develop the… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  41. arXiv:2503.23452  [pdf, other

    cs.CV

    VideoGen-Eval: Agent-based System for Video Generation Evaluation

    Authors: Yuhang Yang, Ke Fan, Shangkun Sun, Hongxiang Li, Ailing Zeng, FeiLin Han, Wei Zhai, Wei Liu, Yang Cao, Zheng-Jun Zha

    Abstract: The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an… ▽ More

    Submitted 26 April, 2025; v1 submitted 30 March, 2025; originally announced March 2025.

    Comments: project:https://github.com/AILab-CVC/VideoGen-Eval

  42. arXiv:2503.22832  [pdf, other

    cs.PL cs.CL

    L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution

    Authors: Simeng Sun, Cheng-Ping Hsieh, Faisal Ladhak, Erik Arakelyan, Santiago Akle Serano, Boris Ginsburg

    Abstract: Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps, a foundational capability which we term "level-0" reasoning. To systematically evaluate this capability, we introduce L0-Bench, a language model benchmark for testing procedural correctness -- the ability to generate correct reasoning processes, complementing existing bench… ▽ More

    Submitted 10 April, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

  43. arXiv:2503.20314  [pdf, other

    cs.CV

    Wan: Open and Advanced Large-Scale Video Generative Models

    Authors: Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu , et al. (37 additional authors not shown)

    Abstract: This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluat… ▽ More

    Submitted 18 April, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

    Comments: 60 pages, 33 figures

  44. arXiv:2503.19584  [pdf, other

    cs.AI cs.CL cs.SE

    Multi-agent Application System in Office Collaboration Scenarios

    Authors: Songtao Sun, Jingyi Li, Yuanfei Dong, Haoguang Liu, Chenxin Xu, Fuyang Li, Qiang Liu

    Abstract: This paper introduces a multi-agent application system designed to enhance office collaboration efficiency and work quality. The system integrates artificial intelligence, machine learning, and natural language processing technologies, achieving functionalities such as task allocation, progress monitoring, and information sharing. The agents within the system are capable of providing personalized… ▽ More

    Submitted 7 April, 2025; v1 submitted 25 March, 2025; originally announced March 2025.

    Comments: Technical report

  45. arXiv:2503.17862  [pdf, other

    cs.CV cs.AI

    A Causal Adjustment Module for Debiasing Scene Graph Generation

    Authors: Li Liu, Shuzhou Sun, Shuaifeng Zhi, Fan Shi, Zhen Liu, Janne Heikkilä, Yongxiang Liu

    Abstract: While recent debiasing methods for Scene Graph Generation (SGG) have shown impressive performance, these efforts often attribute model bias solely to the long-tail distribution of relationships, overlooking the more profound causes stemming from skewed object and object pair distributions. In this paper, we employ causal inference techniques to model the causality among these observed skewed distr… ▽ More

    Submitted 22 March, 2025; originally announced March 2025.

    Comments: 18 pages, 8 tables, 10 figures

  46. arXiv:2503.13081  [pdf

    cs.CL cs.AI

    A Framework to Assess Multilingual Vulnerabilities of LLMs

    Authors: Likai Tang, Niruth Bogahawatta, Yasod Ginige, Jiarui Xu, Shixuan Sun, Surangika Ranathunga, Suranga Seneviratne

    Abstract: Large Language Models (LLMs) are acquiring a wider range of capabilities, including understanding and responding in multiple languages. While they undergo safety training to prevent them from answering illegal questions, imbalances in training data and human evaluation resources can make these models more susceptible to attacks in low-resource languages (LRL). This paper proposes a framework to au… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  47. arXiv:2503.12932  [pdf, other

    cs.LG

    Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs

    Authors: Wei Hung, Shao-Hua Sun, Ping-Chun Hsieh

    Abstract: Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety-critical and resource-constrained applications. The existing ACRL methods can typically achieve favorable constraint satisfaction but at the cost of either high computational burden incurred by the quadratic programs (QP) o… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

    Comments: 23 pages, 14 figures. Accepted at ICLR 2025

    ACM Class: I.2.6; I.5.1

  48. arXiv:2503.11030  [pdf, other

    cs.CV cs.AI

    FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection

    Authors: Ming Deng, Sijin Sun, Zihao Li, Xiaochuan Hu, Xing Wu

    Abstract: Camouflaged Object Detection (COD) is challenging due to the strong similarity between camouflaged objects and their surroundings, which complicates identification. Existing methods mainly rely on spatial local features, failing to capture global information, while Transformers increase computational costs.To address this, the Frequency-Assisted Mamba-Like Linear Attention Network (FMNet) is propo… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

  49. arXiv:2503.10382  [pdf, other

    cs.LG

    Subgroup Performance Analysis in Hidden Stratifications

    Authors: Alceu Bissoto, Trung-Dung Hoang, Tim Flühmann, Susu Sun, Christian F. Baumgartner, Lisa M. Koch

    Abstract: Machine learning (ML) models may suffer from significant performance disparities between patient groups. Identifying such disparities by monitoring performance at a granular level is crucial for safely deploying ML to each patient. Traditional subgroup analysis based on metadata can expose performance disparities only if the available metadata (e.g., patient sex) sufficiently reflects the main rea… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Under review

  50. arXiv:2503.08384  [pdf, other

    cs.CV

    Prototype-Based Multiple Instance Learning for Gigapixel Whole Slide Image Classification

    Authors: Susu Sun, Dominique van Midden, Geert Litjens, Christian F. Baumgartner

    Abstract: Multiple Instance Learning (MIL) methods have succeeded remarkably in histopathology whole slide image (WSI) analysis. However, most MIL models only offer attention-based explanations that do not faithfully capture the model's decision mechanism and do not allow human-model interaction. To address these limitations, we introduce ProtoMIL, an inherently interpretable MIL model for WSI analysis that… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.