Skip to main content

Showing 1–50 of 1,843 results for author: Sun, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.05754  [pdf, ps, other

    cs.RO cs.AI

    LeAD: The LLM Enhanced Planning System Converged with End-to-end Autonomous Driving

    Authors: Yuhang Zhang, Jiaqi Liu, Chengkai Xu, Peng Hang, Jian Sun

    Abstract: A principal barrier to large-scale deployment of urban autonomous driving systems lies in the prevalence of complex scenarios and edge cases. Existing systems fail to effectively interpret semantic information within traffic contexts and discern intentions of other participants, consequently generating decisions misaligned with skilled drivers' reasoning patterns. We present LeAD, a dual-rate auto… ▽ More

    Submitted 8 July, 2025; originally announced July 2025.

  2. Breaking the Plane: Exploring Real-Time Visualization of 3D Surfaces in Augmented Reality with Handwritten Input

    Authors: Liam Franco Esparraguera, Kristoffer Selberg, Brian Lou, Jenny Sun, Beza Desta, Andrés Monroy-Hernández, Parastoo Abtahi

    Abstract: We introduce Breaking the Plane, an augmented reality (AR) application built for AR headsets that enables users to visualize 3D mathematical functions using handwritten input. Researchers have demonstrated overlaying 3D visualizations of mathematical concepts through AR enhances learning motivation and comprehension, and equation parsing makes the authoring of teaching materials more time-efficien… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pp. 1-9. 2024

  3. arXiv:2507.05533  [pdf, ps, other

    cs.LG

    Theoretical Learning Performance of Graph Neural Networks: The Impact of Jumping Connections and Layer-wise Sparsification

    Authors: Jiawei Sun, Hongkang Li, Meng Wang

    Abstract: Jumping connections enable Graph Convolutional Networks (GCNs) to overcome over-smoothing, while graph sparsification reduces computational demands by selecting a sub-matrix of the graph adjacency matrix during neighborhood aggregation. Learning GCNs with graph sparsification has shown empirical success across various applications, but a theoretical understanding of the generalization guarantees r… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

    Comments: TMLR

  4. arXiv:2507.05255  [pdf, ps, other

    cs.CV cs.CL

    Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

    Authors: Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel

    Abstract: The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimoda… ▽ More

    Submitted 7 July, 2025; originally announced July 2025.

  5. arXiv:2507.04258  [pdf, ps, other

    cs.CV

    MoReMouse: Monocular Reconstruction of Laboratory Mouse

    Authors: Yuan Zhong, Jingxiang Sun, Liang An, Yebin Liu

    Abstract: Laboratory mice play a crucial role in biomedical research, yet accurate 3D mouse surface motion reconstruction remains challenging due to their complex non-rigid geometric deformations and textureless appearance. Moreover, the absence of structured 3D datasets severely hinders the progress beyond sparse keypoint tracking. To narrow the gap, we present MoReMouse, the first monocular dense 3D recon… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

  6. arXiv:2507.04060  [pdf, ps, other

    cs.CV cs.AI

    Temporal Continual Learning with Prior Compensation for Human Motion Prediction

    Authors: Jianwei Tang, Jiangxin Sun, Xiaotong Lin, Lifang Zhang, Wei-Shi Zheng, Jian-Fang Hu

    Abstract: Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning of short-term predictions is hindered by the focus on long-term predictions, and the incorporation of prior information from past predictions into subsequent pr… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: Advances in Neural Information Processing Systems 2023

    Journal ref: Advances in Neural Information Processing Systems, 2023, 36: 65837-65849

  7. arXiv:2507.03836  [pdf, ps, other

    cs.GR cs.CV

    F-Hash: Feature-Based Hash Design for Time-Varying Volume Visualization via Multi-Resolution Tesseract Encoding

    Authors: Jianxin Sun, David Lenz, Hongfeng Yu, Tom Peterka

    Abstract: Interactive time-varying volume visualization is challenging due to its complex spatiotemporal features and sheer size of the dataset. Recent works transform the original discrete time-varying volumetric data into continuous Implicit Neural Representations (INR) to address the issues of compression, rendering, and super-resolution in both spatial and temporal domains. However, training the INR tak… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

  8. arXiv:2507.03578  [pdf, ps, other

    cs.CV cs.AI cs.LG

    SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications

    Authors: Yana Hasson, Pauline Luc, Liliane Momeni, Maks Ovsjanikov, Guillaume Le Moing, Alina Kuznetsova, Ira Ktena, Jennifer J. Sun, Skanda Koppula, Dilara Gokay, Joseph Heyward, Etienne Pot, Andrew Zisserman

    Abstract: In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: ICCV 2025, GitHub repo: https://github.com/google-deepmind/scivid

  9. arXiv:2507.03542  [pdf, ps, other

    cs.CV

    Beyond Accuracy: Metrics that Uncover What Makes a `Good' Visual Descriptor

    Authors: Ethan Lin, Linxi Zhao, Atharva Sehgal, Jennifer J. Sun

    Abstract: Text-based visual descriptors-ranging from simple class names to more descriptive phrases-are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representati… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: VisCon @ CVPR 2025

  10. arXiv:2507.02951  [pdf, ps, other

    cs.CR cs.AI

    Bittensor Protocol: The Bitcoin in Decentralized Artificial Intelligence? A Critical and Empirical Analysis

    Authors: Elizabeth Lui, Jiahao Sun

    Abstract: This paper investigates whether Bittensor can be considered the Bitcoin of decentralized Artificial Intelligence by directly comparing its tokenomics, decentralization properties, consensus mechanism, and incentive structure against those of Bitcoin. Leveraging on-chain data from all 64 active Bittensor subnets, we first document considerable concentration in both stake and rewards. We further sho… ▽ More

    Submitted 29 June, 2025; originally announced July 2025.

    Comments: MARBLE 2025

  11. arXiv:2507.02822  [pdf, ps, other

    cs.CL cs.AI cs.LG

    SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

    Authors: Wencheng Zhang, Shiqin Qiao, Lingjie Luo, Yinfeng Li, Chuanyang Zheng, Qian Xu, Meng Li, Yong Gui, Yijun He, Jianing Qiu, Jindong Hong, Jiankai Sun

    Abstract: With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between "thinking" (high reasoning) and "non-thinking" (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical qu… ▽ More

    Submitted 3 July, 2025; originally announced July 2025.

  12. Evaluating Pavement Deterioration Rates Due to Flooding Events Using Explainable AI

    Authors: Lidan Peng, Lu Gao, Feng Hong, Jingran Sun

    Abstract: Flooding can damage pavement infrastructure significantly, causing both immediate and long-term structural and functional issues. This research investigates how flooding events affect pavement deterioration, specifically focusing on measuring pavement roughness by the International Roughness Index (IRI). To quantify these effects, we utilized 20 years of pavement condition data from TxDOT's PMIS d… ▽ More

    Submitted 28 June, 2025; originally announced July 2025.

    Journal ref: Buildings 15.9 (2025): 1452

  13. arXiv:2507.00066  [pdf, other

    cs.HC cs.AI

    InSight-R: A Framework for Risk-informed Human Failure Event Identification and Interface-Induced Risk Assessment Driven by AutoGraph

    Authors: Xingyu Xiao, Jiejuan Tong, Peng Chen, Jun Sun, Zhe Sui, Jingang Liang, Hongru Zhao, Jun Zhao, Haitao Wang

    Abstract: Human reliability remains a critical concern in safety-critical domains such as nuclear power, where operational failures are often linked to human error. While conventional human reliability analysis (HRA) methods have been widely adopted, they rely heavily on expert judgment for identifying human failure events (HFEs) and assigning performance influencing factors (PIFs). This reliance introduces… ▽ More

    Submitted 27 June, 2025; originally announced July 2025.

  14. arXiv:2506.23322  [pdf, ps, other

    cs.DB cs.AI cs.CL cs.IR

    GaussMaster: An LLM-based Database Copilot System

    Authors: Wei Zhou, Ji Sun, Xuanhe Zhou, Guoliang Li, Luyang Liu, Hao Wu, Tianyuan Wang

    Abstract: In the financial industry, data is the lifeblood of operations, and DBAs shoulder significant responsibilities for SQL tuning, database deployment, diagnosis, and service repair. In recent years, both database vendors and customers have increasingly turned to autonomous database platforms in an effort to alleviate the heavy workload of DBAs. However, existing autonomous database platforms are limi… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: We welcome contributions from the community. For reference, please see the code at: https://gitcode.com/opengauss/openGauss-GaussMaster

  15. arXiv:2506.23126  [pdf, ps, other

    cs.RO

    ParticleFormer: A 3D Point Cloud World Model for Multi-Object, Multi-Material Robotic Manipulation

    Authors: Suning Huang, Qianzhong Chen, Xiaohan Zhang, Jiankai Sun, Mac Schwager

    Abstract: 3D world models (i.e., learning-based 3D dynamics models) offer a promising approach to generalizable robotic manipulation by capturing the underlying physics of environment evolution conditioned on robot actions. However, existing 3D world models are primarily limited to single-material dynamics using a particle-based Graph Neural Network model, and often require time-consuming 3D scene reconstru… ▽ More

    Submitted 4 July, 2025; v1 submitted 29 June, 2025; originally announced June 2025.

  16. arXiv:2506.22385  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment

    Authors: Yue Zhang, Jilei Sun, Yunhui Guo, Vibhav Gogate

    Abstract: Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Enta… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  17. arXiv:2506.21932  [pdf, ps, other

    math.NA cs.CE cs.PF

    StructMG: A Fast and Scalable Structured Algebraic Multigrid

    Authors: Yi Zong, Peinan Yu, Haopeng Huang, Zhengding Hu, Xinliang Wang, Qin Wang, Chensong Zhang, Xiaowen Xu, Jian Sun, Yongxiao Zhou, Wei Xue

    Abstract: Parallel multigrid is widely used as preconditioners in solving large-scale sparse linear systems. However, the current multigrid library still needs more satisfactory performance for structured grid problems regarding speed and scalability. Based on the classical 'multigrid seesaw', we derive three necessary principles for an efficient structured multigrid, which instructs our design and implemen… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  18. arXiv:2506.20123  [pdf, ps, other

    cs.CE

    DiT-SGCR: Directed Temporal Structural Representation with Global-Cluster Awareness for Ethereum Malicious Account Detection

    Authors: Ye Tian, Liangliang Song, Peng Qian, Yanbin Wang, Jianguo Sun, Yifan Jia

    Abstract: The detection of malicious accounts on Ethereum - the preeminent DeFi platform - is critical for protecting digital assets and maintaining trust in decentralized finance. Recent advances highlight that temporal transaction evolution reveals more attack signatures than static graphs. However, current methods either fail to model continuous transaction dynamics or incur high computational costs that… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  19. arXiv:2506.19935  [pdf, ps, other

    cs.LG cs.CV stat.ML

    Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

    Authors: Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, Zhi-Ming Ma

    Abstract: Large language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultane… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  20. Beyond Wellbeing Apps: Co-Designing Immersive, Embodied, and Collective Digital Wellbeing Interventions for Healthcare Professionals

    Authors: Zheyuan Zhang, Jingjing Sun, Dorian Peters, Rafael A. Calvo

    Abstract: Healthcare professionals (HCPs) face increasing levels of stress and burnout. Technological wellbeing interventions provide accessible and flexible support for HCPs. While most studies have focused on mobile- and web-based programs, alternative technologies like virtual reality (VR), augmented reality (AR), tangible interfaces, and embodied technologies are emerging as engaging and effective tools… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

    Comments: 21 pages, DIS '25: Designing Interactive Systems Conference, Funchal, Portugal, July 2025

  21. arXiv:2506.19356  [pdf, ps, other

    cs.CR cs.LG

    WebGuard++:Interpretable Malicious URL Detection via Bidirectional Fusion of HTML Subgraphs and Multi-Scale Convolutional BERT

    Authors: Ye Tian, Zhang Yumin, Yifan Jia, Jianguo Sun, Yanbin Wang

    Abstract: URL+HTML feature fusion shows promise for robust malicious URL detection, since attacker artifacts persist in DOM structures. However, prior work suffers from four critical shortcomings: (1) incomplete URL modeling, failing to jointly capture lexical patterns and semantic context; (2) HTML graph sparsity, where threat-indicative nodes (e.g., obfuscated scripts) are isolated amid benign content, ca… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  22. arXiv:2506.18727  [pdf, other

    cs.HC cs.SE

    AutoGraph: A Knowledge-Graph Framework for Modeling Interface Interaction and Automating Procedure Execution in Digital Nuclear Control Rooms

    Authors: Xingyu Xiao, Jiejuan Tong, Jun Sun, Zhe Sui, Jingang Liang, Hongru Zhao, Jun Zhao, Haitao Wang

    Abstract: Digitalization in nuclear power plant (NPP) control rooms is reshaping how operators interact with procedures and interface elements. However, existing computer-based procedures (CBPs) often lack semantic integration with human-system interfaces (HSIs), limiting their capacity to support intelligent automation and increasing the risk of human error, particularly under dynamic or complex operating… ▽ More

    Submitted 26 May, 2025; originally announced June 2025.

  23. arXiv:2506.16699  [pdf, ps, other

    cs.CR

    Exploring Traffic Simulation and Cybersecurity Strategies Using Large Language Models

    Authors: Lu Gao, Yongxin Liu, Hongyun Chen, Dahai Liu, Yunpeng Zhang, Jingran Sun

    Abstract: Intelligent Transportation Systems (ITS) are increasingly vulnerable to sophisticated cyberattacks due to their complex, interconnected nature. Ensuring the cybersecurity of these systems is paramount to maintaining road safety and minimizing traffic disruptions. This study presents a novel multi-agent framework leveraging Large Language Models (LLMs) to enhance traffic simulation and cybersecurit… ▽ More

    Submitted 19 June, 2025; originally announced June 2025.

  24. arXiv:2506.15894  [pdf, ps, other

    cs.CL cs.AI

    Language Models can perform Single-Utterance Self-Correction of Perturbed Reasoning

    Authors: Sam Silver, Jimin Sun, Ivan Zhang, Sara Hooker, Eddie Kim

    Abstract: Large Language Models (LLMs) have demonstrated impressive mathematical reasoning capabilities, yet their performance remains brittle to minor variations in problem description and prompting strategy. Furthermore, reasoning is vulnerable to sampling-induced errors which autoregressive models must primarily address using self-correction via additionally-generated tokens. To better understand self-co… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  25. arXiv:2506.15734  [pdf, ps, other

    cs.AI cs.CL cs.CR cs.CV cs.LG

    The Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models

    Authors: Peiyuan Tang, Haojie Xin, Xiaodong Zhang, Jun Sun, Qin Xia, Zijiang Yang

    Abstract: As Vision-Language Models (VLMs) demonstrate increasing capabilities across real-world applications such as code generation and chatbot assistance, ensuring their safety has become paramount. Unlike traditional Large Language Models (LLMs), VLMs face unique vulnerabilities due to their multimodal nature, allowing adversaries to modify visual or textual inputs to bypass safety guardrails and trigge… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

    Comments: 23 pages, 10 figures

  26. arXiv:2506.15675  [pdf, ps, other

    cs.CV cs.AI

    Sekai: A Video Dataset towards World Exploration

    Authors: Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Zhixiang Wang, Yuwei Wu, Tong He, Jiangmiao Pang, Yu Qiao, Yunde Jia, Kaipeng Zhang

    Abstract: Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai… ▽ More

    Submitted 20 June, 2025; v1 submitted 18 June, 2025; originally announced June 2025.

    Comments: 12 pages, 6 figures

  27. arXiv:2506.15523  [pdf, ps, other

    cs.PF

    Atys: An Efficient Profiling Framework for Identifying Hotspot Functions in Large-scale Cloud Microservices

    Authors: Jiaqi Sun, Dingyu Yang, Shiyou Qian, Jian Cao, Guangtao Xue

    Abstract: To handle the high volume of requests, large-scale services are comprised of thousands of instances deployed in clouds. These services utilize diverse programming languages and are distributed across various nodes as encapsulated containers. Given their vast scale, even minor performance enhancements can lead to significant cost reductions. In this paper, we introduce Atys1, an efficient profiling… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  28. arXiv:2506.14201  [pdf, ps, other

    cs.RO eess.SY

    Pose State Perception of Interventional Robot for Cardio-cerebrovascular Procedures

    Authors: Shunhan Ji, Yanxi Chen, Zhongyu Yang, Quan Zhang, Xiaohang Nie, Jingqian Sun, Yichao Tang

    Abstract: In response to the increasing demand for cardiocerebrovascular interventional surgeries, precise control of interventional robots has become increasingly important. Within these complex vascular scenarios, the accurate and reliable perception of the pose state for interventional robots is particularly crucial. This paper presents a novel vision-based approach without the need of additional sensors… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  29. arXiv:2506.14009  [pdf, ps, other

    cs.RO

    GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics

    Authors: Qianzhong Chen, Naixiang Gao, Suning Huang, JunEn Low, Timothy Chen, Jiankai Sun, Mac Schwager

    Abstract: Autonomous drones capable of interpreting and executing high-level language instructions in unstructured environments remain a long-standing goal. Yet existing approaches are constrained by their dependence on hand-crafted skills, extensive parameter tuning, or computationally intensive models unsuitable for onboard use. We introduce GRaD-Nav++, a lightweight Vision-Language-Action (VLA) framework… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  30. arXiv:2506.13585  [pdf, ps, other

    cs.CL cs.LG

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Authors: MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou , et al. (103 additional authors not shown)

    Abstract: We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: A technical report from MiniMax. The authors are listed in alphabetical order. We open-source our MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1

  31. arXiv:2506.13492  [pdf, ps, other

    cs.CV

    GeoSDF: Plane Geometry Diagram Synthesis via Signed Distance Field

    Authors: Chengrui Zhang, Maizhen Ning, Zihao Zhou, Jie Sun, Kaizhu Huang, Qiufeng Wang

    Abstract: Plane Geometry Diagram Synthesis has been a crucial task in computer graphics, with applications ranging from educational tools to AI-driven mathematical reasoning. Traditionally, we rely on computer tools (e.g., Matplotlib and GeoGebra) to manually generate precise diagrams, but it usually requires huge, complicated calculations cost. Recently, researchers start to work on learning-based methods… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  32. arXiv:2506.13186  [pdf, ps, other

    cs.SE

    Empirical Evaluation of Large Language Models in Automated Program Repair

    Authors: Jiajun Sun, Fengjie Li, Xinzhu Qi, Hongyu Zhang, Jiajun Jiang

    Abstract: The increasing prevalence of software bugs has made automated program repair (APR) a key research focus. Large language models (LLMs) offer new opportunities for APR, but existing studies mostly rely on smaller, earlier-generation models and Java benchmarks. The repair capabilities of modern, large-scale LLMs across diverse languages and scenarios remain underexplored. To address this, we conduct… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  33. arXiv:2506.12849  [pdf, ps, other

    cs.CV

    CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making

    Authors: Songtao Jiang, Yuan Wang, Ruizhe Chen, Yan Zhang, Ruilin Luo, Bohan Lei, Sibo Song, Yang Feng, Jimeng Sun, Jian Wu, Zuozhu Liu

    Abstract: In medical visual question answering (Med-VQA), achieving accurate responses relies on three critical steps: precise perception of medical imaging data, logical reasoning grounded in visual input and textual questions, and coherent answer derivation from the reasoning process. Recent advances in general vision-language models (VLMs) show that large-scale reinforcement learning (RL) could significa… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  34. arXiv:2506.12680  [pdf, ps, other

    cs.CV

    3D Hand Mesh-Guided AI-Generated Malformed Hand Refinement with Hand Pose Transformation via Diffusion Model

    Authors: Chen-Bin Feng, Kangdao Liu, Jian Sun, Jiping Jin, Yiguo Jiang, Chi-Man Vong

    Abstract: The malformed hands in the AI-generated images seriously affect the authenticity of the images. To refine malformed hands, existing depth-based approaches use a hand depth estimator to guide the refinement of malformed hands. Due to the performance limitations of the hand depth estimator, many hand details cannot be represented, resulting in errors in the generated hands, such as confusing the pal… ▽ More

    Submitted 16 June, 2025; v1 submitted 14 June, 2025; originally announced June 2025.

  35. arXiv:2506.12537  [pdf, ps, other

    cs.CL cs.AI eess.AS

    Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

    Authors: Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

    Abstract: Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the impact of key components (i.e., speech tokenizers, speech heads, and speaker modeling) on the performance of LLM-centric SLMs. We… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  36. arXiv:2506.11913  [pdf, ps, other

    cs.CV

    O2Former:Direction-Aware and Multi-Scale Query Enhancement for SAR Ship Instance Segmentation

    Authors: F. Gao, Y Li, X He, J Sun, J Wang

    Abstract: Instance segmentation of ships in synthetic aperture radar (SAR) imagery is critical for applications such as maritime monitoring, environmental analysis, and national security. SAR ship images present challenges including scale variation, object density, and fuzzy target boundary, which are often overlooked in existing methods, leading to suboptimal performance. In this work, we propose O2Former,… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 12 pages, 7 figures

  37. arXiv:2506.10264  [pdf, ps, other

    cs.AI

    WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models

    Authors: Qiyue Yin, Pei Xu, Qiaozhe Li, Shengda Liu, Shengqi Shen, Tong Wang, Yihong Han, Xiaonan Zhao, Likun Yang, Shiyue Cao, Shiyu Qiu, Yuxuan Liu, Shizhao Yu, Lei Cui, Chengxin Yan, Jie Sun, Xiangquan Tang, Kaiqi Huang

    Abstract: Recent breakthroughs in Large Language Models (LLMs) have led to a qualitative leap in artificial intelligence' s performance on reasoning tasks, particularly demonstrating remarkable capabilities in mathematical, symbolic, and commonsense reasoning. However, as a critical component of advanced human cognition, strategic reasoning, i.e., the ability to assess multi-agent behaviors in dynamic envir… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 15 pages, 17 figures

  38. arXiv:2506.09565  [pdf, ps, other

    cs.CV

    SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields

    Authors: Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hongwen Zhang, Yebin Liu

    Abstract: Holistic 3D scene understanding, which jointly models geometry, appearance, and semantics, is crucial for applications like augmented reality and robotic interaction. Existing feed-forward 3D scene understanding methods (e.g., LSM) are limited to extracting language-based semantics from scenes, failing to achieve holistic scene comprehension. Additionally, they suffer from low-quality geometry rec… ▽ More

    Submitted 13 June, 2025; v1 submitted 11 June, 2025; originally announced June 2025.

  39. arXiv:2506.09427  [pdf, other

    cs.CV cs.AI

    A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

    Authors: Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang

    Abstract: Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn, a large-scale multimodal dataset constructed us… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  40. arXiv:2506.09344  [pdf, ps, other

    cs.AI cs.CL cs.CV cs.LG cs.SD eess.AS

    Ming-Omni: A Unified Multimodal Model for Perception and Generation

    Authors: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan , et al. (33 additional authors not shown)

    Abstract: We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: 18 pages,8 figures

  41. arXiv:2506.08967  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

    Authors: Ailin Huang, Bingxin Li, Bruce Wang, Boyong Wu, Chao Yan, Chengli Feng, Heng Wang, Hongyu Zhou, Hongyuan Wang, Jingbei Li, Jianjian Sun, Joanna Wang, Mingrui Chen, Peng Liu, Ruihang Miao, Shilei Jiang, Tian Fei, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Ge, Zheng Gong, Zhewei Huang , et al. (51 additional authors not shown)

    Abstract: Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a du… ▽ More

    Submitted 13 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: 12 pages, 3 figures

  42. arXiv:2506.08566  [pdf, ps, other

    cs.CV

    Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations

    Authors: Yibo Cui, Liang Xie, Yu Zhao, Jiawei Sun, Erwei Yin

    Abstract: Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions, yet faces significant challenges due to the scarcity of fine-grained cross-modal alignment annotations. Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments criti… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  43. arXiv:2506.08184  [pdf, ps, other

    cs.CL cs.AI q-bio.NC

    Unable to Forget: Proactive lnterference Reveals Working Memory Limits in LLMs Beyond Context Length

    Authors: Chupei Wang, Jiaqiu Vince Sun

    Abstract: Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts… ▽ More

    Submitted 11 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  44. arXiv:2506.07584  [pdf, ps, other

    cs.LG

    MIRA: Medical Time Series Foundation Model for Real-World Health Data

    Authors: Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, Jiang Bian

    Abstract: A unified foundation model for medical time series -- pretrained on open access and ethics board-approved medical corpora -- offers the potential to reduce annotation burdens, minimize model customization, and enable robust transfer across clinical institutions, modalities, and tasks, particularly in data-scarce or privacy-constrained environments. However, existing generalist time series foundati… ▽ More

    Submitted 11 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

  45. arXiv:2506.07298  [pdf, ps, other

    cs.LG cs.AI

    Pre-trained Large Language Models Learn Hidden Markov Models In-context

    Authors: Yijia Dai, Zhaolin Gao, Yahya Sattar, Sarah Dean, Jennifer J. Sun

    Abstract: Hidden Markov Models (HMMs) are foundational tools for modeling sequential data with latent Markovian structure, yet fitting them to real-world data remains computationally challenging. In this work, we show that pre-trained large language models (LLMs) can effectively model data generated by HMMs via in-context learning (ICL)$\unicode{x2013}$their ability to infer patterns from examples within a… ▽ More

    Submitted 11 June, 2025; v1 submitted 8 June, 2025; originally announced June 2025.

  46. arXiv:2506.07047  [pdf, ps, other

    cs.AI

    Mathesis: Towards Formal Theorem Proving from Natural Languages

    Authors: Yu Xuejun, Jianyuan Zhong, Zijin Feng, Pengyi Zhai, Roozbeh Yousefzadeh, Wei Chong Ng, Haoxiong Liu, Ziyi Shou, Jing Xiong, Yudong Zhou, Claudia Beth Ong, Austen Jeremy Sugiarto, Yaoxi Zhang, Wai Ming Tai, Huan Cao, Dongcai Lu, Jiacheng Sun, Qiang Xu, Shen Xin, Zhenguo Li

    Abstract: Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

  47. arXiv:2506.07037  [pdf, ps, other

    cs.CL

    KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering

    Authors: Zhongze Luo, Weixuan Wan, Qizhi Zheng, Yanhong Bai, Jingyun Sun, Jian Wang, Dan Wang

    Abstract: There are many types of standards in the field of communication. The traditional consulting model has a long cycle and relies on the knowledge and experience of experts, making it difficult to meet the rapidly developing technological demands. This paper combines the fine-tuning of large language models with the construction of knowledge graphs to implement an intelligent consultation and question… ▽ More

    Submitted 8 June, 2025; originally announced June 2025.

    Comments: 23 pages

  48. arXiv:2506.06645  [pdf, ps, other

    cs.CV

    Parametric Gaussian Human Model: Generalizable Prior for Efficient and Realistic Human Avatar Modeling

    Authors: Cheng Peng, Jingxiang Sun, Yushuo Chen, Zhaoqi Su, Zhuo Su, Yebin Liu

    Abstract: Photorealistic and animatable human avatars are a key enabler for virtual/augmented reality, telepresence, and digital entertainment. While recent advances in 3D Gaussian Splatting (3DGS) have greatly improved rendering quality and efficiency, existing methods still face fundamental challenges, including time-consuming per-subject optimization and poor generalization under sparse monocular inputs.… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

    Comments: Project Page: https://pengc02.github.io/pghm/

  49. arXiv:2506.05348  [pdf, ps, other

    cs.CV

    FreeTimeGS: Free Gaussian Primitives at Anytime and Anywhere for Dynamic Scene Reconstruction

    Authors: Yifan Wang, Peishan Yang, Zhen Xu, Jiaming Sun, Zhanhua Zhang, Yong Chen, Hujun Bao, Sida Peng, Xiaowei Zhou

    Abstract: This paper addresses the challenge of reconstructing dynamic 3D scenes with complex motions. Some recent works define 3D Gaussian primitives in the canonical space and use deformation fields to map canonical primitives to observation spaces, achieving real-time dynamic view synthesis. However, these methods often struggle to handle scenes with complex motions due to the difficulty of optimizing de… ▽ More

    Submitted 6 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

    Comments: CVPR 2025; Project page: https://zju3dv.github.io/freetimegs/

  50. arXiv:2506.03690  [pdf, other

    cs.CL

    Robust Preference Optimization via Dynamic Target Margins

    Authors: Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, Xiang Wang

    Abstract: The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 18 pages, 6 figures, accepted to The 63rd Annual Meeting of the Association for Computational Linguistics (ACL2025)