Skip to main content

Showing 1–50 of 705 results for author: Song, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.01381  [pdf, ps, other

    cs.LG cs.AI

    Distributional Soft Actor-Critic with Diffusion Policy

    Authors: Tong Liu, Yinuo Wang, Xujie Song, Wenjun Zou, Liangfa Chen, Likun Wang, Bin Shuai, Jingliang Duan, Shengbo Eben Li

    Abstract: Reinforcement learning has been proven to be highly effective in handling complex control tasks. Traditional methods typically use unimodal distributions, such as Gaussian distributions, to model the output of value distributions. However, unimodal distribution often and easily causes bias in value function estimation, leading to poor algorithm performance. This paper proposes a distributional rei… ▽ More

    Submitted 2 July, 2025; originally announced July 2025.

    Comments: Accepted IEEE ITSC 2025

  2. arXiv:2506.23369  [pdf, ps, other

    cs.RO

    GS-NBV: a Geometry-based, Semantics-aware Viewpoint Planning Algorithm for Avocado Harvesting under Occlusions

    Authors: Xiao'ao Song, Konstantinos Karydis

    Abstract: Efficient identification of picking points is critical for automated fruit harvesting. Avocados present unique challenges owing to their irregular shape, weight, and less-structured growing environments, which require specific viewpoints for successful harvesting. We propose a geometry-based, semantics-aware viewpoint-planning algorithm to address these challenges. The planning process involves th… ▽ More

    Submitted 29 June, 2025; originally announced June 2025.

    Comments: Accepted for publication in CASE 2025, 6 pages, 8 figures

  3. arXiv:2506.22049  [pdf, ps, other

    cs.LG cs.CL

    GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

    Authors: Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Yin Lu, Can Yang

    Abstract: Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the residual path to dominate over sub-layer outputs and limiting the learning capacity of… ▽ More

    Submitted 27 June, 2025; originally announced June 2025.

  4. arXiv:2506.21718  [pdf, ps, other

    cs.LG cs.AI cs.PF cs.SE eess.SY

    Performance Prediction for Large Systems via Text-to-Text Regression

    Authors: Yash Akhauri, Bryan Lewandowski, Cheng-Hsi Lin, Adrian N. Reyes, Grant C. Forbes, Arissa Wongpanich, Bangding Yang, Mohamed S. Abdelfattah, Sagi Perel, Xingyou Song

    Abstract: In many industries, predicting metric outcomes of large systems is a fundamental problem, driven largely by traditional tabular regression. However, such methods struggle on complex systems data in the wild such as configuration files or system logs, where feature engineering is often infeasible. We propose text-to-text regression as a general, scalable alternative. For predicting resource efficie… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: Code can be found at https://github.com/google-deepmind/regress-lm

  5. arXiv:2506.19290  [pdf, ps, other

    cs.AI cs.CL

    Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

    Authors: Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou

    Abstract: Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual ann… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  6. arXiv:2506.19055  [pdf, ps, other

    eess.IV cs.CV

    Xray2Xray: World Model from Chest X-rays with Volumetric Context

    Authors: Zefan Yang, Xinrui Song, Xuanang Xu, Yongyi Shi, Ge Wang, Mannudeep K. Kalra, Pingkun Yan

    Abstract: Chest X-rays (CXRs) are the most widely used medical imaging modality and play a pivotal role in diagnosing diseases. However, as 2D projection images, CXRs are limited by structural superposition, which constrains their effectiveness in precise disease diagnosis and risk prediction. To address the limitations of 2D CXRs, this study introduces Xray2Xray, a novel World Model that learns latent repr… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  7. arXiv:2506.13585  [pdf, ps, other

    cs.CL cs.LG

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Authors: MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou , et al. (103 additional authors not shown)

    Abstract: We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: A technical report from MiniMax. The authors are listed in alphabetical order. We open-source our MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1

  8. arXiv:2506.12154  [pdf, ps, other

    cs.SD eess.AS

    Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding

    Authors: Haoran Zhou, Xingchen Song, Brendan Fahy, Qiaochu Song, Binbin Zhang, Zhendong Peng, Anshul Wadhawan, Denglin Jiang, Apurv Verma, Vinay Ramesh, Srivas Prasad, Michele M. Franceschini

    Abstract: OpenAI Whisper is a family of robust Automatic Speech Recognition (ASR) models trained on 680,000 hours of audio. However, its encoder-decoder architecture, trained with a sequence-to-sequence objective, lacks native support for streaming ASR. In this paper, we fine-tune Whisper for streaming ASR using the WeNet toolkit by adopting a Unified Two-pass (U2) structure. We introduce an additional Conn… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: Accepted to INTERSPEECH 2025

  9. arXiv:2506.11712  [pdf, ps, other

    cs.AI

    Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization

    Authors: Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, Liqiang Nie

    Abstract: Direct Preference Optimization (DPO) has emerged as an effective approach for mitigating hallucination in Multimodal Large Language Models (MLLMs). Although existing methods have achieved significant progress by utilizing vision-oriented contrastive objectives for enhancing MLLMs' attention to visual inputs and hence reducing hallucination, they suffer from non-rigorous optimization objective func… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  10. arXiv:2506.09049  [pdf, ps, other

    cs.AI cs.CV cs.RO

    VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

    Authors: Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin

    Abstract: Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

    Comments: Project page: https://faceong.github.io/VIKI-R/

  11. arXiv:2506.06221  [pdf, ps, other

    cs.RO cs.LG

    BiAssemble: Learning Collaborative Affordance for Bimanual Geometric Assembly

    Authors: Yan Shen, Ruihai Wu, Yubin Ke, Xinyuan Song, Zeyi Li, Xiaoqi Li, Hongwei Fan, Haoran Lu, Hao dong

    Abstract: Shape assembly, the process of combining parts into a complete whole, is a crucial robotic skill with broad real-world applications. Among various assembly tasks, geometric assembly--where broken parts are reassembled into their original form (e.g., reconstructing a shattered bowl)--is particularly challenging. This requires the robot to recognize geometric cues for grasping, assembly, and subsequ… ▽ More

    Submitted 10 June, 2025; v1 submitted 6 June, 2025; originally announced June 2025.

    Comments: ICML 2025

  12. arXiv:2506.04924  [pdf, ps, other

    cs.LG

    Predicting ICU In-Hospital Mortality Using Adaptive Transformer Layer Fusion

    Authors: Han Wang, Ruoyun He, Guoguang Lao, Ting Liu, Hejiao Luo, Changqi Qin, Hongying Luo, Junmin Huang, Zihan Wei, Lu Chen, Yongzhi Xu, Ziqian Bi, Junhao Song, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Huafeng Liu, Junfeng Hao, Chunjie Tian

    Abstract: Early identification of high-risk ICU patients is crucial for directing limited medical resources. We introduce ALFIA (Adaptive Layer Fusion with Intelligent Attention), a modular, attention-based architecture that jointly trains LoRA (Low-Rank Adaptation) adapters and an adaptive layer-weighting mechanism to fuse multi-layer semantic features from a BERT backbone. Trained on our rigorous cw-24 (C… ▽ More

    Submitted 6 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

    Comments: 21 pages, 6 figures

  13. arXiv:2506.03569  [pdf, ps, other

    cs.CL

    MiMo-VL Technical Report

    Authors: Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song , et al. (50 additional authors not shown)

    Abstract: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 32 pages

  14. arXiv:2506.03524  [pdf, ps, other

    cs.CL cs.SE

    Seed-Coder: Let the Code Model Curate Data for Itself

    Authors: ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen , et al. (2 additional authors not shown)

    Abstract: Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality f… ▽ More

    Submitted 4 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

  15. arXiv:2505.24120  [pdf, ps, other

    cs.CV cs.AI

    CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

    Authors: Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, Xuchen Song

    Abstract: Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remain inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis… ▽ More

    Submitted 16 June, 2025; v1 submitted 29 May, 2025; originally announced May 2025.

    Comments: 36 pages

  16. arXiv:2505.23744  [pdf, other

    cs.CV cs.AI

    Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need

    Authors: Qiang Wang, Xiang Song, Yuhang He, Jizhou Han, Chenhao Ding, Xinyuan Gao, Yihong Gong

    Abstract: Deep neural networks (DNNs) often underperform in real-world, dynamic settings where data distributions change over time. Domain Incremental Learning (DIL) offers a solution by enabling continual model adaptation, with Parameter-Isolation DIL (PIDIL) emerging as a promising paradigm to reduce knowledge conflicts. However, existing PIDIL methods struggle with parameter selection accuracy, especiall… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted at CVPR 2025

  17. arXiv:2505.23426  [pdf, ps, other

    cs.LG cs.AI

    Enhanced DACER Algorithm with High Diffusion Efficiency

    Authors: Yinuo Wang, Mining Tan, Wenjun Zou, Haotian Lin, Xujie Song, Wenxuan Wang, Tong Liu, Likun Wang, Guojian Zhan, Tianze Zhu, Shiqi Liu, Jingliang Duan, Shengbo Eben Li

    Abstract: Due to their expressive capacity, diffusion models have shown great promise in offline RL and imitation learning. Diffusion Actor-Critic with Entropy Regulator (DACER) extended this capability to online RL by using the reverse diffusion process as a policy approximator, trained end-to-end with policy gradient methods, achieving strong performance. However, this comes at the cost of requiring many… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  18. arXiv:2505.20254  [pdf, ps, other

    cs.LG cs.AI cs.CL stat.ML

    Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

    Authors: Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang

    Abstract: Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  19. arXiv:2505.18847  [pdf, other

    cs.AI cs.CL

    Signal, Image, or Symbolic: Exploring the Best Input Representation for Electrocardiogram-Language Models Through a Unified Framework

    Authors: William Han, Chaojing Duan, Zhepeng Cen, Yihang Yao, Xiaoyu Song, Atharva Mhaskar, Dylan Leong, Michael A. Rosenberg, Emerson Liu, Ding Zhao

    Abstract: Recent advances have increasingly applied large language models (LLMs) to electrocardiogram (ECG) interpretation, giving rise to Electrocardiogram-Language Models (ELMs). Conditioned on an ECG and a textual query, an ELM autoregressively generates a free-form textual response. Unlike traditional classification-based systems, ELMs emulate expert cardiac electrophysiologists by issuing diagnoses, an… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: 29 pages, 2 figures, 8 tables

  20. arXiv:2505.18690  [pdf, other

    cs.CL

    Benchmarking and Rethinking Knowledge Editing for Large Language Models

    Authors: Guoxiu He, Xin Song, Futing Wang, Aixin Sun

    Abstract: Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more comple… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: arXiv admin note: text overlap with arXiv:2503.05212

  21. arXiv:2505.18610  [pdf, ps, other

    cs.CL

    PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

    Authors: Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang

    Abstract: Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively stud… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  22. arXiv:2505.17875  [pdf, ps, other

    cs.LG

    Semi-Supervised Multi-Label Feature Selection with Consistent Sparse Graph Learning

    Authors: Yan Zhong, Xingyu Wu, Xinping Zhao, Li Zhang, Xinyuan Song, Lei Shi, Bingbing Jiang

    Abstract: In practical domains, high-dimensional data are usually associated with diverse semantic labels, whereas traditional feature selection methods are designed for single-label data. Moreover, existing multi-label methods encounter two main challenges in semi-supervised scenarios: (1). Most semi-supervised methods fail to evaluate the label correlations without enough labeled samples, which are the cr… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  23. arXiv:2505.17387  [pdf, ps, other

    cs.CL

    WiNGPT-3.0 Technical Report

    Authors: Boqin Zhuang, Chenxiao Song, Huitong Lu, Jiacheng Qiao, Mingqian Liu, Mingxing Yu, Ping Hong, Rui Li, Xiaoxia Song, Xiangjun Xu, Xu Chen, Yaoyao Ma, Yujie Gao

    Abstract: Current Large Language Models (LLMs) exhibit significant limitations, notably in structured, interpretable, and verifiable medical reasoning, alongside practical deployment challenges related to computational resources and data privacy. This report focused on the development of WiNGPT-3.0, the 32-billion parameter LLMs, engineered with the objective of enhancing its capacity for medical reasoning… ▽ More

    Submitted 4 June, 2025; v1 submitted 22 May, 2025; originally announced May 2025.

  24. arXiv:2505.13519  [pdf, other

    stat.ML cs.AI cs.LG

    Continuous Domain Generalization

    Authors: Zekun Cai, Yiheng Yao, Guangji Bai, Renhe Jiang, Xuan Song, Ryosuke Shibasaki, Liang Zhao

    Abstract: Real-world data distributions often shift continuously across multiple latent factors such as time, geography, and socioeconomic context. However, existing domain generalization approaches typically treat domains as discrete or evolving along a single axis (e.g., time), which fails to capture the complex, multi-dimensional nature of real-world variation. This paper introduces the task of Continuou… ▽ More

    Submitted 17 May, 2025; originally announced May 2025.

    Comments: 22 pages, 9 figures

  25. arXiv:2505.13211  [pdf, ps, other

    cs.CV cs.AI

    MAGI-1: Autoregressive Video Generation at Scale

    Authors: Sand. ai, Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, W. Q. Zhang, Weifeng Luo, Xiaoyang Kang, Yuchen Sun, Yue Cao, Yunpeng Huang, Yutong Lin, Yuxin Fang, Zewei Tao, Zheng Zhang, Zhongshu Wang, Zixun Liu, Dai Shi, Guoli Su, Hanwen Sun, Hong Pan , et al. (14 additional authors not shown)

    Abstract: We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks condition… ▽ More

    Submitted 19 May, 2025; originally announced May 2025.

  26. arXiv:2505.11444  [pdf, ps, other

    cs.LG stat.AP stat.ME stat.ML

    A Generative Framework for Causal Estimation via Importance-Weighted Diffusion Distillation

    Authors: Xinran Song, Tianyu Chen, Mingyuan Zhou

    Abstract: Estimating individualized treatment effects from observational data is a central challenge in causal inference, largely due to covariate imbalance and confounding bias from non-randomized treatment assignment. While inverse probability weighting (IPW) is a well-established solution to this problem, its integration into modern deep learning frameworks remains limited. In this work, we propose Impor… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

  27. arXiv:2505.10936  [pdf, ps, other

    cs.CL

    Connecting the Dots: A Chain-of-Collaboration Prompting Framework for LLM Agents

    Authors: Jiaxing Zhao, Hongbin Xie, Yuzhen Lei, Xuan Song, Zhuoran Shi, Lianxin Li, Shuangxue Liu, Haoran Zhang

    Abstract: Large Language Models (LLMs) have demonstrated impressive performance in executing complex reasoning tasks. Chain-of-thought effectively enhances reasoning capabilities by unlocking the potential of large models, while multi-agent systems provide more comprehensive solutions by integrating collective intelligence of multiple agents. However, both approaches face significant limitations. Single-age… ▽ More

    Submitted 16 May, 2025; originally announced May 2025.

    Comments: 34 pages, 20 figures

  28. arXiv:2505.09659  [pdf, ps, other

    cs.LG cs.CL

    LAS: Loss-less ANN-SNN Conversion for Fully Spike-Driven Large Language Models

    Authors: Long Chen, Xiaotian Song, Yanan Sun

    Abstract: Spiking Large Language Models (LLMs) have emerged as an energy-efficient alternative to conventional LLMs through their event-driven computation. To effectively obtain spiking LLMs, researchers develop different ANN-to-SNN conversion methods by leveraging pre-trained ANN parameters while inheriting the energy efficiency of SNN. However, existing conversion methods struggle with extreme activation… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  29. arXiv:2505.07608  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

    Authors: LLM-Core Xiaomi, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai , et al. (40 additional authors not shown)

    Abstract: We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective… ▽ More

    Submitted 5 June, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

  30. arXiv:2505.07538  [pdf, other

    cs.CV

    Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

    Authors: Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang

    Abstract: We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally dist… ▽ More

    Submitted 27 May, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

  31. arXiv:2505.07263  [pdf, ps, other

    cs.CV

    Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

    Authors: Xiaokun Wang, Peiyu Wang, Jiangbo Pei, Wei Shen, Yi Peng, Yunzhuo Hao, Weijie Qiu, Ai Jian, Tianyidan Xie, Xuchen Song, Yang Liu, Yahui Zhou

    Abstract: We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM rea… ▽ More

    Submitted 9 June, 2025; v1 submitted 12 May, 2025; originally announced May 2025.

  32. arXiv:2505.02146  [pdf, other

    cs.CL cs.LG cs.PL

    QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

    Authors: Shouyang Dong, Yuanbo Wen, Jun Bi, Di Huang, Jiaming Guo, Jianxing Xu, Ruibai Xu, Xinkai Song, Yifan Hao, Xuehai Zhou, Tianshi Chen, Qi Guo, Yunji Chen

    Abstract: Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous… ▽ More

    Submitted 4 May, 2025; originally announced May 2025.

    Comments: Accepted to OSDI 2025

  33. arXiv:2505.01932  [pdf, other

    cs.GR cs.CV

    OT-Talk: Animating 3D Talking Head with Optimal Transportation

    Authors: Xinmu Wang, Xiang Gao, Xiyun Song, Heather Yu, Zongfang Lin, Liang Peng, Xianfeng Gu

    Abstract: Animating 3D head meshes using audio inputs has significant applications in AR/VR, gaming, and entertainment through 3D avatars. However, bridging the modality gap between speech signals and facial dynamics remains a challenge, often resulting in incorrect lip syncing and unnatural facial movements. To address this, we propose OT-Talk, the first approach to leverage optimal transportation to optim… ▽ More

    Submitted 10 May, 2025; v1 submitted 3 May, 2025; originally announced May 2025.

  34. arXiv:2505.01279  [pdf, other

    cs.LG

    MultiGran-STGCNFog: Towards Accurate and High-Throughput Inference for Multi-Granular Spatiotemporal Traffic Forecasting

    Authors: Zhaoyan Wang, Xiangchi Song, In-Young Ko

    Abstract: Accurate traffic forecasting and swift inference provision are essential for intelligent transportation systems. However, the present Graph Convolutional Network (GCN)-based approaches cannot extract and fuse multi-granular spatiotemporal features across various spatial and temporal scales sufficiently, proven to yield less accurate forecasts. Besides, additional feature extraction branches introd… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

    Comments: nine pages and five figures included

  35. arXiv:2504.18204  [pdf, ps, other

    cs.CV

    Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding

    Authors: Kun Li, Jianhui Wang, Yangfan He, Xinyuan Song, Ruoyu Wang, Hongyang He, Wenxin Zhang, Jiaqi Chen, Keqin Li, Sida Li, Miao Zhang, Tianyu Shi, Xueqian Wang

    Abstract: Generative AI has significantly changed industries by enabling text-driven image generation, yet challenges remain in achieving high-resolution outputs that align with fine-grained user preferences. Consequently, multi-round interactions are necessary to ensure the generated images meet expectations. Previous methods enhanced prompts via reward feedback but did not optimize over a multi-round dial… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2503.17660

  36. arXiv:2504.17062  [pdf, other

    cs.GR cs.CV

    ePBR: Extended PBR Materials in Image Synthesis

    Authors: Yu Guo, Zhiqiang Lao, Xiyun Song, Yubin Zhou, Zongfang Lin, Heather Yu

    Abstract: Realistic indoor or outdoor image synthesis is a core challenge in computer vision and graphics. The learning-based approach is easy to use but lacks physical consistency, while traditional Physically Based Rendering (PBR) offers high realism but is computationally expensive. Intrinsic image representation offers a well-balanced trade-off, decomposing images into fundamental components (intrinsic… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 8 pages without references, 7 figures, accepted in CVPRW 2025

  37. arXiv:2504.16656  [pdf, ps, other

    cs.CV

    Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

    Authors: Peiyu Wang, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou

    Abstract: We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that jointly leverages the Mixed Preference Optimization (MPO) and the Group Relative Policy Optimization (GRPO), which harmonizes reward-model guidance with rule-based strategies, thereby addressing… ▽ More

    Submitted 6 June, 2025; v1 submitted 23 April, 2025; originally announced April 2025.

  38. arXiv:2504.16524  [pdf, other

    cs.IR

    Modality Reliability Guided Multimodal Recommendation

    Authors: Xue Dong, Xuemeng Song, Na Zheng, Sicheng Zhao, Guiguang Ding

    Abstract: Multimodal recommendation faces an issue of the performance degradation that the uni-modal recommendation sometimes achieves the better performance. A possible reason is that the unreliable item modality data hurts the fusion result. Several existing studies have introduced weights for different modalities to reduce the contribution of the unreliable modality data in predicting the final user rati… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

  39. arXiv:2504.16016  [pdf, ps, other

    cs.CV

    Efficient Temporal Consistency in Diffusion-Based Video Editing with Adaptor Modules: A Theoretical Framework

    Authors: Xinyuan Song, Yangfan He, Sida Li, Jianhui Wang, Hongyang He, Xinhang Yuan, Ruoyu Wang, Jiaqi Chen, Keqin Li, Kuan Lu, Menghao Huo, Binxu Li, Pei Liu

    Abstract: Adapter-based methods are commonly used to enhance model performance with minimal additional complexity, especially in video editing tasks that require frame-to-frame consistency. By inserting small, learnable modules into pretrained diffusion models, these adapters can maintain temporal coherence without extensive retraining. Approaches that incorporate prompt learning with both shared and frame-… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: arXiv admin note: substantial text overlap with arXiv:2501.04606

  40. arXiv:2504.15930  [pdf, other

    cs.LG cs.DC

    StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

    Authors: Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, Daxin Jiang

    Abstract: Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs). RL for LLMs involves two stages: generation and training. The LLM first generates samples online, which are then used to derive rewards for training. The conventional view holds that the colocated architecture, where the two stages share resources via temporal multiplexing, outperforms the dis… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  41. arXiv:2504.15616  [pdf, other

    cs.LG cs.CV

    SocialMOIF: Multi-Order Intention Fusion for Pedestrian Trajectory Prediction

    Authors: Kai Chen, Xiaodong Zhao, Yujie Huang, Guoyu Fang, Xiao Song, Ruiping Wang, Ziyuan Wang

    Abstract: The analysis and prediction of agent trajectories are crucial for decision-making processes in intelligent systems, with precise short-term trajectory forecasting being highly significant across a range of applications. Agents and their social interactions have been quantified and modeled by researchers from various perspectives; however, substantial limitations exist in the current work due to th… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: 11 pages,6 figures

  42. arXiv:2504.14868  [pdf, ps, other

    cs.CV

    Twin Co-Adaptive Dialogue for Progressive Image Generation

    Authors: Jianhui Wang, Yangfan He, Yan Zhong, Xinyuan Song, Jiayi Su, Yuheng Feng, Hongyang He, Wenyu Zhu, Xinhang Yuan, Kuan Lu, Menghao Huo, Miao Zhang, Keqin Li, Jiaqi Chen, Tianyu Shi, Xueqian Wang

    Abstract: Modern text-to-image generation systems have enabled the creation of remarkably realistic and high-quality visuals, yet they often falter when handling the inherent ambiguities in user prompts. In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. Instead of a static generation process, Twin-Co employs a dynamic, i… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  43. arXiv:2504.13891  [pdf, other

    cs.HC cs.AI

    Mozualization: Crafting Music and Visual Representation with Multimodal AI

    Authors: Wanfang Xu, Lixiang Zhao, Haiwen Song, Xinheng Song, Zhaolin Lu, Yu Liu, Min Chen, Eng Gee Lim, Lingyun Yu

    Abstract: In this work, we introduce Mozualization, a music generation and editing tool that creates multi-style embedded music by integrating diverse inputs, such as keywords, images, and sound clips (e.g., segments from various pieces of music or even a playful cat's meow). Our work is inspired by the ways people express their emotions -- writing mood-descriptive poems or articles, creating drawings with… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

    Comments: 7 pages, 5 figures, CHI2025

  44. arXiv:2504.13825  [pdf, other

    cs.CL cs.LG

    Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

    Authors: Junjie Yang, Junhao Song, Xudong Han, Ziqian Bi, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Yichao Zhang, Qian Niu, Benji Peng, Keyu Chen, Ming Liu

    Abstract: Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, suc… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  45. arXiv:2504.13650  [pdf, other

    cs.CV

    EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model

    Authors: Sijing Li, Tianwei Lin, Lingshuai Lin, Wenqiao Zhang, Jiang Liu, Xiaoda Yang, Juncheng Li, Yucheng He, Xiaohui Song, Jun Xiao, Yueting Zhuang, Beng Chin Ooi

    Abstract: Medical Large Vision-Language Models (Med-LVLMs) demonstrate significant potential in healthcare, but their reliance on general medical data and coarse-grained global visual understanding limits them in intelligent ophthalmic diagnosis. Currently, intelligent ophthalmic diagnosis faces three major challenges: (i) Data. The lack of deeply annotated, high-quality, multi-modal ophthalmic visual instr… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

  46. arXiv:2504.11773  [pdf, other

    cs.CV

    TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion

    Authors: Yiran Wang, Jiaqi Li, Chaoyi Hong, Ruibo Li, Liusheng Sun, Xiao Song, Zhe Wang, Zhiguo Cao, Guosheng Lin

    Abstract: Radar-Camera depth estimation aims to predict dense and accurate metric depth by fusing input images and Radar data. Model efficiency is crucial for this task in pursuit of real-time processing on autonomous vehicles and robotic platforms. However, due to the sparsity of Radar returns, the prevailing methods adopt multi-stage frameworks with intermediate quasi-dense depth, which are time-consuming… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: Accepted by CVPR 2025 (Oral Presentation)

  47. arXiv:2504.10982  [pdf, other

    cs.CL cs.AI

    Exploring the Role of Knowledge Graph-Based RAG in Japanese Medical Question Answering with Small-Scale LLMs

    Authors: Yingjian Chen, Feiyang Li, Xingyu Song, Tianxiao Li, Zixin Xu, Xiujie Chen, Issey Sukeda, Irene Li

    Abstract: Large language models (LLMs) perform well in medical QA, but their effectiveness in Japanese contexts is limited due to privacy constraints that prevent the use of commercial models like GPT-4 in clinical settings. As a result, recent efforts focus on instruction-tuning open-source LLMs, though the potential of combining them with retrieval-augmented generation (RAG) remains underexplored. To brid… ▽ More

    Submitted 26 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

    Comments: 10 pages

  48. arXiv:2504.09608  [pdf, other

    cs.CV

    ERL-MPP: Evolutionary Reinforcement Learning with Multi-head Puzzle Perception for Solving Large-scale Jigsaw Puzzles of Eroded Gaps

    Authors: Xingke Song, Xiaoying Yang, Chenglin Yao, Jianfeng Ren, Ruibin Bai, Xin Chen, Xudong Jiang

    Abstract: Solving jigsaw puzzles has been extensively studied. While most existing models focus on solving either small-scale puzzles or puzzles with no gap between fragments, solving large-scale puzzles with gaps presents distinctive challenges in both image understanding and combinatorial optimization. To tackle these challenges, we propose a framework of Evolutionary Reinforcement Learning with Multi-hea… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

    Comments: 9 pages, 5 figures

  49. arXiv:2504.08100  [pdf, other

    cs.CV

    ContrastiveGaussian: High-Fidelity 3D Generation with Contrastive Learning and Gaussian Splatting

    Authors: Junbang Liu, Enpei Huang, Dongxing Mao, Hui Zhang, Xinyuan Song, Yongxin Ni

    Abstract: Creating 3D content from single-view images is a challenging problem that has attracted considerable attention in recent years. Current approaches typically utilize score distillation sampling (SDS) from pre-trained 2D diffusion models to generate multi-view 3D representations. Although some methods have made notable progress by balancing generation speed and model quality, their performance is of… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Code will be available at https://github.com/YaNLlan-ljb/ContrastiveGaussian

  50. arXiv:2504.05599  [pdf, ps, other

    cs.CV cs.CL

    Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

    Authors: Yi Peng, Peiyu Wang, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou

    Abstract: We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text al… ▽ More

    Submitted 9 June, 2025; v1 submitted 7 April, 2025; originally announced April 2025.