Skip to main content

Showing 1–50 of 2,942 results for author: Li, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10348  [pdf, ps, other

    cs.HC cs.SD eess.AS

    ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection

    Authors: Cunhang Fan, Xiaoke Yang, Hongyu Zhang, Ying Chen, Lu Li, Jian Zhou, Zhao Lv

    Abstract: Auditory attention detection (AAD) aims to identify the direction of the attended speaker in multi-speaker environments from brain signals, such as Electroencephalography (EEG) signals. However, existing EEG-based AAD methods overlook the spatio-temporal dependencies of EEG signals, limiting their decoding and generalization abilities. To address these issues, this paper proposes a Lightweight Spa… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

  2. arXiv:2505.09586  [pdf, ps, other

    cs.LG

    Rhomboid Tiling for Geometric Graph Deep Learning

    Authors: Yipeng Zhang, Longlong Li, Kelin Xia

    Abstract: Graph Neural Networks (GNNs) have proven effective for learning from graph-structured data through their neighborhood-based message passing framework. Many hierarchical graph clustering pooling methods modify this framework by introducing clustering-based strategies, enabling the construction of more expressive and powerful models. However, all of these message passing framework heavily rely on th… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  3. arXiv:2505.08617  [pdf, ps, other

    cs.CV

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Authors: Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng

    Abstract: While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effective… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Work in progress

  4. arXiv:2505.08215  [pdf, ps, other

    cs.AI cs.SD eess.AS

    Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People

    Authors: Haoshuai Zhou, Boxuan Cao, Changgeng Mo, Linkai Li, Shan Xiang Wang

    Abstract: Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

  5. arXiv:2505.07796  [pdf, other

    cs.CL cs.AI cs.LG

    Learning Dynamics in Continual Pre-Training for Large Language Models

    Authors: Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng

    Abstract: Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We hav… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: Accepted to ICML2025 (spotlight)

  6. arXiv:2505.07608  [pdf, ps, other

    cs.CL cs.AI cs.LG

    MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

    Authors: Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai , et al. (40 additional authors not shown)

    Abstract: We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  7. arXiv:2505.07062  [pdf, ps, other

    cs.CV cs.AI

    Seed1.5-VL Technical Report

    Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

    Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  8. arXiv:2505.06680  [pdf, ps, other

    cs.AI cs.HC cs.LG eess.SY physics.soc-ph

    A Survey on Data-Driven Modeling of Human Drivers' Lane-Changing Decisions

    Authors: Linxuan Huang, Dong-Fan Xie, Li Li, Zhengbing He

    Abstract: Lane-changing (LC) behavior, a critical yet complex driving maneuver, significantly influences driving safety and traffic dynamics. Traditional analytical LC decision (LCD) models, while effective in specific environments, often oversimplify behavioral heterogeneity and complex interactions, limiting their capacity to capture real LCD. Data-driven approaches address these gaps by leveraging rich e… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

  9. arXiv:2505.06302  [pdf, other

    cs.LG cs.AI

    QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

    Authors: Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, Ling Li

    Abstract: Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks po… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 10 pages, 5 figures

    ACM Class: I.2.2

  10. arXiv:2505.06283  [pdf, other

    cs.LG q-bio.QM stat.ML

    Soft causal learning for generalized molecule property prediction: An environment perspective

    Authors: Limin Li, Kuo Yang, Wenjie Du, Pengkun Wang, Zhengyang Zhou, Yang Wang

    Abstract: Learning on molecule graphs has become an increasingly important topic in AI for science, which takes full advantage of AI to facilitate scientific discovery. Existing solutions on modeling molecules utilize Graph Neural Networks (GNNs) to achieve representations but they mostly fail to adapt models to out-of-distribution (OOD) samples. Although recent advances on OOD-oriented graph learning have… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

    Comments: 23 pages, 7 figures, 3 tables

    ACM Class: I.2.4

  11. arXiv:2505.05472  [pdf, other

    cs.CV

    Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    Authors: Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, Weilin Huang

    Abstract: Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements… ▽ More

    Submitted 11 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

    Comments: Mogao Technical Report

  12. arXiv:2505.05309  [pdf, other

    eess.IV cs.CV

    Augmented Deep Contexts for Spatially Embedded Video Coding

    Authors: Yifan Bian, Chuanbo Tang, Li Li, Dong Liu

    Abstract: Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firs… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: 15 pages,CVPR

  13. arXiv:2505.05035  [pdf, ps, other

    cs.IR

    Divide-and-Conquer: Cold-Start Bundle Recommendation via Mixture of Diffusion Experts

    Authors: Ming Li, Lin Li, Xiaohui Tao, Dong Zhang, Jimmy Xiangji Huang

    Abstract: Cold-start bundle recommendation focuses on modeling new bundles with insufficient information to provide recommendations. Advanced bundle recommendation models usually learn bundle representations from multiple views (e.g., interaction view) at both the bundle and item levels. Consequently, the cold-start problem for bundles is more challenging than that for traditional items due to the dual-leve… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

  14. arXiv:2505.04519  [pdf, other

    cs.CL

    Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

    Authors: Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, Binghan Li, Yonghan Dong, Xiaojun Meng, Yasheng Wang, Dong Li, Yin Li, Dandan Tu, Can Chen, Youliang Yan, Fisher Yu, Ruiming Tang, Yunhe Wang, Botian Huang, Bo Wang, Boxiao Liu , et al. (49 additional authors not shown)

    Abstract: Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing r… ▽ More

    Submitted 7 May, 2025; originally announced May 2025.

  15. BuildingBlock: A Hybrid Approach for Structured Building Generation

    Authors: Junming Huang, Chi Wang, Letian Li, Changxin Huang, Qiang Dai, Weiwei Xu

    Abstract: Three-dimensional building generation is vital for applications in gaming, virtual reality, and digital twins, yet current methods face challenges in producing diverse, structured, and hierarchically coherent buildings. We propose BuildingBlock, a hybrid approach that integrates generative models, procedural content generation (PCG), and large language models (LLMs) to address these limitations. S… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: SIGGRAPH 2025 (Conference Track)

  16. arXiv:2505.03739  [pdf, other

    cs.CL cs.AI

    VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

    Authors: Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun

    Abstract: With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-A… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: Training and Inference Codes: https://github.com/VITA-MLLM/VITA-Audio

  17. arXiv:2505.03463  [pdf, other

    cs.CV physics.med-ph

    Nonperiodic dynamic CT reconstruction using backward-warping INR with regularization of diffeomorphism (BIRD)

    Authors: Muge Du, Zhuozhao Zheng, Wenying Wang, Guotao Quan, Wuliang Shi, Le Shen, Li Zhang, Liang Li, Yinong Liu, Yuxiang Xing

    Abstract: Dynamic computed tomography (CT) reconstruction faces significant challenges in addressing motion artifacts, particularly for nonperiodic rapid movements such as cardiac imaging with fast heart rates. Traditional methods struggle with the extreme limited-angle problems inherent in nonperiodic cases. Deep learning methods have improved performance but face generalization challenges. Recent implicit… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  18. arXiv:2505.03195  [pdf, other

    cs.AR

    QiMeng-CPU-v2: Automated Superscalar Processor Design by Learning Data Dependencies

    Authors: Shuyao Cheng, Rui Zhang, Wenkai He, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Yifan Hao, Guanglin Xu, Yuanbo Wen, Ling Li, Qi Guo, Yunji Chen

    Abstract: Automated processor design, which can significantly reduce human efforts and accelerate design cycles, has received considerable attention. While recent advancements have automatically designed single-cycle processors that execute one instruction per cycle, their performance cannot compete with modern superscalar processors that execute multiple instructions per cycle. Previous methods fail on sup… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: 8 pages, 3 figures

  19. arXiv:2505.02831  [pdf, other

    cs.CV

    No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

    Authors: Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, Jingdong Wang

    Abstract: Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance the generation quality of diffusion transformers. However, existing approaches necessitate to either introduce an external and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation… ▽ More

    Submitted 13 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

    Comments: Self-Representation Alignment for Diffusion Transformers. Code: https://github.com/vvvvvjdy/SRA

  20. arXiv:2505.02784  [pdf, other

    cs.CV

    Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge

    Authors: Vladyslav Zalevskyi, Thomas Sanchez, Misha Kaandorp, Margaux Roulet, Diego Fajardo-Rojas, Liu Li, Jana Hutter, Hongwei Bran Li, Matthew Barkovich, Hui Ji, Luca Wilhelmi, Aline Dändliker, Céline Steger, Mériam Koob, Yvan Gomez, Anton Jakovčić, Melita Klaić, Ana Adžić, Pavel Marković, Gracia Grabarić, Milan Rados, Jordina Aviles Verdera, Gregor Kasprian, Gregor Dovjak, Raphael Gaubert-Rachmühl , et al. (45 additional authors not shown)

    Abstract: Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics wer… ▽ More

    Submitted 8 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

  21. arXiv:2505.01263  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

    Authors: Gaoxiang Cong, Liang Li, Jiadong Pan, Zhedong Zhang, Amin Beheshti, Anton van den Hengel, Yuankai Qi, Qingming Huang

    Abstract: Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  22. arXiv:2504.21619  [pdf, other

    cs.RO

    3D Hand-Eye Calibration for Collaborative Robot Arm: Look at Robot Base Once

    Authors: Leihui Li, Lixuepiao Wan, Volker Krueger, Xuping Zhang

    Abstract: Hand-eye calibration is a common problem in the field of collaborative robotics, involving the determination of the transformation matrix between the visual sensor and the robot flange to enable vision-based robotic tasks. However, this process typically requires multiple movements of the robot arm and an external calibration object, making it both time-consuming and inconvenient, especially in sc… ▽ More

    Submitted 9 May, 2025; v1 submitted 30 April, 2025; originally announced April 2025.

    Comments: updated

  23. arXiv:2504.21248  [pdf, other

    cs.CV

    Multi-modal Transfer Learning for Dynamic Facial Emotion Recognition in the Wild

    Authors: Ezra Engel, Lishan Li, Chris Hudy, Robert Schleusner

    Abstract: Facial expression recognition (FER) is a subset of computer vision with important applications for human-computer-interaction, healthcare, and customer service. FER represents a challenging problem-space because accurate classification requires a model to differentiate between subtle changes in facial features. In this paper, we examine the use of multi-modal transfer learning to improve performan… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: 8 pages, 6 figures

  24. arXiv:2504.20409  [pdf, other

    cs.CV

    GarmentX: Autoregressive Parametric Representations for High-Fidelity 3D Garment Generation

    Authors: Jingfeng Guo, Jinnan Chen, Weikai Chen, Zhenyu Sun, Lanjiong Li, Baozhu Zhao, Lingting Zhu, Xin Wang, Qi Liu

    Abstract: This work presents GarmentX, a novel framework for generating diverse, high-fidelity, and wearable 3D garments from a single input image. Traditional garment reconstruction methods directly predict 2D pattern edges and their connectivity, an overly unconstrained approach that often leads to severe self-intersections and physically implausible garment structures. In contrast, GarmentX introduces a… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

  25. arXiv:2504.20113  [pdf, other

    cs.AI cs.LG

    Transforming Evidence Synthesis: A Systematic Review of the Evolution of Automated Meta-Analysis in the Age of AI

    Authors: Lingbo Li, Anuradha Mathrani, Teo Susnjak

    Abstract: Exponential growth in scientific literature has heightened the demand for efficient evidence-based synthesis, driving the rise of the field of Automated Meta-analysis (AMA) powered by natural language processing and machine learning. This PRISMA systematic review introduces a structured framework for assessing the current state of AMA, based on screening 978 papers from 2006 to 2024, and analyzing… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  26. arXiv:2504.20073  [pdf, other

    cs.LG cs.AI cs.CL

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Authors: Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Monica Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

    Abstract: Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for t… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  27. arXiv:2504.20016  [pdf, other

    cs.HC cs.CY cs.MM

    Applying LLM-Powered Virtual Humans to Child Interviews in Child-Centered Design

    Authors: Linshi Li, Hanlin Cai

    Abstract: In child-centered design, directly engaging children is crucial for deeply understanding their experiences. However, current research often prioritizes adult perspectives, as interviewing children involves unique challenges such as environmental sensitivities and the need for trust-building. AI-powered virtual humans (VHs) offer a promising approach to facilitate engaging and multimodal interactio… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: This paper has been accepted as a Work-in-Progress (WiP) paper in the 24th annual ACM Interaction Design and Children (IDC) Conference

  28. arXiv:2504.19838  [pdf, other

    cs.HC

    LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

    Authors: Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, Hongsheng Li

    Abstract: With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and s… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: 37 pages, 10 figures, 7 tables, Project Homepage: https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents

  29. arXiv:2504.19110  [pdf, other

    cs.CL

    APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries

    Authors: Huajian Xin, Luming Li, Xiaoran Jin, Jacques Fleuriot, Wenda Li

    Abstract: Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), whic… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  30. "I Would Have Written My Code Differently'': Beginners Struggle to Understand LLM-Generated Code

    Authors: Yangtian Zi, Luisa Li, Arjun Guha, Carolyn Jane Anderson, Molly Q Feldman

    Abstract: Large language models (LLMs) are being increasingly adopted for programming work. Prior work shows that while LLMs accelerate task completion for professional programmers, beginning programmers struggle to prompt models effectively. However, prompting is just half of the code generation process -- when code is generated, it must be read, evaluated, and integrated (or rejected). How accessible are… ▽ More

    Submitted 26 April, 2025; originally announced April 2025.

    Comments: To appear in 33rd ACM International Conference on the Foundations of Software Engineering (FSE Companion '25), June 23-28, 2025, Trondheim, Norway

  31. arXiv:2504.18398  [pdf, other

    eess.IV cs.CV

    Partition Map-Based Fast Block Partitioning for VVC Inter Coding

    Authors: Xinmin Feng, Zhuoyuan Li, Li Li, Dong Liu, Feng Wu

    Abstract: Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (QT+MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: 23 pages, 26 figures. Project page: https://github.com/ustc-ivclab/IPM

  32. arXiv:2504.18158  [pdf, other

    cs.CV

    E-InMeMo: Enhanced Prompting for Visual In-Context Learning

    Authors: Jiahao Zhang, Bowen Wang, Hong Liu, Liangzhi Li, Yuta Nakashima, Hajime Nagahara

    Abstract: Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output… ▽ More

    Submitted 25 April, 2025; originally announced April 2025.

    Comments: Preprint

  33. arXiv:2504.17343  [pdf, other

    cs.CV

    TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

    Authors: Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun

    Abstract: The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they fa… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  34. arXiv:2504.17261  [pdf, other

    cs.LG cs.AI

    Symbolic Representation for Any-to-Any Generative Tasks

    Authors: Jiaqi Chen, Xiaoye Zhu, Yue Wang, Tianyang Liu, Xinhui Chen, Ying Chen, Chak Tou Leong, Yifei Ke, Joseph Liu, Yiwen Yuan, Julian McAuley, Li-jia Li

    Abstract: We propose a symbolic generative task description language and a corresponding inference engine capable of representing arbitrary multimodal tasks as structured symbolic flows. Unlike conventional generative models that rely on large-scale training and implicit neural representations to learn cross-modal mappings, often at high computational cost and with limited flexibility, our framework introdu… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  35. arXiv:2504.16586  [pdf, other

    cs.MM

    Learning Switchable Priors for Neural Image Compression

    Authors: Haotian Zhang, Yuqi Li, Li Li, Dong Liu

    Abstract: Neural image compression (NIC) usually adopts a predefined family of probabilistic distributions as the prior of the latent variables, and meanwhile relies on entropy models to estimate the parameters for the probabilistic family. More complex probabilistic distributions may fit the latent variables more accurately, but also incur higher complexity of the entropy models, limiting their practical v… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 18 pages, 15 figures

  36. arXiv:2504.16546  [pdf, other

    cs.CY cs.HC

    Tinkering Against Scaling

    Authors: Bolun Zhang, Yang Shen, Linzhuo Li, Yu Ji, Di Wu, Tongyu Wu, Lianghao Dai

    Abstract: The ascent of scaling in artificial intelligence research has revolutionized the field over the past decade, yet it presents significant challenges for academic researchers, particularly in computational social science and critical algorithm studies. The dominance of large language models, characterized by their extensive parameters and costly training processes, creates a disparity where only ind… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 43 pages, 4 figures

  37. arXiv:2504.16273  [pdf, other

    cs.AI cs.HC

    Investigating LLMs in Clinical Triage: Promising Capabilities, Persistent Intersectional Biases

    Authors: Joseph Lee, Tianqi Shang, Jae Young Baik, Duy Duong-Tran, Shu Yang, Lingyao Li, Li Shen

    Abstract: Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored. We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions: (1) robustness to distribution shifts and missing data, and (2) counterfactual analysis of intersectional biases across sex and race. We assess multiple… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: Accepted to GenAI4Health Workshop @ AAAI 2025

  38. arXiv:2504.15817  [pdf, other

    cs.CR cs.AR

    EFFACT: A Highly Efficient Full-Stack FHE Acceleration Platform

    Authors: Yi Huang, Xinsheng Gong, Xiangyu Kong, Dibei Chen, Jianfeng Zhu, Wenping Zhu, Liangwei Li, Mingyu Gao, Shaojun Wei, Aoyang Zhang, Leibo Liu

    Abstract: Fully Homomorphic Encryption (FHE) is a set of powerful cryptographic schemes that allows computation to be performed directly on encrypted data with an unlimited depth. Despite FHE's promising in privacy-preserving computing, yet in most FHE schemes, ciphertext generally blows up thousands of times compared to the original message, and the massive amount of data load from off-chip memory for boot… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: Accepted by HPCA 2025

  39. arXiv:2504.15720  [pdf, other

    cs.DC

    SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference

    Authors: Yihao Zhao, Jiadun Chen, Peng Sun, Lei Li, Xuanzhe Liu, Xin Jin

    Abstract: Large language models (LLMs) with different architectures and sizes have been developed. Serving each LLM with dedicated GPUs leads to resource waste and service inefficiency due to the varying demand of LLM requests. A common practice is to share multiple LLMs. However, existing sharing systems either do not consider the autoregressive pattern of LLM services, or only focus on improving the throu… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

  40. arXiv:2504.15681  [pdf, other

    cs.CV

    Vidi: Large Multimodal Models for Video Understanding and Editing

    Authors: Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, Xueqiong Qu

    Abstract: Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components… ▽ More

    Submitted 24 April, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

  41. arXiv:2504.15524  [pdf, other

    cs.CL cs.AI

    IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property

    Authors: Qiyao Wang, Guhong Chen, Hongbo Wang, Huaren Liu, Minghui Zhu, Zhifei Qin, Linwei Li, Yilin Yue, Shiqiang Wang, Jiayan Li, Yihang Wu, Ziqiang Liu, Longze Chen, Run Luo, Liyang Fan, Jiaming Li, Lei Zhang, Kan Xu, Hongfei Lin, Hamid Alinejad-Rokny, Shiwen Ni, Yuan Lin, Min Yang

    Abstract: Intellectual Property (IP) is a unique domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. As large language models (LLMs) continue to advance, they show great potential for processing IP tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks either focus narrowl… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 89 pages, 75 figures, 55 tables

  42. arXiv:2504.15275  [pdf, other

    cs.AI cs.LG

    Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

    Authors: Jie Cheng, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Gang Xiong, Yisheng Lv, Fei-Yue Wang

    Abstract: Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

  43. arXiv:2504.15192  [pdf

    cs.CV cs.AI

    Breast density in MRI: an AI-based quantification and relationship to assessment in mammography

    Authors: Yaqian Chen, Lin Li, Hanxue Gu, Haoyu Dong, Derek L. Nguyen, Allan D. Kirk, Maciej A. Mazurowski, E. Shelley Hwang

    Abstract: Mammographic breast density is a well-established risk factor for breast cancer. Recently there has been interest in breast MRI as an adjunct to mammography, as this modality provides an orthogonal and highly quantitative assessment of breast tissue. However, its 3D nature poses analytic challenges related to delineating and aggregating complex structures across slices. Here, we applied an in-hous… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: 13 pages, 5 figures

  44. arXiv:2504.14642  [pdf, other

    cs.CV

    Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension

    Authors: Lin Li, Wei Chen, Jiahui Li, Long Chen

    Abstract: Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning, but remain limited in visual relation understanding (\eg, scene graph generation), particularly in modeling \textit{N}-ary relationships that identify multiple semantic roles among an action event. Such a lack of \textit{semantic dependencies} modeling among multi-… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

    Comments: Ongoing project

  45. arXiv:2504.14603  [pdf, other

    cs.AI cs.HC cs.OS

    UFO2: The Desktop AgentOS

    Authors: Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

    Abstract: Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution. We present UFO2, a multiagent AgentOS for Windows deskto… ▽ More

    Submitted 25 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

    Comments: The source code of UFO2 is publicly available at https://github.com/microsoft/UFO/, with comprehensive documentation provided at https://microsoft.github.io/UFO/

  46. arXiv:2504.14582  [pdf, other

    cs.CV

    NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results

    Authors: Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu , et al. (86 additional authors not shown)

    Abstract: This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that ach… ▽ More

    Submitted 28 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

    Comments: NTIRE 2025 webpage: https://www.cvlai.net/ntire/2025. Code: https://github.com/zhengchen1999/NTIRE2025_ImageSR_x4

  47. arXiv:2504.14560  [pdf, other

    cs.AR cs.AI

    ReasoningV: Efficient Verilog Code Generation with Adaptive Hybrid Reasoning Model

    Authors: Haiyan Qin, Zhiwei Xie, Jingjing Li, Liangchen Li, Xiaotong Feng, Junzhan Liu, Wang Kang

    Abstract: Large Language Models (LLMs) have advanced Verilog code generation significantly, yet face challenges in data quality, reasoning capabilities, and computational efficiency. This paper presents ReasoningV, a novel model employing a hybrid reasoning strategy that integrates trained intrinsic capabilities with dynamic inference adaptation for Verilog code generation. Our framework introduces three co… ▽ More

    Submitted 30 April, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

    Comments: 9 pages, 4 figures

  48. arXiv:2504.14512  [pdf

    cs.DL

    Revisiting the field normalization approaches/practices

    Authors: Xinyue Lu, Li Li, Zhesi Shen

    Abstract: Field normalization plays a crucial role in scientometrics to ensure fair comparisons across different disciplines. In this paper, we revisit the effectiveness of several widely used field normalization methods. Our findings indicate that source-side normalization (as employed in SNIP) does not fully eliminate citation bias across different fields and the imbalanced paper growth rates across field… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

  49. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  50. arXiv:2504.13517  [pdf, other

    cs.AI

    Optimizing Electric Vehicle Charging Station Locations: A Data-driven System with Multi-source Fusion

    Authors: Lihuan Li, Du Yin, Hao Xue, David Lillo-Trynes, Flora Salim

    Abstract: With the growing electric vehicles (EVs) charging demand, urban planners face the challenges of providing charging infrastructure at optimal locations. For example, range anxiety during long-distance travel and the inadequate distribution of residential charging stations are the major issues many cities face. To achieve reasonable estimation and deployment of the charging demand, we develop a data… ▽ More

    Submitted 18 April, 2025; originally announced April 2025.

    Comments: 4-page short paper