-
ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection
Authors:
Cunhang Fan,
Xiaoke Yang,
Hongyu Zhang,
Ying Chen,
Lu Li,
Jian Zhou,
Zhao Lv
Abstract:
Auditory attention detection (AAD) aims to identify the direction of the attended speaker in multi-speaker environments from brain signals, such as Electroencephalography (EEG) signals. However, existing EEG-based AAD methods overlook the spatio-temporal dependencies of EEG signals, limiting their decoding and generalization abilities. To address these issues, this paper proposes a Lightweight Spa…
▽ More
Auditory attention detection (AAD) aims to identify the direction of the attended speaker in multi-speaker environments from brain signals, such as Electroencephalography (EEG) signals. However, existing EEG-based AAD methods overlook the spatio-temporal dependencies of EEG signals, limiting their decoding and generalization abilities. To address these issues, this paper proposes a Lightweight Spatio-Temporal Enhancement Nested Network (ListenNet) for AAD. The ListenNet has three key components: Spatio-temporal Dependency Encoder (STDE), Multi-scale Temporal Enhancement (MSTE), and Cross-Nested Attention (CNA). The STDE reconstructs dependencies between consecutive time windows across channels, improving the robustness of dynamic pattern extraction. The MSTE captures temporal features at multiple scales to represent both fine-grained and long-range temporal patterns. In addition, the CNA integrates hierarchical features more effectively through novel dynamic attention mechanisms to capture deep spatio-temporal correlations. Experimental results on three public datasets demonstrate the superiority of ListenNet over state-of-the-art methods in both subject-dependent and challenging subject-independent settings, while reducing the trainable parameter count by approximately 7 times. Code is available at:https://github.com/fchest/ListenNet.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Rhomboid Tiling for Geometric Graph Deep Learning
Authors:
Yipeng Zhang,
Longlong Li,
Kelin Xia
Abstract:
Graph Neural Networks (GNNs) have proven effective for learning from graph-structured data through their neighborhood-based message passing framework. Many hierarchical graph clustering pooling methods modify this framework by introducing clustering-based strategies, enabling the construction of more expressive and powerful models. However, all of these message passing framework heavily rely on th…
▽ More
Graph Neural Networks (GNNs) have proven effective for learning from graph-structured data through their neighborhood-based message passing framework. Many hierarchical graph clustering pooling methods modify this framework by introducing clustering-based strategies, enabling the construction of more expressive and powerful models. However, all of these message passing framework heavily rely on the connectivity structure of graphs, limiting their ability to capture the rich geometric features inherent in geometric graphs. To address this, we propose Rhomboid Tiling (RT) clustering, a novel clustering method based on the rhomboid tiling structure, which performs clustering by leveraging the complex geometric information of the data and effectively extracts its higher-order geometric structures. Moreover, we design RTPool, a hierarchical graph clustering pooling model based on RT clustering for graph classification tasks. The proposed model demonstrates superior performance, outperforming 21 state-of-the-art competitors on all the 7 benchmark datasets.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Authors:
Zhaochen Su,
Linjie Li,
Mingyang Song,
Yunzhuo Hao,
Zhengyuan Yang,
Jun Zhang,
Guanjie Chen,
Jiawei Gu,
Juntao Li,
Xiaoye Qu,
Yu Cheng
Abstract:
While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effective…
▽ More
While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools. V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies by directly optimizing for task success using feedback from tool interactions. We empirically validate V-ToolRL on challenging chart reasoning tasks. Our RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its SFT-initialized counterpart (+28.83 points) and surpasses established supervised tool-learning baselines like Taco and CogCom by an average of +12.7 points. Notably, it also surpasses prominent closed-source models like GPT-4.1 by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely "think with images".
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People
Authors:
Haoshuai Zhou,
Boxuan Cao,
Changgeng Mo,
Linkai Li,
Shan Xiang Wang
Abstract:
Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder…
▽ More
Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder layer selection, prediction head architecture, and ensemble configurations. Our findings show that, contrary to traditional use-all-layers methods, selecting a single encoder layer yields better results. Additionally, temporal modeling is crucial for effective prediction heads. We also demonstrate that ensembling multiple SFMs improves performance, with stronger individual models providing greater benefit. Finally, we explore the relationship between key SFM attributes and their impact on SIP-HI performance. Our study offers practical insights into effectively adapting SFMs for speech intelligibility prediction for hearing-impaired populations.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Learning Dynamics in Continual Pre-Training for Large Language Models
Authors:
Xingjin Wang,
Howe Tissue,
Lu Wang,
Linjing Li,
Daniel Dajun Zeng
Abstract:
Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We hav…
▽ More
Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
Authors:
Xiaomi LLM-Core Team,
:,
Bingquan Xia,
Bowen Shen,
Cici,
Dawei Zhu,
Di Zhang,
Gang Wang,
Hailin Zhang,
Huaqiu Liu,
Jiebao Xiao,
Jinhao Dong,
Liang Zhao,
Peidian Li,
Peng Wang,
Shihua Yu,
Shimao Chen,
Weikun Wang,
Wenhan Ma,
Xiangwei Deng,
Yi Huang,
Yifan Song,
Zihan Jiang,
Bowen Ye,
Can Cai
, et al. (40 additional authors not shown)
Abstract:
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective…
▽ More
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Seed1.5-VL Technical Report
Authors:
Dong Guo,
Faming Wu,
Feida Zhu,
Fuxing Leng,
Guang Shi,
Haobin Chen,
Haoqi Fan,
Jian Wang,
Jianyu Jiang,
Jiawei Wang,
Jingji Chen,
Jingjia Huang,
Kang Lei,
Liping Yuan,
Lishu Luo,
Pengfei Liu,
Qinghao Ye,
Rui Qian,
Shen Yan,
Shixiong Zhao,
Shuai Peng,
Shuangye Li,
Sihang Yuan,
Sijin Wu,
Tianheng Cheng
, et al. (172 additional authors not shown)
Abstract:
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati…
▽ More
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
A Survey on Data-Driven Modeling of Human Drivers' Lane-Changing Decisions
Authors:
Linxuan Huang,
Dong-Fan Xie,
Li Li,
Zhengbing He
Abstract:
Lane-changing (LC) behavior, a critical yet complex driving maneuver, significantly influences driving safety and traffic dynamics. Traditional analytical LC decision (LCD) models, while effective in specific environments, often oversimplify behavioral heterogeneity and complex interactions, limiting their capacity to capture real LCD. Data-driven approaches address these gaps by leveraging rich e…
▽ More
Lane-changing (LC) behavior, a critical yet complex driving maneuver, significantly influences driving safety and traffic dynamics. Traditional analytical LC decision (LCD) models, while effective in specific environments, often oversimplify behavioral heterogeneity and complex interactions, limiting their capacity to capture real LCD. Data-driven approaches address these gaps by leveraging rich empirical data and machine learning to decode latent decision-making patterns, enabling adaptive LCD modeling in dynamic environments. In light of the rapid development of artificial intelligence and the demand for data-driven models oriented towards connected vehicles and autonomous vehicles, this paper presents a comprehensive survey of data-driven LCD models, with a particular focus on human drivers LC decision-making. It systematically reviews the modeling framework, covering data sources and preprocessing, model inputs and outputs, objectives, structures, and validation methods. This survey further discusses the opportunities and challenges faced by data-driven LCD models, including driving safety, uncertainty, as well as the integration and improvement of technical frameworks.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives
Authors:
Xuzhi Zhang,
Shaohui Peng,
Qirui Zhou,
Yuanbo Wen,
Qi Guo,
Ruizhi Chen,
Xinguo Zhu,
Weiqiang Xiong,
Haixin Chen,
Congying Ma,
Ke Gao,
Chen Zhao,
Yanjun Wu,
Yunji Chen,
Ling Li
Abstract:
Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks po…
▽ More
Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 \times$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 \%$ of OpenBLAS on RISC-V CPUs, and $124 \%$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 \times$ compared with human experts.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Soft causal learning for generalized molecule property prediction: An environment perspective
Authors:
Limin Li,
Kuo Yang,
Wenjie Du,
Pengkun Wang,
Zhengyang Zhou,
Yang Wang
Abstract:
Learning on molecule graphs has become an increasingly important topic in AI for science, which takes full advantage of AI to facilitate scientific discovery. Existing solutions on modeling molecules utilize Graph Neural Networks (GNNs) to achieve representations but they mostly fail to adapt models to out-of-distribution (OOD) samples. Although recent advances on OOD-oriented graph learning have…
▽ More
Learning on molecule graphs has become an increasingly important topic in AI for science, which takes full advantage of AI to facilitate scientific discovery. Existing solutions on modeling molecules utilize Graph Neural Networks (GNNs) to achieve representations but they mostly fail to adapt models to out-of-distribution (OOD) samples. Although recent advances on OOD-oriented graph learning have discovered the invariant rationale on graphs, they still ignore three important issues, i.e., 1) the expanding atom patterns regarding environments on graphs lead to failures of invariant rationale based models, 2) the associations between discovered molecular subgraphs and corresponding properties are complex where causal substructures cannot fully interpret the labels. 3) the interactions between environments and invariances can influence with each other thus are challenging to be modeled. To this end, we propose a soft causal learning framework, to tackle the unresolved OOD challenge in molecular science, from the perspective of fully modeling the molecule environments and bypassing the invariant subgraphs. Specifically, we first incorporate chemistry theories into our graph growth generator to imitate expaned environments, and then devise an GIB-based objective to disentangle environment from whole graphs and finally introduce a cross-attention based soft causal interaction, which allows dynamic interactions between environments and invariances. We perform experiments on seven datasets by imitating different kinds of OOD generalization scenarios. Extensive comparison, ablation experiments as well as visualized case studies demonstrate well generalization ability of our proposal.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Authors:
Chao Liao,
Liyang Liu,
Xun Wang,
Zhengxiong Luo,
Xinyu Zhang,
Wenliang Zhao,
Jie Wu,
Liang Li,
Zhi Tian,
Weilin Huang
Abstract:
Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements…
▽ More
Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we introduce an efficient training strategy on a large-scale, in-house dataset specifically curated for joint text and image generation. Extensive experiments show that Mogao not only achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs. Its emergent capabilities in zero-shot image editing and compositional generation highlight Mogao as a practical omni-modal foundation model, paving the way for future development and scaling the unified multi-modal systems.
△ Less
Submitted 11 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
Augmented Deep Contexts for Spatially Embedded Video Coding
Authors:
Yifan Bian,
Chuanbo Tang,
Li Li,
Dong Liu
Abstract:
Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firs…
▽ More
Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firstly, our SEVC leverages both spatial and temporal references to generate augmented motion vectors and hybrid spatial-temporal contexts. Secondly, to address the misalignment issue in latent prior and enrich the prior information, we introduce a spatial-guided latent prior augmented by multiple temporal latent representations. At last, we design a joint spatial-temporal optimization to learn quality-adaptive bit allocation for spatial references, further boosting rate-distortion performance. Experimental results show that our SEVC effectively alleviates the limitations in handling large motions or emerging objects, and also reduces 11.9% more bitrate than the previous state-of-the-art NVC while providing an additional low-resolution bitstream. Our code and model are available at https://github.com/EsakaK/SEVC.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Divide-and-Conquer: Cold-Start Bundle Recommendation via Mixture of Diffusion Experts
Authors:
Ming Li,
Lin Li,
Xiaohui Tao,
Dong Zhang,
Jimmy Xiangji Huang
Abstract:
Cold-start bundle recommendation focuses on modeling new bundles with insufficient information to provide recommendations. Advanced bundle recommendation models usually learn bundle representations from multiple views (e.g., interaction view) at both the bundle and item levels. Consequently, the cold-start problem for bundles is more challenging than that for traditional items due to the dual-leve…
▽ More
Cold-start bundle recommendation focuses on modeling new bundles with insufficient information to provide recommendations. Advanced bundle recommendation models usually learn bundle representations from multiple views (e.g., interaction view) at both the bundle and item levels. Consequently, the cold-start problem for bundles is more challenging than that for traditional items due to the dual-level multi-view complexity. In this paper, we propose a novel Mixture of Diffusion Experts (MoDiffE) framework, which employs a divide-and-conquer strategy for cold-start bundle recommendation and follows three steps:(1) Divide: The bundle cold-start problem is divided into independent but similar sub-problems sequentially by level and view, which can be summarized as the poor representation of feature-missing bundles in prior-embedding models. (2) Conquer: Beyond prior-embedding models that fundamentally provide the embedded representations, we introduce a diffusion-based method to solve all sub-problems in a unified way, which directly generates diffusion representations using diffusion models without depending on specific features. (3) Combine: A cold-aware hierarchical Mixture of Experts (MoE) is employed to combine results of the sub-problems for final recommendations, where the two models for each view serve as experts and are adaptively fused for different bundles in a multi-layer manner. Additionally, MoDiffE adopts a multi-stage decoupled training pipeline and introduces a cold-start gating augmentation method to enable the training of gating for cold bundles. Through extensive experiments on three real-world datasets, we demonstrate that MoDiffE significantly outperforms existing solutions in handling cold-start bundle recommendation. It achieves up to a 0.1027 absolute gain in Recall@20 in cold-start scenarios and up to a 47.43\% relative improvement in all-bundle scenarios.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs
Authors:
Yehui Tang,
Yichun Yin,
Yaoyuan Wang,
Hang Zhou,
Yu Pan,
Wei Guo,
Ziyang Zhang,
Miao Rang,
Fangcheng Liu,
Naifu Zhang,
Binghan Li,
Yonghan Dong,
Xiaojun Meng,
Yasheng Wang,
Dong Li,
Yin Li,
Dandan Tu,
Can Chen,
Youliang Yan,
Fisher Yu,
Ruiming Tang,
Yunhe Wang,
Botian Huang,
Bo Wang,
Boxiao Liu
, et al. (49 additional authors not shown)
Abstract:
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing r…
▽ More
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
BuildingBlock: A Hybrid Approach for Structured Building Generation
Authors:
Junming Huang,
Chi Wang,
Letian Li,
Changxin Huang,
Qiang Dai,
Weiwei Xu
Abstract:
Three-dimensional building generation is vital for applications in gaming, virtual reality, and digital twins, yet current methods face challenges in producing diverse, structured, and hierarchically coherent buildings. We propose BuildingBlock, a hybrid approach that integrates generative models, procedural content generation (PCG), and large language models (LLMs) to address these limitations. S…
▽ More
Three-dimensional building generation is vital for applications in gaming, virtual reality, and digital twins, yet current methods face challenges in producing diverse, structured, and hierarchically coherent buildings. We propose BuildingBlock, a hybrid approach that integrates generative models, procedural content generation (PCG), and large language models (LLMs) to address these limitations. Specifically, our method introduces a two-phase pipeline: the Layout Generation Phase (LGP) and the Building Construction Phase (BCP).
LGP reframes box-based layout generation as a point-cloud generation task, utilizing a newly constructed architectural dataset and a Transformer-based diffusion model to create globally consistent layouts. With LLMs, these layouts are extended into rule-based hierarchical designs, seamlessly incorporating component styles and spatial structures.
The BCP leverages these layouts to guide PCG, enabling local-customizable, high-quality structured building generation. Experimental results demonstrate BuildingBlock's effectiveness in generating diverse and hierarchically structured buildings, achieving state-of-the-art results on multiple benchmarks, and paving the way for scalable and intuitive architectural workflows.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Authors:
Zuwei Long,
Yunhang Shen,
Chaoyou Fu,
Heting Gao,
Lijiang Li,
Peixian Chen,
Mengdan Zhang,
Hang Shao,
Jian Li,
Jinlong Peng,
Haoyu Cao,
Ke Li,
Rongrong Ji,
Xing Sun
Abstract:
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-A…
▽ More
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Nonperiodic dynamic CT reconstruction using backward-warping INR with regularization of diffeomorphism (BIRD)
Authors:
Muge Du,
Zhuozhao Zheng,
Wenying Wang,
Guotao Quan,
Wuliang Shi,
Le Shen,
Li Zhang,
Liang Li,
Yinong Liu,
Yuxiang Xing
Abstract:
Dynamic computed tomography (CT) reconstruction faces significant challenges in addressing motion artifacts, particularly for nonperiodic rapid movements such as cardiac imaging with fast heart rates. Traditional methods struggle with the extreme limited-angle problems inherent in nonperiodic cases. Deep learning methods have improved performance but face generalization challenges. Recent implicit…
▽ More
Dynamic computed tomography (CT) reconstruction faces significant challenges in addressing motion artifacts, particularly for nonperiodic rapid movements such as cardiac imaging with fast heart rates. Traditional methods struggle with the extreme limited-angle problems inherent in nonperiodic cases. Deep learning methods have improved performance but face generalization challenges. Recent implicit neural representation (INR) techniques show promise through self-supervised deep learning, but have critical limitations: computational inefficiency due to forward-warping modeling, difficulty balancing DVF complexity with anatomical plausibility, and challenges in preserving fine details without additional patient-specific pre-scans. This paper presents a novel INR-based framework, BIRD, for nonperiodic dynamic CT reconstruction. It addresses these challenges through four key contributions: (1) backward-warping deformation that enables direct computation of each dynamic voxel with significantly reduced computational cost, (2) diffeomorphism-based DVF regularization that ensures anatomically plausible deformations while maintaining representational capacity, (3) motion-compensated analytical reconstruction that enhances fine details without requiring additional pre-scans, and (4) dimensional-reduction design for efficient 4D coordinate encoding. Through various simulations and practical studies, including digital and physical phantoms and retrospective patient data, we demonstrate the effectiveness of our approach for nonperiodic dynamic CT reconstruction with enhanced details and reduced motion artifacts. The proposed framework enables more accurate dynamic CT reconstruction with potential clinical applications, such as one-beat cardiac reconstruction, cinematic image sequences for functional imaging, and motion artifact reduction in conventional CT scans.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
QiMeng-CPU-v2: Automated Superscalar Processor Design by Learning Data Dependencies
Authors:
Shuyao Cheng,
Rui Zhang,
Wenkai He,
Pengwei Jin,
Chongxiao Li,
Zidong Du,
Xing Hu,
Yifan Hao,
Guanglin Xu,
Yuanbo Wen,
Ling Li,
Qi Guo,
Yunji Chen
Abstract:
Automated processor design, which can significantly reduce human efforts and accelerate design cycles, has received considerable attention. While recent advancements have automatically designed single-cycle processors that execute one instruction per cycle, their performance cannot compete with modern superscalar processors that execute multiple instructions per cycle. Previous methods fail on sup…
▽ More
Automated processor design, which can significantly reduce human efforts and accelerate design cycles, has received considerable attention. While recent advancements have automatically designed single-cycle processors that execute one instruction per cycle, their performance cannot compete with modern superscalar processors that execute multiple instructions per cycle. Previous methods fail on superscalar processor design because they cannot address inter-instruction data dependencies, leading to inefficient sequential instruction execution.
This paper proposes a novel approach to automatically designing superscalar processors using a hardware-friendly model called the Stateful Binary Speculation Diagram (State-BSD). We observe that processor parallelism can be enhanced through on-the-fly inter-instruction dependent data predictors, reusing the processor's internal states to learn the data dependency. To meet the challenge of both hardware-resource limitation and design functional correctness, State-BSD consists of two components: 1) a lightweight state-selector trained by the simulated annealing method to detect the most reusable processor states and store them in a small buffer; and 2) a highly precise state-speculator trained by the BSD expansion method to predict the inter-instruction dependent data using the selected states. It is the first work to achieve the automated superscalar processor design, i.e. QiMeng-CPU-v2, which improves the performance by about $380\times$ than the state-of-the-art automated design and is comparable to human-designed superscalar processors such as ARM Cortex A53.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
Authors:
Dengyang Jiang,
Mengmeng Wang,
Liuzhuozheng Li,
Lei Zhang,
Haoyu Wang,
Wei Wei,
Guang Dai,
Yanning Zhang,
Jingdong Wang
Abstract:
Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance the generation quality of diffusion transformers. However, existing approaches necessitate to either introduce an external and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation…
▽ More
Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance the generation quality of diffusion transformers. However, existing approaches necessitate to either introduce an external and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation guidance during the original generative training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We therefore propose Self-Representation Alignment (SRA), a simple yet straightforward method that obtains representation guidance through a self-distillation manner. Specifically, SRA aligns the output latent representation of the diffusion transformer in the earlier layer with higher noise to that in the later layer with lower noise to progressively enhance the overall representation learning during only the generative training process. Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA not only significantly outperforms approaches relying on auxiliary, complex representation training frameworks but also achieves performance comparable to methods that are heavily dependent on powerful external representation priors.
△ Less
Submitted 13 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge
Authors:
Vladyslav Zalevskyi,
Thomas Sanchez,
Misha Kaandorp,
Margaux Roulet,
Diego Fajardo-Rojas,
Liu Li,
Jana Hutter,
Hongwei Bran Li,
Matthew Barkovich,
Hui Ji,
Luca Wilhelmi,
Aline Dändliker,
Céline Steger,
Mériam Koob,
Yvan Gomez,
Anton Jakovčić,
Melita Klaić,
Ana Adžić,
Pavel Marković,
Gracia Grabarić,
Milan Rados,
Jordina Aviles Verdera,
Gregor Kasprian,
Gregor Dovjak,
Raphael Gaubert-Rachmühl
, et al. (45 additional authors not shown)
Abstract:
Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics wer…
▽ More
Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics were also expanded to include the topology-specific Euler characteristic difference (ED). Sixteen teams submitted segmentation methods, most of which performed consistently across both high- and low-field scans. However, longitudinal trends indicate that segmentation accuracy may be reaching a plateau, with results now approaching inter-rater variability. The ED metric uncovered topological differences that were missed by conventional metrics, while the low-field dataset achieved the highest segmentation scores, highlighting the potential of affordable imaging systems when paired with high-quality reconstruction. Seven teams participated in the biometry task, but most methods failed to outperform a simple baseline that predicted measurements based solely on gestational age, underscoring the challenge of extracting reliable biometric estimates from image data alone. Domain shift analysis identified image quality as the most significant factor affecting model generalization, with super-resolution pipelines also playing a substantial role. Other factors, such as gestational age, pathology, and acquisition site, had smaller, though still measurable, effects. Overall, FeTA 2024 offers a comprehensive benchmark for multi-class segmentation and biometry estimation in fetal brain MRI, underscoring the need for data-centric approaches, improved topological evaluation, and greater dataset diversity to enable clinically robust and generalizable AI tools.
△ Less
Submitted 8 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing
Authors:
Gaoxiang Cong,
Liang Li,
Jiadong Pan,
Zhedong Zhang,
Amin Beheshti,
Anton van den Hengel,
Yuankai Qi,
Qingming Huang
Abstract:
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow…
▽ More
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning while achieving better acoustic quality via the proposed voice-enhanced flow matching than previous works. First, we introduce Qwen2.5 as the backbone of LLM to learn the in-context sequence from movie scripts and reference audio. Then, the proposed semantic-aware learning focuses on capturing LLM semantic knowledge at the phoneme level. Next, dual contrastive aligning (DCA) boosts mutual alignment with lip movement, reducing ambiguities where similar phonemes might be confused. Finally, the proposed Flow-based Voice Enhancing (FVE) improves acoustic quality in two aspects, which introduces an LLM-based acoustics flow matching guidance to strengthen clarity and uses affine style prior to enhance identity when recovering noise into mel-spectrograms via gradient vector field prediction. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks. The demos are available at {\href{https://galaxycong.github.io/LLM-Flow-Dubber/}{\textcolor{red}{https://galaxycong.github.io/LLM-Flow-Dubber/}}}.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
3D Hand-Eye Calibration for Collaborative Robot Arm: Look at Robot Base Once
Authors:
Leihui Li,
Lixuepiao Wan,
Volker Krueger,
Xuping Zhang
Abstract:
Hand-eye calibration is a common problem in the field of collaborative robotics, involving the determination of the transformation matrix between the visual sensor and the robot flange to enable vision-based robotic tasks. However, this process typically requires multiple movements of the robot arm and an external calibration object, making it both time-consuming and inconvenient, especially in sc…
▽ More
Hand-eye calibration is a common problem in the field of collaborative robotics, involving the determination of the transformation matrix between the visual sensor and the robot flange to enable vision-based robotic tasks. However, this process typically requires multiple movements of the robot arm and an external calibration object, making it both time-consuming and inconvenient, especially in scenarios where frequent recalibration is necessary. In this work, we extend our previous method which eliminates the need for external calibration objects such as a chessboard. We propose a generic dataset generation approach for point cloud registration, focusing on aligning the robot base point cloud with the scanned data. Furthermore, a more detailed simulation study is conducted involving several different collaborative robot arms, followed by real-world experiments in an industrial setting. Our improved method is simulated and evaluated using a total of 14 robotic arms from 9 different brands, including KUKA, Universal Robots, UFACTORY, and Franka Emika, all of which are widely used in the field of collaborative robotics. Physical experiments demonstrate that our extended approach achieves performance comparable to existing commercial hand-eye calibration solutions, while completing the entire calibration procedure in just a few seconds. In addition, we provide a user-friendly hand-eye calibration solution, with the code publicly available at github.com/leihui6/LRBO.
△ Less
Submitted 9 May, 2025; v1 submitted 30 April, 2025;
originally announced April 2025.
-
Multi-modal Transfer Learning for Dynamic Facial Emotion Recognition in the Wild
Authors:
Ezra Engel,
Lishan Li,
Chris Hudy,
Robert Schleusner
Abstract:
Facial expression recognition (FER) is a subset of computer vision with important applications for human-computer-interaction, healthcare, and customer service. FER represents a challenging problem-space because accurate classification requires a model to differentiate between subtle changes in facial features. In this paper, we examine the use of multi-modal transfer learning to improve performan…
▽ More
Facial expression recognition (FER) is a subset of computer vision with important applications for human-computer-interaction, healthcare, and customer service. FER represents a challenging problem-space because accurate classification requires a model to differentiate between subtle changes in facial features. In this paper, we examine the use of multi-modal transfer learning to improve performance on a challenging video-based FER dataset, Dynamic Facial Expression in-the-Wild (DFEW). Using a combination of pretrained ResNets, OpenPose, and OmniVec networks, we explore the impact of cross-temporal, multi-modal features on classification accuracy. Ultimately, we find that these finely-tuned multi-modal feature generators modestly improve accuracy of our transformer-based classification model.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
GarmentX: Autoregressive Parametric Representations for High-Fidelity 3D Garment Generation
Authors:
Jingfeng Guo,
Jinnan Chen,
Weikai Chen,
Zhenyu Sun,
Lanjiong Li,
Baozhu Zhao,
Lingting Zhu,
Xin Wang,
Qi Liu
Abstract:
This work presents GarmentX, a novel framework for generating diverse, high-fidelity, and wearable 3D garments from a single input image. Traditional garment reconstruction methods directly predict 2D pattern edges and their connectivity, an overly unconstrained approach that often leads to severe self-intersections and physically implausible garment structures. In contrast, GarmentX introduces a…
▽ More
This work presents GarmentX, a novel framework for generating diverse, high-fidelity, and wearable 3D garments from a single input image. Traditional garment reconstruction methods directly predict 2D pattern edges and their connectivity, an overly unconstrained approach that often leads to severe self-intersections and physically implausible garment structures. In contrast, GarmentX introduces a structured and editable parametric representation compatible with GarmentCode, ensuring that the decoded sewing patterns always form valid, simulation-ready 3D garments while allowing for intuitive modifications of garment shape and style. To achieve this, we employ a masked autoregressive model that sequentially predicts garment parameters, leveraging autoregressive modeling for structured generation while mitigating inconsistencies in direct pattern prediction. Additionally, we introduce GarmentX dataset, a large-scale dataset of 378,682 garment parameter-image pairs, constructed through an automatic data generation pipeline that synthesizes diverse and high-quality garment images conditioned on parametric garment representations. Through integrating our method with GarmentX dataset, we achieve state-of-the-art performance in geometric fidelity and input image alignment, significantly outperforming prior approaches. We will release GarmentX dataset upon publication.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Transforming Evidence Synthesis: A Systematic Review of the Evolution of Automated Meta-Analysis in the Age of AI
Authors:
Lingbo Li,
Anuradha Mathrani,
Teo Susnjak
Abstract:
Exponential growth in scientific literature has heightened the demand for efficient evidence-based synthesis, driving the rise of the field of Automated Meta-analysis (AMA) powered by natural language processing and machine learning. This PRISMA systematic review introduces a structured framework for assessing the current state of AMA, based on screening 978 papers from 2006 to 2024, and analyzing…
▽ More
Exponential growth in scientific literature has heightened the demand for efficient evidence-based synthesis, driving the rise of the field of Automated Meta-analysis (AMA) powered by natural language processing and machine learning. This PRISMA systematic review introduces a structured framework for assessing the current state of AMA, based on screening 978 papers from 2006 to 2024, and analyzing 54 studies across diverse domains. Findings reveal a predominant focus on automating data processing (57%), such as extraction and statistical modeling, while only 17% address advanced synthesis stages. Just one study (2%) explored preliminary full-process automation, highlighting a critical gap that limits AMA's capacity for comprehensive synthesis. Despite recent breakthroughs in large language models (LLMs) and advanced AI, their integration into statistical modeling and higher-order synthesis, such as heterogeneity assessment and bias evaluation, remains underdeveloped. This has constrained AMA's potential for fully autonomous meta-analysis. From our dataset spanning medical (67%) and non-medical (33%) applications, we found that AMA has exhibited distinct implementation patterns and varying degrees of effectiveness in actually improving efficiency, scalability, and reproducibility. While automation has enhanced specific meta-analytic tasks, achieving seamless, end-to-end automation remains an open challenge. As AI systems advance in reasoning and contextual understanding, addressing these gaps is now imperative. Future efforts must focus on bridging automation across all meta-analysis stages, refining interpretability, and ensuring methodological robustness to fully realize AMA's potential for scalable, domain-agnostic synthesis.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Authors:
Zihan Wang,
Kangrui Wang,
Qineng Wang,
Pingyue Zhang,
Linjie Li,
Zhengyuan Yang,
Kefan Yu,
Minh Nhat Nguyen,
Licheng Liu,
Eli Gottlieb,
Monica Lam,
Yiping Lu,
Kyunghyun Cho,
Jiajun Wu,
Li Fei-Fei,
Lijuan Wang,
Yejin Choi,
Manling Li
Abstract:
Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for t…
▽ More
Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on three stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and decoupled clipping. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Applying LLM-Powered Virtual Humans to Child Interviews in Child-Centered Design
Authors:
Linshi Li,
Hanlin Cai
Abstract:
In child-centered design, directly engaging children is crucial for deeply understanding their experiences. However, current research often prioritizes adult perspectives, as interviewing children involves unique challenges such as environmental sensitivities and the need for trust-building. AI-powered virtual humans (VHs) offer a promising approach to facilitate engaging and multimodal interactio…
▽ More
In child-centered design, directly engaging children is crucial for deeply understanding their experiences. However, current research often prioritizes adult perspectives, as interviewing children involves unique challenges such as environmental sensitivities and the need for trust-building. AI-powered virtual humans (VHs) offer a promising approach to facilitate engaging and multimodal interactions with children. This study establishes key design guidelines for LLM-powered virtual humans tailored to child interviews, standardizing multimodal elements including color schemes, voice characteristics, facial features, expressions, head movements, and gestures. Using ChatGPT-based prompt engineering, we developed three distinct Human-AI workflows (LLM-Auto, LLM-Interview, and LLM-Analyze) and conducted a user study involving 15 children aged 6 to 12. The results indicated that the LLM-Analyze workflow outperformed the others by eliciting longer responses, achieving higher user experience ratings, and promoting more effective child engagement.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
Authors:
Guangyi Liu,
Pengxiang Zhao,
Liang Liu,
Yaxuan Guo,
Han Xiao,
Weifeng Lin,
Yuxiang Chai,
Yue Han,
Shuai Ren,
Hao Wang,
Xiaoyu Liang,
Wenhao Wang,
Tianze Wu,
Linghao Li,
Hao Wang,
Guanjing Xiong,
Yong Liu,
Hongsheng Li
Abstract:
With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and s…
▽ More
With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries
Authors:
Huajian Xin,
Luming Li,
Xiaoran Jin,
Jacques Fleuriot,
Wenda Li
Abstract:
Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), whic…
▽ More
Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), which aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. To facilitate research in this direction, we present APE-Bench I, the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib. Empirical results on state-of-the-art LLMs demonstrate strong performance on localized edits but substantial degradation on handling complex proof engineering. This work lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
"I Would Have Written My Code Differently'': Beginners Struggle to Understand LLM-Generated Code
Authors:
Yangtian Zi,
Luisa Li,
Arjun Guha,
Carolyn Jane Anderson,
Molly Q Feldman
Abstract:
Large language models (LLMs) are being increasingly adopted for programming work. Prior work shows that while LLMs accelerate task completion for professional programmers, beginning programmers struggle to prompt models effectively. However, prompting is just half of the code generation process -- when code is generated, it must be read, evaluated, and integrated (or rejected). How accessible are…
▽ More
Large language models (LLMs) are being increasingly adopted for programming work. Prior work shows that while LLMs accelerate task completion for professional programmers, beginning programmers struggle to prompt models effectively. However, prompting is just half of the code generation process -- when code is generated, it must be read, evaluated, and integrated (or rejected). How accessible are these tasks for beginning programmers?
This paper measures how well beginners comprehend LLM-generated code and explores the challenges students face in judging code correctness. We compare how well students understand natural language descriptions of functions and LLM-generated implementations, studying 32 CS1 students on 160 task instances. Our results show a low per-task success rate of 32.5\%, with indiscriminate struggles across demographic populations. Key challenges include barriers for non-native English speakers, unfamiliarity with Python syntax, and automation bias. Our findings highlight the barrier that code comprehension presents to beginning programmers seeking to write code with LLMs.
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
Partition Map-Based Fast Block Partitioning for VVC Inter Coding
Authors:
Xinmin Feng,
Zhuoyuan Li,
Li Li,
Dong Liu,
Feng Wu
Abstract:
Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (QT+MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast…
▽ More
Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (QT+MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding, and thus improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion (RD) performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjontegaard Delta Bit Rate (BDBR) under the random access configuration.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
E-InMeMo: Enhanced Prompting for Visual In-Context Learning
Authors:
Jiahao Zhang,
Bowen Wang,
Hong Liu,
Liangzhi Li,
Yuta Nakashima,
Hajime Nagahara
Abstract:
Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output…
▽ More
Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
Authors:
Linli Yao,
Yicheng Li,
Yuancheng Wei,
Lei Li,
Shuhuai Ren,
Yuanxin Liu,
Kun Ouyang,
Lean Wang,
Shicheng Li,
Sida Li,
Lingpeng Kong,
Qi Liu,
Yuanxing Zhang,
Xu Sun
Abstract:
The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they fa…
▽ More
The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they face significant limitations in streaming scenarios due to their inability to handle dense, redundant frames efficiently. We introduce TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video interaction. At its core lies our innovative Differential Token Drop (DTD) module, which addresses the fundamental challenge of visual redundancy in streaming videos. Drawing inspiration from human visual perception's Change Blindness phenomenon, DTD preserves meaningful temporal changes while filtering out static, redundant content between frames. Remarkably, our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance. To enable seamless real-time interaction, we present TimeChat-Online-139K, a comprehensive streaming video dataset featuring diverse interaction patterns including backward-tracing, current-perception, and future-responding scenarios. TimeChat-Online's unique Proactive Response capability, naturally achieved through continuous monitoring of video scene transitions via DTD, sets it apart from conventional approaches. Our extensive evaluation demonstrates TimeChat-Online's superior performance on streaming benchmarks (StreamingBench and OvOBench) and maintaining competitive results on long-form video tasks such as Video-MME and MLVU.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Symbolic Representation for Any-to-Any Generative Tasks
Authors:
Jiaqi Chen,
Xiaoye Zhu,
Yue Wang,
Tianyang Liu,
Xinhui Chen,
Ying Chen,
Chak Tou Leong,
Yifei Ke,
Joseph Liu,
Yiwen Yuan,
Julian McAuley,
Li-jia Li
Abstract:
We propose a symbolic generative task description language and a corresponding inference engine capable of representing arbitrary multimodal tasks as structured symbolic flows. Unlike conventional generative models that rely on large-scale training and implicit neural representations to learn cross-modal mappings, often at high computational cost and with limited flexibility, our framework introdu…
▽ More
We propose a symbolic generative task description language and a corresponding inference engine capable of representing arbitrary multimodal tasks as structured symbolic flows. Unlike conventional generative models that rely on large-scale training and implicit neural representations to learn cross-modal mappings, often at high computational cost and with limited flexibility, our framework introduces an explicit symbolic representation comprising three core primitives: functions, parameters, and topological logic. Leveraging a pre-trained language model, our inference engine maps natural language instructions directly to symbolic workflows in a training-free manner. Our framework successfully performs over 12 diverse multimodal generative tasks, demonstrating strong performance and flexibility without the need for task-specific tuning. Experiments show that our method not only matches or outperforms existing state-of-the-art unified models in content quality, but also offers greater efficiency, editability, and interruptibility. We believe that symbolic task representations provide a cost-effective and extensible foundation for advancing the capabilities of generative AI.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Learning Switchable Priors for Neural Image Compression
Authors:
Haotian Zhang,
Yuqi Li,
Li Li,
Dong Liu
Abstract:
Neural image compression (NIC) usually adopts a predefined family of probabilistic distributions as the prior of the latent variables, and meanwhile relies on entropy models to estimate the parameters for the probabilistic family. More complex probabilistic distributions may fit the latent variables more accurately, but also incur higher complexity of the entropy models, limiting their practical v…
▽ More
Neural image compression (NIC) usually adopts a predefined family of probabilistic distributions as the prior of the latent variables, and meanwhile relies on entropy models to estimate the parameters for the probabilistic family. More complex probabilistic distributions may fit the latent variables more accurately, but also incur higher complexity of the entropy models, limiting their practical value. To address this dilemma, we propose a solution to decouple the entropy model complexity from the prior distributions. We use a finite set of trainable priors that correspond to samples of the parametric probabilistic distributions. We train the entropy model to predict the index of the appropriate prior within the set, rather than the specific parameters. Switching between the trained priors further enables us to embrace a skip mode into the prior set, which simply omits a latent variable during the entropy coding. To demonstrate the practical value of our solution, we present a lightweight NIC model, namely FastNIC, together with the learning of switchable priors. FastNIC obtains a better trade-off between compression efficiency and computational complexity for neural image compression. We also implanted the switchable priors into state-of-the-art NIC models and observed improved compression efficiency with a significant reduction of entropy coding complexity.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Tinkering Against Scaling
Authors:
Bolun Zhang,
Yang Shen,
Linzhuo Li,
Yu Ji,
Di Wu,
Tongyu Wu,
Lianghao Dai
Abstract:
The ascent of scaling in artificial intelligence research has revolutionized the field over the past decade, yet it presents significant challenges for academic researchers, particularly in computational social science and critical algorithm studies. The dominance of large language models, characterized by their extensive parameters and costly training processes, creates a disparity where only ind…
▽ More
The ascent of scaling in artificial intelligence research has revolutionized the field over the past decade, yet it presents significant challenges for academic researchers, particularly in computational social science and critical algorithm studies. The dominance of large language models, characterized by their extensive parameters and costly training processes, creates a disparity where only industry-affiliated researchers can access these resources. This imbalance restricts academic researchers from fully understanding their tools, leading to issues like reproducibility in computational social science and a reliance on black-box metaphors in critical studies.
To address these challenges, we propose a "tinkering" approach that is inspired by existing works. This method involves engaging with smaller models or components that are manageable for ordinary researchers, fostering hands-on interaction with algorithms. We argue that tinkering is both a way of making and knowing for computational social science and a way of knowing for critical studies, and fundamentally, it is a way of caring that has broader implications for both fields.
△ Less
Submitted 23 April, 2025;
originally announced April 2025.
-
Investigating LLMs in Clinical Triage: Promising Capabilities, Persistent Intersectional Biases
Authors:
Joseph Lee,
Tianqi Shang,
Jae Young Baik,
Duy Duong-Tran,
Shu Yang,
Lingyao Li,
Li Shen
Abstract:
Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored. We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions: (1) robustness to distribution shifts and missing data, and (2) counterfactual analysis of intersectional biases across sex and race. We assess multiple…
▽ More
Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored. We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions: (1) robustness to distribution shifts and missing data, and (2) counterfactual analysis of intersectional biases across sex and race. We assess multiple LLM-based approaches, ranging from continued pre-training to in-context learning, as well as machine learning approaches. Our results indicate that LLMs exhibit superior robustness, and we investigate the key factors contributing to the promising LLM-based approaches. Furthermore, in this setting, we identify gaps in LLM preferences that emerge in particular intersections of sex and race. LLMs generally exhibit sex-based differences, but they are most pronounced in certain racial groups. These findings suggest that LLMs encode demographic preferences that may emerge in specific clinical contexts or particular combinations of characteristics.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
EFFACT: A Highly Efficient Full-Stack FHE Acceleration Platform
Authors:
Yi Huang,
Xinsheng Gong,
Xiangyu Kong,
Dibei Chen,
Jianfeng Zhu,
Wenping Zhu,
Liangwei Li,
Mingyu Gao,
Shaojun Wei,
Aoyang Zhang,
Leibo Liu
Abstract:
Fully Homomorphic Encryption (FHE) is a set of powerful cryptographic schemes that allows computation to be performed directly on encrypted data with an unlimited depth. Despite FHE's promising in privacy-preserving computing, yet in most FHE schemes, ciphertext generally blows up thousands of times compared to the original message, and the massive amount of data load from off-chip memory for boot…
▽ More
Fully Homomorphic Encryption (FHE) is a set of powerful cryptographic schemes that allows computation to be performed directly on encrypted data with an unlimited depth. Despite FHE's promising in privacy-preserving computing, yet in most FHE schemes, ciphertext generally blows up thousands of times compared to the original message, and the massive amount of data load from off-chip memory for bootstrapping and privacy-preserving machine learning applications (such as HELR, ResNet-20), both degrade the performance of FHE-based computation. Several hardware designs have been proposed to address this issue, however, most of them require enormous resources and power. An acceleration platform with easy programmability, high efficiency, and low overhead is a prerequisite for practical application.
This paper proposes EFFACT, a highly efficient full-stack FHE acceleration platform with a compiler that provides comprehensive optimizations and vector-friendly hardware. We start by examining the computational overhead across different real-world benchmarks to highlight the potential benefits of reallocating computing resources for efficiency enhancement. Then we make a design space exploration to find an optimal SRAM size with high utilization and low cost. On the other hand, EFFACT features a novel optimization named streaming memory access which is proposed to enable high throughput with limited SRAMs. Regarding the software-side optimization, we also propose a circuit-level function unit reuse scheme, to substantially reduce the computing resources without performance degradation. Moreover, we design novel NTT and automorphism units that are suitable for a cost-sensitive and highly efficient architecture, leading to low area. For generality, EFFACT is also equipped with an ISA and a compiler backend that can support several FHE schemes like CKKS, BGV, and BFV.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference
Authors:
Yihao Zhao,
Jiadun Chen,
Peng Sun,
Lei Li,
Xuanzhe Liu,
Xin Jin
Abstract:
Large language models (LLMs) with different architectures and sizes have been developed. Serving each LLM with dedicated GPUs leads to resource waste and service inefficiency due to the varying demand of LLM requests. A common practice is to share multiple LLMs. However, existing sharing systems either do not consider the autoregressive pattern of LLM services, or only focus on improving the throu…
▽ More
Large language models (LLMs) with different architectures and sizes have been developed. Serving each LLM with dedicated GPUs leads to resource waste and service inefficiency due to the varying demand of LLM requests. A common practice is to share multiple LLMs. However, existing sharing systems either do not consider the autoregressive pattern of LLM services, or only focus on improving the throughput, which impairs the sharing performance, especially the serving latency. We present SeaLLM, which enables service-aware and latency-optimized LLM sharing. SeaLLM improves the overall sharing performance by (1) a latency-optimized scheduling algorithm utilizing the characteristics of LLM services, (2) a placement algorithm to determine the placement plan and an adaptive replacement algorithm to decide the replacement interval, and (3) a unified key-value cache to share GPU memory among LLM services efficiently. Our evaluation under real-world traces and LLM services demonstrates that SeaLLM improves the normalized latency by up to $13.60\times$, the tail latency by up to $18.69\times$, and the SLO attainment by up to $3.64\times$ compared to existing solutions.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Vidi: Large Multimodal Models for Video Understanding and Editing
Authors:
Vidi Team,
Celong Liu,
Chia-Wen Kuo,
Dawei Du,
Fan Chen,
Guang Chen,
Jiamin Yuan,
Lingxi Zhang,
Lu Guo,
Lusha Li,
Longyin Wen,
Qingyu Chen,
Rachel Deng,
Sijie Zhu,
Stuart Siew,
Tong Jin,
Wei Lu,
Wen Zhong,
Xiaohui Shen,
Xin Gu,
Xing Mei,
Xueqiong Qu
Abstract:
Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components…
▽ More
Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than videos of existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.
△ Less
Submitted 24 April, 2025; v1 submitted 22 April, 2025;
originally announced April 2025.
-
IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property
Authors:
Qiyao Wang,
Guhong Chen,
Hongbo Wang,
Huaren Liu,
Minghui Zhu,
Zhifei Qin,
Linwei Li,
Yilin Yue,
Shiqiang Wang,
Jiayan Li,
Yihang Wu,
Ziqiang Liu,
Longze Chen,
Run Luo,
Liyang Fan,
Jiaming Li,
Lei Zhang,
Kan Xu,
Hongfei Lin,
Hamid Alinejad-Rokny,
Shiwen Ni,
Yuan Lin,
Min Yang
Abstract:
Intellectual Property (IP) is a unique domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. As large language models (LLMs) continue to advance, they show great potential for processing IP tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks either focus narrowl…
▽ More
Intellectual Property (IP) is a unique domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. As large language models (LLMs) continue to advance, they show great potential for processing IP tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks either focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce the first comprehensive IP task taxonomy and a large, diverse bilingual benchmark, IPBench, covering 8 IP mechanisms and 20 tasks. This benchmark is designed to evaluate LLMs in real-world intellectual property applications, encompassing both understanding and generation. We benchmark 16 LLMs, ranging from general-purpose to domain-specific models, and find that even the best-performing model achieves only 75.8% accuracy, revealing substantial room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. We publicly release all data and code of IPBench and will continue to update it with additional IP-related tasks to better reflect real-world challenges in the intellectual property domain.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
Authors:
Jie Cheng,
Ruixi Qiao,
Lijun Li,
Chao Guo,
Junle Wang,
Gang Xiong,
Yisheng Lv,
Fei-Yue Wang
Abstract:
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which…
▽ More
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. Code and models are available at https://github.com/CJReinforce/PURE.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Breast density in MRI: an AI-based quantification and relationship to assessment in mammography
Authors:
Yaqian Chen,
Lin Li,
Hanxue Gu,
Haoyu Dong,
Derek L. Nguyen,
Allan D. Kirk,
Maciej A. Mazurowski,
E. Shelley Hwang
Abstract:
Mammographic breast density is a well-established risk factor for breast cancer. Recently there has been interest in breast MRI as an adjunct to mammography, as this modality provides an orthogonal and highly quantitative assessment of breast tissue. However, its 3D nature poses analytic challenges related to delineating and aggregating complex structures across slices. Here, we applied an in-hous…
▽ More
Mammographic breast density is a well-established risk factor for breast cancer. Recently there has been interest in breast MRI as an adjunct to mammography, as this modality provides an orthogonal and highly quantitative assessment of breast tissue. However, its 3D nature poses analytic challenges related to delineating and aggregating complex structures across slices. Here, we applied an in-house machine-learning algorithm to assess breast density on normal breasts in three MRI datasets. Breast density was consistent across different datasets (0.104 - 0.114). Analysis across different age groups also demonstrated strong consistency across datasets and confirmed a trend of decreasing density with age as reported in previous studies. MR breast density was correlated with mammographic breast density, although some notable differences suggest that certain breast density components are captured only on MRI. Future work will determine how to integrate MR breast density with current tools to improve future breast cancer risk prediction.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension
Authors:
Lin Li,
Wei Chen,
Jiahui Li,
Long Chen
Abstract:
Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning, but remain limited in visual relation understanding (\eg, scene graph generation), particularly in modeling \textit{N}-ary relationships that identify multiple semantic roles among an action event. Such a lack of \textit{semantic dependencies} modeling among multi-…
▽ More
Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning, but remain limited in visual relation understanding (\eg, scene graph generation), particularly in modeling \textit{N}-ary relationships that identify multiple semantic roles among an action event. Such a lack of \textit{semantic dependencies} modeling among multi-entities leads to unreliable outputs, intensifying MLLMs' hallucinations and over-reliance on language priors. To this end, we propose Relation-R1, the first unified relational comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-reward optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and \textit{N}-ary relation understanding.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
UFO2: The Desktop AgentOS
Authors:
Chaoyun Zhang,
He Huang,
Chiming Ni,
Jian Mu,
Si Qin,
Shilin He,
Lu Wang,
Fangkai Yang,
Pu Zhao,
Chao Du,
Liqun Li,
Yu Kang,
Zhao Jiang,
Suzhen Zheng,
Rujia Wang,
Jiaxu Qian,
Minghua Ma,
Jian-Guang Lou,
Qingwei Lin,
Saravan Rajmohan,
Dongmei Zhang
Abstract:
Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution.
We present UFO2, a multiagent AgentOS for Windows deskto…
▽ More
Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution.
We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgent equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference.
We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.
△ Less
Submitted 25 April, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Image Super-Resolution ($\times$4): Methods and Results
Authors:
Zheng Chen,
Kai Liu,
Jue Gong,
Jingkai Wang,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yulun Zhang,
Xiangyu Kong,
Xiaoxuan Yu,
Hyunhee Park,
Suejin Han,
Hakjae Jeon,
Dafeng Zhang,
Hyung-Ju Chun,
Donghun Ryou,
Inju Ha,
Bohyung Han,
Lu Zhao,
Yuyi Zhang,
Pengyu Yan,
Jiawei Hu,
Pengwei Liu,
Fengjun Guo,
Hongyuan Yu
, et al. (86 additional authors not shown)
Abstract:
This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that ach…
▽ More
This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.
△ Less
Submitted 28 April, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
ReasoningV: Efficient Verilog Code Generation with Adaptive Hybrid Reasoning Model
Authors:
Haiyan Qin,
Zhiwei Xie,
Jingjing Li,
Liangchen Li,
Xiaotong Feng,
Junzhan Liu,
Wang Kang
Abstract:
Large Language Models (LLMs) have advanced Verilog code generation significantly, yet face challenges in data quality, reasoning capabilities, and computational efficiency. This paper presents ReasoningV, a novel model employing a hybrid reasoning strategy that integrates trained intrinsic capabilities with dynamic inference adaptation for Verilog code generation. Our framework introduces three co…
▽ More
Large Language Models (LLMs) have advanced Verilog code generation significantly, yet face challenges in data quality, reasoning capabilities, and computational efficiency. This paper presents ReasoningV, a novel model employing a hybrid reasoning strategy that integrates trained intrinsic capabilities with dynamic inference adaptation for Verilog code generation. Our framework introduces three complementary innovations: (1) ReasoningV-5K, a high-quality dataset of 5,000 functionally verified instances with reasoning paths created through multi-dimensional filtering of PyraNet samples; (2) a two-stage training approach combining parameter-efficient fine-tuning for foundational knowledge with full-parameter optimization for enhanced reasoning; and (3) an adaptive reasoning mechanism that dynamically adjusts reasoning depth based on problem complexity, reducing token consumption by up to 75\% while preserving performance. Experimental results demonstrate ReasoningV's effectiveness with a pass@1 accuracy of 57.8\% on VerilogEval-human, achieving performance competitive with leading commercial models like Gemini-2.0-flash (59.5\%) and exceeding the previous best open-source model by 10.4 percentage points. ReasoningV offers a more reliable and accessible pathway for advancing AI-driven hardware design automation, with our model, data, and code available at https://github.com/BUAA-CLab/ReasoningV.
△ Less
Submitted 30 April, 2025; v1 submitted 20 April, 2025;
originally announced April 2025.
-
Revisiting the field normalization approaches/practices
Authors:
Xinyue Lu,
Li Li,
Zhesi Shen
Abstract:
Field normalization plays a crucial role in scientometrics to ensure fair comparisons across different disciplines. In this paper, we revisit the effectiveness of several widely used field normalization methods. Our findings indicate that source-side normalization (as employed in SNIP) does not fully eliminate citation bias across different fields and the imbalanced paper growth rates across field…
▽ More
Field normalization plays a crucial role in scientometrics to ensure fair comparisons across different disciplines. In this paper, we revisit the effectiveness of several widely used field normalization methods. Our findings indicate that source-side normalization (as employed in SNIP) does not fully eliminate citation bias across different fields and the imbalanced paper growth rates across fields are a key factor for this phenomenon. To address the issue of skewness, logarithmic transformation has been applied. Recently, a combination of logarithmic transformation and mean-based normalization, expressed as ln(c+1)/mu, has gained popularity. However, our analysis shows that this approach does not yield satisfactory results. Instead, we find that combining logarithmic transformation (ln(c+1)) with z-score normalization provides a better alternative. Furthermore, our study suggests that the better performance is achieved when combining both source-side and target-side field normalization methods.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
Authors:
ByteDance Seed,
:,
Jiaze Chen,
Tiantian Fan,
Xin Liu,
Lingjun Liu,
Zhiqi Lin,
Mingxuan Wang,
Chengyi Wang,
Xiangpeng Wei,
Wenyuan Xu,
Yufeng Yuan,
Yu Yue,
Lin Yan,
Qiying Yu,
Xiaochen Zuo,
Chi Zhang,
Ruofei Zhu,
Zhecheng An,
Zhihao Bai,
Yu Bao,
Xingyan Bin,
Jiangjie Chen,
Feng Chen,
Hongmin Chen
, et al. (249 additional authors not shown)
Abstract:
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in…
▽ More
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.
△ Less
Submitted 29 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
Optimizing Electric Vehicle Charging Station Locations: A Data-driven System with Multi-source Fusion
Authors:
Lihuan Li,
Du Yin,
Hao Xue,
David Lillo-Trynes,
Flora Salim
Abstract:
With the growing electric vehicles (EVs) charging demand, urban planners face the challenges of providing charging infrastructure at optimal locations. For example, range anxiety during long-distance travel and the inadequate distribution of residential charging stations are the major issues many cities face. To achieve reasonable estimation and deployment of the charging demand, we develop a data…
▽ More
With the growing electric vehicles (EVs) charging demand, urban planners face the challenges of providing charging infrastructure at optimal locations. For example, range anxiety during long-distance travel and the inadequate distribution of residential charging stations are the major issues many cities face. To achieve reasonable estimation and deployment of the charging demand, we develop a data-driven system based on existing EV trips in New South Wales (NSW) state, Australia, incorporating multiple factors that enhance the geographical feasibility of recommended charging stations. Our system integrates data sources including EV trip data, geographical data such as route data and Local Government Area (LGA) boundaries, as well as features like fire and flood risks, and Points of Interest (POIs). We visualize our results to intuitively demonstrate the findings from our data-driven, multi-source fusion system, and evaluate them through case studies. The outcome of this work can provide a platform for discussion to develop new insights that could be used to give guidance on where to position future EV charging stations.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.