-
ContextPRM: Leveraging Contextual Coherence for multi-domain Test-Time Scaling
Authors:
Haotian Zhang,
Liu Liu,
Baosheng Yu,
Jiayan Qiu,
Likang Xiao,
Yanwei Ren,
Quan Chen,
Xianglong Liu
Abstract:
Process reward models (PRMs) have demonstrated significant efficacy in enhancing the mathematical reasoning capabilities of large language models (LLMs) by leveraging test-time scaling (TTS). However, while most PRMs exhibit substantial gains in mathematical domains, the scarcity of domain-specific training data and knowledge-based learning patterns limits their generalization ability when faced w…
▽ More
Process reward models (PRMs) have demonstrated significant efficacy in enhancing the mathematical reasoning capabilities of large language models (LLMs) by leveraging test-time scaling (TTS). However, while most PRMs exhibit substantial gains in mathematical domains, the scarcity of domain-specific training data and knowledge-based learning patterns limits their generalization ability when faced with other domains. To address this limitation, we shift the learning objective from verifying domain-specific knowledge to modeling domain-agnostic logical flow. Centering on contextual coherence between chain-of-thought (CoT) steps, our approach is realized through a novel data annotation and training framework, which enhances the model's generalization capabilities across diverse domains. For instance, our resulting model, ContextPRM, achieves a notable 6.5% average accuracy improvement over the majority voting baseline via weighted majority voting across nine non-mathematical domains in MMLU-Pro, including law, history, and philosophy, significantly surpassing the 2.2% improvement from VersaPRM and 0.5% gains from other mathematics-focused PRMs, demonstrating consistent performance across both mathematical and non-mathematical domains.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
VeriLLM: A Lightweight Framework for Publicly Verifiable Decentralized Inference
Authors:
Ke Wang,
Felix Qu,
Libin Xia,
Zishuo Zhao,
Chris Tong,
Lynn Ai,
Eric Yang
Abstract:
Decentralized inference is an appealing paradigm for serving large language models (LLMs), offering strong security, high efficiency, and lower operating costs. Yet the permissionless setting admits no a priori trust in participating nodes, making output verifiability a prerequisite for secure deployment. We present VeriLLM, a publicly verifiable protocol for decentralized LLM inference that (i) a…
▽ More
Decentralized inference is an appealing paradigm for serving large language models (LLMs), offering strong security, high efficiency, and lower operating costs. Yet the permissionless setting admits no a priori trust in participating nodes, making output verifiability a prerequisite for secure deployment. We present VeriLLM, a publicly verifiable protocol for decentralized LLM inference that (i) achieves security under a one-honest-verifier assumption, (ii) attains near-negligible verification cost (about 1% of the underlying inference) via a lightweight verification algorithm designed explicitly for LLMs, and (iii) enforces honest checking through a peer-prediction mechanism that mitigates lazy verification in naive voting. We further introduce an isomorphic inference-verification network that multiplexes both roles on the same set of GPU workers. This architecture (i) increases GPU utilization and thereby improves end-to-end throughput for both inference and verification, (ii) expands the effective pool of available validators, strengthening robustness and security, and (iii) enforces task indistinguishability at the worker boundary to prevent job-type-conditioned behavior. Finally, we provide a formal game-theoretic analysis and prove that, under our incentives, honest inference and verification constitute a Nash equilibrium, ensuring incentive compatibility against rational adversaries. To our knowledge, this is the first decentralized inference verification protocol with an end-to-end game-theoretic security proof.
△ Less
Submitted 29 September, 2025;
originally announced September 2025.
-
Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems
Authors:
Guojian Li,
Chengyou Wang,
Hongfei Xue,
Shuiyuan Wang,
Dehui Gao,
Zihan Zhang,
Yuke Lin,
Wenjie Li,
Longshuai Xiao,
Zhonghua Fu,
Lei Xie
Abstract:
Full-duplex interaction is crucial for natural human-machine communication, yet remains challenging as it requires robust turn-taking detection to decide when the system should speak, listen, or remain silent. Existing solutions either rely on dedicated turn-taking models, most of which are not open-sourced. The few available ones are limited by their large parameter size or by supporting only a s…
▽ More
Full-duplex interaction is crucial for natural human-machine communication, yet remains challenging as it requires robust turn-taking detection to decide when the system should speak, listen, or remain silent. Existing solutions either rely on dedicated turn-taking models, most of which are not open-sourced. The few available ones are limited by their large parameter size or by supporting only a single modality, such as acoustic or linguistic. Alternatively, some approaches finetune LLM backbones to enable full-duplex capability, but this requires large amounts of full-duplex data, which remain scarce in open-source form. To address these issues, we propose Easy Turn, an open-source, modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states: complete, incomplete, backchannel, and wait, accompanied by the release of Easy Turn trainset, a 1,145-hour speech dataset designed for training turn-taking detection models. Compared to existing open-source models like TEN Turn Detection and Smart Turn V2, our model achieves state-of-the-art turn-taking detection accuracy on our open-source Easy Turn testset. The data and model will be made publicly available on GitHub.
△ Less
Submitted 28 September, 2025;
originally announced September 2025.
-
Sparse-Up: Learnable Sparse Upsampling for 3D Generation with High-Fidelity Textures
Authors:
Lu Xiao,
Jiale Zhang,
Yang Liu,
Taicheng Huang,
Xin Tian
Abstract:
The creation of high-fidelity 3D assets is often hindered by a 'pixel-level pain point': the loss of high-frequency details. Existing methods often trade off one aspect for another: either sacrificing cross-view consistency, resulting in torn or drifting textures, or remaining trapped by the resolution ceiling of explicit voxels, forfeiting fine texture detail. In this work, we propose Sparse-Up,…
▽ More
The creation of high-fidelity 3D assets is often hindered by a 'pixel-level pain point': the loss of high-frequency details. Existing methods often trade off one aspect for another: either sacrificing cross-view consistency, resulting in torn or drifting textures, or remaining trapped by the resolution ceiling of explicit voxels, forfeiting fine texture detail. In this work, we propose Sparse-Up, a memory-efficient, high-fidelity texture modeling framework that effectively preserves high-frequency details. We use sparse voxels to guide texture reconstruction and ensure multi-view consistency, while leveraging surface anchoring and view-domain partitioning to break through resolution constraints. Surface anchoring employs a learnable upsampling strategy to constrain voxels to the mesh surface, eliminating over 70% of redundant voxels present in traditional voxel upsampling. View-domain partitioning introduces an image patch-guided voxel partitioning scheme, supervising and back-propagating gradients only on visible local patches. Through these two strategies, we can significantly reduce memory consumption during high-resolution voxel training without sacrificing geometric consistency, while preserving high-frequency details in textures.
△ Less
Submitted 28 September, 2025;
originally announced September 2025.
-
DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice
Authors:
Zijie Meng,
Jin Hao,
Xiwei Dai,
Yang Feng,
Jiaxiang Liu,
Bin Feng,
Huikai Wu,
Xiaotang Gai,
Hengchuan Zhu,
Tianxiang Hu,
Yangyang Wu,
Hongxia Xu,
Jin Li,
Jun Xiao,
Xiaoqiang Liu,
Joey Tianyi Zhou,
Fudong Zhu,
Zhihe Zhao,
Lunguo Xia,
Bing Fang,
Jimeng Sun,
Jian Wu,
Zuozhu Liu
Abstract:
Diagnosing and managing oral diseases necessitate advanced visual interpretation across diverse imaging modalities and integrated information synthesis. While current AI models excel at isolated tasks, they often fall short in addressing the complex, multimodal requirements of comprehensive clinical dental practice. Here we introduce DentVLM, a multimodal vision-language model engineered for exper…
▽ More
Diagnosing and managing oral diseases necessitate advanced visual interpretation across diverse imaging modalities and integrated information synthesis. While current AI models excel at isolated tasks, they often fall short in addressing the complex, multimodal requirements of comprehensive clinical dental practice. Here we introduce DentVLM, a multimodal vision-language model engineered for expert-level oral disease diagnosis. DentVLM was developed using a comprehensive, large-scale, bilingual dataset of 110,447 images and 2.46 million visual question-answering (VQA) pairs. The model is capable of interpreting seven 2D oral imaging modalities across 36 diagnostic tasks, significantly outperforming leading proprietary and open-source models by 19.6% higher accuracy for oral diseases and 27.9% for malocclusions. In a clinical study involving 25 dentists, evaluating 1,946 patients and encompassing 3,105 QA pairs, DentVLM surpassed the diagnostic performance of 13 junior dentists on 21 of 36 tasks and exceeded that of 12 senior dentists on 12 of 36 tasks. When integrated into a collaborative workflow, DentVLM elevated junior dentists' performance to senior levels and reduced diagnostic time for all practitioners by 15-22%. Furthermore, DentVLM exhibited promising performance across three practical utility scenarios, including home-based dental health management, hospital-based intelligent diagnosis and multi-agent collaborative interaction. These findings establish DentVLM as a robust clinical decision support tool, poised to enhance primary dental care, mitigate provider-patient imbalances, and democratize access to specialized medical expertise within the field of dentistry.
△ Less
Submitted 27 September, 2025;
originally announced September 2025.
-
MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow
Authors:
Yike Zhu,
Boyi Kang,
Ziqian Wang,
Xingchen Li,
Zihan Zhang,
Wenjie Li,
Longshuai Xiao,
Wei Xue,
Lei Xie
Abstract:
Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE,…
▽ More
Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE, a one-step generative SE framework. It adopts MeanFlow to predict an average-velocity field for one-step latent refinement and conditions the model on self-supervised learning (SSL) representations rather than VAE latents. This design accelerates inference and provides robust acoustic-semantic guidance during training. In the Interspeech 2020 DNS Challenge blind test set and simulated test set, MeanFlowSE attains state-of-the-art (SOTA) level perceptual quality and competitive intelligibility while significantly lowering both real-time factor (RTF) and model size compared with recent generative competitors, making it suitable for practical use. The code will be released upon publication at https://github.com/Hello3orld/MeanFlowSE.
△ Less
Submitted 30 September, 2025; v1 submitted 27 September, 2025;
originally announced September 2025.
-
Non-Linear Trajectory Modeling for Multi-Step Gradient Inversion Attacks in Federated Learning
Authors:
Li Xia,
Zheng Liu,
Sili Huang,
Wei Tang,
Xuan Liu
Abstract:
Federated Learning (FL) preserves privacy by keeping raw data local, yet Gradient Inversion Attacks (GIAs) pose significant threats. In FedAVG multi-step scenarios, attackers observe only aggregated gradients, making data reconstruction challenging. Existing surrogate model methods like SME assume linear parameter trajectories, but we demonstrate this severely underestimates SGD's nonlinear comple…
▽ More
Federated Learning (FL) preserves privacy by keeping raw data local, yet Gradient Inversion Attacks (GIAs) pose significant threats. In FedAVG multi-step scenarios, attackers observe only aggregated gradients, making data reconstruction challenging. Existing surrogate model methods like SME assume linear parameter trajectories, but we demonstrate this severely underestimates SGD's nonlinear complexity, fundamentally limiting attack effectiveness. We propose Non-Linear Surrogate Model Extension (NL-SME), the first method to introduce nonlinear parametric trajectory modeling for GIAs. Our approach replaces linear interpolation with learnable quadratic Bézier curves that capture SGD's curved characteristics through control points, combined with regularization and dvec scaling mechanisms for enhanced expressiveness. Extensive experiments on CIFAR-100 and FEMNIST datasets show NL-SME significantly outperforms baselines across all metrics, achieving order-of-magnitude improvements in cosine similarity loss while maintaining computational efficiency.This work exposes heightened privacy vulnerabilities in FL's multi-step update paradigm and offers novel perspectives for developing robust defense strategies.
△ Less
Submitted 26 September, 2025;
originally announced September 2025.
-
Discovering and Analyzing Stochastic Processes to Reduce Waste in Food Retail
Authors:
Anna Kalenkova,
Lu Xia,
Dirk Neumann
Abstract:
This paper proposes a novel method for analyzing food retail processes with a focus on reducing food waste. The approach integrates object-centric process mining (OCPM) with stochastic process discovery and analysis. First, a stochastic process in the form of a continuous-time Markov chain is discovered from grocery store sales data. This model is then extended with supply activities. Finally, a w…
▽ More
This paper proposes a novel method for analyzing food retail processes with a focus on reducing food waste. The approach integrates object-centric process mining (OCPM) with stochastic process discovery and analysis. First, a stochastic process in the form of a continuous-time Markov chain is discovered from grocery store sales data. This model is then extended with supply activities. Finally, a what-if analysis is conducted to evaluate how the quantity of products in the store evolves over time. This enables the identification of an optimal balance between customer purchasing behavior and supply strategies, helping to prevent both food waste due to oversupply and product shortages.
△ Less
Submitted 16 August, 2025;
originally announced September 2025.
-
MLF-4DRCNet: Multi-Level Fusion with 4D Radar and Camera for 3D Object Detection in Autonomous Driving
Authors:
Yuzhi Wu,
Li Xiao,
Jun Liu,
Guangfeng Jiang,
XiangGen Xia
Abstract:
The emerging 4D millimeter-wave radar, measuring the range, azimuth, elevation, and Doppler velocity of objects, is recognized for its cost-effectiveness and robustness in autonomous driving. Nevertheless, its point clouds exhibit significant sparsity and noise, restricting its standalone application in 3D object detection. Recent 4D radar-camera fusion methods have provided effective perception.…
▽ More
The emerging 4D millimeter-wave radar, measuring the range, azimuth, elevation, and Doppler velocity of objects, is recognized for its cost-effectiveness and robustness in autonomous driving. Nevertheless, its point clouds exhibit significant sparsity and noise, restricting its standalone application in 3D object detection. Recent 4D radar-camera fusion methods have provided effective perception. Most existing approaches, however, adopt explicit Bird's-Eye-View fusion paradigms originally designed for LiDAR-camera fusion, neglecting radar's inherent drawbacks. Specifically, they overlook the sparse and incomplete geometry of radar point clouds and restrict fusion to coarse scene-level integration. To address these problems, we propose MLF-4DRCNet, a novel two-stage framework for 3D object detection via multi-level fusion of 4D radar and camera images. Our model incorporates the point-, scene-, and proposal-level multi-modal information, enabling comprehensive feature representation. It comprises three crucial components: the Enhanced Radar Point Encoder (ERPE) module, the Hierarchical Scene Fusion Pooling (HSFP) module, and the Proposal-Level Fusion Enhancement (PLFE) module. Operating at the point-level, ERPE densities radar point clouds with 2D image instances and encodes them into voxels via the proposed Triple-Attention Voxel Feature Encoder. HSFP dynamically integrates multi-scale voxel features with 2D image features using deformable attention to capture scene context and adopts pooling to the fused features. PLFE refines region proposals by fusing image features, and further integrates with the pooled features from HSFP. Experimental results on the View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate that MLF-4DRCNet achieves the state-of-the-art performance. Notably, it attains performance comparable to LiDAR-based models on the VoD dataset.
△ Less
Submitted 23 September, 2025;
originally announced September 2025.
-
Impact of Phonetics on Speaker Identity in Adversarial Voice Attack
Authors:
Daniyal Kabir Dar,
Qiben Yan,
Li Xiao,
Arun Ross
Abstract:
Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-end ASR models have been widely studied, the phonetic basis of these perturbations and their effect on speaker identity r…
▽ More
Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-end ASR models have been widely studied, the phonetic basis of these perturbations and their effect on speaker identity remain underexplored. In this work, we analyze adversarial audio at the phonetic level and show that perturbations exploit systematic confusions such as vowel centralization and consonant substitutions. These distortions not only mislead transcription but also degrade phonetic cues critical for speaker verification, leading to identity drift. Using DeepSpeech as our ASR target, we generate targeted adversarial examples and evaluate their impact on speaker embeddings across genuine and impostor samples. Results across 16 phonetically diverse target phrases demonstrate that adversarial audio induces both transcription errors and identity drift, highlighting the need for phonetic-aware defenses to ensure the robustness of ASR and speaker recognition systems.
△ Less
Submitted 18 September, 2025;
originally announced September 2025.
-
Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods
Authors:
Bingshen Mu,
Pengcheng Guo,
Zhaokai Sun,
Shuai Wang,
Hexin Liu,
Mingchen Shao,
Lei Xie,
Eng Siong Chng,
Longshuai Xiao,
Qiangze Feng,
Daliang Wang
Abstract:
This paper summarizes the Interspeech2025 Multilingual Conversational Speech Language Model (MLC-SLM) challenge, which aims to advance the exploration of building effective multilingual conversational speech LLMs (SLLMs). We provide a detailed description of the task settings for the MLC-SLM challenge, the released real-world multilingual conversational speech dataset totaling approximately 1,604…
▽ More
This paper summarizes the Interspeech2025 Multilingual Conversational Speech Language Model (MLC-SLM) challenge, which aims to advance the exploration of building effective multilingual conversational speech LLMs (SLLMs). We provide a detailed description of the task settings for the MLC-SLM challenge, the released real-world multilingual conversational speech dataset totaling approximately 1,604 hours, and the baseline systems for participants. The MLC-SLM challenge attracts 78 teams from 13 countries to participate, with 489 valid leaderboard results and 14 technical reports for the two tasks. We distill valuable insights on building multilingual conversational SLLMs based on submissions from participants, aiming to contribute to the advancement of the community.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
Cooperative Base Station Assignment and Resource Allocation for 6G ISAC Network
Authors:
Jiajia Liao,
Luping Xiang,
Shida Zhong,
Lixia Xiao,
Haochen Liu,
Kun Yang
Abstract:
In the upcoming 6G networks, integrated sensing and communications (ISAC) will be able to provide a performance boost in both perception and wireless connectivity. This paper considers a multiple base station (BS) architecture to support the comprehensive services of data transmission and multi-target sensing. In this context, a cooperative BS assignment and resource allocation (CBARA) strategy is…
▽ More
In the upcoming 6G networks, integrated sensing and communications (ISAC) will be able to provide a performance boost in both perception and wireless connectivity. This paper considers a multiple base station (BS) architecture to support the comprehensive services of data transmission and multi-target sensing. In this context, a cooperative BS assignment and resource allocation (CBARA) strategy is proposed in this paper, aiming at jointly optimizing the communication and sensing (C&S) performance. The posterior Cramer-Rao lower bound and the achievable rate with respect to transmit power and bandwidth are derived and utilized as optimization criteria for the CBARA scheme. We develop a heuristic alternating optimization algorithm to obtain an effective sub-optimal solution for the non-convex optimization problem caused by multiple coupled variables. Numerical results show the effectiveness of the proposed solution, which achieves a performance improvement of 117% in communication rate and 40% in sensing accuracy, compared to the classic scheme.
△ Less
Submitted 12 September, 2025;
originally announced September 2025.
-
Bregman Douglas-Rachford Splitting Method
Authors:
Shiqian Ma,
Lin Xiao,
Renbo Zhao
Abstract:
In this paper, we propose the Bregman Douglas-Rachford splitting (BDRS) method and its variant Bregman Peaceman-Rachford splitting method for solving maximal monotone inclusion problem. We show that BDRS is equivalent to a Bregman alternating direction method of multipliers (ADMM) when applied to the dual of the problem. A special case of the Bregman ADMM is an alternating direction version of the…
▽ More
In this paper, we propose the Bregman Douglas-Rachford splitting (BDRS) method and its variant Bregman Peaceman-Rachford splitting method for solving maximal monotone inclusion problem. We show that BDRS is equivalent to a Bregman alternating direction method of multipliers (ADMM) when applied to the dual of the problem. A special case of the Bregman ADMM is an alternating direction version of the exponential multiplier method. To the best of our knowledge, algorithms proposed in this paper are new to the literature. We also discuss how to use our algorithms to solve the discrete optimal transport (OT) problem. We prove the convergence of the algorithms under certain assumptions, though we point out that one assumption does not apply to the OT problem.
△ Less
Submitted 10 September, 2025;
originally announced September 2025.
-
Understanding the Video Content Creation Journey of Creators with Sensory Impairment in Kenya
Authors:
Lan Xiao,
Maryam Bandukda,
Franklin Mingzhe Li,
Mark Colley,
Catherine Holloway
Abstract:
Video content creation offers vital opportunities for expression and participation, yet remains largely inaccessible to creators with sensory impairments, especially in low-resource settings. We conducted interviews with 20 video creators with visual and hearing impairments in Kenya to examine their tools, challenges, and collaborative practices. Our findings show that accessibility barriers and i…
▽ More
Video content creation offers vital opportunities for expression and participation, yet remains largely inaccessible to creators with sensory impairments, especially in low-resource settings. We conducted interviews with 20 video creators with visual and hearing impairments in Kenya to examine their tools, challenges, and collaborative practices. Our findings show that accessibility barriers and infrastructural limitations shape video creation as a staged, collaborative process involving trusted human partners and emerging AI tools. Across workflows, creators actively negotiated agency and trust, maintaining creative control while bridging sensory gaps. We discuss the need for flexible, interdependent collaboration models, inclusive human-AI workflows, and diverse storytelling practices. This work broadens accessibility research in HCI by examining how technology and social factors intersect in low-resource contexts, suggesting ways to better support disabled creators globally.
△ Less
Submitted 9 September, 2025;
originally announced September 2025.
-
Sharpe Ratio Optimization in Markov Decision Processes
Authors:
Shuai Ma,
Guangwu Liu,
Li Xia
Abstract:
Sharpe ratio (also known as reward-to-variability ratio) is a widely-used metric in finance, which measures the additional return at the cost of per unit of increased risk (standard deviation of return). However, the optimization of Sharpe ratio in Markov decision processes (MDPs) is challenging, because there exist two difficulties hindering the application of dynamic programming. One is that dyn…
▽ More
Sharpe ratio (also known as reward-to-variability ratio) is a widely-used metric in finance, which measures the additional return at the cost of per unit of increased risk (standard deviation of return). However, the optimization of Sharpe ratio in Markov decision processes (MDPs) is challenging, because there exist two difficulties hindering the application of dynamic programming. One is that dynamic programming does not work for fractional objectives, and the other is that dynamic programming is invalid for risk metrics. In this paper, we study the Sharpe ratio optimization in infinite-horizon MDPs, considering both the long-run average and discounted settings. We address the first challenge with the Dinkelbachs transform, which converts the Sharpe ratio objective to a mean-squared-variance (M2V) objective. It is shown that the M2V optimization and the original Sharpe ratio optimization share the same optimal policy when the risk-sensitive parameter is equal to the optimal Sharpe ratio. For the second challenge, we develop an iterative algorithm to solve the M2V optimization which is similar to a mean-variance optimization in MDPs. We iteratively solve the M2V problem and obtain the associated Sharpe ratio that is used to update the risk-sensitive parameter in the next iteration of M2V problems. We show that such a sequence of Sharpe ratios derived is monotonically increasing and converges to the optimal Sharpe ratio. For both average and discounted MDP settings, we develop a policy iteration procedure and prove its convergence to the optimum. Numerical experiments are conducted for validation. To the best of our knowledge, our approach is the first that solves the Sharpe ratio optimization in MDPs with dynamic programming type algorithms. We believe that the proposed algorithm can shed light on solving MDPs with other fractional objectives.
△ Less
Submitted 31 August, 2025;
originally announced September 2025.
-
Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards
Authors:
Xiaolong Wei,
Bo Lu,
Xingyu Zhang,
Zhejun Zhao,
Dongdong Shen,
Long Xia,
Dawei Yin
Abstract:
Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinc…
▽ More
Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.
△ Less
Submitted 29 August, 2025;
originally announced August 2025.
-
Stack Trace-Based Crash Deduplication with Transformer Adaptation
Authors:
Md Afif Al Mamun,
Gias Uddin,
Lan Xia,
Longyu Zhang
Abstract:
Automated crash reporting systems generate large volumes of duplicate reports, overwhelming issue-tracking systems and increasing developer workload. Traditional stack trace-based deduplication methods, relying on string similarity, rule-based heuristics, or deep learning (DL) models, often fail to capture the contextual and structural relationships within stack traces. We propose dedupT, a transf…
▽ More
Automated crash reporting systems generate large volumes of duplicate reports, overwhelming issue-tracking systems and increasing developer workload. Traditional stack trace-based deduplication methods, relying on string similarity, rule-based heuristics, or deep learning (DL) models, often fail to capture the contextual and structural relationships within stack traces. We propose dedupT, a transformer-based approach that models stack traces holistically rather than as isolated frames. dedupT first adapts a pretrained language model (PLM) to stack traces, then uses its embeddings to train a fully-connected network (FCN) to rank duplicate crashes effectively. Extensive experiments on real-world datasets show that dedupT outperforms existing DL and traditional methods (e.g., sequence alignment and information retrieval techniques) in both duplicate ranking and unique crash detection, significantly reducing manual triage effort. On four public datasets, dedupT improves Mean Reciprocal Rank (MRR) often by over 15% compared to the best DL baseline and up to 9% over traditional methods while achieving higher Receiver Operating Characteristic Area Under the Curve (ROC-AUC) in detecting unique crash reports. Our work advances the integration of modern natural language processing (NLP) techniques into software engineering, providing an effective solution for stack trace-based crash deduplication.
△ Less
Submitted 26 August, 2025;
originally announced August 2025.
-
TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings
Authors:
Md Afif Al Mamun,
Gias Uddin,
Lan Xia,
Longyu Zhang
Abstract:
Pretrained Language Models or PLMs are transformer-based architectures that can be used in bug triaging tasks. PLMs can better capture token semantics than traditional Machine Learning (ML) models that rely on statistical features (e.g., TF-IDF, bag of words). However, PLMs may still attend to less relevant tokens in a bug report, which can impact their effectiveness. In addition, the model can be…
▽ More
Pretrained Language Models or PLMs are transformer-based architectures that can be used in bug triaging tasks. PLMs can better capture token semantics than traditional Machine Learning (ML) models that rely on statistical features (e.g., TF-IDF, bag of words). However, PLMs may still attend to less relevant tokens in a bug report, which can impact their effectiveness. In addition, the model can be sub-optimal with its recommendations when the interaction history of developers around similar bugs is not taken into account. We designed TriagerX to address these limitations. First, to assess token semantics more reliably, we leverage a dual-transformer architecture. Unlike current state-of-the-art (SOTA) baselines that employ a single transformer architecture, TriagerX collects recommendations from two transformers with each offering recommendations via its last three layers. This setup generates a robust content-based ranking of candidate developers. TriagerX then refines this ranking by employing a novel interaction-based ranking methodology, which considers developers' historical interactions with similar fixed bugs. Across five datasets, TriagerX surpasses all nine transformer-based methods, including SOTA baselines, often improving Top-1 and Top-3 developer recommendation accuracy by over 10%. We worked with our large industry partner to successfully deploy TriagerX in their development environment. The partner required both developer and component recommendations, with components acting as proxies for team assignments-particularly useful in cases of developer turnover or team changes. We trained TriagerX on the partner's dataset for both tasks, and it outperformed SOTA baselines by up to 10% for component recommendations and 54% for developer recommendations.
△ Less
Submitted 22 August, 2025;
originally announced August 2025.
-
AdapSNE: Adaptive Fireworks-Optimized and Entropy-Guided Dataset Sampling for Edge DNN Training
Authors:
Boran Zhao,
Hetian Liu,
Zihang Yuan,
Li Zhu,
Fan Yang,
Lina Xie Tian Xia,
Wenzhe Zhao,
Pengju Ren
Abstract:
Training deep neural networks (DNNs) directly on edge devices has attracted increasing attention, as it offers promising solutions to challenges such as domain adaptation and privacy preservation. However, conventional DNN training typically requires large-scale datasets, which imposes prohibitive overhead on edge devices-particularly for emerging large language model (LLM) tasks. To address this…
▽ More
Training deep neural networks (DNNs) directly on edge devices has attracted increasing attention, as it offers promising solutions to challenges such as domain adaptation and privacy preservation. However, conventional DNN training typically requires large-scale datasets, which imposes prohibitive overhead on edge devices-particularly for emerging large language model (LLM) tasks. To address this challenge, a DNN-free method (ie., dataset sampling without DNN), named NMS (Near-Memory Sampling), has been introduced. By first conducting dimensionality reduction of the dataset and then performing exemplar sampling in the reduced space, NMS avoids the architectural bias inherent in DNN-based methods and thus achieves better generalization. However, The state-of-the-art, NMS, suffers from two limitations: (1) The mismatch between the search method and the non-monotonic property of the perplexity error function leads to the emergence of outliers in the reduced representation; (2) Key parameter (ie., target perplexity) is selected empirically, introducing arbitrariness and leading to uneven sampling. These two issues lead to representative bias of examplars, resulting in degraded accuracy. To address these issues, we propose AdapSNE, which integrates an efficient non-monotonic search method-namely, the Fireworks Algorithm (FWA)-to suppress outliers, and employs entropy-guided optimization to enforce uniform sampling, thereby ensuring representative training samples and consequently boosting training accuracy. To cut the edge-side cost arising from the iterative computations of FWA search and entropy-guided optimization, we design an accelerator with custom dataflow and time-multiplexing markedly reducing on-device training energy and area.
△ Less
Submitted 19 August, 2025;
originally announced August 2025.
-
A Unified Voxel Diffusion Module for Point Cloud 3D Object Detection
Authors:
Qifeng Liu,
Dawei Zhao,
Yabo Dong,
Linzhi Shang,
Liang Xiao,
Juan Wang,
Kunkong Zhao,
Dongming Lu,
Qi Zhu
Abstract:
Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This…
▽ More
Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This limitation significantly affects detection accuracy. Inspired by CNN-based object detection architectures, we propose a novel Voxel Diffusion Module (VDM) to enhance voxel-level representation and diffusion in point cloud data. VDM is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections. To ensure computational efficiency, the output feature maps are downsampled to one-fourth of the original input resolution. VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation. The enhanced voxel features produced by VDM can be seamlessly integrated into mainstream Transformer- or SSM-based detection models for accurate object classification and localization, highlighting the generalizability of our method. We evaluate VDM on several benchmark datasets by embedding it into both Transformerbased and SSM-based models. Experimental results show that our approach consistently improves detection accuracy over baseline models. Specifically, VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new stateof-the-art performance across all datasets. Our code will be made publicly available.
△ Less
Submitted 21 August, 2025;
originally announced August 2025.
-
Intern-S1: A Scientific Multimodal Foundation Model
Authors:
Lei Bai,
Zhongrui Cai,
Yuhang Cao,
Maosong Cao,
Weihan Cao,
Chiyu Chen,
Haojiong Chen,
Kai Chen,
Pengcheng Chen,
Ying Chen,
Yongkang Chen,
Yu Cheng,
Pei Chu,
Tao Chu,
Erfei Cui,
Ganqu Cui,
Long Cui,
Ziyun Cui,
Nianchen Deng,
Ning Ding,
Nanqing Dong,
Peijie Dong,
Shihan Dou,
Sinan Du,
Haodong Duan
, et al. (152 additional authors not shown)
Abstract:
In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared…
▽ More
In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training. On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.
△ Less
Submitted 24 August, 2025; v1 submitted 21 August, 2025;
originally announced August 2025.
-
Large Foundation Model for Ads Recommendation
Authors:
Shangyu Zhang,
Shijie Quan,
Zhongren Wang,
Junwei Pan,
Tianqu Zhuang,
Bo Fu,
Yilong Sun,
Jieying Lin,
Jushuo Chen,
Xiaotian Li,
Zhixiang Feng,
Xian Hu,
Huiting Deng,
Hua Lu,
Jinpeng Wang,
Boqi Dai,
Xiaoyu Chen,
Bin Hu,
Lili Huang,
Yanwen Wu,
Yeshou Cai,
Qi Zhou,
Huang Tang,
Chunfeng Yang,
Chengguo Yin
, et al. (8 additional authors not shown)
Abstract:
Online advertising relies on accurate recommendation models, with recent advances using pre-trained large-scale foundation models (LFMs) to capture users' general interests across multiple scenarios and tasks. However, existing methods have critical limitations: they extract and transfer only user representations (URs), ignoring valuable item representations (IRs) and user-item cross representatio…
▽ More
Online advertising relies on accurate recommendation models, with recent advances using pre-trained large-scale foundation models (LFMs) to capture users' general interests across multiple scenarios and tasks. However, existing methods have critical limitations: they extract and transfer only user representations (URs), ignoring valuable item representations (IRs) and user-item cross representations (CRs); and they simply use a UR as a feature in downstream applications, which fails to bridge upstream-downstream gaps and overlooks more transfer granularities. In this paper, we propose LFM4Ads, an All-Representation Multi-Granularity transfer framework for ads recommendation. It first comprehensively transfers URs, IRs, and CRs, i.e., all available representations in the pre-trained foundation model. To effectively utilize the CRs, it identifies the optimal extraction layer and aggregates them into transferable coarse-grained forms. Furthermore, we enhance the transferability via multi-granularity mechanisms: non-linear adapters for feature-level transfer, an Isomorphic Interaction Module for module-level transfer, and Standalone Retrieval for model-level transfer. LFM4Ads has been successfully deployed in Tencent's industrial-scale advertising platform, processing tens of billions of daily samples while maintaining terabyte-scale model parameters with billions of sparse embedding keys across approximately two thousand features. Since its production deployment in Q4 2024, LFM4Ads has achieved 10+ successful production launches across various advertising scenarios, including primary ones like Weixin Moments and Channels. These launches achieve an overall GMV lift of 2.45% across the entire platform, translating to estimated annual revenue increases in the hundreds of millions of dollars.
△ Less
Submitted 20 August, 2025;
originally announced August 2025.
-
PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning
Authors:
Ruheng Wang,
Hang Zhang,
Trieu Nguyen,
Shasha Feng,
Hao-Wei Pang,
Xiang Yu,
Li Xiao,
Peter Zhiping Zhang
Abstract:
Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike…
▽ More
Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence generation, enabling interpretable design choices while optimizing for multiple pharmacological properties. Guided by a tailored reward function balancing chemical validity and property improvements, the model autonomously explores diverse sequence variants. We demonstrate that PepThink-R1 generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming existing general LLMs (e.g., GPT-5) and domain-specific baseline in both optimization success and interpretability. To our knowledge, this is the first LLM-based peptide design framework that combines explicit reasoning with RL-driven property control, marking a step toward reliable and transparent peptide optimization for therapeutic discovery.
△ Less
Submitted 20 August, 2025;
originally announced August 2025.
-
UniDCF: A Foundation Model for Comprehensive Dentocraniofacial Hard Tissue Reconstruction
Authors:
Chunxia Ren,
Ning Zhu,
Yue Lai,
Gui Chen,
Ruijie Wang,
Yangyi Hu,
Suyao Liu,
Shuwen Mao,
Hong Su,
Yu Zhang,
Li Xiao
Abstract:
Dentocraniofacial hard tissue defects profoundly affect patients' physiological functions, facial aesthetics, and psychological well-being, posing significant challenges for precise reconstruction. Current deep learning models are limited to single-tissue scenarios and modality-specific imaging inputs, resulting in poor generalizability and trade-offs between anatomical fidelity, computational eff…
▽ More
Dentocraniofacial hard tissue defects profoundly affect patients' physiological functions, facial aesthetics, and psychological well-being, posing significant challenges for precise reconstruction. Current deep learning models are limited to single-tissue scenarios and modality-specific imaging inputs, resulting in poor generalizability and trade-offs between anatomical fidelity, computational efficiency, and cross-tissue adaptability. Here we introduce UniDCF, a unified framework capable of reconstructing multiple dentocraniofacial hard tissues through multimodal fusion encoding of point clouds and multi-view images. By leveraging the complementary strengths of each modality and incorporating a score-based denoising module to refine surface smoothness, UniDCF overcomes the limitations of prior single-modality approaches. We curated the largest multimodal dataset, comprising intraoral scans, CBCT, and CT from 6,609 patients, resulting in 54,555 annotated instances. Evaluations demonstrate that UniDCF outperforms existing state-of-the-art methods in terms of geometric precision, structural completeness, and spatial accuracy. Clinical simulations indicate UniDCF reduces reconstruction design time by 99% and achieves clinician-rated acceptability exceeding 94%. Overall, UniDCF enables rapid, automated, and high-fidelity reconstruction, supporting personalized and precise restorative treatments, streamlining clinical workflows, and enhancing patient outcomes.
△ Less
Submitted 15 August, 2025;
originally announced August 2025.
-
Quantization through Piecewise-Affine Regularization: Optimization and Statistical Guarantees
Authors:
Jianhao Ma,
Lin Xiao
Abstract:
Optimization problems over discrete or quantized variables are very challenging in general due to the combinatorial nature of their search space. Piecewise-affine regularization (PAR) provides a flexible modeling and computational framework for quantization based on continuous optimization. In this work, we focus on the setting of supervised learning and investigate the theoretical foundations of…
▽ More
Optimization problems over discrete or quantized variables are very challenging in general due to the combinatorial nature of their search space. Piecewise-affine regularization (PAR) provides a flexible modeling and computational framework for quantization based on continuous optimization. In this work, we focus on the setting of supervised learning and investigate the theoretical foundations of PAR from optimization and statistical perspectives. First, we show that in the overparameterized regime, where the number of parameters exceeds the number of samples, every critical point of the PAR-regularized loss function exhibits a high degree of quantization. Second, we derive closed-form proximal mappings for various (convex, quasi-convex, and non-convex) PARs and show how to solve PAR-regularized problems using the proximal gradient method, its accelerated variant, and the Alternating Direction Method of Multipliers. Third, we study statistical guarantees of PAR-regularized linear regression problems; specifically, we can approximate classical formulations of $\ell_1$-, squared $\ell_2$-, and nonconvex regularizations using PAR and obtain similar statistical guarantees with quantized solutions.
△ Less
Submitted 14 August, 2025;
originally announced August 2025.
-
A learning-driven automatic planning framework for proton PBS treatments of H&N cancers
Authors:
Qingqing Wang,
Liqiang Xiao,
Chang Chang
Abstract:
Proton pencil beam scanning (PBS) treatment planning for head & neck (H&N) cancers involves numerous conflicting objectives, requiring iterative objective parameter adjustments to balance multiple clinical goals. We propose a learning-driven inverse optimizer and integrate it into a proximal policy optimization (PPO)-based planning framework to automatically generate high-quality plans for patient…
▽ More
Proton pencil beam scanning (PBS) treatment planning for head & neck (H&N) cancers involves numerous conflicting objectives, requiring iterative objective parameter adjustments to balance multiple clinical goals. We propose a learning-driven inverse optimizer and integrate it into a proximal policy optimization (PPO)-based planning framework to automatically generate high-quality plans for patients with diverse treatment requirements. The inverse optimizer is a learning-to-optimize (L2O) method that predicts update steps by learning from task-specific data distributions. For the first time, long-context processing techniques developed for large language models (LLMs) are utilized to address the scalability limitations of existing L2O methods, enabling simultaneous optimization over a substantially large set of variables. The PPO framework functions as an outer-loop virtual planner, autonomously adjusting objective parameters through a policy network, and the inner-loop L2O inverse optimizer computes machine-deliverable spot monitor unit (MU) values based on the PPO-refined objectives. Moreover, a Swin UnetR dose predictor is trained with prescription- and beam-specific information to estimate the initial objective parameters. In our experiments, total 97 patients with bilateral or ipsilateral H&N cancers are collected for training and testing. Compared with the second-order gradient-based methods, our L2O optimizer improves the effectiveness and efficiency of the time-consuming inverse optimization by 22.97% and 36.41%, respectively, and in conjunction with the PPO-based virtual planner, plans are generated within clinically acceptable times, i.e. 2.55 hours in average, and shows improved or comparable organs-at-risk sparing with superior target coverage compared with human-generated plans.
△ Less
Submitted 15 September, 2025; v1 submitted 14 August, 2025;
originally announced August 2025.
-
Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos
Authors:
Chaoyi Wang,
Yifan Yang,
Jun Pei,
Lijie Xia,
Jianpo Liu,
Xiaobing Yuan,
Xinhan Di
Abstract:
Creating realistic, fully animatable whole-body avatars from a single portrait is challenging due to limitations in capturing subtle expressions, body movements, and dynamic backgrounds. Current evaluation datasets and metrics fall short in addressing these complexities. To bridge this gap, we introduce the Whole-Body Benchmark Dataset (WB-DH), an open-source, multi-modal benchmark designed for ev…
▽ More
Creating realistic, fully animatable whole-body avatars from a single portrait is challenging due to limitations in capturing subtle expressions, body movements, and dynamic backgrounds. Current evaluation datasets and metrics fall short in addressing these complexities. To bridge this gap, we introduce the Whole-Body Benchmark Dataset (WB-DH), an open-source, multi-modal benchmark designed for evaluating whole-body animatable avatar generation. Key features include: (1) detailed multi-modal annotations for fine-grained guidance, (2) a versatile evaluation framework, and (3) public access to the dataset and tools at https://github.com/deepreasonings/WholeBodyBenchmark.
△ Less
Submitted 12 August, 2025;
originally announced August 2025.
-
Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model
Authors:
Bin Cao,
Sipeng Zheng,
Ye Wang,
Lujie Xia,
Qianshan Wei,
Qin Jin,
Jing Liu,
Zongqing Lu
Abstract:
Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initializati…
▽ More
Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences, 100 million multi-task instructional instances, and detailed part-level annotations that address a critical gap in existing datasets. We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation. Extensive experimental validation demonstrates Being-M0.5's superior performance across diverse motion benchmarks, while comprehensive efficiency analysis confirms its real-time capabilities. Our contributions include design insights and detailed computational analysis to guide future development of practical motion generators. We believe that HuMo100M and Being-M0.5 represent significant advances that will accelerate the adoption of motion generation technologies in real-world applications. The project page is available at https://beingbeyond.github.io/Being-M0.5.
△ Less
Submitted 11 August, 2025;
originally announced August 2025.
-
Many-Turn Jailbreaking
Authors:
Xianjun Yang,
Liqiang Xiao,
Shiyang Li,
Faisal Ladhak,
Hyokun Yun,
Linda Ruth Petzold,
Yi Xu,
William Yang Wang
Abstract:
Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs a…
▽ More
Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.
△ Less
Submitted 8 August, 2025;
originally announced August 2025.
-
TURA: Tool-Augmented Unified Retrieval Agent for AI Search
Authors:
Zhejun Zhao,
Yuehu Dong,
Alley Liu,
Lixue Zheng,
Pingsheng Liu,
Dongdong Shen,
Long Xia,
Jiashu Zhao,
Dawei Yin
Abstract:
The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availabil…
▽ More
The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system.
△ Less
Submitted 6 August, 2025;
originally announced August 2025.
-
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
Authors:
Shudong Liu,
Hongwei Liu,
Junnan Liu,
Linchen Xiao,
Songyang Gao,
Chengqi Lyu,
Yuzhe Gu,
Wenwei Zhang,
Derek F. Wong,
Songyang Zhang,
Kai Chen
Abstract:
Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prom…
▽ More
Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.
△ Less
Submitted 5 August, 2025;
originally announced August 2025.
-
Robust Single-Stage Fully Sparse 3D Object Detection via Detachable Latent Diffusion
Authors:
Wentao Qu,
Guofeng Mei,
Jing Wang,
Yujiao Wu,
Xiaoshui Huang,
Liang Xiao
Abstract:
Denoising Diffusion Probabilistic Models (DDPMs) have shown success in robust 3D object detection tasks. Existing methods often rely on the score matching from 3D boxes or pre-trained diffusion priors. However, they typically require multi-step iterations in inference, which limits efficiency. To address this, we propose a Robust single-stage fully Sparse 3D object Detection Network with a Detacha…
▽ More
Denoising Diffusion Probabilistic Models (DDPMs) have shown success in robust 3D object detection tasks. Existing methods often rely on the score matching from 3D boxes or pre-trained diffusion priors. However, they typically require multi-step iterations in inference, which limits efficiency. To address this, we propose a Robust single-stage fully Sparse 3D object Detection Network with a Detachable Latent Framework (DLF) of DDPMs, named RSDNet. Specifically, RSDNet learns the denoising process in latent feature spaces through lightweight denoising networks like multi-level denoising autoencoders (DAEs). This enables RSDNet to effectively understand scene distributions under multi-level perturbations, achieving robust and reliable detection. Meanwhile, we reformulate the noising and denoising mechanisms of DDPMs, enabling DLF to construct multi-type and multi-level noise samples and targets, enhancing RSDNet robustness to multiple perturbations. Furthermore, a semantic-geometric conditional guidance is introduced to perceive the object boundaries and shapes, alleviating the center feature missing problem in sparse representations, enabling RSDNet to perform in a fully sparse detection pipeline. Moreover, the detachable denoising network design of DLF enables RSDNet to perform single-step detection in inference, further enhancing detection efficiency. Extensive experiments on public benchmarks show that RSDNet can outperform existing methods, achieving state-of-the-art detection.
△ Less
Submitted 27 August, 2025; v1 submitted 5 August, 2025;
originally announced August 2025.
-
Fast and Memory-efficient Non-line-of-sight Imaging with Quasi-Fresnel Transform
Authors:
Yijun Wei,
Jianyu Wang,
Leping Xiao,
Zuoqiang Shi,
Xing Fu,
Lingyun Qiu
Abstract:
Non-line-of-sight (NLOS) imaging seeks to reconstruct hidden objects by analyzing reflections from intermediary surfaces. Existing methods typically model both the measurement data and the hidden scene in three dimensions, overlooking the inherently two-dimensional nature of most hidden objects. This oversight leads to high computational costs and substantial memory consumption, limiting practical…
▽ More
Non-line-of-sight (NLOS) imaging seeks to reconstruct hidden objects by analyzing reflections from intermediary surfaces. Existing methods typically model both the measurement data and the hidden scene in three dimensions, overlooking the inherently two-dimensional nature of most hidden objects. This oversight leads to high computational costs and substantial memory consumption, limiting practical applications and making real-time, high-resolution NLOS imaging on lightweight devices challenging. In this paper, we introduce a novel approach that represents the hidden scene using two-dimensional functions and employs a Quasi-Fresnel transform to establish a direct inversion formula between the measurement data and the hidden scene. This transformation leverages the two-dimensional characteristics of the problem to significantly reduce computational complexity and memory requirements. Our algorithm efficiently performs fast transformations between these two-dimensional aggregated data, enabling rapid reconstruction of hidden objects with minimal memory usage. Compared to existing methods, our approach reduces runtime and memory demands by several orders of magnitude while maintaining imaging quality. The substantial reduction in memory usage not only enhances computational efficiency but also enables NLOS imaging on lightweight devices such as mobile and embedded systems. We anticipate that this method will facilitate real-time, high-resolution NLOS imaging and broaden its applicability across a wider range of platforms.
△ Less
Submitted 3 August, 2025;
originally announced August 2025.
-
Details Matter for Indoor Open-vocabulary 3D Instance Segmentation
Authors:
Sanghun Jung,
Jingjing Zheng,
Ke Zhang,
Nan Qiao,
Albert Y. C. Chen,
Lu Xia,
Chi Liu,
Yuyin Sun,
Xiao Zeng,
Hsiang-Wei Huang,
Byron Boots,
Min Sun,
Cheng-Hao Kuo
Abstract:
Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this pape…
▽ More
Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.
△ Less
Submitted 30 July, 2025;
originally announced July 2025.
-
Mean-Variance Optimization and Algorithm for Finite-Horizon Markov Decision Processes
Authors:
Li Xia,
Zhihui Yu
Abstract:
Multi-period mean-variance optimization is a long-standing problem, caused by the failure of dynamic programming principle. This paper studies the mean-variance optimization in a setting of finite-horizon discrete-time Markov decision processes (MDPs), where the objective is to maximize the combined metrics of mean and variance of the accumulated rewards at terminal stage. By introducing the conce…
▽ More
Multi-period mean-variance optimization is a long-standing problem, caused by the failure of dynamic programming principle. This paper studies the mean-variance optimization in a setting of finite-horizon discrete-time Markov decision processes (MDPs), where the objective is to maximize the combined metrics of mean and variance of the accumulated rewards at terminal stage. By introducing the concepts of pseudo mean and pseudo variance, we convert the original mean-variance MDP to a bilevel MDP, where the outer is a single parameter optimization of the pseudo mean and the inner is a standard finite-horizon MDP with an augmented state space by adding an auxiliary state of accumulated rewards. We further study the properties of this bilevel MDP, including the optimality of history-dependent deterministic policies and the piecewise quadratic concavity of the inner MDPs' optimal values with respect to the pseudo mean. To efficiently solve this bilevel MDP, we propose an iterative algorithm that alternatingly updates the inner optimal policy and the outer pseudo mean. We prove that this algorithm converges to a local optimum. We also derive a sufficient condition under which our algorithm converges to the global optimum. Furthermore, we apply this approach to study the mean-variance optimization of multi-period portfolio selection problem, which shows that our approach exactly coincides with the classical result by Li and Ng (2000) in financial engineering. Our approach builds a new avenue to solve mean-variance optimization problems and has wide applicability to any problem modeled by MDPs, which is further demonstrated by examples of mean-variance optimization for queueing control and inventory management.
△ Less
Submitted 29 July, 2025;
originally announced July 2025.
-
Adversarial Demonstration Learning for Low-resource NER Using Dual Similarity
Authors:
Guowen Yuan,
Tien-Hsuan Wu,
Lianghao Xia,
Ben Kao
Abstract:
We study the problem of named entity recognition (NER) based on demonstration learning in low-resource scenarios. We identify two issues in demonstration construction and model training. Firstly, existing methods for selecting demonstration examples primarily rely on semantic similarity; We show that feature similarity can provide significant performance improvement. Secondly, we show that the NER…
▽ More
We study the problem of named entity recognition (NER) based on demonstration learning in low-resource scenarios. We identify two issues in demonstration construction and model training. Firstly, existing methods for selecting demonstration examples primarily rely on semantic similarity; We show that feature similarity can provide significant performance improvement. Secondly, we show that the NER tagger's ability to reference demonstration examples is generally inadequate. We propose a demonstration and training approach that effectively addresses these issues. For the first issue, we propose to select examples by dual similarity, which comprises both semantic similarity and feature similarity. For the second issue, we propose to train an NER model with adversarial demonstration such that the model is forced to refer to the demonstrations when performing the tagging task. We conduct comprehensive experiments in low-resource NER tasks, and the results demonstrate that our method outperforms a range of methods.
△ Less
Submitted 13 July, 2025;
originally announced July 2025.
-
Stochastic Approximation with Block Coordinate Optimal Stepsizes
Authors:
Tao Jiang,
Lin Xiao
Abstract:
We consider stochastic approximation with block-coordinate stepsizes and propose adaptive stepsize rules that aim to minimize the expected distance from the next iterate to an optimal point. These stepsize rules employ online estimates of the second moment of the search direction along each block coordinate. The popular Adam algorithm can be interpreted as a particular heuristic for such estimatio…
▽ More
We consider stochastic approximation with block-coordinate stepsizes and propose adaptive stepsize rules that aim to minimize the expected distance from the next iterate to an optimal point. These stepsize rules employ online estimates of the second moment of the search direction along each block coordinate. The popular Adam algorithm can be interpreted as a particular heuristic for such estimation. By leveraging a simple conditional estimator, we derive a new method that obtains comparable performance as Adam but requires less memory and fewer hyper-parameters. We prove that this family of methods converges almost surely to a small neighborhood of the optimal point, and the radius of the neighborhood depends on the bias and variance of the second-moment estimator. Our analysis relies on a simple aiming condition that assumes neither convexity nor smoothness, thus has broad applicability.
△ Less
Submitted 11 July, 2025;
originally announced July 2025.
-
Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation
Authors:
Liu He,
Xiao Zeng,
Yizhi Song,
Albert Y. C. Chen,
Lu Xia,
Shashwat Verma,
Sankalp Dayal,
Min Sun,
Cheng-Hao Kuo,
Daniel Aliaga
Abstract:
Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-…
▽ More
Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.
△ Less
Submitted 23 July, 2025; v1 submitted 11 July, 2025;
originally announced July 2025.
-
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
Authors:
Ke Fan,
Shunlin Lu,
Minyue Dai,
Runyi Yu,
Lixing Xiao,
Zhiyang Dou,
Junting Dong,
Lizhuang Ma,
Jingbo Wang
Abstract:
Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training dat…
▽ More
Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.
△ Less
Submitted 9 July, 2025;
originally announced July 2025.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Authors:
Gheorghe Comanici,
Eric Bieber,
Mike Schaekermann,
Ice Pasupat,
Noveen Sachdeva,
Inderjit Dhillon,
Marcel Blistein,
Ori Ram,
Dan Zhang,
Evan Rosen,
Luke Marris,
Sam Petulla,
Colin Gaffney,
Asaf Aharoni,
Nathan Lintz,
Tiago Cardal Pais,
Henrik Jacobsson,
Idan Szpektor,
Nan-Jiang Jiang,
Krishna Haridasan,
Ahmed Omran,
Nikunj Saunshi,
Dara Bahri,
Gaurav Mishra,
Eric Chu
, et al. (3284 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal unde…
▽ More
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
△ Less
Submitted 22 July, 2025; v1 submitted 7 July, 2025;
originally announced July 2025.
-
Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity
Authors:
Shuai Zhao,
Yulin Zhang,
Luwei Xiao,
Xinyi Wu,
Yanhao Jia,
Zhongliang Guo,
Xiaobao Wu,
Cong-Duy Nguyen,
Guoming Zhang,
Anh Tuan Luu
Abstract:
Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities a…
▽ More
Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs' intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model's ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield two principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems.
△ Less
Submitted 8 July, 2025;
originally announced July 2025.
-
SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement
Authors:
Liwen Xiao,
Zhiyu Pan,
Zhicheng Wang,
Zhiguo Cao,
Wei Li
Abstract:
Accurate prediction of multi-agent future trajectories is crucial for autonomous driving systems to make safe and efficient decisions. Trajectory refinement has emerged as a key strategy to enhance prediction accuracy. However, existing refinement methods often overlook the topological relationships between trajectories, which are vital for improving prediction precision. Inspired by braid theory,…
▽ More
Accurate prediction of multi-agent future trajectories is crucial for autonomous driving systems to make safe and efficient decisions. Trajectory refinement has emerged as a key strategy to enhance prediction accuracy. However, existing refinement methods often overlook the topological relationships between trajectories, which are vital for improving prediction precision. Inspired by braid theory, we propose a novel trajectory refinement approach, Soft-Braid Refiner (SRefiner), guided by the soft-braid topological structure of trajectories using Soft-Braid Attention. Soft-Braid Attention captures spatio-temporal topological relationships between trajectories by considering both spatial proximity and vehicle motion states at ``soft intersection points". Additionally, we extend this approach to model interactions between trajectories and lanes, further improving the prediction accuracy. SRefiner is a multi-iteration, multi-agent framework that iteratively refines trajectories, incorporating topological information to enhance interactions within traffic scenarios. SRefiner achieves significant performance improvements over four baseline methods across two datasets, establishing a new state-of-the-art in trajectory refinement. Code is here https://github.com/Liwen-Xiao/SRefiner.
△ Less
Submitted 6 July, 2025;
originally announced July 2025.
-
From Query to Explanation: Uni-RAG for Multi-Modal Retrieval-Augmented Learning in STEM
Authors:
Xinyi Wu,
Yanhao Jia,
Luwei Xiao,
Shuai Zhao,
Fengkuang Chiang,
Erik Cambria
Abstract:
In AI-facilitated teaching, leveraging various query styles to interpret abstract educational content is crucial for delivering effective and accessible learning experiences. However, existing retrieval systems predominantly focus on natural text-image matching and lack the capacity to address the diversity and ambiguity inherent in real-world educational scenarios. To address this limitation, we…
▽ More
In AI-facilitated teaching, leveraging various query styles to interpret abstract educational content is crucial for delivering effective and accessible learning experiences. However, existing retrieval systems predominantly focus on natural text-image matching and lack the capacity to address the diversity and ambiguity inherent in real-world educational scenarios. To address this limitation, we develop a lightweight and efficient multi-modal retrieval module, named Uni-Retrieval, which extracts query-style prototypes and dynamically matches them with tokens from a continually updated Prompt Bank. This Prompt Bank encodes and stores domain-specific knowledge by leveraging a Mixture-of-Expert Low-Rank Adaptation (MoE-LoRA) module and can be adapted to enhance Uni-Retrieval's capability to accommodate unseen query types at test time. To enable natural language educational content generation, we integrate the original Uni-Retrieval with a compact instruction-tuned language model, forming a complete retrieval-augmented generation pipeline named Uni-RAG. Given a style-conditioned query, Uni-RAG first retrieves relevant educational materials and then generates human-readable explanations, feedback, or instructional content aligned with the learning objective. Experimental results on SER and other multi-modal benchmarks show that Uni-RAG outperforms baseline retrieval and RAG systems in both retrieval accuracy and generation quality, while maintaining low computational cost. Our framework provides a scalable, pedagogically grounded solution for intelligent educational systems, bridging retrieval and generation to support personalized, explainable, and efficient learning assistance across diverse STEM scenarios.
△ Less
Submitted 4 July, 2025;
originally announced July 2025.
-
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
Authors:
Shikai Qiu,
Lechao Xiao,
Andrew Gordon Wilson,
Jeffrey Pennington,
Atish Agarwala
Abstract:
What scaling limits govern neural network training dynamics when model size and training time grow in tandem? We show that despite the complex interactions between architecture, training algorithms, and data, compute-optimally trained models exhibit a remarkably precise universality. Specifically, loss curves from models of varying sizes collapse onto a single universal curve when training compute…
▽ More
What scaling limits govern neural network training dynamics when model size and training time grow in tandem? We show that despite the complex interactions between architecture, training algorithms, and data, compute-optimally trained models exhibit a remarkably precise universality. Specifically, loss curves from models of varying sizes collapse onto a single universal curve when training compute and loss are normalized to unity at the end of training. With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor of individual loss curves across random seeds, a phenomenon we term supercollapse. We observe supercollapse across learning rate schedules, datasets, and architectures, including transformers trained on next-token prediction, and find it breaks down when hyperparameters are scaled suboptimally, providing a precise and practical indicator of good scaling. We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws, and analyzing a simple yet surprisingly effective model of SGD noise dynamics that accurately predicts loss curves across various learning rate schedules and quantitatively explains the origin of supercollapse.
△ Less
Submitted 7 July, 2025; v1 submitted 2 July, 2025;
originally announced July 2025.
-
Fast Clifford Neural Layers
Authors:
Tianxiang Xia,
Max Neuwinger,
Lin Xiao
Abstract:
Clifford Neural Layers improve PDE modeling by introducing Clifford Algebra into neural networks. In this project we focus on optimizing the inference of 2/3D Clifford convolutional layers and multivector activation layers for one core CPU performance.
Overall, by testing on a real network block involving Clifford convolutional layers and multivector activation layers, we observe that our implem…
▽ More
Clifford Neural Layers improve PDE modeling by introducing Clifford Algebra into neural networks. In this project we focus on optimizing the inference of 2/3D Clifford convolutional layers and multivector activation layers for one core CPU performance.
Overall, by testing on a real network block involving Clifford convolutional layers and multivector activation layers, we observe that our implementation is 30% faster than standard PyTorch implementation in relatively large data + network size (>L2 cache).
We open source our code base at https://github.com/egretwAlker/c-opt-clifford-layers
△ Less
Submitted 22 June, 2025;
originally announced July 2025.
-
Act-With-Think: Chunk Auto-Regressive Modeling for Generative Recommendation
Authors:
Yifan Wang,
Weinan Gan,
Longtao Xiao,
Jieming Zhu,
Heng Chang,
Haozhao Wang,
Rui Zhang,
Zhenhua Dong,
Ruiming Tang,
Ruixuan Li
Abstract:
Generative recommendation (GR) typically encodes behavioral or semantic aspects of item information into discrete tokens, leveraging the standard autoregressive (AR) generation paradigm to make predictions. However, existing methods tend to overlook their intrinsic relationship, that is, the semantic usually provides some reasonable explainability "$\textbf{why}$" for the behavior "…
▽ More
Generative recommendation (GR) typically encodes behavioral or semantic aspects of item information into discrete tokens, leveraging the standard autoregressive (AR) generation paradigm to make predictions. However, existing methods tend to overlook their intrinsic relationship, that is, the semantic usually provides some reasonable explainability "$\textbf{why}$" for the behavior "$\textbf{what}$", which may constrain the full potential of GR. To this end, we present Chunk AutoRegressive Modeling (CAR), a new generation paradigm following the decision pattern that users usually think semantic aspects of items (e.g. brand) and then take actions on target items (e.g. purchase). Our CAR, for the $\textit{first time}$, incorporates semantics (SIDs) and behavior (UID) into a single autoregressive transformer from an ``act-with-think'' dual perspective via chunk-level autoregression. Specifically, CAR packs SIDs and UID into a conceptual chunk for item unified representation, allowing each decoding step to make a holistic prediction. Experiments show that our CAR significantly outperforms existing methods based on traditional AR, improving Recall@5 by 7.93% to 22.30%. Furthermore, we verify the scaling effect between model performance and SIDs bit number, demonstrating that CAR preliminary emulates a kind of slow-thinking style mechanism akin to the reasoning processes observed in large language models (LLMs).
△ Less
Submitted 30 June, 2025;
originally announced June 2025.
-
Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL
Authors:
Tong Yang,
Bo Dai,
Lin Xiao,
Yuejie Chi
Abstract:
Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practi…
▽ More
Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practical schemes that are backed by theoretical performance guarantees. Motivated by recent developments in exploration via optimistic regularization, this paper provides an interpretation of the principle of optimism through the lens of primal-dual optimization. From this fresh perspective, we set forth a new value-incentivized actor-critic (VAC) method, which optimizes a single easy-to-optimize objective integrating exploration and exploitation -- it promotes state-action and policy estimates that are both consistent with collected data transitions and result in higher value functions. Theoretically, the proposed VAC method has near-optimal regret guarantees under linear Markov decision processes (MDPs) in both finite-horizon and infinite-horizon settings, which can be extended to the general function approximation setting under appropriate assumptions.
△ Less
Submitted 27 June, 2025;
originally announced June 2025.
-
CodeMorph: Mitigating Data Leakage in Large Language Model Assessment
Authors:
Hongzhou Rao,
Yanjie Zhao,
Wenjie Zhu,
Ling Xiao,
Meizhen Wang,
Haoyu Wang
Abstract:
Concerns about benchmark leakage in large language models for code (Code LLMs) have raised issues of data contamination and inflated evaluation metrics. The diversity and inaccessibility of many training datasets make it difficult to prevent data leakage entirely, even with time lag strategies. Consequently, generating new datasets through code perturbation has become essential. However, existing…
▽ More
Concerns about benchmark leakage in large language models for code (Code LLMs) have raised issues of data contamination and inflated evaluation metrics. The diversity and inaccessibility of many training datasets make it difficult to prevent data leakage entirely, even with time lag strategies. Consequently, generating new datasets through code perturbation has become essential. However, existing methods often fail to produce complex and diverse variations, struggle with complex cross-file dependencies, and lack support for multiple programming languages, which limits their effectiveness in enhancing LLM evaluations for coding tasks. To fill this gap, we propose CodeMorph, an approach designed to support multiple programming languages while preserving cross-file dependencies to mitigate data leakage. CodeMorph consists of two main components that work together to enhance the perturbation process. The first component employs 26 semantic-preserving transformation methods to iteratively perturb code, generating diverse variations while ensuring that the modified code remains compilable. The second component introduces a genetic algorithm-based selection algorithm, PESO, to identify the more effective perturbation method for each iteration by targeting lower similarity scores between the perturbed and original code, thereby enhancing overall perturbation effectiveness. Experimental results demonstrate that after applying CodeMorph, the accuracy of the LLM on code completion tasks across five programming languages decreased by an average of 24.67%, with Python showing the most significant reduction at 45%. The similarity score of code optimized by PESO is, on average, 7.01% lower than that of randomly perturbed code, peaking at a reduction of 42.86%.
△ Less
Submitted 21 June, 2025;
originally announced June 2025.
-
Towards AI Search Paradigm
Authors:
Yuchen Li,
Hengyi Cai,
Rui Kong,
Xinran Chen,
Jiamin Chen,
Jun Yang,
Haojie Zhang,
Jiayi Li,
Jiayi Wu,
Yiqun Chen,
Changle Qu,
Keyi Kong,
Wenwen Ye,
Lixin Su,
Xinyu Ma,
Long Xia,
Daiting Shi,
Jiashu Zhao,
Haoyi Xiong,
Shuaiqiang Wang,
Dawei Yin
Abstract:
In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making. The paradigm employs a modular architecture of four LLM-powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex m…
▽ More
In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making. The paradigm employs a modular architecture of four LLM-powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex multi-stage reasoning tasks. These agents collaborate dynamically through coordinated workflows to evaluate query complexity, decompose problems into executable plans, and orchestrate tool usage, task execution, and content synthesis. We systematically present key methodologies for realizing this paradigm, including task planning and tool integration, execution strategies, aligned and robust retrieval-augmented generation, and efficient LLM inference, spanning both algorithmic techniques and infrastructure-level optimizations. By providing an in-depth guide to these foundational components, this work aims to inform the development of trustworthy, adaptive, and scalable AI search systems.
△ Less
Submitted 20 June, 2025;
originally announced June 2025.
-
Large Language Models for Unit Testing: A Systematic Literature Review
Authors:
Quanjun Zhang,
Chunrong Fang,
Siqi Gu,
Ye Shang,
Zhenyu Chen,
Liang Xiao
Abstract:
Unit testing is a fundamental practice in modern software engineering, with the aim of ensuring the correctness, maintainability, and reliability of individual software components. Very recently, with the advances in Large Language Models (LLMs), a rapidly growing body of research has leveraged LLMs to automate various unit testing tasks, demonstrating remarkable performance and significantly redu…
▽ More
Unit testing is a fundamental practice in modern software engineering, with the aim of ensuring the correctness, maintainability, and reliability of individual software components. Very recently, with the advances in Large Language Models (LLMs), a rapidly growing body of research has leveraged LLMs to automate various unit testing tasks, demonstrating remarkable performance and significantly reducing manual effort. However, due to ongoing explorations in the LLM-based unit testing field, it is challenging for researchers to understand existing achievements, open challenges, and future opportunities. This paper presents the first systematic literature review on the application of LLMs in unit testing until March 2025. We analyze \numpaper{} relevant papers from the perspectives of both unit testing and LLMs. We first categorize existing unit testing tasks that benefit from LLMs, e.g., test generation and oracle generation. We then discuss several critical aspects of integrating LLMs into unit testing research, including model usage, adaptation strategies, and hybrid approaches. We further summarize key challenges that remain unresolved and outline promising directions to guide future research in this area. Overall, our paper provides a systematic overview of the research landscape to the unit testing community, helping researchers gain a comprehensive understanding of achievements and promote future research. Our artifacts are publicly available at the GitHub repository: https://github.com/iSEngLab/AwesomeLLM4UT.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.