-
A Survey on Open-Source Edge Computing Simulators and Emulators: The Computing and Networking Convergence Perspective
Authors:
Jianpeng Qi,
Chao Liu,
Xiao Zhang,
Lei Wang,
Rui Wang,
Junyu Dong,
Yanwei Yu
Abstract:
Edge computing, with its low latency, dynamic scalability, and location awareness, along with the convergence of computing and communication paradigms, has been successfully applied in critical domains such as industrial IoT, smart healthcare, smart homes, and public safety. This paper provides a comprehensive survey of open-source edge computing simulators and emulators, presented in our GitHub r…
▽ More
Edge computing, with its low latency, dynamic scalability, and location awareness, along with the convergence of computing and communication paradigms, has been successfully applied in critical domains such as industrial IoT, smart healthcare, smart homes, and public safety. This paper provides a comprehensive survey of open-source edge computing simulators and emulators, presented in our GitHub repository (https://github.com/qijianpeng/awesome-edge-computing), emphasizing the convergence of computing and networking paradigms. By examining more than 40 tools, including CloudSim, NS-3, and others, we identify the strengths and limitations in simulating and emulating edge environments. This survey classifies these tools into three categories: packet-level, application-level, and emulators. Furthermore, we evaluate them across five dimensions, ranging from resource representation to resource utilization. The survey highlights the integration of different computing paradigms, packet processing capabilities, support for edge environments, user-defined metric interfaces, and scenario visualization. The findings aim to guide researchers in selecting appropriate tools for developing and validating advanced computing and networking technologies.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
UICopilot: Automating UI Synthesis via Hierarchical Code Generation from Webpage Designs
Authors:
Yi Gui,
Yao Wan,
Zhen Li,
Zhongyi Zhang,
Dongping Chen,
Hongyu Zhang,
Yi Su,
Bohua Chen,
Xing Zhou,
Wenbin Jiang,
Xiangliang Zhang
Abstract:
Automating the synthesis of User Interfaces (UIs) plays a crucial role in enhancing productivity and accelerating the development lifecycle, reducing both development time and manual effort. Recently, the rapid development of Multimodal Large Language Models (MLLMs) has made it possible to generate front-end Hypertext Markup Language (HTML) code directly from webpage designs. However, real-world w…
▽ More
Automating the synthesis of User Interfaces (UIs) plays a crucial role in enhancing productivity and accelerating the development lifecycle, reducing both development time and manual effort. Recently, the rapid development of Multimodal Large Language Models (MLLMs) has made it possible to generate front-end Hypertext Markup Language (HTML) code directly from webpage designs. However, real-world webpages encompass not only a diverse array of HTML tags but also complex stylesheets, resulting in significantly lengthy code. The lengthy code poses challenges for the performance and efficiency of MLLMs, especially in capturing the structural information of UI designs. To address these challenges, this paper proposes UICopilot, a novel approach to automating UI synthesis via hierarchical code generation from webpage designs. The core idea of UICopilot is to decompose the generation process into two stages: first, generating the coarse-grained HTML hierarchical structure, followed by the generation of fine-grained code. To validate the effectiveness of UICopilot, we conduct experiments on a real-world dataset, i.e., WebCode2M. Experimental results demonstrate that UICopilot significantly outperforms existing baselines in both automatic evaluation metrics and human evaluations. Specifically, statistical analysis reveals that the majority of human annotators prefer the webpages generated by UICopilot over those produced by GPT-4V.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Self-Consuming Generative Models with Adversarially Curated Data
Authors:
Xiukun Wei,
Xueru Zhang
Abstract:
Recent advances in generative models have made it increasingly difficult to distinguish real data from model-generated synthetic data. Using synthetic data for successive training of future model generations creates "self-consuming loops", which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their pre…
▽ More
Recent advances in generative models have made it increasingly difficult to distinguish real data from model-generated synthetic data. Using synthetic data for successive training of future model generations creates "self-consuming loops", which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their preferences. Ferbach et al. (2024) recently showed that when data is curated according to user preferences, the self-consuming retraining loop drives the model to converge toward a distribution that optimizes those preferences. However, in practice, data curation is often noisy or adversarially manipulated. For example, competing platforms may recruit malicious users to adversarially curate data and disrupt rival models. In this paper, we study how generative models evolve under self-consuming retraining loops with noisy and adversarially curated data. We theoretically analyze the impact of such noisy data curation on generative models and identify conditions for the robustness of the retraining process. Building on this analysis, we design attack algorithms for competitive adversarial scenarios, where a platform with a limited budget employs malicious users to misalign a rival's model from actual user preferences. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithms.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Qwen3 Technical Report
Authors:
An Yang,
Anfeng Li,
Baosong Yang,
Beichen Zhang,
Binyuan Hui,
Bo Zheng,
Bowen Yu,
Chang Gao,
Chengen Huang,
Chenxu Lv,
Chujie Zheng,
Dayiheng Liu,
Fan Zhou,
Fei Huang,
Feng Hu,
Hao Ge,
Haoran Wei,
Huan Lin,
Jialong Tang,
Jian Yang,
Jianhong Tu,
Jianwei Zhang,
Jianxin Yang,
Jiaxi Yang,
Jing Zhou
, et al. (35 additional authors not shown)
Abstract:
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration…
▽ More
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Strategic Jenga Play via Graph Based Dynamics Modeling
Authors:
Kavya Puthuveetil,
Xinyi Zhang,
Kazuto Yokoyama,
Tetsuya Narita
Abstract:
Controlled manipulation of multiple objects whose dynamics are closely linked is a challenging problem within contact-rich manipulation, requiring an understanding of how the movement of one will impact the others. Using the Jenga game as a testbed to explore this problem, we graph-based modeling to tackle two different aspects of the task: 1) block selection and 2) block extraction. For block sel…
▽ More
Controlled manipulation of multiple objects whose dynamics are closely linked is a challenging problem within contact-rich manipulation, requiring an understanding of how the movement of one will impact the others. Using the Jenga game as a testbed to explore this problem, we graph-based modeling to tackle two different aspects of the task: 1) block selection and 2) block extraction. For block selection, we construct graphs of the Jenga tower and attempt to classify, based on the tower's structure, whether removing a given block will cause the tower to collapse. For block extraction, we train a dynamics model that predicts how all the blocks in the tower will move at each timestep in an extraction trajectory, which we then use in a sampling-based model predictive control loop to safely pull blocks out of the tower with a general-purpose parallel-jaw gripper. We train and evaluate our methods in simulation, demonstrating promising results towards block selection and block extraction on a challenging set of full-sized Jenga towers, even at advanced stages of the game.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset
Authors:
Yicheng Gu,
Chaoren Wang,
Junan Zhang,
Xueyao Zhang,
Zihao Fang,
Haorui He,
Zhizheng Wu
Abstract:
The lack of a publicly-available large-scale and diverse dataset has long been a significant bottleneck for singing voice applications like Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC). To tackle this problem, we present SingNet, an extensive, diverse, and in-the-wild singing voice dataset. Specifically, we propose a data processing pipeline to extract ready-to-use training dat…
▽ More
The lack of a publicly-available large-scale and diverse dataset has long been a significant bottleneck for singing voice applications like Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC). To tackle this problem, we present SingNet, an extensive, diverse, and in-the-wild singing voice dataset. Specifically, we propose a data processing pipeline to extract ready-to-use training data from sample packs and songs on the internet, forming 3000 hours of singing voices in various languages and styles. Furthermore, to facilitate the use and demonstrate the effectiveness of SingNet, we pre-train and open-source various state-of-the-art (SOTA) models on Wav2vec2, BigVGAN, and NSF-HiFiGAN based on our collected singing voice data. We also conduct benchmark experiments on Automatic Lyric Transcription (ALT), Neural Vocoder, and Singing Voice Conversion (SVC). Audio demos are available at: https://singnet-dataset.github.io/.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Controllable Image Colorization with Instance-aware Texts and Masks
Authors:
Yanru An,
Ling Gui,
Qiang Hu,
Chunlei Cai,
Tianxiao Ye,
Xiaoyun Zhang,
Yanfeng Wang
Abstract:
Recently, the application of deep learning in image colorization has received widespread attention. The maturation of diffusion models has further advanced the development of image colorization models. However, current mainstream image colorization models still face issues such as color bleeding and color binding errors, and cannot colorize images at the instance level. In this paper, we propose a…
▽ More
Recently, the application of deep learning in image colorization has received widespread attention. The maturation of diffusion models has further advanced the development of image colorization models. However, current mainstream image colorization models still face issues such as color bleeding and color binding errors, and cannot colorize images at the instance level. In this paper, we propose a diffusion-based colorization method MT-Color to achieve precise instance-aware colorization with use-provided guidance. To tackle color bleeding issue, we design a pixel-level mask attention mechanism that integrates latent features and conditional gray image features through cross-attention. We use segmentation masks to construct cross-attention masks, preventing pixel information from exchanging between different instances. We also introduce an instance mask and text guidance module that extracts instance masks and text representations of each instance, which are then fused with latent features through self-attention, utilizing instance masks to form self-attention masks to prevent instance texts from guiding the colorization of other areas, thus mitigating color binding errors. Furthermore, we apply a multi-instance sampling strategy, which involves sampling each instance region separately and then fusing the results. Additionally, we have created a specialized dataset for instance-level colorization tasks, GPT-color, by leveraging large visual language models on existing image datasets. Qualitative and quantitative experiments show that our model and dataset outperform previous methods and datasets.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
The Truth Becomes Clearer Through Debate! Multi-Agent Systems with Large Language Models Unmask Fake News
Authors:
Yuhan Liu,
Yuxuan Liu,
Xiaoqing Zhang,
Xiuying Chen,
Rui Yan
Abstract:
In today's digital environment, the rapid propagation of fake news via social networks poses significant social challenges. Most existing detection methods either employ traditional classification models, which suffer from low interpretability and limited generalization capabilities, or craft specific prompts for large language models (LLMs) to produce explanations and results directly, failing to…
▽ More
In today's digital environment, the rapid propagation of fake news via social networks poses significant social challenges. Most existing detection methods either employ traditional classification models, which suffer from low interpretability and limited generalization capabilities, or craft specific prompts for large language models (LLMs) to produce explanations and results directly, failing to leverage LLMs' reasoning abilities fully. Inspired by the saying that "truth becomes clearer through debate," our study introduces a novel multi-agent system with LLMs named TruEDebate (TED) to enhance the interpretability and effectiveness of fake news detection. TED employs a rigorous debate process inspired by formal debate settings. Central to our approach are two innovative components: the DebateFlow Agents and the InsightFlow Agents. The DebateFlow Agents organize agents into two teams, where one supports and the other challenges the truth of the news. These agents engage in opening statements, cross-examination, rebuttal, and closing statements, simulating a rigorous debate process akin to human discourse analysis, allowing for a thorough evaluation of news content. Concurrently, the InsightFlow Agents consist of two specialized sub-agents: the Synthesis Agent and the Analysis Agent. The Synthesis Agent summarizes the debates and provides an overarching viewpoint, ensuring a coherent and comprehensive evaluation. The Analysis Agent, which includes a role-aware encoder and a debate graph, integrates role embeddings and models the interactions between debate roles and arguments using an attention mechanism, providing the final judgment.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
MA-ROESL: Motion-aware Rapid Reward Optimization for Efficient Robot Skill Learning from Single Videos
Authors:
Xianghui Wang,
Xinming Zhang,
Yanjun Chen,
Xiaoyu Shen,
Wei Zhang
Abstract:
Vision-language models (VLMs) have demonstrated excellent high-level planning capabilities, enabling locomotion skill learning from video demonstrations without the need for meticulous human-level reward design. However, the improper frame sampling method and low training efficiency of current methods remain a critical bottleneck, resulting in substantial computational overhead and time costs. To…
▽ More
Vision-language models (VLMs) have demonstrated excellent high-level planning capabilities, enabling locomotion skill learning from video demonstrations without the need for meticulous human-level reward design. However, the improper frame sampling method and low training efficiency of current methods remain a critical bottleneck, resulting in substantial computational overhead and time costs. To address this limitation, we propose Motion-aware Rapid Reward Optimization for Efficient Robot Skill Learning from Single Videos (MA-ROESL). MA-ROESL integrates a motion-aware frame selection method to implicitly enhance the quality of VLM-generated reward functions. It further employs a hybrid three-phase training pipeline that improves training efficiency via rapid reward optimization and derives the final policy through online fine-tuning. Experimental results demonstrate that MA-ROESL significantly enhances training efficiency while faithfully reproducing locomotion skills in both simulated and real-world settings, thereby underscoring its potential as a robust and scalable framework for efficient robot locomotion skill learning from video demonstrations.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Benchmarking AI scientists in omics data-driven biological research
Authors:
Erpai Luo,
Jinmeng Jia,
Yifan Xiong,
Xiangyu Li,
Xiaobo Guo,
Baoqi Yu,
Lei Wei,
Xuegong Zhang
Abstract:
The rise of large language models and multi-agent systems has sparked growing interest in AI scientists capable of autonomous biological research. However, existing benchmarks either focus on reasoning without data or on data analysis with predefined statistical answers, lacking realistic, data-driven evaluation settings. Here, we introduce the Biological AI Scientist Benchmark (BaisBench), a benc…
▽ More
The rise of large language models and multi-agent systems has sparked growing interest in AI scientists capable of autonomous biological research. However, existing benchmarks either focus on reasoning without data or on data analysis with predefined statistical answers, lacking realistic, data-driven evaluation settings. Here, we introduce the Biological AI Scientist Benchmark (BaisBench), a benchmark designed to assess AI scientists' ability to generate biological discoveries through data analysis and reasoning with external knowledge. BaisBench comprises two tasks: cell type annotation on 31 expert-labeled single-cell datasets, and scientific discovery through answering 198 multiple-choice questions derived from the biological insights of 41 recent single-cell studies. Systematic experiments on state-of-the-art AI scientists and LLM agents showed that while promising, current models still substantially underperform human experts on both tasks. We hope BaisBench will fill this gap and serve as a foundation for advancing and evaluating AI models for scientific discovery. The benchmark can be found at: https://github.com/EperLuo/BaisBench.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
On Analysis of Superimposed Pilot in Multi-User Massive MIMO with Massive Connectivity
Authors:
Shuxiao Ye,
Xianchao Zhang,
Neng Ye
Abstract:
The simultaneous transmission of numerous users presents substantial challenges due to the inherent trade-off between channel estimation and information transmission in multi-user multiple-input multiple-output (MIMO) system. In this paper, we explore the use of the superimposed pilot (SP) scheme to tackle the large transmitting users, where the number of users may exceed the coherent time. SP sch…
▽ More
The simultaneous transmission of numerous users presents substantial challenges due to the inherent trade-off between channel estimation and information transmission in multi-user multiple-input multiple-output (MIMO) system. In this paper, we explore the use of the superimposed pilot (SP) scheme to tackle the large transmitting users, where the number of users may exceed the coherent time. SP scheme incorporates both transmitted data and noise in the channel estimation process, which is significant different from the counterpart of RP scheme. We provide an in-depth analysis of the interaction between interference caused by channel estimation errors and noise. We then derive the explicit expression for the scaling law of the mutual information lower bound (MILB) in relation to the number of users and the levels of transmitted power. Besides, the optimal power allocation between pilots and data transmission is also derived analytically. The analytical results demonstrate that the SP scheme significantly improves performance compared to traditional RP scheme in our consider case. Numerical results are also presented to validate our theoretical derivations.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
On the Account Security Risks Posed by Password Strength Meters
Authors:
Ming Xu,
Weili Han,
Jitao Yu,
Jing Liu,
Xinyi Zhang,
Yun Lin,
Jin Song Dong
Abstract:
Password strength meters (PSMs) have been widely used by websites to gauge password strength, encouraging users to create stronger passwords. Popular data-driven PSMs, e.g., based on Markov, Probabilistic Context-free Grammar (PCFG) and neural networks, alarm strength based on a model learned from real passwords. Despite their proven effectiveness, the secure utility that arises from the leakage o…
▽ More
Password strength meters (PSMs) have been widely used by websites to gauge password strength, encouraging users to create stronger passwords. Popular data-driven PSMs, e.g., based on Markov, Probabilistic Context-free Grammar (PCFG) and neural networks, alarm strength based on a model learned from real passwords. Despite their proven effectiveness, the secure utility that arises from the leakage of trained passwords remains largely overlooked. To address this gap, we analyze 11 PSMs and find that 5 data-driven meters are vulnerable to membership inference attacks that expose their trained passwords, and seriously, 3 rule-based meters openly disclose their blocked passwords. We specifically design a PSM privacy leakage evaluation approach, and uncover that a series of general data-driven meters are vulnerable to leaking between 10^4 to 10^5 trained passwords, with the PCFG-based models being more vulnerable than other counterparts; furthermore, we aid in deriving insights that the inherent utility-privacy tradeoff is not as severe as previously thought. To further exploit the risks, we develop novel meter-aware attacks when a clever attacker can filter the used passwords during compromising accounts on websites using the meter, and experimentally show that attackers targeting websites that deployed the popular Zxcvbn meter can compromise an additional 5.84% user accounts within 10 attempts, demonstrating the urgent need for privacy-preserving PSMs that protect the confidentiality of the meter's used passwords. Finally, we sketch some counter-measures to mitigate these threats.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion
Authors:
Anle Ke,
Xu Zhang,
Tong Chen,
Ming Lu,
Chao Zhou,
Jiawen Gu,
Zhan Ma
Abstract:
Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual sign…
▽ More
Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual signals into both semantic retrieval and the diffusion-based generation process. Specifically, we introduce Semantic Residual Coding (SRC) to capture the semantic disparity between the original image and its compressed latent representation. A perceptual fidelity optimizer is further applied for superior reconstruction quality. Additionally, we present the Compression-aware Diffusion Model (CDM), which establishes an optimal alignment between bitrates and diffusion time steps, improving compression-reconstruction synergy. Extensive experiments demonstrate the effectiveness of ResULIC, achieving superior objective and subjective performance compared to state-of-the-art diffusion-based methods with - 80.7%, -66.3% BD-rate saving in terms of LPIPS and FID. Project page is available at https: //njuvision.github.io/ResULIC/.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement
Authors:
Haoran Ye,
Jing Jin,
Yuhang Xie,
Xin Zhang,
Guojie Song
Abstract:
The rapid advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. It presents novel challenges, such as measuring human-like psychological constructs, navigating beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with Psychometrics, the science of quantifying the intangible aspects of human psych…
▽ More
The rapid advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. It presents novel challenges, such as measuring human-like psychological constructs, navigating beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with Psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This survey introduces and synthesizes an emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. We systematically explore the role of Psychometrics in shaping benchmarking principles, broadening evaluation scopes, refining methodologies, validating results, and advancing LLM capabilities. This paper integrates diverse perspectives to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, we aim to provide actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement
Authors:
Xuechen Zhang,
Zijian Huang,
Chenshun Ni,
Ziyang Xiong,
Jiasi Chen,
Samet Oymak
Abstract:
Recent research enhances language model reasoning by scaling test-time compute via longer chain-of-thought traces. This often improves accuracy but also introduces redundancy and high computational cost, especially for small language models distilled with supervised fine-tuning (SFT). In this work, we propose new algorithms to improve token-efficient reasoning with small-scale models by effectivel…
▽ More
Recent research enhances language model reasoning by scaling test-time compute via longer chain-of-thought traces. This often improves accuracy but also introduces redundancy and high computational cost, especially for small language models distilled with supervised fine-tuning (SFT). In this work, we propose new algorithms to improve token-efficient reasoning with small-scale models by effectively trading off accuracy and computation. We first show that the post-SFT model fails to determine the optimal stopping point of the reasoning process, resulting in verbose and repetitive outputs. Verbosity also significantly varies across wrong vs correct responses. To address these issues, we propose two solutions: (1) Temperature scaling (TS) to control the stopping point for the thinking phase and thereby trace length, and (2) TLDR: a length-regularized reinforcement learning method based on GRPO that facilitates multi-level trace length control (e.g. short, medium, long reasoning). Experiments on four reasoning benchmarks, MATH500, AMC, AIME24 and OlympiadBench, demonstrate that TS is highly effective compared to s1's budget forcing approach and TLDR significantly improves token efficiency by about 50% with minimal to no accuracy loss over the SFT baseline. Moreover, TLDR also facilitates flexible control over the response length, offering a practical and effective solution for token-efficient reasoning in small models. Ultimately, our work reveals the importance of stopping time control, highlights shortcomings of pure SFT, and provides effective algorithmic recipes.
△ Less
Submitted 13 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning
Authors:
Yuyang Liu,
Liuzhenghao Lv,
Xiancheng Zhang,
Li Yuan,
Yonghong Tian
Abstract:
Biological protocols are fundamental to reproducible and safe life science research. While LLMs excel on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning. While lim…
▽ More
Biological protocols are fundamental to reproducible and safe life science research. While LLMs excel on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning. While limited benchmarks have touched upon specific aspects like protocol QA, BioProBench provides a comprehensive suite of five core tasks: Protocol Question Answering, Step Ordering, Error Correction, Protocol Generation, and Protocol Reasoning, enabling a holistic evaluation of LLMs on procedural biological texts. Built upon 27K original protocols, it yields nearly 556K high-quality structured instances. We evaluate 12 mainstream open/closed-source LLMs on BioProBench. Experimental results reveal that while top models preform well on surface understanding tasks, struggle significantly with deep reasoning and structured generation tasks like ordering and generation. Furthermore, model comparisons reveal diverse performance: certain open-source models approach closed-source levels on some tasks, yet bio-specific small models lag behind general LLMs, indicating limitations on complex procedural content. Overall, our findings underscore that procedural reasoning within biological protocols represents a significant challenge for current LLMs. BioProBench serves as a standardized framework to diagnose these specific limitations and guide the development of AI systems better equipped for safely automating complex scientific procedures. The code and data are available at: https://github.com/YuyangSunshine/bioprotocolbench and https://huggingface.co/datasets/GreatCaptainNemo/BioProBench.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
QoSBERT: An Uncertainty-Aware Approach based on Pre-trained Language Models for Service Quality Prediction
Authors:
Ziliang Wang,
Xiaohong Zhang,
Ze Shi Li,
Meng Yan
Abstract:
Accurate prediction of Quality of Service (QoS) metrics is fundamental for selecting and managing cloud based services. Traditional QoS models rely on manual feature engineering and yield only point estimates, offering no insight into the confidence of their predictions. In this paper, we propose QoSBERT, the first framework that reformulates QoS prediction as a semantic regression task based on p…
▽ More
Accurate prediction of Quality of Service (QoS) metrics is fundamental for selecting and managing cloud based services. Traditional QoS models rely on manual feature engineering and yield only point estimates, offering no insight into the confidence of their predictions. In this paper, we propose QoSBERT, the first framework that reformulates QoS prediction as a semantic regression task based on pre trained language models. Unlike previous approaches relying on sparse numerical features, QoSBERT automatically encodes user service metadata into natural language descriptions, enabling deep semantic understanding. Furthermore, we integrate a Monte Carlo Dropout based uncertainty estimation module, allowing for trustworthy and risk-aware service quality prediction, which is crucial yet underexplored in existing QoS models. QoSBERT applies attentive pooling over contextualized embeddings and a lightweight multilayer perceptron regressor, fine tuned jointly to minimize absolute error. We further exploit the resulting uncertainty estimates to select high quality training samples, improving robustness in low resource settings. On standard QoS benchmark datasets, QoSBERT achieves an average reduction of 11.7% in MAE and 6.7% in RMSE for response time prediction, and 6.9% in MAE for throughput prediction compared to the strongest baselines, while providing well calibrated confidence intervals for robust and trustworthy service quality estimation. Our approach not only advances the accuracy of service quality prediction but also delivers reliable uncertainty quantification, paving the way for more trustworthy, data driven service selection and optimization.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets
Authors:
Weiyu Li,
Xuanyang Zhang,
Zheng Sun,
Di Qi,
Hao Li,
Wei Cheng,
Weiwei Cai,
Shihao Wu,
Jiarui Liu,
Zihao Wang,
Xiao Chen,
Feipeng Tian,
Jianxiong Pan,
Zeming Li,
Gang Yu,
Xiangyu Zhang,
Daxin Jiang,
Ping Tan
Abstract:
While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline…
▽ More
While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing >5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
Authors:
Xiaomi LLM-Core Team,
:,
Bingquan Xia,
Bowen Shen,
Cici,
Dawei Zhu,
Di Zhang,
Gang Wang,
Hailin Zhang,
Huaqiu Liu,
Jiebao Xiao,
Jinhao Dong,
Liang Zhao,
Peidian Li,
Peng Wang,
Shihua Yu,
Shimao Chen,
Weikun Wang,
Wenhan Ma,
Xiangwei Deng,
Yi Huang,
Yifan Song,
Zihan Jiang,
Bowen Ye,
Can Cai
, et al. (40 additional authors not shown)
Abstract:
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective…
▽ More
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Seed1.5-VL Technical Report
Authors:
Dong Guo,
Faming Wu,
Feida Zhu,
Fuxing Leng,
Guang Shi,
Haobin Chen,
Haoqi Fan,
Jian Wang,
Jianyu Jiang,
Jiawei Wang,
Jingji Chen,
Jingjia Huang,
Kang Lei,
Liping Yuan,
Lishu Luo,
Pengfei Liu,
Qinghao Ye,
Rui Qian,
Shen Yan,
Shixiong Zhao,
Shuai Peng,
Shuangye Li,
Sihang Yuan,
Sijin Wu,
Tianheng Cheng
, et al. (172 additional authors not shown)
Abstract:
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati…
▽ More
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
YOPOv2-Tracker: An End-to-End Agile Tracking and Navigation Framework from Perception to Action
Authors:
Junjie Lu,
Yulin Hui,
Xuewei Zhang,
Wencan Feng,
Hongming Shen,
Zhiyu Li,
Bailing Tian
Abstract:
Traditional target tracking pipelines including detection, mapping, navigation, and control are comprehensive but introduce high latency, limitting the agility of quadrotors. On the contrary, we follow the design principle of "less is more", striving to simplify the process while maintaining effectiveness. In this work, we propose an end-to-end agile tracking and navigation framework for quadrotor…
▽ More
Traditional target tracking pipelines including detection, mapping, navigation, and control are comprehensive but introduce high latency, limitting the agility of quadrotors. On the contrary, we follow the design principle of "less is more", striving to simplify the process while maintaining effectiveness. In this work, we propose an end-to-end agile tracking and navigation framework for quadrotors that directly maps the sensory observations to control commands. Importantly, leveraging the multimodal nature of navigation and detection tasks, our network maintains interpretability by explicitly integrating the independent modules of the traditional pipeline, rather than a crude action regression. In detail, we adopt a set of motion primitives as anchors to cover the searching space regarding the feasible region and potential target. Then we reformulate the trajectory optimization as regression of primitive offsets and associated costs considering the safety, smoothness, and other metrics. For tracking task, the trajectories are expected to approach the target and additional objectness scores are predicted. Subsequently, the predictions, after compensation for the estimated lumped disturbance, are transformed into thrust and attitude as control commands for swift response. During training, we seamlessly integrate traditional motion planning with deep learning by directly back-propagating the gradients of trajectory costs to the network, eliminating the need for expert demonstration in imitation learning and providing more direct guidance than reinforcement learning. Finally, we deploy the algorithm on a compact quadrotor and conduct real-world validations in both forest and building environments to demonstrate the efficiency of the proposed method.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
RCOMPSs: A Scalable Runtime System for R Code Execution on Manycore Systems
Authors:
Xiran Zhang,
Javier Conejero,
Sameh Abdulah,
Jorge Ejarque,
Ying Sun,
Rosa M. Badia,
David E. Keyes,
Marc G. Genton
Abstract:
R has become a cornerstone of scientific and statistical computing due to its extensive package ecosystem, expressive syntax, and strong support for reproducible analysis. However, as data sizes and computational demands grow, native R parallelism support remains limited. This paper presents RCOMPSs, a scalable runtime system that enables efficient parallel execution of R applications on multicore…
▽ More
R has become a cornerstone of scientific and statistical computing due to its extensive package ecosystem, expressive syntax, and strong support for reproducible analysis. However, as data sizes and computational demands grow, native R parallelism support remains limited. This paper presents RCOMPSs, a scalable runtime system that enables efficient parallel execution of R applications on multicore and manycore systems. RCOMPSs adopts a dynamic, task-based programming model, allowing users to write code in a sequential style, while the runtime automatically handles asynchronous task execution, dependency tracking, and scheduling across available resources. We present RCOMPSs using three representative data analysis algorithms, i.e., K-nearest neighbors (KNN) classification, K-means clustering, and linear regression and evaluate their performance on two modern HPC systems: KAUST Shaheen-III and Barcelona Supercomputing Center (BSC) MareNostrum 5. Experimental results reveal that RCOMPSs demonstrates both strong and weak scalability on up to 128 cores per node and across 32 nodes. For KNN and K-means, parallel efficiency remains above 70% in most settings, while linear regression maintains acceptable performance under shared and distributed memory configurations despite its deeper task dependencies. Overall, RCOMPSs significantly enhances the parallel capabilities of R with minimal, automated, and runtime-aware user intervention, making it a practical solution for large-scale data analytics in high-performance environments.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Beyond Identity: A Generalizable Approach for Deepfake Audio Detection
Authors:
Yasaman Ahmadiadli,
Xiao-Ping Zhang,
Naimul Khan
Abstract:
Deepfake audio presents a growing threat to digital security, due to its potential for social engineering, fraud, and identity misuse. However, existing detection models suffer from poor generalization across datasets, due to implicit identity leakage, where models inadvertently learn speaker-specific features instead of manipulation artifacts. To the best of our knowledge, this is the first study…
▽ More
Deepfake audio presents a growing threat to digital security, due to its potential for social engineering, fraud, and identity misuse. However, existing detection models suffer from poor generalization across datasets, due to implicit identity leakage, where models inadvertently learn speaker-specific features instead of manipulation artifacts. To the best of our knowledge, this is the first study to explicitly analyze and address identity leakage in the audio deepfake detection domain. This work proposes an identity-independent audio deepfake detection framework that mitigates identity leakage by encouraging the model to focus on forgery-specific artifacts instead of overfitting to speaker traits. Our approach leverages Artifact Detection Modules (ADMs) to isolate synthetic artifacts in both time and frequency domains, enhancing cross-dataset generalization. We introduce novel dynamic artifact generation techniques, including frequency domain swaps, time domain manipulations, and background noise augmentation, to enforce learning of dataset-invariant features. Extensive experiments conducted on ASVspoof2019, ADD 2022, FoR, and In-The-Wild datasets demonstrate that the proposed ADM-enhanced models achieve F1 scores of 0.230 (ADD 2022), 0.604 (FoR), and 0.813 (In-The-Wild), consistently outperforming the baseline. Dynamic Frequency Swap proves to be the most effective strategy across diverse conditions. These findings emphasize the value of artifact-based learning in mitigating implicit identity leakage for more generalizable audio deepfake detection.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
MultiTaskVIF: Segmentation-oriented visible and infrared image fusion via multi-task learning
Authors:
Zixian Zhao,
Andrew Howes,
Xingchen Zhang
Abstract:
Visible and infrared image fusion (VIF) has attracted significant attention in recent years. Traditional VIF methods primarily focus on generating fused images with high visual quality, while recent advancements increasingly emphasize incorporating semantic information into the fusion model during training. However, most existing segmentation-oriented VIF methods adopt a cascade structure comprisi…
▽ More
Visible and infrared image fusion (VIF) has attracted significant attention in recent years. Traditional VIF methods primarily focus on generating fused images with high visual quality, while recent advancements increasingly emphasize incorporating semantic information into the fusion model during training. However, most existing segmentation-oriented VIF methods adopt a cascade structure comprising separate fusion and segmentation models, leading to increased network complexity and redundancy. This raises a critical question: can we design a more concise and efficient structure to integrate semantic information directly into the fusion model during training-Inspired by multi-task learning, we propose a concise and universal training framework, MultiTaskVIF, for segmentation-oriented VIF models. In this framework, we introduce a multi-task head decoder (MTH) to simultaneously output both the fused image and the segmentation result during training. Unlike previous cascade training frameworks that necessitate joint training with a complete segmentation model, MultiTaskVIF enables the fusion model to learn semantic features by simply replacing its decoder with MTH. Extensive experimental evaluations validate the effectiveness of the proposed method. Our code will be released upon acceptance.
△ Less
Submitted 10 May, 2025;
originally announced May 2025.
-
QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives
Authors:
Xuzhi Zhang,
Shaohui Peng,
Qirui Zhou,
Yuanbo Wen,
Qi Guo,
Ruizhi Chen,
Xinguo Zhu,
Weiqiang Xiong,
Haixin Chen,
Congying Ma,
Ke Gao,
Chen Zhao,
Yanjun Wu,
Yunji Chen,
Ling Li
Abstract:
Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks po…
▽ More
Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neural Networks.Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability.LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to $1291 \times$ performance improvement. Even compared with human experts, QiMeng-TensorOp could reach $251 \%$ of OpenBLAS on RISC-V CPUs, and $124 \%$ of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by $200 \times$ compared with human experts.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
A machine learning model for skillful climate system prediction
Authors:
Chenguang Zhou,
Lei Chen,
Xiaohui Zhong,
Bo Lu,
Hao Li,
Libo Wu,
Jie Wu,
Jiahui Hu,
Zesheng Dou,
Pang-Chi Hsu,
Xiaoye Zhang
Abstract:
Climate system models (CSMs), through integrating cross-sphere interactions among the atmosphere, ocean, land, and cryosphere, have emerged as pivotal tools for deciphering climate dynamics and improving forecasting capabilities. Recent breakthroughs in artificial intelligence (AI)-driven meteorological modeling have demonstrated remarkable success in single-sphere systems and partially spheres co…
▽ More
Climate system models (CSMs), through integrating cross-sphere interactions among the atmosphere, ocean, land, and cryosphere, have emerged as pivotal tools for deciphering climate dynamics and improving forecasting capabilities. Recent breakthroughs in artificial intelligence (AI)-driven meteorological modeling have demonstrated remarkable success in single-sphere systems and partially spheres coupled systems. However, the development of a fully coupled AI-based climate system model encompassing atmosphere-ocean-land-sea ice interactions has remained an unresolved challenge. This paper introduces FengShun-CSM, an AI-based CSM model that provides 60-day global daily forecasts for 29 critical variables across atmospheric, oceanic, terrestrial, and cryospheric domains. The model significantly outperforms the European Centre for Medium-Range Weather Forecasts (ECMWF) subseasonal-to-seasonal (S2S) model in predicting most variables, particularly precipitation, land surface, and oceanic components. This enhanced capability is primarily attributed to its improved representation of intra-seasonal variability modes, most notably the Madden-Julian Oscillation (MJO). Remarkably, FengShun-CSM exhibits substantial potential in predicting subseasonal extreme events. Such breakthroughs will advance its applications in meteorological disaster mitigation, marine ecosystem conservation, and agricultural productivity enhancement. Furthermore, it validates the feasibility of developing AI-powered CSMs through machine learning technologies, establishing a transformative paradigm for next-generation Earth system modeling.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
BrainSegDMlF: A Dynamic Fusion-enhanced SAM for Brain Lesion Segmentation
Authors:
Hongming Wang,
Yifeng Wu,
Huimin Huang,
Hongtao Wu,
Jia-Xuan Jiang,
Xiaodong Zhang,
Hao Zheng,
Xian Wu,
Yefeng Zheng,
Jinping Xu,
Jing Cheng
Abstract:
The segmentation of substantial brain lesions is a significant and challenging task in the field of medical image segmentation. Substantial brain lesions in brain imaging exhibit high heterogeneity, with indistinct boundaries between lesion regions and normal brain tissue. Small lesions in single slices are difficult to identify, making the accurate and reproducible segmentation of abnormal region…
▽ More
The segmentation of substantial brain lesions is a significant and challenging task in the field of medical image segmentation. Substantial brain lesions in brain imaging exhibit high heterogeneity, with indistinct boundaries between lesion regions and normal brain tissue. Small lesions in single slices are difficult to identify, making the accurate and reproducible segmentation of abnormal regions, as well as their feature description, highly complex. Existing methods have the following limitations: 1) They rely solely on single-modal information for learning, neglecting the multi-modal information commonly used in diagnosis. This hampers the ability to comprehensively acquire brain lesion information from multiple perspectives and prevents the effective integration and utilization of multi-modal data inputs, thereby limiting a holistic understanding of lesions. 2) They are constrained by the amount of data available, leading to low sensitivity to small lesions and difficulty in detecting subtle pathological changes. 3) Current SAM-based models rely on external prompts, which cannot achieve automatic segmentation and, to some extent, affect diagnostic efficiency.To address these issues, we have developed a large-scale fully automated segmentation model specifically designed for brain lesion segmentation, named BrainSegDMLF. This model has the following features: 1) Dynamic Modal Interactive Fusion (DMIF) module that processes and integrates multi-modal data during the encoding process, providing the SAM encoder with more comprehensive modal information. 2) Layer-by-Layer Upsampling Decoder, enabling the model to extract rich low-level and high-level features even with limited data, thereby detecting the presence of small lesions. 3) Automatic segmentation masks, allowing the model to generate lesion masks automatically without requiring manual prompts.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Versatile Distributed Maneuvering with Generalized Formations using Guiding Vector Fields
Authors:
Yang Lu,
Sha Luo,
Pengming Zhu,
Weijia Yao,
Hector Garcia de Marina,
Xinglong Zhang,
Xin Xu
Abstract:
This paper presents a unified approach to realize versatile distributed maneuvering with generalized formations. Specifically, we decompose the robots' maneuvers into two independent components, i.e., interception and enclosing, which are parameterized by two independent virtual coordinates. Treating these two virtual coordinates as dimensions of an abstract manifold, we derive the corresponding s…
▽ More
This paper presents a unified approach to realize versatile distributed maneuvering with generalized formations. Specifically, we decompose the robots' maneuvers into two independent components, i.e., interception and enclosing, which are parameterized by two independent virtual coordinates. Treating these two virtual coordinates as dimensions of an abstract manifold, we derive the corresponding singularity-free guiding vector field (GVF), which, along with a distributed coordination mechanism based on the consensus theory, guides robots to achieve various motions (i.e., versatile maneuvering), including (a) formation tracking, (b) target enclosing, and (c) circumnavigation. Additional motion parameters can generate more complex cooperative robot motions. Based on GVFs, we design a controller for a nonholonomic robot model. Besides the theoretical results, extensive simulations and experiments are performed to validate the effectiveness of the approach.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
Authors:
Haojie Duanmu,
Xiuhong Li,
Zhihang Yuan,
Size Zheng,
Jiangfei Duan,
Xingcheng Zhang,
Dahua Lin
Abstract:
Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE,…
▽ More
Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available at https://github.com/cat538/MxMoE.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Smart Starts: Accelerating Convergence through Uncommon Region Exploration
Authors:
Xinyu Zhang,
Mário Antunes,
Tyler Estro,
Erez Zadok,
Klaus Mueller
Abstract:
Initialization profoundly affects evolutionary algorithm (EA) efficacy by dictating search trajectories and convergence. This study introduces a hybrid initialization strategy combining empty-space search algorithm (ESA) and opposition-based learning (OBL). OBL initially generates a diverse population, subsequently augmented by ESA, which identifies under-explored regions. This synergy enhances po…
▽ More
Initialization profoundly affects evolutionary algorithm (EA) efficacy by dictating search trajectories and convergence. This study introduces a hybrid initialization strategy combining empty-space search algorithm (ESA) and opposition-based learning (OBL). OBL initially generates a diverse population, subsequently augmented by ESA, which identifies under-explored regions. This synergy enhances population diversity, accelerates convergence, and improves EA performance on complex, high-dimensional optimization problems. Benchmark results demonstrate the proposed method's superiority in solution quality and convergence speed compared to conventional initialization techniques.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Image Restoration via Multi-domain Learning
Authors:
Xingyu Jiang,
Ning Gao,
Xiuhui Zhang,
Hongkun Dou,
Shaowen Fu,
Xiaoqing Zhong,
Hongjue Li,
Yue Deng
Abstract:
Due to adverse atmospheric and imaging conditions, natural images suffer from various degradation phenomena. Consequently, image restoration has emerged as a key solution and garnered substantial attention. Although recent Transformer architectures have demonstrated impressive success across various restoration tasks, their considerable model complexity poses significant challenges for both traini…
▽ More
Due to adverse atmospheric and imaging conditions, natural images suffer from various degradation phenomena. Consequently, image restoration has emerged as a key solution and garnered substantial attention. Although recent Transformer architectures have demonstrated impressive success across various restoration tasks, their considerable model complexity poses significant challenges for both training and real-time deployment. Furthermore, instead of investigating the commonalities among different degradations, most existing restoration methods focus on modifying Transformer under limited restoration priors. In this work, we first review various degradation phenomena under multi-domain perspective, identifying common priors. Then, we introduce a novel restoration framework, which integrates multi-domain learning into Transformer. Specifically, in Token Mixer, we propose a Spatial-Wavelet-Fourier multi-domain structure that facilitates local-region-global multi-receptive field modeling to replace vanilla self-attention. Additionally, in Feed-Forward Network, we incorporate multi-scale learning to fuse multi-domain features at different resolutions. Comprehensive experimental results across ten restoration tasks, such as dehazing, desnowing, motion deblurring, defocus deblurring, rain streak/raindrop removal, cloud removal, shadow removal, underwater enhancement and low-light enhancement, demonstrate that our proposed model outperforms state-of-the-art methods and achieves a favorable trade-off among restoration performance, parameter size, computational cost and inference latency. The code is available at: https://github.com/deng-ai-lab/SWFormer.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Authors:
Chao Liao,
Liyang Liu,
Xun Wang,
Zhengxiong Luo,
Xinyu Zhang,
Wenliang Zhao,
Jie Wu,
Liang Li,
Zhi Tian,
Weilin Huang
Abstract:
Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements…
▽ More
Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we introduce an efficient training strategy on a large-scale, in-house dataset specifically curated for joint text and image generation. Extensive experiments show that Mogao not only achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs. Its emergent capabilities in zero-shot image editing and compositional generation highlight Mogao as a practical omni-modal foundation model, paving the way for future development and scaling the unified multi-modal systems.
△ Less
Submitted 11 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
CottonSim: Development of an autonomous visual-guided robotic cotton-picking system in the Gazebo
Authors:
Thevathayarajh Thayananthan,
Xin Zhang,
Yanbo Huang,
Jingdao Chen,
Nuwan K. Wijewardane,
Vitor S. Martins,
Gary D. Chesser,
Christopher T. Goodin
Abstract:
In this study, an autonomous visual-guided robotic cotton-picking system, built on a Clearpath's Husky robot platform and the Cotton-Eye perception system, was developed in the Gazebo robotic simulator. Furthermore, a virtual cotton farm was designed and developed as a Robot Operating System (ROS 1) package to deploy the robotic cotton picker in the Gazebo environment for simulating autonomous fie…
▽ More
In this study, an autonomous visual-guided robotic cotton-picking system, built on a Clearpath's Husky robot platform and the Cotton-Eye perception system, was developed in the Gazebo robotic simulator. Furthermore, a virtual cotton farm was designed and developed as a Robot Operating System (ROS 1) package to deploy the robotic cotton picker in the Gazebo environment for simulating autonomous field navigation. The navigation was assisted by the map coordinates and an RGB-depth camera, while the ROS navigation algorithm utilized a trained YOLOv8n-seg model for instance segmentation. The model achieved a desired mean Average Precision (mAP) of 85.2%, a recall of 88.9%, and a precision of 93.0% for scene segmentation. The developed ROS navigation packages enabled our robotic cotton-picking system to autonomously navigate through the cotton field using map-based and GPS-based approaches, visually aided by a deep learning-based perception system. The GPS-based navigation approach achieved a 100% completion rate (CR) with a threshold of 5 x 10^-6 degrees, while the map-based navigation approach attained a 96.7% CR with a threshold of 0.25 m. This study establishes a fundamental baseline of simulation for future agricultural robotics and autonomous vehicles in cotton farming and beyond. CottonSim code and data are released to the research community via GitHub: https://github.com/imtheva/CottonSim
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents
Authors:
Kaixin Wang,
Tianlin Li,
Xiaoyu Zhang,
Chong Wang,
Weisong Sun,
Yang Liu,
Bin Shi
Abstract:
Code large language models (CodeLLMs) and agents have shown great promise in tackling complex software engineering tasks.Compared to traditional software engineering methods, CodeLLMs and agents offer stronger abilities, and can flexibly process inputs and outputs in both natural and code. Benchmarking plays a crucial role in evaluating the capabilities of CodeLLMs and agents, guiding their develo…
▽ More
Code large language models (CodeLLMs) and agents have shown great promise in tackling complex software engineering tasks.Compared to traditional software engineering methods, CodeLLMs and agents offer stronger abilities, and can flexibly process inputs and outputs in both natural and code. Benchmarking plays a crucial role in evaluating the capabilities of CodeLLMs and agents, guiding their development and deployment. However, despite their growing significance, there remains a lack of comprehensive reviews of benchmarks for CodeLLMs and agents. To bridge this gap, this paper provides a comprehensive review of existing benchmarks for CodeLLMs and agents, studying and analyzing 181 benchmarks from 461 relevant papers, covering the different phases of the software development life cycle (SDLC). Our findings reveal a notable imbalance in the coverage of current benchmarks, with approximately 60% focused on the software development phase in SDLC, while requirements engineering and software design phases receive minimal attention at only 5% and 3%, respectively. Additionally, Python emerges as the dominant programming language across the reviewed benchmarks. Finally, this paper highlights the challenges of current research and proposes future directions, aiming to narrow the gap between the theoretical capabilities of CodeLLMs and agents and their application in real-world scenarios.
△ Less
Submitted 8 May, 2025; v1 submitted 8 May, 2025;
originally announced May 2025.
-
PADriver: Towards Personalized Autonomous Driving
Authors:
Genghua Kou,
Fan Jia,
Weixin Mao,
Yingfei Liu,
Yucheng Zhao,
Ziheng Zhang,
Osamu Yoshie,
Tiancai Wang,
Ying Li,
Xiangyu Zhang
Abstract:
In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action…
▽ More
In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action and provides an explicit reference for the final action, which corresponds to the preset personalized prompt. Moreover, we construct a closed-loop benchmark named PAD-Highway based on Highway-Env simulator to comprehensively evaluate the decision performance under traffic rules. The dataset contains 250 hours videos with high-quality annotation to facilitate the development of PAD behavior analysis. Experimental results on the constructed benchmark show that PADriver outperforms state-of-the-art approaches on different evaluation metrics, and enables various driving modes.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Efficient Parallel Ising Samplers via Localization Schemes
Authors:
Xiaoyu Chen,
Hongyang Liu,
Yitong Yin,
Xinyuan Zhang
Abstract:
We introduce efficient parallel algorithms for sampling from the Gibbs distribution and estimating the partition function of Ising models. These algorithms achieve parallel efficiency, with polylogarithmic depth and polynomial total work, and are applicable to Ising models in the following regimes: (1) Ferromagnetic Ising models with external fields; (2) Ising models with interaction matrix $J$ of…
▽ More
We introduce efficient parallel algorithms for sampling from the Gibbs distribution and estimating the partition function of Ising models. These algorithms achieve parallel efficiency, with polylogarithmic depth and polynomial total work, and are applicable to Ising models in the following regimes: (1) Ferromagnetic Ising models with external fields; (2) Ising models with interaction matrix $J$ of operator norm $\|J\|_2<1$.
Our parallel Gibbs sampling approaches are based on localization schemes, which have proven highly effective in establishing rapid mixing of Gibbs sampling. In this work, we employ two such localization schemes to obtain efficient parallel Ising samplers: the \emph{field dynamics} induced by \emph{negative-field localization}, and \emph{restricted Gaussian dynamics} induced by \emph{stochastic localization}. This shows that localization schemes are powerful tools, not only for achieving rapid mixing but also for the efficient parallelization of Gibbs sampling.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Listen to Extract: Onset-Prompted Target Speaker Extraction
Authors:
Pengjie Shen,
Kangrui Chen,
Shulin He,
Pengru Chen,
Shuqi Yuan,
He Kong,
Xueliang Zhang,
Zhong-Qiu Wang
Abstract:
We propose $\textit{listen to extract}$ (LExt), a highly-effective while extremely-simple algorithm for monaural target speaker extraction (TSE). Given an enrollment utterance of a target speaker, LExt aims at extracting the target speaker from the speaker's mixed speech with other speakers. For each mixture, LExt concatenates an enrollment utterance of the target speaker to the mixture signal at…
▽ More
We propose $\textit{listen to extract}$ (LExt), a highly-effective while extremely-simple algorithm for monaural target speaker extraction (TSE). Given an enrollment utterance of a target speaker, LExt aims at extracting the target speaker from the speaker's mixed speech with other speakers. For each mixture, LExt concatenates an enrollment utterance of the target speaker to the mixture signal at the waveform level, and trains deep neural networks (DNN) to extract the target speech based on the concatenated mixture signal. The rationale is that, this way, an artificial speech onset is created for the target speaker and it could prompt the DNN (a) which speaker is the target to extract; and (b) spectral-temporal patterns of the target speaker that could help extraction. This simple approach produces strong TSE performance on multiple public TSE datasets including WSJ0-2mix, WHAM! and WHAMR!.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization
Authors:
Yuntai Bao,
Xuhong Zhang,
Tianyu Du,
Xinkui Zhao,
Jiang Zong,
Hao Peng,
Jianwei Yin
Abstract:
Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fai…
▽ More
Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fail to compute ``multi-stage'' influence and lack scalability to billion-scale LLMs.
In this paper, we propose the multi-stage influence function to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation. Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates. Our code is public at https://github.com/colored-dye/multi_stage_influence_function.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Learning Item Representations Directly from Multimodal Features for Effective Recommendation
Authors:
Xin Zhou,
Xiaoxiong Zhang,
Dusit Niyato,
Zhiqi Shen
Abstract:
Conventional multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations by amalgamating item identity (ID) embeddings with multimodal features. Nevertheless, our empirical and theoretical findings unequivocally demonstrate a pronounced optimization gradient bias in favor of acquiring representations from multimodal features…
▽ More
Conventional multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations by amalgamating item identity (ID) embeddings with multimodal features. Nevertheless, our empirical and theoretical findings unequivocally demonstrate a pronounced optimization gradient bias in favor of acquiring representations from multimodal features over item ID embeddings. As a consequence, item ID embeddings frequently exhibit suboptimal characteristics despite the convergence of multimodal feature parameters. Given the rich informational content inherent in multimodal features, in this paper, we propose a novel model (i.e., LIRDRec) that learns item representations directly from these features to augment recommendation performance. Recognizing that features derived from each modality may capture disparate yet correlated aspects of items, we propose a multimodal transformation mechanism, integrated with modality-specific encoders, to effectively fuse features from all modalities. Moreover, to differentiate the influence of diverse modality types, we devise a progressive weight copying fusion module within LIRDRec. This module incrementally learns the weight assigned to each modality in synthesizing the final user or item representations. Finally, we utilize the powerful visual understanding of Multimodal Large Language Models (MLLMs) to convert the item images into texts and extract semantics embeddings upon the texts via LLMs. Empirical evaluations conducted on five real-world datasets validate the superiority of our approach relative to competing baselines. It is worth noting the proposed model, equipped with embeddings extracted from MLLMs and LLMs, can further improve the recommendation accuracy of NDCG@20 by an average of 4.21% compared to the original embeddings.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Authors:
Yunxin Li,
Zhenyu Liu,
Zitao Li,
Xuanyu Zhang,
Zhenran Xu,
Xinyu Chen,
Haoyuan Shi,
Shenyuan Jiang,
Xintong Wang,
Jifang Wang,
Shouzheng Huang,
Xinping Zhao,
Borui Jiang,
Lanqing Hong,
Longyue Wang,
Zhuotao Tian,
Baoxing Huai,
Wenhan Luo,
Weihua Luo,
Zheng Zhang,
Baotian Hu,
Min Zhang
Abstract:
Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integra…
▽ More
Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
PR2: Peephole Raw Pointer Rewriting with LLMs for Translating C to Safer Rust
Authors:
Yifei Gao,
Chengpeng Wang,
Pengxiang Huang,
Xuwei Liu,
Mingwei Zheng,
Xiangyu Zhang
Abstract:
There has been a growing interest in translating C code to Rust due to Rust's robust memory and thread safety guarantees. Tools such as C2RUST enable syntax-guided transpilation from C to semantically equivalent Rust code. However, the resulting Rust programs often rely heavily on unsafe constructs--particularly raw pointers--which undermines Rust's safety guarantees. This paper aims to improve th…
▽ More
There has been a growing interest in translating C code to Rust due to Rust's robust memory and thread safety guarantees. Tools such as C2RUST enable syntax-guided transpilation from C to semantically equivalent Rust code. However, the resulting Rust programs often rely heavily on unsafe constructs--particularly raw pointers--which undermines Rust's safety guarantees. This paper aims to improve the memory safety of Rust programs generated by C2RUST by eliminating raw pointers. Specifically, we propose a peephole raw pointer rewriting technique that lifts raw pointers in individual functions to appropriate Rust data structures. Technically, PR2 employs decision-tree-based prompting to guide the pointer lifting process. Additionally, it leverages code change analysis to guide the repair of errors introduced during rewriting, effectively addressing errors encountered during compilation and test case execution. We implement PR2 as a prototype and evaluate it using gpt-4o-mini on 28 real-world C projects. The results show that PR2 successfully eliminates 13.22% of local raw pointers across these projects, significantly enhancing the safety of the translated Rust code. On average, PR2 completes the transformation of a project in 5.44 hours, at an average cost of $1.46.
△ Less
Submitted 9 May, 2025; v1 submitted 7 May, 2025;
originally announced May 2025.
-
Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers
Authors:
Divyansh Srivastava,
Xiang Zhang,
He Wen,
Chenru Wen,
Zhuowen Tu
Abstract:
We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-sou…
▽ More
We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait
Authors:
Feng Liu,
Nicholas Chimitt,
Lanqing Guo,
Jitesh Jain,
Aditya Kane,
Minchul Kim,
Wes Robbins,
Yiyang Su,
Dingqiang Ye,
Xingguang Zhang,
Jie Zhu,
Siddharth Satyakam,
Christopher Perry,
Stanley H. Chan,
Arun Ross,
Humphrey Shi,
Zhangyang Wang,
Anil Jain,
Xiaoming Liu
Abstract:
We address the problem of whole-body person recognition in unconstrained environments. This problem arises in surveillance scenarios such as those in the IARPA Biometric Recognition and Identification at Altitude and Range (BRIAR) program, where biometric data is captured at long standoff distances, elevated viewing angles, and under adverse atmospheric conditions (e.g., turbulence and high wind v…
▽ More
We address the problem of whole-body person recognition in unconstrained environments. This problem arises in surveillance scenarios such as those in the IARPA Biometric Recognition and Identification at Altitude and Range (BRIAR) program, where biometric data is captured at long standoff distances, elevated viewing angles, and under adverse atmospheric conditions (e.g., turbulence and high wind velocity). To this end, we propose FarSight, a unified end-to-end system for person recognition that integrates complementary biometric cues across face, gait, and body shape modalities. FarSight incorporates novel algorithms across four core modules: multi-subject detection and tracking, recognition-aware video restoration, modality-specific biometric feature encoding, and quality-guided multi-modal fusion. These components are designed to work cohesively under degraded image conditions, large pose and scale variations, and cross-domain gaps. Extensive experiments on the BRIAR dataset, one of the most comprehensive benchmarks for long-range, multi-modal biometric recognition, demonstrate the effectiveness of FarSight. Compared to our preliminary system, this system achieves a 34.1% absolute gain in 1:1 verification accuracy ([email protected]% FAR), a 17.8% increase in closed-set identification (Rank-20), and a 34.3% reduction in open-set identification errors (FNIR@1% FPIR). Furthermore, FarSight was evaluated in the 2025 NIST RTE Face in Video Evaluation (FIVE), which conducts standardized face recognition testing on the BRIAR dataset. These results establish FarSight as a state-of-the-art solution for operational biometric recognition in challenging real-world conditions.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing
Authors:
Jie Sun,
Heng Liu,
Yongzhen Wang,
Xiao-Ping Zhang,
Mingqiang Wei
Abstract:
In this paper, we reveal a novel haze-specific wavelet degradation prior observed through wavelet transform analysis, which shows that haze-related information predominantly resides in low-frequency components. Exploiting this insight, we propose a novel dehazing framework, WDMamba, which decomposes the image dehazing task into two sequential stages: low-frequency restoration followed by detail en…
▽ More
In this paper, we reveal a novel haze-specific wavelet degradation prior observed through wavelet transform analysis, which shows that haze-related information predominantly resides in low-frequency components. Exploiting this insight, we propose a novel dehazing framework, WDMamba, which decomposes the image dehazing task into two sequential stages: low-frequency restoration followed by detail enhancement. This coarse-to-fine strategy enables WDMamba to effectively capture features specific to each stage of the dehazing process, resulting in high-quality restored images. Specifically, in the low-frequency restoration stage, we integrate Mamba blocks to reconstruct global structures with linear complexity, efficiently removing overall haze and producing a coarse restored image. Thereafter, the detail enhancement stage reinstates fine-grained information that may have been overlooked during the previous phase, culminating in the final dehazed output. Furthermore, to enhance detail retention and achieve more natural dehazing, we introduce a self-guided contrastive regularization during network training. By utilizing the coarse restored output as a hard negative example, our model learns more discriminative representations, substantially boosting the overall dehazing performance. Extensive evaluations on public dehazing benchmarks demonstrate that our method surpasses state-of-the-art approaches both qualitatively and quantitatively. Code is available at https://github.com/SunJ000/WDMamba.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
A Dataset and Toolkit for Multiparameter Cardiovascular Physiology Sensing on Rings
Authors:
Jiankai Tang,
Kegang Wang,
Yingke Ding,
Jiatong Ji,
Zeyu Wang,
Xiyuxing Zhang,
Ping Chen,
Yuanchun Shi,
Yuntao Wang
Abstract:
Smart rings offer a convenient way to continuously and unobtrusively monitor cardiovascular physiological signals. However, a gap remains between the ring hardware and reliable methods for estimating cardiovascular parameters, partly due to the lack of publicly available datasets and standardized analysis tools. In this work, we present $τ$-Ring, the first open-source ring-based dataset designed f…
▽ More
Smart rings offer a convenient way to continuously and unobtrusively monitor cardiovascular physiological signals. However, a gap remains between the ring hardware and reliable methods for estimating cardiovascular parameters, partly due to the lack of publicly available datasets and standardized analysis tools. In this work, we present $τ$-Ring, the first open-source ring-based dataset designed for cardiovascular physiological sensing. The dataset comprises photoplethysmography signals (infrared and red channels) and 3-axis accelerometer data collected from two rings (reflective and transmissive optical paths), with 28.21 hours of raw data from 34 subjects across seven activities. $τ$-Ring encompasses both stationary and motion scenarios, as well as stimulus-evoked abnormal physiological states, annotated with four ground-truth labels: heart rate, respiratory rate, oxygen saturation, and blood pressure. Using our proposed RingTool toolkit, we evaluated three widely-used physics-based methods and four cutting-edge deep learning approaches. Our results show superior performance compared to commercial rings, achieving best MAE values of 5.18 BPM for heart rate, 2.98 BPM for respiratory rate, 3.22\% for oxygen saturation, and 13.33/7.56 mmHg for systolic/diastolic blood pressure estimation. The open-sourced dataset and toolkit aim to foster further research and community-driven advances in ring-based cardiovascular health sensing.
△ Less
Submitted 8 May, 2025; v1 submitted 7 May, 2025;
originally announced May 2025.
-
A Framework to Prevent Biometric Data Leakage in the Immersive Technologies Domain
Authors:
Keshav Sood,
Iynkaran Natgunanathan,
Uthayasanker Thayasivam,
Vithurabiman Senthuran,
Xiaoning Zhang,
Shui Yu
Abstract:
Doubtlessly, the immersive technologies have potential to ease people's life and uplift economy, however the obvious data privacy risks cannot be ignored. For example, a participant wears a 3D headset device which detects participant's head motion to track the pose of participant's head to match the orientation of camera with participant's eyes positions in the real-world. In a preliminary study,…
▽ More
Doubtlessly, the immersive technologies have potential to ease people's life and uplift economy, however the obvious data privacy risks cannot be ignored. For example, a participant wears a 3D headset device which detects participant's head motion to track the pose of participant's head to match the orientation of camera with participant's eyes positions in the real-world. In a preliminary study, researchers have proved that the voice command features on such headsets could lead to major privacy leakages. By analyzing the facial dynamics captured with the motion sensors, the headsets suffer security vulnerabilities revealing a user's sensitive speech without user's consent. The psychography data (such as voice command features, facial dynamics, etc.) is sensitive data and it should not be leaked out of the device without users consent else it is a privacy breach. To the best of our literature review, the work done in this particular research problem is very limited. Motivated from this, we develop a simple technical framework to mitigate sensitive data (or biometric data) privacy leaks in immersive technology domain. The performance evaluation is conducted in a robust way using six data sets, to show that the proposed solution is effective and feasible to prevent this issue.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
RFNNS: Robust Fixed Neural Network Steganography with Popular Deep Generative Models
Authors:
Yu Cheng,
Jiuan Zhou,
Jiawei Chen,
Zhaoxia Yin,
Xinpeng Zhang
Abstract:
Image steganography is a technique that conceals secret information in a cover image to achieve covert communication. Recent research has demonstrated that Fixed Neural Network Steganography (FNNS) exhibits significant practical advantages, as it enables stable and efficient steganographic embedding and extraction without requiring neural network training. However, the stego image generated by exi…
▽ More
Image steganography is a technique that conceals secret information in a cover image to achieve covert communication. Recent research has demonstrated that Fixed Neural Network Steganography (FNNS) exhibits significant practical advantages, as it enables stable and efficient steganographic embedding and extraction without requiring neural network training. However, the stego image generated by existing FNNS methods suffers from considerable distortion and exhibits poor robustness, severely reducing the security and practicality of steganography. To address the aforementioned issues, we propose a Robust Fixed Neural Network Steganography (RFNNS). In RFNNS, we introduce a texture-aware localization technique to add perturbations carrying secret image information to complex texture areas that are less perceptible to the human eye, thereby ensuring the quality of the stego image. To enhance robustness, a robust steganographic perturbation generation (RSPG) strategy is designed, which enables slight perturbations to be accurately decoded even after common image attacks. Subsequently, the generated robust perturbations are combined with the AI-generated cover image to produce the stego image. The receiver only needs to share the secret key and employ the same decoding network structure to accurately extract the secret image from the attacked stego image. Experimental results demonstrate that RFNNS achieves enhanced performance in terms of security, including imperceptibility and anti-steganalysis performance. Furthermore, RFNNS demonstrates superior robustness against common image attacks, such as JPEG compression, Gaussian noise, and contrast adjustment, across diverse embedding capacities, outperforming existing SOTA FNNS methods.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
Authors:
Xueyao Zhang,
Yuancheng Wang,
Chaoren Wang,
Ziniu Li,
Zhuo Chen,
Zhizheng Wu
Abstract:
Modern zero-shot text-to-speech (TTS) systems, despite using extensive pre-training, often struggle in challenging scenarios such as tongue twisters, repeated words, code-switching, and cross-lingual synthesis, leading to intelligibility issues. To address these limitations, this paper leverages preference alignment techniques, which enable targeted construction of out-of-pretraining-distribution…
▽ More
Modern zero-shot text-to-speech (TTS) systems, despite using extensive pre-training, often struggle in challenging scenarios such as tongue twisters, repeated words, code-switching, and cross-lingual synthesis, leading to intelligibility issues. To address these limitations, this paper leverages preference alignment techniques, which enable targeted construction of out-of-pretraining-distribution data to enhance performance. We introduce a new dataset, named the Intelligibility Preference Speech Dataset (INTP), and extend the Direct Preference Optimization (DPO) framework to accommodate diverse TTS architectures. After INTP alignment, in addition to intelligibility, we observe overall improvements including naturalness, similarity, and audio quality for multiple TTS models across diverse domains. Based on that, we also verify the weak-to-strong generalization ability of INTP for more intelligible models such as CosyVoice 2 and Ints. Moreover, we showcase the potential for further improvements through iterative alignment based on Ints. Audio samples are available at https://intalign.github.io/.
△ Less
Submitted 7 May, 2025;
originally announced May 2025.
-
Shadow Wireless Intelligence: Large Language Model-Driven Reasoning in Covert Communications
Authors:
Yuanai Xie,
Zhaozhi Liu,
Xiao Zhang,
Shihua Zhang,
Rui Hou,
Minrui Xu,
Ruichen Zhang,
Dusit Niyato
Abstract:
Covert Communications (CC) can secure sensitive transmissions in industrial, military, and mission-critical applications within 6G wireless networks. However, traditional optimization methods based on Artificial Noise (AN), power control, and channel manipulation might not adapt to dynamic and adversarial environments due to the high dimensionality, nonlinearity, and stringent real-time covertness…
▽ More
Covert Communications (CC) can secure sensitive transmissions in industrial, military, and mission-critical applications within 6G wireless networks. However, traditional optimization methods based on Artificial Noise (AN), power control, and channel manipulation might not adapt to dynamic and adversarial environments due to the high dimensionality, nonlinearity, and stringent real-time covertness requirements. To bridge this gap, we introduce Shadow Wireless Intelligence (SWI), which integrates the reasoning capabilities of Large Language Models (LLMs) with retrieval-augmented generation to enable intelligent decision-making in covert wireless systems. Specifically, we utilize DeepSeek-R1, a mixture-of-experts-based LLM with RL-enhanced reasoning, combined with real-time retrieval of domain-specific knowledge to improve context accuracy and mitigate hallucinations. Our approach develops a structured CC knowledge base, supports context-aware retrieval, and performs semantic optimization, allowing LLMs to generate and adapt CC strategies in real time. In a case study on optimizing AN power in a full-duplex CC scenario, DeepSeek-R1 achieves 85% symbolic derivation accuracy and 94% correctness in the generation of simulation code, outperforming baseline models. These results validate SWI as a robust, interpretable, and adaptive foundation for LLM-driven intelligent covert wireless systems in 6G networks.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Novel Extraction of Discriminative Fine-Grained Feature to Improve Retinal Vessel Segmentation
Authors:
Shuang Zeng,
Chee Hong Lee,
Micky C Nnamdi,
Wenqi Shi,
J Ben Tamo,
Lei Zhu,
Hangzhou He,
Xinliang Zhang,
Qian Chen,
May D. Wang,
Yanye Lu,
Qiushi Ren
Abstract:
Retinal vessel segmentation is a vital early detection method for several severe ocular diseases. Despite significant progress in retinal vessel segmentation with the advancement of Neural Networks, there are still challenges to overcome. Specifically, retinal vessel segmentation aims to predict the class label for every pixel within a fundus image, with a primary focus on intra-image discriminati…
▽ More
Retinal vessel segmentation is a vital early detection method for several severe ocular diseases. Despite significant progress in retinal vessel segmentation with the advancement of Neural Networks, there are still challenges to overcome. Specifically, retinal vessel segmentation aims to predict the class label for every pixel within a fundus image, with a primary focus on intra-image discrimination, making it vital for models to extract more discriminative features. Nevertheless, existing methods primarily focus on minimizing the difference between the output from the decoder and the label, but ignore fully using feature-level fine-grained representations from the encoder. To address these issues, we propose a novel Attention U-shaped Kolmogorov-Arnold Network named AttUKAN along with a novel Label-guided Pixel-wise Contrastive Loss for retinal vessel segmentation. Specifically, we implement Attention Gates into Kolmogorov-Arnold Networks to enhance model sensitivity by suppressing irrelevant feature activations and model interpretability by non-linear modeling of KAN blocks. Additionally, we also design a novel Label-guided Pixel-wise Contrastive Loss to supervise our proposed AttUKAN to extract more discriminative features by distinguishing between foreground vessel-pixel pairs and background pairs. Experiments are conducted across four public datasets including DRIVE, STARE, CHASE_DB1, HRF and our private dataset. AttUKAN achieves F1 scores of 82.50%, 81.14%, 81.34%, 80.21% and 80.09%, along with MIoU scores of 70.24%, 68.64%, 68.59%, 67.21% and 66.94% in the above datasets, which are the highest compared to 11 networks for retinal vessel segmentation. Quantitative and qualitative results show that our AttUKAN achieves state-of-the-art performance and outperforms existing retinal vessel segmentation methods. Our code will be available at https://github.com/stevezs315/AttUKAN.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.