Search | arXiv e-print repository

A Deterministic Partition Tree and Applications

Abstract: In this paper, we present a deterministic variant of Chan's randomized partition tree [Discret. Comput. Geom., 2012]. This result leads to numerous applications. In particular, for $d$-dimensional simplex range counting (for any constant $d \ge 2$), we construct a data structure using $O(n)$ space and $O(n^{1+ε})$ preprocessing time, such that each query can be answered in $o(n^{1-1/d})$ time (spe… ▽ More In this paper, we present a deterministic variant of Chan's randomized partition tree [Discret. Comput. Geom., 2012]. This result leads to numerous applications. In particular, for $d$-dimensional simplex range counting (for any constant $d \ge 2$), we construct a data structure using $O(n)$ space and $O(n^{1+ε})$ preprocessing time, such that each query can be answered in $o(n^{1-1/d})$ time (specifically, $O(n^{1-1/d} / \log^{Ω(1)} n)$ time), thereby breaking an $Ω(n^{1-1/d})$ lower bound known for the semigroup setting. Notably, our approach does not rely on any bit-packing techniques. We also obtain deterministic improvements for several other classical problems, including simplex range stabbing counting and reporting, segment intersection detection, counting and reporting, ray-shooting among segments, and more. Similar to Chan's original randomized partition tree, we expect that additional applications will emerge in the future, especially in situations where deterministic results are preferred. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: To appear in ESA 2025

arXiv:2507.01773 [pdf, ps, other]

Frontiers of Generative AI for Network Optimization: Theories, Limits, and Visions

Authors: Bo Yang, Ruihuai Liang, Weixin Li, Han Wang, Xuelin Cao, Zhiwen Yu, Samson Lasaulce, Mérouane Debbah, Mohamed-Slim Alouini, H. Vincent Poor, Chau Yuen

Abstract: While interest in the application of generative AI (GenAI) in network optimization has surged in recent years, its rapid progress has often overshadowed critical limitations intrinsic to generative models that remain insufficiently examined in existing literature. This survey provides a comprehensive review and critical analysis of GenAI in network optimization. We focus on the two dominant paradi… ▽ More While interest in the application of generative AI (GenAI) in network optimization has surged in recent years, its rapid progress has often overshadowed critical limitations intrinsic to generative models that remain insufficiently examined in existing literature. This survey provides a comprehensive review and critical analysis of GenAI in network optimization. We focus on the two dominant paradigms of GenAI including generative diffusion models (GDMs) and large pre-trained models (LPTMs), and organize our discussion around a categorization we introduce, dividing network optimization problems into two primary formulations: one-shot optimization and Markov decision process (MDP). We first trace key works, including foundational contributions from the AI community, and categorize current efforts in network optimization. We also review frontier applications of GDMs and LPTMs in other networking tasks, providing additional context. Furthermore, we present theoretical generalization bounds for GDMs in both one-shot and MDP settings, offering insights into the fundamental factors affecting model performance. Most importantly, we reflect on the overestimated perception of GenAI's general capabilities and caution against the all-in-one illusion it may convey. We highlight critical limitations, including difficulties in constraint satisfying, limited concept understanding, and the inherent probabilistic nature of outputs. We also propose key future directions, such as bridging the gap between generation and optimization. Although they are increasingly integrated in implementations, they differ fundamentally in both objectives and underlying mechanisms, necessitating a deeper understanding of their theoretical connections. Ultimately, this survey aims to provide a structured overview and a deeper insight into the strengths, limitations, and potential of GenAI in network optimization. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01694 [pdf, ps, other]

Graph Representation-based Model Poisoning on Federated LLMs in CyberEdge Networks

Authors: Hanlin Cai, Haofan Dong, Houtianfu Wang, Kai Li, Ozgur B. Akan

Abstract: Federated large language models (FedLLMs) provide powerful generative capabilities in CyberEdge networks while protecting data privacy. However, FedLLMs remains highly vulnerable to model poisoning attacks. This article first reviews recent model poisoning techniques and existing defense mechanisms for FedLLMs, highlighting critical limitations, particularly under non-IID text distributions. In pa… ▽ More Federated large language models (FedLLMs) provide powerful generative capabilities in CyberEdge networks while protecting data privacy. However, FedLLMs remains highly vulnerable to model poisoning attacks. This article first reviews recent model poisoning techniques and existing defense mechanisms for FedLLMs, highlighting critical limitations, particularly under non-IID text distributions. In particular, current defenses primarily utilize distance-based outlier detection or norm constraints, operating under the assumption that adversarial updates significantly diverge from benign statistics. This assumption can fail when facing adaptive attackers targeting billionparameter LLMs. Next, this article investigates emerging Graph Representation-Based Model Poisoning (GRMP), a novel attack paradigm that leverages higher-order correlations among honest client gradients to synthesize malicious updates indistinguishable from legitimate model updates. GRMP can effectively evade advanced defenses, resulting in substantial accuracy loss and performance degradation. Moreover, this article outlines a research roadmap emphasizing the importance of graph-aware secure aggregation methods, FedLLMs-specific vulnerability metrics, and evaluation frameworks to strengthen the robustness of future federated language model deployments. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: 7 pages, 5 figures

arXiv:2507.01663 [pdf, ps, other]

AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training

Authors: Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, Jianping Wu

Abstract: Reinforcement learning (RL) has become a pivotal technology in the post-training phase of large language models (LLMs). Traditional task-colocated RL frameworks suffer from significant scalability bottlenecks, while task-separated RL frameworks face challenges in complex dataflows and the corresponding resource idling and workload imbalance. Moreover, most existing frameworks are tightly coupled w… ▽ More Reinforcement learning (RL) has become a pivotal technology in the post-training phase of large language models (LLMs). Traditional task-colocated RL frameworks suffer from significant scalability bottlenecks, while task-separated RL frameworks face challenges in complex dataflows and the corresponding resource idling and workload imbalance. Moreover, most existing frameworks are tightly coupled with LLM training or inference engines, making it difficult to support custom-designed engines. To address these challenges, we propose AsyncFlow, an asynchronous streaming RL framework for efficient post-training. Specifically, we introduce a distributed data storage and transfer module that provides a unified data management and fine-grained scheduling capability in a fully streamed manner. This architecture inherently facilitates automated pipeline overlapping among RL tasks and dynamic load balancing. Moreover, we propose a producer-consumer-based asynchronous workflow engineered to minimize computational idleness by strategically deferring parameter update process within staleness thresholds. Finally, the core capability of AsynFlow is architecturally decoupled from underlying training and inference engines and encapsulated by service-oriented user interfaces, offering a modular and customizable user experience. Extensive experiments demonstrate an average of 1.59 throughput improvement compared with state-of-the-art baseline. The presented architecture in this work provides actionable insights for next-generation RL training system designs. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01586 [pdf, ps, other]

SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation

Authors: Bryan Constantine Sadihin, Michael Hua Wang, Shei Pern Chua, Hang Su

Abstract: The production of high-quality 2D animation is highly labor-intensive process, as animators are currently required to draw and color a large number of frames by hand. We present SketchColour, the first sketch-to-colour pipeline for 2D animation built on a diffusion transformer (DiT) backbone. By replacing the conventional U-Net denoiser with a DiT-style architecture and injecting sketch informatio… ▽ More The production of high-quality 2D animation is highly labor-intensive process, as animators are currently required to draw and color a large number of frames by hand. We present SketchColour, the first sketch-to-colour pipeline for 2D animation built on a diffusion transformer (DiT) backbone. By replacing the conventional U-Net denoiser with a DiT-style architecture and injecting sketch information via lightweight channel-concatenation adapters accompanied with LoRA finetuning, our method natively integrates conditioning without the parameter and memory bloat of a duplicated ControlNet, greatly reducing parameter count and GPU memory usage. Evaluated on the SAKUGA dataset, SketchColour outperforms previous state-of-the-art video colourization methods across all metrics, despite using only half the training data of competing models. Our approach produces temporally coherent animations with minimal artifacts such as colour bleeding or object deformation. Our code is available at: https://bconstantine.github.io/SketchColour . △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: Project page and code: https://bconstantine.github.io/SketchColour

arXiv:2507.01573 [pdf, ps, other]

A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation

Authors: Hao Wang, Keyan Hu, Xin Guo, Haifeng Li, Chao Tao

Abstract: Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily o… ▽ More Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model's ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework's capability of consistent boundary refinement for coarse results from diverse discriminative architectures. The source code will be available at https://github.com/KeyanHu-git/IDGBR. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: 20 pages, 14 figures

arXiv:2507.01561 [pdf, ps, other]

Self-Closing Suction Grippers for Industrial Grasping via Form-Flexible Design

Authors: Huijiang Wang, Holger Kunz, Timon Adler, Fumiya Iida

Abstract: Shape-morphing robots have shown benefits in industrial grasping. We propose form-flexible grippers for adaptive grasping. The design is based on the hybrid jamming and suction mechanism, which deforms to handle objects that vary significantly in size from the aperture, including both larger and smaller parts. Compared with traditional grippers, the gripper achieves self-closing to form an airtigh… ▽ More Shape-morphing robots have shown benefits in industrial grasping. We propose form-flexible grippers for adaptive grasping. The design is based on the hybrid jamming and suction mechanism, which deforms to handle objects that vary significantly in size from the aperture, including both larger and smaller parts. Compared with traditional grippers, the gripper achieves self-closing to form an airtight seal. Under a vacuum, a wide range of grasping is realized through the passive morphing mechanism at the interface that harmonizes pressure and flow rate. This hybrid gripper showcases the capability to securely grasp an egg, as small as 54.5% of its aperture, while achieving a maximum load-to-mass ratio of 94.3. △ Less

Submitted 2 July, 2025; originally announced July 2025.

Comments: This manuscript has been submitted for potential consideration at IEEE publication venues

arXiv:2507.01378 [pdf, ps, other]

RALLY: Role-Adaptive LLM-Driven Yoked Navigation for Agentic UAV Swarms

Authors: Ziyao Wang, Rongpeng Li, Sizhao Li, Yuming Xiang, Haiping Wang, Zhifeng Zhao, Honggang Zhang

Abstract: Intelligent control of Unmanned Aerial Vehicles (UAVs) swarms has emerged as a critical research focus, and it typically requires the swarm to navigate effectively while avoiding obstacles and achieving continuous coverage over multiple mission targets. Although traditional Multi-Agent Reinforcement Learning (MARL) approaches offer dynamic adaptability, they are hindered by the semantic gap in num… ▽ More Intelligent control of Unmanned Aerial Vehicles (UAVs) swarms has emerged as a critical research focus, and it typically requires the swarm to navigate effectively while avoiding obstacles and achieving continuous coverage over multiple mission targets. Although traditional Multi-Agent Reinforcement Learning (MARL) approaches offer dynamic adaptability, they are hindered by the semantic gap in numerical communication and the rigidity of homogeneous role structures, resulting in poor generalization and limited task scalability. Recent advances in Large Language Model (LLM)-based control frameworks demonstrate strong semantic reasoning capabilities by leveraging extensive prior knowledge. However, due to the lack of online learning and over-reliance on static priors, these works often struggle with effective exploration, leading to reduced individual potential and overall system performance. To address these limitations, we propose a Role-Adaptive LLM-Driven Yoked navigation algorithm RALLY. Specifically, we first develop an LLM-driven semantic decision framework that uses structured natural language for efficient semantic communication and collaborative reasoning. Afterward, we introduce a dynamic role-heterogeneity mechanism for adaptive role switching and personalized decision-making. Furthermore, we propose a Role-value Mixing Network (RMIX)-based assignment strategy that integrates LLM offline priors with MARL online policies to enable semi-offline training of role selection strategies. Experiments in the Multi-Agent Particle Environment (MPE) environment and a Software-In-The-Loop (SITL) platform demonstrate that RALLY outperforms conventional approaches in terms of task coverage, convergence speed, and generalization, highlighting its strong potential for collaborative navigation in agentic multi-UAV systems. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01216 [pdf, ps, other]

PAE MobiLLM: Privacy-Aware and Efficient LLM Fine-Tuning on the Mobile Device via Additive Side-Tuning

Authors: Xingke Yang, Liang Li, Zhiyi Wan, Sicong Li, Hao Wang, Xiaoqi Qi, Jiang Liu, Tomoaki Ohtsuki, Xin Fu, Miao Pan

Abstract: There is a huge gap between numerous intriguing applications fostered by on-device large language model (LLM) fine-tuning (FT) from fresh mobile data and the limited resources of a mobile device. While existing server-assisted methods (e.g., split learning or side-tuning) may enable LLM FT on the local mobile device, they suffer from heavy communication burdens of activation transmissions, and may… ▽ More There is a huge gap between numerous intriguing applications fostered by on-device large language model (LLM) fine-tuning (FT) from fresh mobile data and the limited resources of a mobile device. While existing server-assisted methods (e.g., split learning or side-tuning) may enable LLM FT on the local mobile device, they suffer from heavy communication burdens of activation transmissions, and may disclose data, labels or fine-tuned models to the server. To address those issues, we develop PAE MobiLLM, a privacy-aware and efficient LLM FT method which can be deployed on the mobile device via server-assisted additive side-tuning. To further accelerate FT convergence and improve computing efficiency, PAE MobiLLM integrates activation caching on the server side, which allows the server to reuse historical activations and saves the mobile device from repeatedly computing forward passes for the recurring data samples. Besides, to reduce communication cost, PAE MobiLLM develops a one-token (i.e., ``pivot'' token) activation shortcut that transmits only a single activation dimension instead of full activation matrices to guide the side network tuning. Last but not least, PAE MobiLLM introduces the additive adapter side-network design which makes the server train the adapter modules based on device-defined prediction differences rather than raw ground-truth labels. In this way, the server can only assist device-defined side-network computing, and learn nothing about data, labels or fine-tuned models. △ Less

Submitted 1 July, 2025; originally announced July 2025.

arXiv:2507.01017 [pdf, ps, other]

A Comprehensive Review of Human Error in Risk-Informed Decision Making: Integrating Human Reliability Assessment, Artificial Intelligence, and Human Performance Models

Authors: Xingyu Xiao, Hongxu Zhu, Jingang Liang, Jiejuan Tong, Haitao Wang

Abstract: Human error remains a dominant risk driver in safety-critical sectors such as nuclear power, aviation, and healthcare, where seemingly minor mistakes can cascade into catastrophic outcomes. Although decades of research have produced a rich repertoire of mitigation techniques, persistent limitations: scarce high-quality data, algorithmic opacity, and residual reliance on expert judgment, continue t… ▽ More Human error remains a dominant risk driver in safety-critical sectors such as nuclear power, aviation, and healthcare, where seemingly minor mistakes can cascade into catastrophic outcomes. Although decades of research have produced a rich repertoire of mitigation techniques, persistent limitations: scarce high-quality data, algorithmic opacity, and residual reliance on expert judgment, continue to constrain progress. This review synthesizes recent advances at the intersection of risk-informed decision making, human reliability assessment (HRA), artificial intelligence (AI), and cognitive science to clarify how their convergence can curb human-error risk. We first categorize the principal forms of human error observed in complex sociotechnical environments and outline their quantitative impact on system reliability. Next, we examine risk-informed frameworks that embed HRA within probabilistic and data-driven methodologies, highlighting successes and gaps. We then survey cognitive and human-performance models, detailing how mechanistic accounts of perception, memory, and decision-making enrich error prediction and complement HRA metrics. Building on these foundations, we critically assess AI-enabled techniques for real-time error detection, operator-state estimation, and AI-augmented HRA workflows. Across these strands, a recurring insight emerges: integrating cognitive models with AI-based analytics inside risk-informed HRA pipelines markedly enhances predictive fidelity, yet doing so demands richer datasets, transparent algorithms, and rigorous validation. Finally, we identify promising research directions, coupling resilience engineering concepts with grounded theory, operationalizing the iceberg model of incident causation, and establishing cross-domain data consortia, to foster a multidisciplinary paradigm that elevates human reliability in high-stakes systems. △ Less

Submitted 10 June, 2025; originally announced July 2025.

arXiv:2507.00915 [pdf, ps, other]

MichelangeRoll: Sculpting Rational Distributions Exactly and Efficiently

Authors: Jui-Hsiang Shao, Hsin-Po Wang

Abstract: Simulating an arbitrary discrete distribution $D \in [0, 1]^n$ using fair coin tosses incurs trade-offs between entropy complexity and space and time complexity. Shannon's theory suggests that $H(D)$ tosses are necessary and sufficient, but does not guarantee exact distribution. Knuth and Yao showed that a decision tree consumes fewer than $H(D) + 2$ tosses for one exact sample. Drapper and Saad's… ▽ More Simulating an arbitrary discrete distribution $D \in [0, 1]^n$ using fair coin tosses incurs trade-offs between entropy complexity and space and time complexity. Shannon's theory suggests that $H(D)$ tosses are necessary and sufficient, but does not guarantee exact distribution. Knuth and Yao showed that a decision tree consumes fewer than $H(D) + 2$ tosses for one exact sample. Drapper and Saad's recent work addresses the space and time aspect, showing that $H(D) + 2$ tosses, $O(n \log(n) \log(m))$ memory, and $O(H(D))$ operations are all it costs, where $m$ is the common denominator of the probability masses in $D$ and $n$ is the number of possible outcomes. In this paper, MichelangeRoll recycles leftover entropy to break the "$+2$" barrier. With $O((n + 1/\varepsilon) \log(m/\varepsilon))$ memory, the entropy cost of generating a ongoing sequence of $D$ is reduced to $H(D) + \varepsilon$ per sample. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: 13 pages, 7 figures, RANDOM says no so here

arXiv:2507.00790 [pdf, ps, other]

LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Authors: Huaqiu Li, Yong Wang, Tongwen Huang, Hailang Huang, Haoqian Wang, Xiangxiang Chu

Abstract: Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recur… ▽ More Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data will be available at https://github.com/AMAP-ML/LD-RPS. △ Less

Submitted 1 July, 2025; originally announced July 2025.

arXiv:2507.00066 [pdf, other]

InSight-R: A Framework for Risk-informed Human Failure Event Identification and Interface-Induced Risk Assessment Driven by AutoGraph

Authors: Xingyu Xiao, Jiejuan Tong, Peng Chen, Jun Sun, Zhe Sui, Jingang Liang, Hongru Zhao, Jun Zhao, Haitao Wang

Abstract: Human reliability remains a critical concern in safety-critical domains such as nuclear power, where operational failures are often linked to human error. While conventional human reliability analysis (HRA) methods have been widely adopted, they rely heavily on expert judgment for identifying human failure events (HFEs) and assigning performance influencing factors (PIFs). This reliance introduces… ▽ More Human reliability remains a critical concern in safety-critical domains such as nuclear power, where operational failures are often linked to human error. While conventional human reliability analysis (HRA) methods have been widely adopted, they rely heavily on expert judgment for identifying human failure events (HFEs) and assigning performance influencing factors (PIFs). This reliance introduces challenges related to reproducibility, subjectivity, and limited integration of interface-level data. In particular, current approaches lack the capacity to rigorously assess how human-machine interface design contributes to operator performance variability and error susceptibility. To address these limitations, this study proposes a framework for risk-informed human failure event identification and interface-induced risk assessment driven by AutoGraph (InSight-R). By linking empirical behavioral data to the interface-embedded knowledge graph (IE-KG) constructed by the automated graph-based execution framework (AutoGraph), the InSight-R framework enables automated HFE identification based on both error-prone and time-deviated operational paths. Furthermore, we discuss the relationship between designer-user conflicts and human error. The results demonstrate that InSight-R not only enhances the objectivity and interpretability of HFE identification but also provides a scalable pathway toward dynamic, real-time human reliability assessment in digitalized control environments. This framework offers actionable insights for interface design optimization and contributes to the advancement of mechanism-driven HRA methodologies. △ Less

Submitted 27 June, 2025; originally announced July 2025.

arXiv:2506.24123 [pdf, ps, other]

Calligrapher: Freestyle Text Image Customization

Authors: Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen

Abstract: We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechani… ▽ More We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: Project page: https://calligrapher2025.github.io/Calligrapher Code: https://github.com/Calligrapher2025/Calligrapher

arXiv:2506.24005 [pdf, ps, other]

Provably Efficient and Agile Randomized Q-Learning

Authors: He Wang, Xingyu Xu, Yuejie Chi

Abstract: While Bayesian-based exploration often demonstrates superior empirical performance compared to bonus-based methods in model-based reinforcement learning (RL), its theoretical understanding remains limited for model-free settings. Existing provable algorithms either suffer from computational intractability or rely on stage-wise policy updates which reduce responsiveness and slow down the learning p… ▽ More While Bayesian-based exploration often demonstrates superior empirical performance compared to bonus-based methods in model-based reinforcement learning (RL), its theoretical understanding remains limited for model-free settings. Existing provable algorithms either suffer from computational intractability or rely on stage-wise policy updates which reduce responsiveness and slow down the learning process. In this paper, we propose a novel variant of Q-learning algorithm, refereed to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates, for episodic tabular RL. We establish an $\widetilde{O}(\sqrt{H^5SAT})$ regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the episode length, and $T$ is the total number of episodes. In addition, we present a logarithmic regret bound under a mild positive sub-optimality condition on the optimal Q-function. Empirically, RandomizedQ exhibits outstanding performance compared to existing Q-learning variants with both bonus-based and Bayesian-based exploration on standard benchmarks. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.23986 [pdf, ps, other]

StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding

Authors: Dake Guo, Jixun Yao, Linhan Ma, He Wang, Lei Xie

Abstract: Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token s… ▽ More Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms.\footnote{Speech samples: https://dukguo.github.io/StreamFlow/} △ Less

Submitted 1 July, 2025; v1 submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.23762 [pdf, ps, other]

Software Engineering for Large Language Models: Research Status, Challenges and the Road Ahead

Authors: Hongzhou Rao, Yanjie Zhao, Xinyi Hou, Shenao Wang, Haoyu Wang

Abstract: The rapid advancement of large language models (LLMs) has redefined artificial intelligence (AI), pushing the boundaries of AI research and enabling unbounded possibilities for both academia and the industry. However, LLM development faces increasingly complex challenges throughout its lifecycle, yet no existing research systematically explores these challenges and solutions from the perspective o… ▽ More The rapid advancement of large language models (LLMs) has redefined artificial intelligence (AI), pushing the boundaries of AI research and enabling unbounded possibilities for both academia and the industry. However, LLM development faces increasingly complex challenges throughout its lifecycle, yet no existing research systematically explores these challenges and solutions from the perspective of software engineering (SE) approaches. To fill the gap, we systematically analyze research status throughout the LLM development lifecycle, divided into six phases: requirements engineering, dataset construction, model development and enhancement, testing and evaluation, deployment and operations, and maintenance and evolution. We then conclude by identifying the key challenges for each phase and presenting potential research directions to address these challenges. In general, we provide valuable insights from an SE perspective to facilitate future advances in LLM development. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.23643 [pdf, ps, other]

Act-With-Think: Chunk Auto-Regressive Modeling for Generative Recommendation

Authors: Yifan Wang, Weinan Gan, Longtao Xiao, Jieming Zhu, Heng Chang, Haozhao Wang, Rui Zhang, Zhenhua Dong, Ruiming Tang, Ruixuan Li

Abstract: Generative recommendation (GR) typically encodes behavioral or semantic aspects of item information into discrete tokens, leveraging the standard autoregressive (AR) generation paradigm to make predictions. However, existing methods tend to overlook their intrinsic relationship, that is, the semantic usually provides some reasonable explainability "$\textbf{why}$" for the behavior "… ▽ More Generative recommendation (GR) typically encodes behavioral or semantic aspects of item information into discrete tokens, leveraging the standard autoregressive (AR) generation paradigm to make predictions. However, existing methods tend to overlook their intrinsic relationship, that is, the semantic usually provides some reasonable explainability "$\textbf{why}$" for the behavior "$\textbf{what}$", which may constrain the full potential of GR. To this end, we present Chunk AutoRegressive Modeling (CAR), a new generation paradigm following the decision pattern that users usually think semantic aspects of items (e.g. brand) and then take actions on target items (e.g. purchase). Our CAR, for the $\textit{first time}$, incorporates semantics (SIDs) and behavior (UID) into a single autoregressive transformer from an ``act-with-think'' dual perspective via chunk-level autoregression. Specifically, CAR packs SIDs and UID into a conceptual chunk for item unified representation, allowing each decoding step to make a holistic prediction. Experiments show that our CAR significantly outperforms existing methods based on traditional AR, improving Recall@5 by 7.93% to 22.30%. Furthermore, we verify the scaling effect between model performance and SIDs bit number, demonstrating that CAR preliminary emulates a kind of slow-thinking style mechanism akin to the reasoning processes observed in large language models (LLMs). △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: 9 pages, 2 figures

arXiv:2506.23622 [pdf, ps, other]

Privacy-Preserving Federated Learning Scheme with Mitigating Model Poisoning Attacks: Vulnerabilities and Countermeasures

Authors: Jiahui Wu, Fucai Luo, Tiecheng Sun, Haiyan Wang, Weizhe Zhang

Abstract: The privacy-preserving federated learning schemes based on the setting of two honest-but-curious and non-colluding servers offer promising solutions in terms of security and efficiency. However, our investigation reveals that these schemes still suffer from privacy leakage when considering model poisoning attacks from malicious users. Specifically, we demonstrate that the privacy-preserving comput… ▽ More The privacy-preserving federated learning schemes based on the setting of two honest-but-curious and non-colluding servers offer promising solutions in terms of security and efficiency. However, our investigation reveals that these schemes still suffer from privacy leakage when considering model poisoning attacks from malicious users. Specifically, we demonstrate that the privacy-preserving computation process for defending against model poisoning attacks inadvertently leaks privacy to one of the honest-but-curious servers, enabling it to access users' gradients in plaintext. To address both privacy leakage and model poisoning attacks, we propose an enhanced privacy-preserving and Byzantine-robust federated learning (PBFL) scheme, comprising three components: (1) a two-trapdoor fully homomorphic encryption (FHE) scheme to bolster users' privacy protection; (2) a novel secure normalization judgment method to preemptively thwart gradient poisoning; and (3) an innovative secure cosine similarity measurement method for detecting model poisoning attacks without compromising data privacy. Our scheme guarantees privacy preservation and resilience against model poisoning attacks, even in scenarios with heterogeneous, non-IID (Independently and Identically Distributed) datasets. Theoretical analyses substantiate the security and efficiency of our scheme, and extensive experiments corroborate the efficacy of our private attacks. Furthermore, the experimental results demonstrate that our scheme accelerates training speed while reducing communication overhead compared to the state-of-the-art PBFL schemes. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.23549 [pdf, ps, other]

CooT: Learning to Coordinate In-Context with Coordination Transformers

Authors: Huai-Chih Wang, Hsiang-Chun Chuang, Hsi-Chun Cheng, Dai-Jie Wu, Shao-Hua Sun

Abstract: Effective coordination among artificial agents in dynamic and uncertain environments remains a significant challenge in multi-agent systems. Existing approaches, such as self-play and population-based methods, either generalize poorly to unseen partners or require extensive training. To overcome these limitations, we propose Coordination Transformers (CooT), a novel in-context coordination framewo… ▽ More Effective coordination among artificial agents in dynamic and uncertain environments remains a significant challenge in multi-agent systems. Existing approaches, such as self-play and population-based methods, either generalize poorly to unseen partners or require extensive training. To overcome these limitations, we propose Coordination Transformers (CooT), a novel in-context coordination framework that uses recent interaction histories to adapt to unseen partners rapidly. Unlike previous approaches that primarily aim to increase the diversity of training partners, CooT explicitly focuses on adapting to new partner behaviors by predicting actions aligned with observed partner interactions. Trained on interaction trajectories collected from diverse pairs of agents with complementary behaviors, CooT quickly learns effective coordination strategies without explicit supervision or fine-tuning. Evaluations on the Overcooked benchmark demonstrate that CooT significantly outperforms baseline methods in coordination tasks involving previously unseen partners. Human evaluations further confirm CooT as the most effective collaborative partner, while extensive ablations highlight its robustness, flexibility, and sensitivity to context in multi-agent scenarios. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: 23 pages, 10 tables, 8 figures

arXiv:2506.23485 [pdf, ps, other]

Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

Authors: Haocheng Yu, Yaxiong Wu, Hao Wang, Wei Guo, Yong Liu, Yawen Li, Yuyang Ye, Junping Du, Enhong Chen

Abstract: Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effectively capturing users' real-time needs and enhancing personalized experiences. However, due to limi… ▽ More Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effectively capturing users' real-time needs and enhancing personalized experiences. However, due to limited planning and generalization capabilities, existing formulations of LLM-powered interactive recommender agents struggle to effectively address diverse and complex user intents, such as intuitive, unrefined, or occasionally ambiguous requests. To tackle this challenge, we propose a novel thought-augmented interactive recommender agent system (TAIRA) that addresses complex user intents through distilled thought patterns. Specifically, TAIRA is designed as an LLM-powered multi-agent system featuring a manager agent that orchestrates recommendation tasks by decomposing user needs and planning subtasks, with its planning capacity strengthened through Thought Pattern Distillation (TPD), a thought-augmentation method that extracts high-level thoughts from the agent's and human experts' experiences. Moreover, we designed a set of user simulation schemes to generate personalized queries of different difficulties and evaluate the recommendations based on specific datasets. Through comprehensive experiments conducted across multiple datasets, TAIRA exhibits significantly enhanced performance compared to existing methods. Notably, TAIRA shows a greater advantage on more challenging tasks while generalizing effectively on novel tasks, further validating its superiority in managing complex user intents within interactive recommendation systems. The code is publicly available at:https://github.com/Alcein/TAIRA. △ Less

Submitted 29 June, 2025; originally announced June 2025.

arXiv:2506.23351 [pdf, ps, other]

Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop

Authors: Tianxing Chen, Kaixuan Wang, Zhaohui Yang, Yuhao Zhang, Zanxin Chen, Baijun Chen, Wanxi Dong, Ziyuan Liu, Dong Chen, Tianshuo Yang, Haibao Yu, Xiaokang Yang, Yusen Qin, Zhiqiang Xie, Yao Mu, Ping Luo, Tian Nian, Weiliang Deng, Yiheng Ge, Yibin Liu, Zixuan Li, Dehui Wang, Zhixuan Liang, Haohui Xie, Rijie Zeng , et al. (74 additional authors not shown)

Abstract: Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To ad… ▽ More Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To advance this goal, we launched the RoboTwin Dual-Arm Collaboration Challenge at the 2nd MEIS Workshop, CVPR 2025. Built on the RoboTwin Simulation platform (1.0 and 2.0) and the AgileX COBOT-Magic Robot platform, the competition consisted of three stages: Simulation Round 1, Simulation Round 2, and a final Real-World Round. Participants totally tackled 17 dual-arm manipulation tasks, covering rigid, deformable, and tactile-based scenarios. The challenge attracted 64 global teams and over 400 participants, producing top-performing solutions like SEM and AnchorDP3 and generating valuable insights into generalizable bimanual policy learning. This report outlines the competition setup, task design, evaluation methodology, key findings and future direction, aiming to support future research on robust and generalizable bimanual manipulation policies. The Challenge Webpage is available at https://robotwin-benchmark.github.io/cvpr-2025-challenge/. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: Challenge Webpage: https://robotwin-benchmark.github.io/cvpr-2025-challenge/

arXiv:2506.23346 [pdf, ps, other]

Safe and Performant Deployment of Autonomous Systems via Model Predictive Control and Hamilton-Jacobi Reachability Analysis

Authors: Hao Wang, Armand Jordana, Ludovic Righetti, Somil Bansal

Abstract: While we have made significant algorithmic developments to enable autonomous systems to perform sophisticated tasks, it remains difficult for them to perform tasks effective and safely. Most existing approaches either fail to provide any safety assurances or substantially compromise task performance for safety. In this work, we develop a framework, based on model predictive control (MPC) and Hamil… ▽ More While we have made significant algorithmic developments to enable autonomous systems to perform sophisticated tasks, it remains difficult for them to perform tasks effective and safely. Most existing approaches either fail to provide any safety assurances or substantially compromise task performance for safety. In this work, we develop a framework, based on model predictive control (MPC) and Hamilton-Jacobi (HJ) reachability, to optimize task performance for autonomous systems while respecting the safety constraints. Our framework guarantees recursive feasibility for the MPC controller, and it is scalable to high-dimensional systems. We demonstrate the effectiveness of our framework with two simulation studies using a 4D Dubins Car and a 6 Dof Kuka iiwa manipulator, and the experiments show that our framework significantly improves the safety constraints satisfaction of the systems over the baselines. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: RSS 2025 Workshop on Reliable Robotics

arXiv:2506.23270 [pdf, ps, other]

Token Activation Map to Visually Explain Multimodal LLMs

Authors: Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, Xiaomeng Li

Abstract: Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated t… ▽ More Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation, with a novel rank Gaussian filter to further reduce activation noises. We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens. TAM also indicates that it excels at explaining multiple tokens of MLLM, which is different from the Class Activation Map (CAM) for a single prediction. Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results that can be utilized for various scenarios, such as object localization, failure case analysis, video visualization, MLLMs visual comparison, and model understanding (e.g., color, shape, action, location, visual reasoning, multi-turn conversation, etc). The code is available atgithub.com/xmed-lab/TAM. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: ICCV2025 Accepted

arXiv:2506.23157 [pdf, ps, other]

STD-GS: Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene

Authors: Hanyu Zhou, Haonan Wang, Haoyue Liu, Yuxing Duan, Luxin Yan, Gim Hee Lee

Abstract: High-dynamic scene reconstruction aims to represent static background with rigid spatial features and dynamic objects with deformed continuous spatiotemporal features. Typically, existing methods adopt unified representation model (e.g., Gaussian) to directly match the spatiotemporal features of dynamic scene from frame camera. However, this unified paradigm fails in the potential discontinuous te… ▽ More High-dynamic scene reconstruction aims to represent static background with rigid spatial features and dynamic objects with deformed continuous spatiotemporal features. Typically, existing methods adopt unified representation model (e.g., Gaussian) to directly match the spatiotemporal features of dynamic scene from frame camera. However, this unified paradigm fails in the potential discontinuous temporal features of objects due to frame imaging and the heterogeneous spatial features between background and objects. To address this issue, we disentangle the spatiotemporal features into various latent representations to alleviate the spatiotemporal mismatching between background and objects. In this work, we introduce event camera to compensate for frame camera, and propose a spatiotemporal-disentangled Gaussian splatting framework for high-dynamic scene reconstruction. As for dynamic scene, we figure out that background and objects have appearance discrepancy in frame-based spatial features and motion discrepancy in event-based temporal features, which motivates us to distinguish the spatiotemporal features between background and objects via clustering. As for dynamic object, we discover that Gaussian representations and event data share the consistent spatiotemporal characteristic, which could serve as a prior to guide the spatiotemporal disentanglement of object Gaussians. Within Gaussian splatting framework, the cumulative scene-object disentanglement can improve the spatiotemporal discrimination between background and objects to render the time-continuous dynamic scene. Extensive experiments have been performed to verify the superiority of the proposed method. △ Less

Submitted 29 June, 2025; originally announced June 2025.

arXiv:2506.22637 [pdf, ps, other]

CaO$_2$: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation

Authors: Haoxuan Wang, Zhenghao Zhao, Junyi Wu, Yuzhang Shang, Gaowen Liu, Yan Yan

Abstract: The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two cri… ▽ More The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two critical inconsistencies in the distillation process: (1) Objective Inconsistency, where the distillation process diverges from the evaluation objective, and (2) Condition Inconsistency, leading to mismatches between generated images and their corresponding conditions. To resolve these issues, we introduce Condition-aware Optimization with Objective-guided Sampling (CaO$_2$), a two-stage diffusion-based framework that aligns the distillation process with the evaluation objective. The first stage employs a probability-informed sample selection pipeline, while the second stage refines the corresponding latent representations to improve conditional likelihood. CaO$_2$ achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3% accuracy. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: ICCV 2025. Code is available at https://github.com/hatchetProject/CaO2

arXiv:2506.22037 [pdf]

KARMA Approach supporting Development Process Reconstruction in Model-based Systems Engineering

Authors: Jiawei Li, Zan Liang, Guoxin Wang, Jinzhi Lu, Yan Yan, Shouxuan Wu, Hao Wang

Abstract: Model reconstruction is a method used to drive the development of complex system development processes in model-based systems engineering. Currently, during the iterative design process of a system, there is a lack of an effective method to manage changes in development requirements, such as development cycle requirements and cost requirements, and to realize the reconstruction of the system devel… ▽ More Model reconstruction is a method used to drive the development of complex system development processes in model-based systems engineering. Currently, during the iterative design process of a system, there is a lack of an effective method to manage changes in development requirements, such as development cycle requirements and cost requirements, and to realize the reconstruction of the system development process model. To address these issues, this paper proposes a model reconstruction method to support the development process model. Firstly, the KARMA language, based on the GOPPRR-E metamodeling method, is utilized to uniformly formalize the process models constructed based on different modeling languages. Secondly, a model reconstruction framework is introduced. This framework takes a structured development requirements based natural language as input, employs natural language processing techniques to analyze the development requirements text, and extracts structural and optimization constraint information. Then, after structural reorganization and algorithm optimization, a development process model that meets the development requirements is obtained. Finally, as a case study, the development process of the aircraft onboard maintenance system is reconstructed. The results demonstrate that this method can significantly enhance the design efficiency of the development process. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: 12 pages, 9 figures, submitted to the 15th international Complex Systems Design & Management (CSD&M) conference

arXiv:2506.22027 [pdf, ps, other]

Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method

Authors: Han Wang, Shengyang Li, Jian Yang, Yuxuan Liu, Yixuan Lv, Zhuang Zhou

Abstract: Detecting and tracking ground objects using earth observation imagery remains a significant challenge in the field of remote sensing. Continuous maritime ship tracking is crucial for applications such as maritime search and rescue, law enforcement, and shipping analysis. However, most current ship tracking methods rely on geostationary satellites or video satellites. The former offer low resolutio… ▽ More Detecting and tracking ground objects using earth observation imagery remains a significant challenge in the field of remote sensing. Continuous maritime ship tracking is crucial for applications such as maritime search and rescue, law enforcement, and shipping analysis. However, most current ship tracking methods rely on geostationary satellites or video satellites. The former offer low resolution and are susceptible to weather conditions, while the latter have short filming durations and limited coverage areas, making them less suitable for the real-world requirements of ship tracking. To address these limitations, we present the Hybrid Optical and Synthetic Aperture Radar (SAR) Ship Re-Identification Dataset (HOSS ReID dataset), designed to evaluate the effectiveness of ship tracking using low-Earth orbit constellations of optical and SAR sensors. This approach ensures shorter re-imaging cycles and enables all-weather tracking. HOSS ReID dataset includes images of the same ship captured over extended periods under diverse conditions, using different satellites of different modalities at varying times and angles. Furthermore, we propose a baseline method for cross-modal ship re-identification, TransOSS, which is built on the Vision Transformer architecture. It refines the patch embedding structure to better accommodate cross-modal tasks, incorporates additional embeddings to introduce more reference information, and employs contrastive learning to pre-train on large-scale optical-SAR image pairs, ensuring the model's ability to extract modality-invariant features. Our dataset and baseline method are publicly available on https://github.com/Alioth2000/Hoss-ReID. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: Accepted to ICCV 2025

arXiv:2506.22023 [pdf, ps, other]

Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy

Authors: Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu

Abstract: Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention… ▽ More Recently, autoregressive (AR) language models have emerged as a dominant approach in speech synthesis, offering expressive generation and scalable training. However, conventional AR speech synthesis models relying on the next-token prediction paradigm often encounter significant challenges when handling long speech sequences. These models often struggle to construct stable frame-to-frame attention, leading to increased latency and degraded synthesis quality, thereby limiting their feasibility for real-time applications. To address these limitations, we introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, designed to enhance both efficiency and intelligibility robustness in AR speech generation. DCAR introduces a chunk-to-frame attention mechanism through training with multi-token prediction, enabling dynamic chunk prediction in variable speech contexts using a lightweight module trained on-policy. DCAR dynamically adjusts the token prediction span, significantly reducing the sequence length dependency while obtaining high synthesis quality. Comprehensive empirical evaluations demonstrate that DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set. Furthermore, we conduct comprehensive analysis to support it as a versatile foundation for next-generation speech synthesis systems. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: 17 pages, 8 figures, 5 tables

arXiv:2506.21952 [pdf, ps, other]

Physics-informed network paradigm with data generation and background noise removal for diverse distributed acoustic sensing applications

Authors: Yangyang Wan, Haotian Wang, Xuhui Yu, Jiageng Chen, Xinyu Fan, Zuyuan He

Abstract: Distributed acoustic sensing (DAS) has attracted considerable attention across various fields and artificial intelligence (AI) technology plays an important role in DAS applications to realize event recognition and denoising. Existing AI models require real-world data (RWD), whether labeled or not, for training, which is contradictory to the fact of limited available event data in real-world scena… ▽ More Distributed acoustic sensing (DAS) has attracted considerable attention across various fields and artificial intelligence (AI) technology plays an important role in DAS applications to realize event recognition and denoising. Existing AI models require real-world data (RWD), whether labeled or not, for training, which is contradictory to the fact of limited available event data in real-world scenarios. Here, a physics-informed DAS neural network paradigm is proposed, which does not need real-world events data for training. By physically modeling target events and the constraints of real world and DAS system, physical functions are derived to train a generative network for generation of DAS events data. DAS debackground net is trained by using the generated DAS events data to eliminate background noise in DAS data. The effectiveness of the proposed paradigm is verified in event identification application based on a public dataset of DAS spatiotemporal data and in belt conveyor fault monitoring application based on DAS time-frequency data, and achieved comparable or better performance than data-driven networks trained with RWD. Owing to the introduction of physical information and capability of background noise removal, the paradigm demonstrates generalization in same application on different sites. A fault diagnosis accuracy of 91.8% is achieved in belt conveyor field with networks which transferred from simulation test site without any fault events data of test site and field for training. The proposed paradigm is a prospective solution to address significant obstacles of data acquisition and intense noise in practical DAS applications and explore more potential fields for DAS. △ Less

Submitted 27 June, 2025; originally announced June 2025.

arXiv:2506.21926 [pdf, ps, other]

Computing Maximum Cliques in Unit Disk Graphs

Authors: Anastasiia Tkachenko, Haitao Wang

Abstract: Given a set $P$ of $n$ points in the plane, the unit-disk graph $G(P)$ is a graph with $P$ as its vertex set such that two points of $P$ have an edge if their Euclidean distance is at most $1$. We consider the problem of computing a maximum clique in $G(P)$. The previously best algorithm for the problem runs in $O(n^{7/3+o(1)})$ time. We show that the problem can be solved in… ▽ More Given a set $P$ of $n$ points in the plane, the unit-disk graph $G(P)$ is a graph with $P$ as its vertex set such that two points of $P$ have an edge if their Euclidean distance is at most $1$. We consider the problem of computing a maximum clique in $G(P)$. The previously best algorithm for the problem runs in $O(n^{7/3+o(1)})$ time. We show that the problem can be solved in $O(n \log n + n K^{4/3+o(1)})$ time, where $K$ is the maximum clique size. The algorithm is faster than the previous one when $K=o(n)$. In addition, if $P$ is in convex position, we give a randomized algorithm that runs in $O(n^{15/7+o(1)})= O(n^{2.143})$ worst-case time and the algorithm can compute a maximum clique with high probability. For points in convex position, one special case we solve is when a point in the maximum clique is given; we present an $O(n^2\log n)$ time (deterministic) algorithm for this special case. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: To appear in CCCG 2025

arXiv:2506.21865 [pdf, ps, other]

RiverEcho: Real-Time Interactive Digital System for Ancient Yellow River Culture

Authors: Haofeng Wang, Yilin Guo, Zehao Li, Tong Yue, Yizong Wang, Enci Zhang, Rongqun Lin, Feng Gao, Shiqi Wang, Siwei Ma

Abstract: The Yellow River is China's mother river and a cradle of human civilization. The ancient Yellow River culture is, moreover, an indispensable part of human art history. To conserve and inherit the ancient Yellow River culture, we designed RiverEcho, a real-time interactive system that responds to voice queries using a large language model and a cultural knowledge dataset, delivering explanations th… ▽ More The Yellow River is China's mother river and a cradle of human civilization. The ancient Yellow River culture is, moreover, an indispensable part of human art history. To conserve and inherit the ancient Yellow River culture, we designed RiverEcho, a real-time interactive system that responds to voice queries using a large language model and a cultural knowledge dataset, delivering explanations through a talking-head digital human. Specifically, we built a knowledge database focused on the ancient Yellow River culture, including the collection of historical texts and the processing pipeline. Experimental results demonstrate that leveraging Retrieval-Augmented Generation (RAG) on the proposed dataset enhances the response quality of the Large Language Model(LLM), enabling the system to generate more professional and informative responses. Our work not only diversifies the means of promoting Yellow River culture but also provides users with deeper cultural insights. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: IEEE International Conference on Multimedia and Expo Workshop, 2025.(Accepted)

arXiv:2506.21851 [pdf, ps, other]

End-to-End RGB-IR Joint Image Compression With Channel-wise Cross-modality Entropy Model

Authors: Haofeng Wang, Fangtao Zhou, Qi Zhang, Zeyuan Chen, Enci Zhang, Zhao Wang, Xiaofeng Huang, Siwei Ma

Abstract: RGB-IR(RGB-Infrared) image pairs are frequently applied simultaneously in various applications like intelligent surveillance. However, as the number of modalities increases, the required data storage and transmission costs also double. Therefore, efficient RGB-IR data compression is essential. This work proposes a joint compression framework for RGB-IR image pair. Specifically, to fully utilize cr… ▽ More RGB-IR(RGB-Infrared) image pairs are frequently applied simultaneously in various applications like intelligent surveillance. However, as the number of modalities increases, the required data storage and transmission costs also double. Therefore, efficient RGB-IR data compression is essential. This work proposes a joint compression framework for RGB-IR image pair. Specifically, to fully utilize cross-modality prior information for accurate context probability modeling within and between modalities, we propose a Channel-wise Cross-modality Entropy Model (CCEM). Among CCEM, a Low-frequency Context Extraction Block (LCEB) and a Low-frequency Context Fusion Block (LCFB) are designed for extracting and aggregating the global low-frequency information from both modalities, which assist the model in predicting entropy parameters more accurately. Experimental results demonstrate that our approach outperforms existing RGB-IR image pair and single-modality compression methods on LLVIP and KAIST datasets. For instance, the proposed framework achieves a 23.1% bit rate saving on LLVIP dataset compared to the state-of-the-art RGB-IR image codec presented at CVPR 2022. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: IEEE International Conference on Systems, Man, and Cybernetics 2025. (SMC), under review

arXiv:2506.21611 [pdf, ps, other]

Does Multimodality Lead to Better Time Series Forecasting?

Authors: Xiyuan Zhang, Boran Han, Haoyang Fang, Abdul Fatir Ansari, Shuai Zhang, Danielle C. Maddix, Cuixiong Hu, Andrew Gordon Wilson, Michael W. Mahoney, Hao Wang, Yan Liu, Huzefa Rangwala, George Karypis, Bernie Wang

Abstract: Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 14 forecasting tasks spanning 7 domains, including health, environment, and… ▽ More Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 14 forecasting tasks spanning 7 domains, including health, environment, and economics. We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting. Although prior works report gains from multimodal input, we find these effects are not universal across datasets and models, and multimodal methods sometimes do not outperform the strongest unimodal baselines. To understand when textual information helps, we disentangle the effects of model architectural properties and data characteristics. Our findings highlight that on the modeling side, incorporating text information is most helpful given (1) high-capacity text models, (2) comparatively weaker time series models, and (3) appropriate aligning strategies. On the data side, performance gains are more likely when (4) sufficient training data is available and (5) the text offers complementary predictive signal beyond what is already captured from the time series alone. Our empirical findings offer practical guidelines for when multimodality can be expected to aid forecasting tasks, and when it does not. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.21349 [pdf, ps, other]

Generalizable Neural Electromagnetic Inverse Scattering

Authors: Yizhe Cheng, Chunxun Tian, Haoru Wang, Wentao Zhu, Xiaoxuan Ma, Yizhou Wang

Abstract: Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by l… ▽ More Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires case-specific optimization, lacks generalization to unseen data, and fails under sparse transmitter setups (e.g., with only one transmitter). To address these limitations, we revisit EISP from a physics-informed perspective, reformulating it as a two stage inverse transmission-scattering process. This formulation reveals the induced current as a generalizable intermediate representation, effectively decoupling the nonlinear scattering process from the ill-posed inverse problem. Built on this insight, we propose the first generalizable physics-driven framework for EISP, comprising a current estimator and a permittivity solver, working in an end-to-end manner. The current estimator explicitly learns the induced current as a physical bridge between the incident and scattered field, while the permittivity solver computes the relative permittivity directly from the estimated induced current. This design enables data-driven training and generalizable feed-forward prediction of relative permittivity on unseen data while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging. △ Less

Submitted 1 July, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.21322 [pdf, ps, other]

"Who Should I Believe?": User Interpretation and Decision-Making When a Family Healthcare Robot Contradicts Human Memory

Authors: Hong Wang, Natalia Calvo-Barajas, Katie Winkle, Ginevra Castellano

Abstract: Advancements in robotic capabilities for providing physical assistance, psychological support, and daily health management are making the deployment of intelligent healthcare robots in home environments increasingly feasible in the near future. However, challenges arise when the information provided by these robots contradicts users' memory, raising concerns about user trust and decision-making. T… ▽ More Advancements in robotic capabilities for providing physical assistance, psychological support, and daily health management are making the deployment of intelligent healthcare robots in home environments increasingly feasible in the near future. However, challenges arise when the information provided by these robots contradicts users' memory, raising concerns about user trust and decision-making. This paper presents a study that examines how varying a robot's level of transparency and sociability influences user interpretation, decision-making and perceived trust when faced with conflicting information from a robot. In a 2 x 2 between-subjects online study, 176 participants watched videos of a Furhat robot acting as a family healthcare assistant and suggesting a fictional user to take medication at a different time from that remembered by the user. Results indicate that robot transparency influenced users' interpretation of information discrepancies: with a low transparency robot, the most frequent assumption was that the user had not correctly remembered the time, while with the high transparency robot, participants were more likely to attribute the discrepancy to external factors, such as a partner or another household member modifying the robot's information. Additionally, participants exhibited a tendency toward overtrust, often prioritizing the robot's recommendations over the user's memory, even when suspecting system malfunctions or third-party interference. These findings highlight the impact of transparency mechanisms in robotic systems, the complexity and importance associated with system access control for multi-user robots deployed in home environments, and the potential risks of users' over reliance on robots in sensitive domains such as healthcare. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 8 pages

arXiv:2506.21140 [pdf, ps, other]

DBConformer: Dual-Branch Convolutional Transformer for EEG Decoding

Authors: Ziwei Wang, Hongbin Wang, Tianwang Jia, Xingyi He, Siyang Li, Dongrui Wu

Abstract: Electroencephalography (EEG)-based brain-computer interfaces (BCIs) transform spontaneous/evoked neural activity into control commands for external communication. While convolutional neural networks (CNNs) remain the mainstream backbone for EEG decoding, their inherently short receptive field makes it difficult to capture long-range temporal dependencies and global inter-channel relationships. Rec… ▽ More Electroencephalography (EEG)-based brain-computer interfaces (BCIs) transform spontaneous/evoked neural activity into control commands for external communication. While convolutional neural networks (CNNs) remain the mainstream backbone for EEG decoding, their inherently short receptive field makes it difficult to capture long-range temporal dependencies and global inter-channel relationships. Recent CNN-Transformer (Conformers) hybrids partially address this issue, but most adopt a serial design, resulting in suboptimal integration of local and global features, and often overlook explicit channel-wise modeling. To address these limitations, we propose DBConformer, a dual-branch convolutional Transformer network tailored for EEG decoding. It integrates a temporal Conformer to model long-range temporal dependencies and a spatial Conformer to extract inter-channel interactions, capturing both temporal dynamics and spatial patterns in EEG signals. A lightweight channel attention module further refines spatial representations by assigning data-driven importance to EEG channels. Extensive experiments on five motor imagery (MI) datasets and two seizure detection datasets under three evaluation settings demonstrate that DBConformer consistently outperforms 10 competitive baseline models, with over eight times fewer parameters than the high-capacity EEG Conformer baseline. Further, the visualization results confirm that the features extracted by DBConformer are physiologically interpretable and aligned with sensorimotor priors in MI. The superior performance and interpretability of DBConformer make it reliable for robust and explainable EEG decoding. Code is publicized at https://github.com/wzwvv/DBConformer. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 12 pages, 6 figures

arXiv:2506.21096 [pdf, ps, other]

DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning

Authors: Kang He, Yuzhe Ding, Haining Wang, Fei Li, Chong Teng, Donghong Ji

Abstract: Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignme… ▽ More Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines. △ Less

Submitted 30 June, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

Comments: Accepted by ACL 2025 Findings

arXiv:2506.21074 [pdf, ps, other]

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

Authors: Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Xie Chen, Kai Yu

Abstract: Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address th… ▽ More Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ($\approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up to 46% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 16 pages, 5 figures, 9 tables

arXiv:2506.21012 [pdf, ps, other]

doi 10.1145/3711896.3736957

FedSC: Federated Learning with Semantic-Aware Collaboration

Authors: Huan Wang, Haoran Li, Huaming Chen, Jun Yan, Jiahua Shi, Jun Shen

Abstract: Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global… ▽ More Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global model), often neglecting inherent semantic information contained in each client. To explore the possibility of using intra-client semantically meaningful knowledge in handling data heterogeneity, in this paper, we propose Federated Learning with Semantic-Aware Collaboration (FedSC) to capture client-specific and class-relevant knowledge across heterogeneous clients. The core idea of FedSC is to construct relational prototypes and consistent prototypes at semantic-level, aiming to provide fruitful class underlying knowledge and stable convergence signals in a prototype-wise collaborative way. On the one hand, FedSC introduces an inter-contrastive learning strategy to bring instance-level embeddings closer to relational prototypes with the same semantics and away from distinct classes. On the other hand, FedSC devises consistent prototypes via a discrepancy aggregation manner, as a regularization penalty to constrain the optimization region of the local model. Moreover, a theoretical analysis for FedSC is provided to ensure a convergence guarantee. Experimental results on various challenging scenarios demonstrate the effectiveness of FedSC and the efficiency of crucial components. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 12 pages, KDD 2025

arXiv:2506.20923 [pdf, ps, other]

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Authors: Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang

Abstract: In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with si… ▽ More In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Technical Report; 26 pages 12 tables 1 figure. arXiv admin note: substantial text overlap with arXiv:2501.01028

arXiv:2506.20606 [pdf, ps, other]

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

Authors: Baixiang Huang, Zhen Tan, Haoran Wang, Zijie Liu, Dawei Li, Ali Payani, Huan Liu, Tianlong Chen, Kai Shu

Abstract: Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of… ▽ More Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent's global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through comprehensive evaluations on agents based on frontier LLMs, BehaviorBench shows the effectiveness of Behavior Editing across different models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Main paper: 9 pages; total: 18 pages (including appendix). Code, data, results, and additional resources are available at: https://model-editing.github.io

arXiv:2506.20487 [pdf, ps, other]

Behavior Foundation Model: Towards Next-Generation Whole-Body Control System of Humanoid Robots

Authors: Mingqi Yuan, Tao Yu, Wenqi Ge, Xiuyong Yao, Dapeng Li, Huijiang Wang, Jiayu Chen, Xin Jin, Bo Li, Hua Chen, Wei Zhang, Wenjun Zeng

Abstract: Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for… ▽ More Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for complex tasks, their reliance on labor-intensive and costly retraining for new scenarios limits real-world applicability. To address these limitations, behavior(al) foundation models (BFMs) have emerged as a new paradigm that leverages large-scale pretraining to learn reusable primitive skills and behavioral priors, enabling zero-shot or rapid adaptation to a wide range of downstream tasks. In this paper, we present a comprehensive overview of BFMs for humanoid WBC, tracing their development across diverse pre-training pipelines. Furthermore, we discuss real-world applications, current limitations, urgent challenges, and future opportunities, positioning BFMs as a key approach toward scalable and general-purpose humanoid intelligence. Finally, we provide a curated and long-term list of BFM papers and projects to facilitate more subsequent research, which is available at https://github.com/yuanmingqi/awesome-bfm-papers. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: 19 pages, 8 figures

arXiv:2506.20388 [pdf]

A Novel Large Vision Foundation Model (LVFM)-based Approach for Generating High-Resolution Canopy Height Maps in Plantations for Precision Forestry Management

Authors: Shen Tan, Xin Zhang, Liangxiu Han, Huaguo Huang, Han Wang

Abstract: Accurate, cost-effective monitoring of plantation aboveground biomass (AGB) is crucial for supporting local livelihoods and carbon sequestration initiatives like the China Certified Emission Reduction (CCER) program. High-resolution canopy height maps (CHMs) are essential for this, but standard lidar-based methods are expensive. While deep learning with RGB imagery offers an alternative, accuratel… ▽ More Accurate, cost-effective monitoring of plantation aboveground biomass (AGB) is crucial for supporting local livelihoods and carbon sequestration initiatives like the China Certified Emission Reduction (CCER) program. High-resolution canopy height maps (CHMs) are essential for this, but standard lidar-based methods are expensive. While deep learning with RGB imagery offers an alternative, accurately extracting canopy height features remains challenging. To address this, we developed a novel model for high-resolution CHM generation using a Large Vision Foundation Model (LVFM). Our model integrates a feature extractor, a self-supervised feature enhancement module to preserve spatial details, and a height estimator. Tested in Beijing's Fangshan District using 1-meter Google Earth imagery, our model outperformed existing methods, including conventional CNNs. It achieved a mean absolute error of 0.09 m, a root mean square error of 0.24 m, and a correlation of 0.78 against lidar-based CHMs. The resulting CHMs enabled over 90% success in individual tree detection, high accuracy in AGB estimation, and effective tracking of plantation growth, demonstrating strong generalization to non-training areas. This approach presents a promising, scalable tool for evaluating carbon sequestration in both plantations and natural forests. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20101 [pdf, ps, other]

Secure Multi-Key Homomorphic Encryption with Application to Privacy-Preserving Federated Learning

Authors: Jiahui Wu, Tiecheng Sun, Fucai Luo, Haiyan Wang, Weizhe Zhang

Abstract: Multi-Key Homomorphic Encryption (MKHE), proposed by Lopez-Alt et al. (STOC 2012), allows for performing arithmetic computations directly on ciphertexts encrypted under distinct keys. Subsequent works by Chen and Dai et al. (CCS 2019) and Kim and Song et al. (CCS 2023) extended this concept by proposing multi-key BFV/CKKS variants, referred to as the CDKS scheme. These variants incorporate asympto… ▽ More Multi-Key Homomorphic Encryption (MKHE), proposed by Lopez-Alt et al. (STOC 2012), allows for performing arithmetic computations directly on ciphertexts encrypted under distinct keys. Subsequent works by Chen and Dai et al. (CCS 2019) and Kim and Song et al. (CCS 2023) extended this concept by proposing multi-key BFV/CKKS variants, referred to as the CDKS scheme. These variants incorporate asymptotically optimal techniques to facilitate secure computation across multiple data providers. In this paper, we identify a critical security vulnerability in the CDKS scheme when applied to multiparty secure computation tasks, such as privacy-preserving federated learning (PPFL). In particular, we show that CDKS may inadvertently leak plaintext information from one party to others. To mitigate this issue, we propose a new scheme, SMHE (Secure Multi-Key Homomorphic Encryption), which incorporates a novel masking mechanism into the multi-key BFV and CKKS frameworks to ensure that plaintexts remain confidential throughout the computation. We implement a PPFL application using SMHE and demonstrate that it provides significantly improved security with only a modest overhead in homomorphic evaluation. For instance, our PPFL model based on multi-key CKKS incurs less than a 2\times runtime and communication traffic increase compared to the CDKS-based PPFL model. The code is publicly available at https://github.com/JiahuiWu2022/SMHE.git. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19816 [pdf, ps, other]

CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation

Authors: Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, Jiangmiao Pang

Abstract: Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substan… ▽ More Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: 36 pages, 21 figures

arXiv:2506.19469 [pdf, ps, other]

Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning

Authors: Pengfei Hao, Shuaibo Li, Hongqiu Wang, Zhizhuo Kou, Junhang Zhang, Guang Yang, Lei Zhu

Abstract: In recent years, significant progress has been made in the field of surgical scene understanding, particularly in the task of Visual Question Localized-Answering in robotic surgery (Surgical-VQLA). However, existing Surgical-VQLA models lack deep reasoning capabilities and interpretability in surgical scenes, which limits their reliability and potential for development in clinical applications. To… ▽ More In recent years, significant progress has been made in the field of surgical scene understanding, particularly in the task of Visual Question Localized-Answering in robotic surgery (Surgical-VQLA). However, existing Surgical-VQLA models lack deep reasoning capabilities and interpretability in surgical scenes, which limits their reliability and potential for development in clinical applications. To address this issue, inspired by the development of Reasoning Multimodal Large Language Models (MLLMs), we first build the Surgery-R1-54k dataset, including paired data for Visual-QA, Grounding-QA, and Chain-of-Thought (CoT). Then, we propose the first Reasoning MLLM for Surgical-VQLA (Surgery-R1). In our Surgery-R1, we design a two-stage fine-tuning mechanism to enable the basic MLLM with complex reasoning abilities by utilizing supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Furthermore, for an efficient and high-quality rule-based reward system in our RFT, we design a Multimodal Coherence reward mechanism to mitigate positional illusions that may arise in surgical scenarios. Experiment results demonstrate that Surgery-R1 outperforms other existing state-of-the-art (SOTA) models in the Surgical-VQLA task and widely-used MLLMs, while also validating its reasoning capabilities and the effectiveness of our approach. The code and dataset will be organized in https://github.com/FiFi-HAO467/Surgery-R1. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19246 [pdf]

Behavioral Anomaly Detection in Distributed Systems via Federated Contrastive Learning

Authors: Renzi Meng, Heyi Wang, Yumeng Sun, Qiyuan Wu, Lian Lian, Renhan Zhang

Abstract: This paper addresses the increasingly prominent problem of anomaly detection in distributed systems. It proposes a detection method based on federated contrastive learning. The goal is to overcome the limitations of traditional centralized approaches in terms of data privacy, node heterogeneity, and anomaly pattern recognition. The proposed method combines the distributed collaborative modeling ca… ▽ More This paper addresses the increasingly prominent problem of anomaly detection in distributed systems. It proposes a detection method based on federated contrastive learning. The goal is to overcome the limitations of traditional centralized approaches in terms of data privacy, node heterogeneity, and anomaly pattern recognition. The proposed method combines the distributed collaborative modeling capabilities of federated learning with the feature discrimination enhancement of contrastive learning. It builds embedding representations on local nodes and constructs positive and negative sample pairs to guide the model in learning a more discriminative feature space. Without exposing raw data, the method optimizes a global model through a federated aggregation strategy. Specifically, the method uses an encoder to represent local behavior data in high-dimensional space. This includes system logs, operational metrics, and system calls. The model is trained using both contrastive loss and classification loss to improve its ability to detect fine-grained anomaly patterns. The method is evaluated under multiple typical attack types. It is also tested in a simulated real-time data stream scenario to examine its responsiveness. Experimental results show that the proposed method outperforms existing approaches across multiple performance metrics. It demonstrates strong detection accuracy and adaptability, effectively addressing complex anomalies in distributed environments. Through careful design of key modules and optimization of the training mechanism, the proposed method achieves a balance between privacy preservation and detection performance. It offers a feasible technical path for intelligent security management in distributed systems. △ Less

Submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.18898 [pdf, ps, other]

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Authors: Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang

Abstract: This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expan… ▽ More This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: Project page: https://tar.csuhan.com

arXiv:2506.18727 [pdf, other]

AutoGraph: A Knowledge-Graph Framework for Modeling Interface Interaction and Automating Procedure Execution in Digital Nuclear Control Rooms

Authors: Xingyu Xiao, Jiejuan Tong, Jun Sun, Zhe Sui, Jingang Liang, Hongru Zhao, Jun Zhao, Haitao Wang

Abstract: Digitalization in nuclear power plant (NPP) control rooms is reshaping how operators interact with procedures and interface elements. However, existing computer-based procedures (CBPs) often lack semantic integration with human-system interfaces (HSIs), limiting their capacity to support intelligent automation and increasing the risk of human error, particularly under dynamic or complex operating… ▽ More Digitalization in nuclear power plant (NPP) control rooms is reshaping how operators interact with procedures and interface elements. However, existing computer-based procedures (CBPs) often lack semantic integration with human-system interfaces (HSIs), limiting their capacity to support intelligent automation and increasing the risk of human error, particularly under dynamic or complex operating conditions. In this study, we present AutoGraph, a knowledge-graph-based framework designed to formalize and automate procedure execution in digitalized NPP environments.AutoGraph integrates (1) a proposed HTRPM tracking module to capture operator interactions and interface element locations; (2) an Interface Element Knowledge Graph (IE-KG) encoding spatial, semantic, and structural properties of HSIs; (3) automatic mapping from textual procedures to executable interface paths; and (4) an execution engine that maps textual procedures to executable interface paths. This enables the identification of cognitively demanding multi-action steps and supports fully automated execution with minimal operator input. We validate the framework through representative control room scenarios, demonstrating significant reductions in task completion time and the potential to support real-time human reliability assessment. Further integration into dynamic HRA frameworks (e.g., COGMIF) and real-time decision support systems (e.g., DRIF) illustrates AutoGraph extensibility in enhancing procedural safety and cognitive performance in complex socio-technical systems. △ Less

Submitted 26 May, 2025; originally announced June 2025.

Showing 1–50 of 6,792 results for author: Wang, H