Search | arXiv e-print repository

AVC-DPO: Aligned Video Captioning via Direct Preference Optimization

Authors: Jiyang Tang, Hengyi Li, Yifan Du, Wayne Xin Zhao

Abstract: Although video multimodal large language models (video MLLMs) have achieved substantial progress in video captioning tasks, it remains challenging to adjust the focal emphasis of video captions according to human preferences. To address this limitation, we propose Aligned Video Captioning via Direct Preference Optimization (AVC-DPO), a post-training framework designed to enhance captioning capabil… ▽ More Although video multimodal large language models (video MLLMs) have achieved substantial progress in video captioning tasks, it remains challenging to adjust the focal emphasis of video captions according to human preferences. To address this limitation, we propose Aligned Video Captioning via Direct Preference Optimization (AVC-DPO), a post-training framework designed to enhance captioning capabilities in video MLLMs through preference alignment. Our approach designs enhanced prompts that specifically target temporal dynamics and spatial information-two key factors that humans care about when watching a video-thereby incorporating human-centric preferences. AVC-DPO leverages the same foundation model's caption generation responses under varied prompt conditions to conduct preference-aware training and caption alignment. Using this framework, we have achieved exceptional performance in the LOVE@CVPR'25 Workshop Track 1A: Video Detailed Captioning Challenge, achieving first place on the Video Detailed Captioning (VDC) benchmark according to the VDCSCORE evaluation metric. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01449 [pdf, ps, other]

LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

Authors: Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun

Abstract: Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhea… ▽ More Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01291 [pdf, ps, other]

PanTS: The Pancreatic Tumor Segmentation Dataset

Authors: Wenxuan Li, Xinze Zhou, Qi Chen, Tianyu Lin, Pedro R. A. S. Bassi, Szymon Plotka, Jaroslaw B. Cwikla, Xiaoxi Chen, Chen Ye, Zheren Zhu, Kai Ding, Heng Li, Kang Wang, Yang Yang, Yucheng Tang, Daguang Xu, Alan L. Yuille, Zongwei Zhou

Abstract: PanTS is a large-scale, multi-institutional dataset curated to advance research in pancreatic CT analysis. It contains 36,390 CT scans from 145 medical centers, with expert-validated, voxel-wise annotations of over 993,000 anatomical structures, covering pancreatic tumors, pancreas head, body, and tail, and 24 surrounding anatomical structures such as vascular/skeletal structures and abdominal/tho… ▽ More PanTS is a large-scale, multi-institutional dataset curated to advance research in pancreatic CT analysis. It contains 36,390 CT scans from 145 medical centers, with expert-validated, voxel-wise annotations of over 993,000 anatomical structures, covering pancreatic tumors, pancreas head, body, and tail, and 24 surrounding anatomical structures such as vascular/skeletal structures and abdominal/thoracic organs. Each scan includes metadata such as patient age, sex, diagnosis, contrast phase, in-plane spacing, slice thickness, etc. AI models trained on PanTS achieve significantly better performance in pancreatic tumor detection, localization, and segmentation compared to those trained on existing public datasets. Our analysis indicates that these gains are directly attributable to the 16x larger-scale tumor annotations and indirectly supported by the 24 additional surrounding anatomical structures. As the largest and most comprehensive resource of its kind, PanTS offers a new benchmark for developing and evaluating AI models in pancreatic CT analysis. △ Less

Submitted 1 July, 2025; originally announced July 2025.

arXiv:2507.01249 [pdf, ps, other]

Search for an Axion-Like Particle in $B\rightarrow K^{(*)} a (\rightarrowγγ)$ Decays at Belle

Authors: Belle, Belle II Collaborations, :, I. Adachi, L. Aggarwal, H. Ahmed, Y. Ahn, H. Aihara, N. Akopov, S. Alghamdi, M. Alhakami, A. Aloisio, N. Althubiti, K. Amos, M. Angelsmark, N. Anh Ky, C. Antonioli, D. M. Asner, H. Atmacan, T. Aushev, V. Aushev, M. Aversano, R. Ayad, V. Babu, H. Bae , et al. (400 additional authors not shown)

Abstract: We report a search for an axion-like particle $a$ in $B\rightarrow K^{(*)} a (\rightarrowγγ)$ decays using data collected with the Belle detector at the KEKB asymmetric energy electron-positron collider. The search is based on a $711 \mathrm{fb^{-1}}$ data sample collected at the $Υ4S$ resonance energy, corresponding to a sample of $772\times10^6$ $Υ4S$ events. In this study, we search for the dec… ▽ More We report a search for an axion-like particle $a$ in $B\rightarrow K^{(*)} a (\rightarrowγγ)$ decays using data collected with the Belle detector at the KEKB asymmetric energy electron-positron collider. The search is based on a $711 \mathrm{fb^{-1}}$ data sample collected at the $Υ4S$ resonance energy, corresponding to a sample of $772\times10^6$ $Υ4S$ events. In this study, we search for the decay of the axion-like particle into a pair of photons, $a \rightarrow γγ$. We scan the two-photon invariant mass in the range $0.16\ \mathrm{GeV/}c^2-4.50\ \mathrm{GeV}/c^2$ for the $K$ modes and $0.16\ \mathrm{GeV/}c^2-4.20\ \mathrm{GeV}/c^2$ for the $K^{*}$ modes. No significant signal is observed in any of the modes, and 90\% confidence level upper limits are established on the coupling to the $W$ boson, $g_aW$, as a function of $a$ mass. The limits range from $3 \times 10^{-6} \mathrm{GeV}^{-1}$ to $3 \times 10^{-5} \mathrm{GeV}^{-1}$, improving the current constraints on $g_aW$ by a factor of two over the most stringent previous experimental results. △ Less

Submitted 3 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

Comments: 26 pages, 15 Figures

Report number: Belle II Preprint: 2025-017 KEK Preprint: 2025-16

arXiv:2507.00790 [pdf, ps, other]

LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Authors: Huaqiu Li, Yong Wang, Tongwen Huang, Hailang Huang, Haoqian Wang, Xiangxiang Chu

Abstract: Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recur… ▽ More Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data will be available at https://github.com/AMAP-ML/LD-RPS. △ Less

Submitted 4 July, 2025; v1 submitted 1 July, 2025; originally announced July 2025.

arXiv:2507.00748 [pdf, ps, other]

Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

Authors: Bob Zhang, Haoran Li, Tao Zhang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yanbin Hao

Abstract: Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions, which reveals limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement L… ▽ More Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions, which reveals limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, achieving +9.04\% improvements on MIG-Bench and +4.98\% improvements on several out-of-domain reasoning grounding benchmarks over the SFT baseline. Furthermore, our approach exhibits strong generalization in multi-image perception, with gains of +3.1\% and +2.4\% over the base model on subsets of the BLINK and MMIU benchmarks, respectively. △ Less

Submitted 1 July, 2025; originally announced July 2025.

Comments: 11 pages

arXiv:2507.00557 [pdf, ps, other]

Advancing Local Search in SMT-NRA with MCSAT Integration

Authors: Tianyi Ding, Haokun Li, Xinpeng Ni, Bican Xia, Tianqi Zhao

Abstract: In this paper, we advance local search for Satisfiability Modulo the Theory of Nonlinear Real Arithmetic (SMT-NRA for short). First, we introduce a two-dimensional cell-jump move, called \emph{$2d$-cell-jump}, generalizing the key operation, cell-jump, of the local search method for SMT-NRA. Then, we propose an extended local search framework, named \emph{$2d$-LS} (following the local search frame… ▽ More In this paper, we advance local search for Satisfiability Modulo the Theory of Nonlinear Real Arithmetic (SMT-NRA for short). First, we introduce a two-dimensional cell-jump move, called \emph{$2d$-cell-jump}, generalizing the key operation, cell-jump, of the local search method for SMT-NRA. Then, we propose an extended local search framework, named \emph{$2d$-LS} (following the local search framework, LS, for SMT-NRA), integrating the model constructing satisfiability calculus (MCSAT) framework to improve search efficiency. To further improve the efficiency of MCSAT, we implement a recently proposed technique called \emph{sample-cell projection operator} for MCSAT, which is well suited for CDCL-style search in the real domain and helps guide the search away from conflicting states. Finally, we design a hybrid framework for SMT-NRA combining MCSAT, $2d$-LS and OpenCAD, to improve search efficiency through information exchange. The experimental results demonstrate improvements in local search performance, highlighting the effectiveness of the proposed methods. △ Less

Submitted 1 July, 2025; originally announced July 2025.

arXiv:2507.00356 [pdf]

CGEarthEye:A High-Resolution Remote Sensing Vision Foundation Model Based on the Jilin-1 Satellite Constellation

Authors: Zhiwei Yi, Xin Cheng, Jingyu Ma, Ruifei Zhu, Junwei Tian, Yuanxiu Zhou, Xinge Zhao, Hongzhe Li

Abstract: Deep learning methods have significantly advanced the development of intelligent rinterpretation in remote sensing (RS), with foundational model research based on large-scale pre-training paradigms rapidly reshaping various domains of Earth Observation (EO). However, compared to the open accessibility and high spatiotemporal coverage of medium-resolution data, the limited acquisition channels for… ▽ More Deep learning methods have significantly advanced the development of intelligent rinterpretation in remote sensing (RS), with foundational model research based on large-scale pre-training paradigms rapidly reshaping various domains of Earth Observation (EO). However, compared to the open accessibility and high spatiotemporal coverage of medium-resolution data, the limited acquisition channels for ultra-high-resolution optical RS imagery have constrained the progress of high-resolution remote sensing vision foundation models (RSVFM). As the world's largest sub-meter-level commercial RS satellite constellation, the Jilin-1 constellation possesses abundant sub-meter-level image resources. This study proposes CGEarthEye, a RSVFM framework specifically designed for Jilin-1 satellite characteristics, comprising five backbones with different parameter scales with totaling 2.1 billion parameters. To enhance the representational capacity of the foundation model, we developed JLSSD, the first 15-million-scale multi-temporal self-supervised learning (SSL) dataset featuring global coverage with quarterly temporal sampling within a single year, constructed through multi-level representation clustering and sampling strategies. The framework integrates seasonal contrast, augmentation-based contrast, and masked patch token contrastive strategies for pre-training. Comprehensive evaluations across 10 benchmark datasets covering four typical RS tasks demonstrate that the CGEarthEye consistently achieves state-of-the-art (SOTA) performance. Further analysis reveals CGEarthEye's superior characteristics in feature visualization, model convergence, parameter efficiency, and practical mapping applications. This study anticipates that the exceptional representation capabilities of CGEarthEye will facilitate broader and more efficient applications of Jilin-1 data in traditional EO application. △ Less

Submitted 30 June, 2025; originally announced July 2025.

Comments: A Remote Sensing Fundation Model for Very High Resolution Images

arXiv:2507.00111 [pdf, ps, other]

Sudakov evolution without unitarity

Authors: Javira Altmann, Hai Tao Li, Ludovic Scyboz, Peter Skands

Abstract: We present a method for sampling singular functions defined on (nested) multi-particle phase spaces, based on a generalisation of parton-shower phase-space generation techniques. At the heart of the method are three key ingredients: 1) the Sudakov sampling by which shower-style calculations sweep across phase space in an ordered manner, from hard to soft; 2) the sequential nesting of multiparticle… ▽ More We present a method for sampling singular functions defined on (nested) multi-particle phase spaces, based on a generalisation of parton-shower phase-space generation techniques. At the heart of the method are three key ingredients: 1) the Sudakov sampling by which shower-style calculations sweep across phase space in an ordered manner, from hard to soft; 2) the sequential nesting of multiparticle phase spaces; and 3) the factorisations obeyed by singular multiparton amplitudes on the edges of these phase spaces. We demonstrate a C++ implementation of the proposed algorithm, dubbed Sunshine, for hadronic Z decays, and use it to test the tree-level accuracy of the Vincia sector shower through $\mathcal{O}(α_s^2)$. △ Less

Submitted 30 June, 2025; originally announced July 2025.

Comments: 30 pages, 13 figures, 1 appendix

arXiv:2507.00025 [pdf, ps, other]

Generalizing to New Dynamical Systems via Frequency Domain Adaptation

Authors: Tiexin Qin, Hong Yan, Haoliang Li

Abstract: Learning the underlying dynamics from data with deep neural networks has shown remarkable potential in modeling various complex physical dynamics. However, current approaches are constrained in their ability to make reliable predictions in a specific domain and struggle with generalizing to unseen systems that are governed by the same general dynamics but differ in environmental characteristics. I… ▽ More Learning the underlying dynamics from data with deep neural networks has shown remarkable potential in modeling various complex physical dynamics. However, current approaches are constrained in their ability to make reliable predictions in a specific domain and struggle with generalizing to unseen systems that are governed by the same general dynamics but differ in environmental characteristics. In this work, we formulate a parameter-efficient method, Fourier Neural Simulator for Dynamical Adaptation (FNSDA), that can readily generalize to new dynamics via adaptation in the Fourier space. Specifically, FNSDA identifies the shareable dynamics based on the known environments using an automatic partition in Fourier modes and learns to adjust the modes specific for each new environment by conditioning on low-dimensional latent systematic parameters for efficient generalization. We evaluate our approach on four representative families of dynamic systems, and the results show that FNSDA can achieve superior or competitive generalization performance compared to existing methods with a significantly reduced parameter cost. Our code is available at https://github.com/WonderSeven/FNSDA. △ Less

Submitted 17 June, 2025; originally announced July 2025.

Comments: Accepted by TPAMI 2025

arXiv:2506.24118 [pdf, ps, other]

Scaling Human Judgment in Community Notes with LLMs

Authors: Haiwen Li, Soham De, Manon Revel, Andreas Haupt, Brad Miller, Keith Coleman, Jay Baxter, Martin Saveski, Michiel A. Bakker

Abstract: This paper argues for a new paradigm for Community Notes in the LLM era: an open ecosystem where both humans and LLMs can write notes, and the decision of which notes are helpful enough to show remains in the hands of humans. This approach can accelerate the delivery of notes, while maintaining trust and legitimacy through Community Notes' foundational principle: A community of diverse human rater… ▽ More This paper argues for a new paradigm for Community Notes in the LLM era: an open ecosystem where both humans and LLMs can write notes, and the decision of which notes are helpful enough to show remains in the hands of humans. This approach can accelerate the delivery of notes, while maintaining trust and legitimacy through Community Notes' foundational principle: A community of diverse human raters collectively serve as the ultimate evaluator and arbiter of what is helpful. Further, the feedback from this diverse community can be used to improve LLMs' ability to produce accurate, unbiased, broadly helpful notes--what we term Reinforcement Learning from Community Feedback (RLCF). This becomes a two-way street: LLMs serve as an asset to humans--helping deliver context quickly and with minimal effort--while human feedback, in turn, enhances the performance of LLMs. This paper describes how such a system can work, its benefits, key new risks and challenges it introduces, and a research agenda to solve those challenges and realize the potential of this approach. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.24045 [pdf, ps, other]

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

Authors: Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Rui Qu, Maoliang Li, Xiang Chen, Guojie Luo

Abstract: The proliferation of agentic Large Language Models (LLMs) on personal devices introduces a new class of workloads characterized by a dichotomy of objectives. Reactive tasks, initiated by users, demand immediate, low-latency responses, while proactive tasks operate invisibly and prioritize throughput. Existing on-device LLM engines, designed for isolated inferences, fail to efficiently manage these… ▽ More The proliferation of agentic Large Language Models (LLMs) on personal devices introduces a new class of workloads characterized by a dichotomy of objectives. Reactive tasks, initiated by users, demand immediate, low-latency responses, while proactive tasks operate invisibly and prioritize throughput. Existing on-device LLM engines, designed for isolated inferences, fail to efficiently manage these concurrent and conflicting requests on consumer-grade heterogeneous SoCs with CPU, integrated GPU, and NPU. This paper introduces Agent.xpu, an efficient serving system for agentic LLM workloads on memory-unified heterogeneous SoCs. With dedicated offline profiling, Agent.xpu first constructs a heterogeneous execution graph, which fuses and chunks model kernels for affinity-guided, elastic accelerator mapping with predictive kernel annotation. At runtime, its online scheduler enables fine-grained, kernel-level preemption to guarantee the responsiveness of reactive tasks. To maximize SoC utilization, it adopts slack-aware kernel backfill to opportunistically append proactive tasks, and mitigates NPU-iGPU contention via bandwidth-aware dispatch. Evaluation on an Intel Core Ultra SoC shows that Agent.xpu achieves 4.6$\times$ lower latency for reactive tasks and sustains 1.6$\times$-6.8$\times$ higher throughput for proactive tasks compared to state-of-the-art inference engines. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.23673 [pdf, ps, other]

HASD: Hierarchical Adaption for pathology Slide-level Domain-shift

Authors: Jingsong Liu, Han Li, Chen Yang, Michael Deutges, Ario Sadafi, Xin You, Katharina Breininger, Nassir Navab, Peter J. Schüffler

Abstract: Domain shift is a critical problem for pathology AI as pathology data is heavily influenced by center-specific conditions. Current pathology domain adaptation methods focus on image patches rather than WSI, thus failing to capture global WSI features required in typical clinical scenarios. In this work, we address the challenges of slide-level domain shift by proposing a Hierarchical Adaptation fr… ▽ More Domain shift is a critical problem for pathology AI as pathology data is heavily influenced by center-specific conditions. Current pathology domain adaptation methods focus on image patches rather than WSI, thus failing to capture global WSI features required in typical clinical scenarios. In this work, we address the challenges of slide-level domain shift by proposing a Hierarchical Adaptation framework for Slide-level Domain-shift (HASD). HASD achieves multi-scale feature consistency and computationally efficient slide-level domain adaptation through two key components: (1) a hierarchical adaptation framework that integrates a Domain-level Alignment Solver for feature alignment, a Slide-level Geometric Invariance Regularization to preserve the morphological structure, and a Patch-level Attention Consistency Regularization to maintain local critical diagnostic cues; and (2) a prototype selection mechanism that reduces computational overhead. We validate our method on two slide-level tasks across five datasets, achieving a 4.1\% AUROC improvement in a Breast Cancer HER2 Grading cohort and a 3.9\% C-index gain in a UCEC survival prediction cohort. Our method provides a practical and reliable slide-level domain adaption solution for pathology institutions, minimizing both computational and annotation costs. △ Less

Submitted 30 June, 2025; originally announced June 2025.

arXiv:2506.23623 [pdf, ps, other]

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

Authors: Shaofei Huang, Rui Ling, Tianrui Hui, Hongyu Li, Xu Zhou, Shifeng Zhang, Si Liu, Richang Hong, Meng Wang

Abstract: Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened de… ▽ More Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at https://github.com/spyflying/VCT_AVS. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: Accepted by CVPR 2025; Code: https://github.com/spyflying/VCT_AVS; Models: https://huggingface.co/nowherespyfly/VCT_AVS

arXiv:2506.23543 [pdf, ps, other]

Pyramidal Patchification Flow for Visual Generation

Authors: Hui Li, Baoyou Chen, Liwei Zhang, Jiaye Li, Jingdong Wang, Siyu Zhu

Abstract: Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch siz… ▽ More Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at https://github.com/fudan-generative-vision/PPFlow. △ Less

Submitted 30 June, 2025; originally announced June 2025.

Comments: 10 pages, 9figures

arXiv:2506.23487 [pdf, ps, other]

Test of partial effects for Frechet regression on Bures-Wasserstein manifolds

Authors: Haoshu Xu, Hongzhe Li

Abstract: We propose a novel test for assessing partial effects in Frechet regression on Bures Wasserstein manifolds. Our approach employs a sample splitting strategy: the first subsample is used to fit the Frechet regression model, yielding estimates of the covariance matrices and their associated optimal transport maps, while the second subsample is used to construct the test statistic. We prove that this… ▽ More We propose a novel test for assessing partial effects in Frechet regression on Bures Wasserstein manifolds. Our approach employs a sample splitting strategy: the first subsample is used to fit the Frechet regression model, yielding estimates of the covariance matrices and their associated optimal transport maps, while the second subsample is used to construct the test statistic. We prove that this statistic converges in distribution to a weighted mixture of chi squared components, where the weights correspond to the eigenvalues of an integral operator defined by an appropriate RKHS kernel. We establish that our procedure achieves the nominal asymptotic size and demonstrate that its worst-case power converges uniformly to one. Through extensive simulations and a real data application, we illustrate the test's finite-sample accuracy and practical utility. △ Less

Submitted 29 June, 2025; originally announced June 2025.

arXiv:2506.23201 [pdf, ps, other]

External Data-Enhanced Meta-Representation for Adaptive Probabilistic Load Forecasting

Authors: Haoran Li, Muhao Guo, Marija Ilic, Yang Weng, Guangchun Ruan

Abstract: Accurate residential load forecasting is critical for power system reliability with rising renewable integration and demand-side flexibility. However, most statistical and machine learning models treat external factors, such as weather, calendar effects, and pricing, as extra input, ignoring their heterogeneity, and thus limiting the extraction of useful external information. We propose a paradigm… ▽ More Accurate residential load forecasting is critical for power system reliability with rising renewable integration and demand-side flexibility. However, most statistical and machine learning models treat external factors, such as weather, calendar effects, and pricing, as extra input, ignoring their heterogeneity, and thus limiting the extraction of useful external information. We propose a paradigm shift: external data should serve as meta-knowledge to dynamically adapt the forecasting model itself. Based on this idea, we design a meta-representation framework using hypernetworks that modulate selected parameters of a base Deep Learning (DL) model in response to external conditions. This provides both expressivity and adaptability. We further integrate a Mixture-of-Experts (MoE) mechanism to enhance efficiency through selective expert activation, while improving robustness by filtering redundant external inputs. The resulting model, dubbed as a Meta Mixture of Experts for External data (M2oE2), achieves substantial improvements in accuracy and robustness with limited additional overhead, outperforming existing state-of-the-art methods in diverse load datasets. The dataset and source code are publicly available at https://github.com/haorandd/M2oE2\_load\_forecast.git. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: 10 pages

arXiv:2506.23125 [pdf, ps, other]

Learning Motion Skills with Adaptive Assistive Curriculum Force in Humanoid Robots

Authors: Zhanxiang Cao, Yang Zhang, Buqing Nie, Huangxuan Lin, Haoyang Li, Yue Gao

Abstract: Learning policies for complex humanoid tasks remains both challenging and compelling. Inspired by how infants and athletes rely on external support--such as parental walkers or coach-applied guidance--to acquire skills like walking, dancing, and performing acrobatic flips, we propose A2CF: Adaptive Assistive Curriculum Force for humanoid motion learning. A2CF trains a dual-agent system, in which a… ▽ More Learning policies for complex humanoid tasks remains both challenging and compelling. Inspired by how infants and athletes rely on external support--such as parental walkers or coach-applied guidance--to acquire skills like walking, dancing, and performing acrobatic flips, we propose A2CF: Adaptive Assistive Curriculum Force for humanoid motion learning. A2CF trains a dual-agent system, in which a dedicated assistive force agent applies state-dependent forces to guide the robot through difficult initial motions and gradually reduces assistance as the robot's proficiency improves. Across three benchmarks--bipedal walking, choreographed dancing, and backflip--A2CF achieves convergence 30% faster than baseline methods, lowers failure rates by over 40%, and ultimately produces robust, support-free policies. Real-world experiments further demonstrate that adaptively applied assistive forces significantly accelerate the acquisition of complex skills in high-dimensional robotic control. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: 8 pages, 8 figures

arXiv:2506.23086 [pdf, ps, other]

Frequency-enhanced Multi-granularity Context Network for Efficient Vertebrae Segmentation

Authors: Jian Shi, Tianqi You, Pingping Zhang, Hongli Zhang, Rui Xu, Haojie Li

Abstract: Automated and accurate segmentation of individual vertebra in 3D CT and MRI images is essential for various clinical applications. Due to the limitations of current imaging techniques and the complexity of spinal structures, existing methods still struggle with reducing the impact of image blurring and distinguishing similar vertebrae. To alleviate these issues, we introduce a Frequency-enhanced M… ▽ More Automated and accurate segmentation of individual vertebra in 3D CT and MRI images is essential for various clinical applications. Due to the limitations of current imaging techniques and the complexity of spinal structures, existing methods still struggle with reducing the impact of image blurring and distinguishing similar vertebrae. To alleviate these issues, we introduce a Frequency-enhanced Multi-granularity Context Network (FMC-Net) to improve the accuracy of vertebrae segmentation. Specifically, we first apply wavelet transform for lossless downsampling to reduce the feature distortion in blurred images. The decomposed high and low-frequency components are then processed separately. For the high-frequency components, we apply a High-frequency Feature Refinement (HFR) to amplify the prominence of key features and filter out noises, restoring fine-grained details in blurred images. For the low-frequency components, we use a Multi-granularity State Space Model (MG-SSM) to aggregate feature representations with different receptive fields, extracting spatially-varying contexts while capturing long-range dependencies with linear complexity. The utilization of multi-granularity contexts is essential for distinguishing similar vertebrae and improving segmentation accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches on both CT and MRI vertebrae segmentation datasets. The source code is publicly available at https://github.com/anaanaa/FMCNet. △ Less

Submitted 29 June, 2025; originally announced June 2025.

Comments: Accepted by MICCAI2025. More modifications my be performed

arXiv:2506.23068 [pdf, ps, other]

Curious Causality-Seeking Agents Learn Meta Causal World

Authors: Zhiyu Zhao, Haoxuan Li, Haifeng Zhang, Jun Wang, Francesco Faccio, Jürgen Schmidhuber, Mengyue Yang

Abstract: When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton's laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even sub… ▽ More When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton's laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even subtle shifts in policy or environment states can alter the very observed causal mechanisms. In this work, we introduce the \textbf{Meta-Causal Graph} as world models, a minimal unified representation that efficiently encodes the transformation rules governing how causal structures shift across different latent world states. A single Meta-Causal Graph is composed of multiple causal subgraphs, each triggered by meta state, which is in the latent state space. Building on this representation, we introduce a \textbf{Causality-Seeking Agent} whose objectives are to (1) identify the meta states that trigger each subgraph, (2) discover the corresponding causal relationships by agent curiosity-driven intervention policy, and (3) iteratively refine the Meta-Causal Graph through ongoing curiosity-driven exploration and agent experiences. Experiments on both synthetic tasks and a challenging robot arm manipulation task demonstrate that our method robustly captures shifts in causal dynamics and generalizes effectively to previously unseen contexts. △ Less

Submitted 28 June, 2025; originally announced June 2025.

Comments: 33 pages

arXiv:2506.22923 [pdf, ps, other]

Energy-Aware Model Predictive Control for Batch Manufacturing System Scheduling Under Different Electricity Pricing Strategies

Authors: Hongliang Li, Herschel C. Pangborn, Ilya Kovalenko

Abstract: Manufacturing industries are among the highest energy-consuming sectors, facing increasing pressure to reduce energy costs. This paper presents an energy-aware Model Predictive Control (MPC) framework to dynamically schedule manufacturing processes in response to time-varying electricity prices without compromising production goals or violating production constraints. A network-based manufacturing… ▽ More Manufacturing industries are among the highest energy-consuming sectors, facing increasing pressure to reduce energy costs. This paper presents an energy-aware Model Predictive Control (MPC) framework to dynamically schedule manufacturing processes in response to time-varying electricity prices without compromising production goals or violating production constraints. A network-based manufacturing system model is developed to capture complex material flows, batch processing, and capacities of buffers and machines. The scheduling problem is formulated as a Mixed-Integer Quadratic Program (MIQP) that balances energy costs, buffer levels, and production requirements. A case study evaluates the proposed MPC framework under four industrial electricity pricing schemes. Numerical results demonstrate that the approach reduces energy usage expenses while satisfying production goals and adhering to production constraints. The findings highlight the importance of considering the detailed electricity cost structure in manufacturing scheduling decisions and provide practical insights for manufacturers when selecting among different electricity pricing strategies. △ Less

Submitted 28 June, 2025; originally announced June 2025.

arXiv:2506.22824 [pdf, ps, other]

Sensing Security Oriented OFDM-ISAC Against Multi-Intercept Threats

Authors: Lingyun Xu, Bowen Wang, Huiyong Li, Ziyang Cheng

Abstract: In recent years, security has emerged as a critical aspect of integrated sensing and communication (ISAC) systems. While significant research has focused on secure communications, particularly in ensuring physical layer security, the issue of sensing security has received comparatively less attention. This paper addresses the sensing security problem in ISAC, particularly under the threat of multi… ▽ More In recent years, security has emerged as a critical aspect of integrated sensing and communication (ISAC) systems. While significant research has focused on secure communications, particularly in ensuring physical layer security, the issue of sensing security has received comparatively less attention. This paper addresses the sensing security problem in ISAC, particularly under the threat of multi-intercept adversaries. We consider a realistic scenario in which the sensing target is an advanced electronic reconnaissance aircraft capable of employing multiple signal interception techniques, such as power detection (PD) and cyclostationary analysis (CA). To evaluate sensing security under such sophisticated threats, we analyze two critical features of the transmitted signal: (i) power distribution and (ii) cyclic spectrum. Further, we introduce a novel ergodic cyclic spectrum metric which leverages the intrinsic mathematical structure of cyclostationary signals to more comprehensively characterize their behavior. Building on this analysis, we formulate a new ISAC design problem that explicitly considers sensing security, and we develop a low-complexity, efficient optimization approach to solve it. Simulation results demonstrate that the proposed metric is both effective and insightful, and that our ISAC design significantly enhances sensing security performance in the presence of multi-intercept threats. △ Less

Submitted 28 June, 2025; originally announced June 2025.

arXiv:2506.22815 [pdf, ps, other]

Memory as a Service (MaaS): Rethinking Contextual Memory as Service-Oriented Modules for Collaborative Agents

Authors: Haichang Li

Abstract: This position paper aims to rethink the role and design of memory in Large Language Model (LLM)-based agent systems. We observe that while current memory practices have begun to transcend the limitations of single interactions, they remain conceptually grounded in "bound memory" in terms of design concept-where memory is treated as local state attached to specific context or entities, forming "mem… ▽ More This position paper aims to rethink the role and design of memory in Large Language Model (LLM)-based agent systems. We observe that while current memory practices have begun to transcend the limitations of single interactions, they remain conceptually grounded in "bound memory" in terms of design concept-where memory is treated as local state attached to specific context or entities, forming "memory silos" that impede cross-entity collaboration. To overcome this architectural bottleneck, this paper proposes the timely design perspective of "Memory as a Service" (MaaS). MaaS advocates decoupling memory from its conventional role as an interaction byproduct and encapsulating it as a modular service that can be independently callable, dynamically composable, and finely governed. At its core, MaaS leverages the duality of memory-its inherently private nature and its potential for public service-to grant memory controlled, on-demand interoperability across entities. This paper introduces a two-dimensional design space defined by entity structure and service type, illustrating how MaaS aligns with current memory practices while naturally extending them to cross-entity collaborative scenarios. Finally, we outline an open research agenda spanning governance, security, and ethical ecosystems, and call upon the broader research community to explore this shift toward service-oriented memory for collaborative agents operating across entity boundaries. △ Less

Submitted 28 June, 2025; originally announced June 2025.

Comments: Position Paper for workshop. This is an initial version for discussion purposes

ACM Class: H.5.0

arXiv:2506.22736 [pdf, ps, other]

UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments

Authors: Dayong Su, Yafei Zhang, Huafeng Li, Jinxing Li, Yu Liu

Abstract: Current multimodal medical image fusion typically assumes that source images are of high quality and perfectly aligned at the pixel level. Its effectiveness heavily relies on these conditions and often deteriorates when handling misaligned or degraded medical images. To address this, we propose UniFuse, a general fusion framework. By embedding a degradation-aware prompt learning module, UniFuse se… ▽ More Current multimodal medical image fusion typically assumes that source images are of high quality and perfectly aligned at the pixel level. Its effectiveness heavily relies on these conditions and often deteriorates when handling misaligned or degraded medical images. To address this, we propose UniFuse, a general fusion framework. By embedding a degradation-aware prompt learning module, UniFuse seamlessly integrates multi-directional information from input images and correlates cross-modal alignment with restoration, enabling joint optimization of both tasks within a unified framework. Additionally, we design an Omni Unified Feature Representation scheme, which leverages Spatial Mamba to encode multi-directional features and mitigate modality differences in feature alignment. To enable simultaneous restoration and fusion within an All-in-One configuration, we propose a Universal Feature Restoration & Fusion module, incorporating the Adaptive LoRA Synergistic Network (ALSN) based on LoRA principles. By leveraging ALSN's adaptive feature representation along with degradation-type guidance, we enable joint restoration and fusion within a single-stage framework. Compared to staged approaches, UniFuse unifies alignment, restoration, and fusion within a single framework. Experimental results across multiple datasets demonstrate the method's effectiveness and significant advantages over existing approaches. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: Accepted by ICCV2025

arXiv:2506.22554 [pdf, ps, other]

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Authors: Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D'Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard, Martin Gleize, Sahir Gomez, Hongyu Gong, Srivathsan Govindarajan, Brandon Han, Sen He, Denise Hernandez , et al. (59 additional authors not shown)

Abstract: Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours… ▽ More Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual behavior of their interlocutors. We present a variant with speech from an LLM model and integrations with 2D and 3D rendering methods, bringing us closer to interactive virtual agents. Additionally, we describe controllable variants of our motion models that can adapt emotional responses and expressivity levels, as well as generating more semantically-relevant gestures. Finally, we discuss methods for assessing the quality of these dyadic motion models, which are demonstrating the potential for more intuitive and responsive human-AI interactions. △ Less

Submitted 30 June, 2025; v1 submitted 27 June, 2025; originally announced June 2025.

arXiv:2506.22465 [pdf, ps, other]

Preconditioned Conjugate Gradient for MIMO-AFDM System

Authors: Jun Zhu, Yin Xu, Dazhi He, Haoyang Li, Yunfeng Guan, Wenjun Zhang

Abstract: Affine frequency division multiplexing (AFDM) is a promising chirp-assisted multicarrier waveform for future high mobility communications. A significant challenge in MIMO-AFDM systems is the multi-user interference (MUI), which can be effectively addressed by employing precoding techniques. However, the complexity introduced by AFDM makes the precoding process computationally expensive and challen… ▽ More Affine frequency division multiplexing (AFDM) is a promising chirp-assisted multicarrier waveform for future high mobility communications. A significant challenge in MIMO-AFDM systems is the multi-user interference (MUI), which can be effectively addressed by employing precoding techniques. However, the complexity introduced by AFDM makes the precoding process computationally expensive and challenging. To overcome this issue, We combine AFDM channel sparse property and using Preconditioned Conjugate Gradient (PCG) method to iteratively process the precoding, thereby reducing the complexity of the precoding design. Simulation results demonstrate that the proposed sparsification approach, coupled with the PCG method, achieving quite precoding performance while significantly reducing computational complexity. This makes the application of AFDM more feasible and efficient for high-mobility communication scenarios, paving the way for its broader implementation in next-generation communication systems. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: arXiv admin note: text overlap with arXiv:2503.10525

arXiv:2506.22161 [pdf, ps, other]

Attention-disentangled Uniform Orthogonal Feature Space Optimization for Few-shot Object Detection

Authors: Taijin Zhao, Heqian Qiu, Yu Dai, Lanxiao Wang, Fanman Meng, Qingbo Wu, Hongliang Li

Abstract: Few-shot object detection (FSOD) aims to detect objects with limited samples for novel classes, while relying on abundant data for base classes. Existing FSOD approaches, predominantly built on the Faster R-CNN detector, entangle objectness recognition and foreground classification within shared feature spaces. This paradigm inherently establishes class-specific objectness criteria and suffers fro… ▽ More Few-shot object detection (FSOD) aims to detect objects with limited samples for novel classes, while relying on abundant data for base classes. Existing FSOD approaches, predominantly built on the Faster R-CNN detector, entangle objectness recognition and foreground classification within shared feature spaces. This paradigm inherently establishes class-specific objectness criteria and suffers from unrepresentative novel class samples. To resolve this limitation, we propose a Uniform Orthogonal Feature Space (UOFS) optimization framework. First, UOFS decouples the feature space into two orthogonal components, where magnitude encodes objectness and angle encodes classification. This decoupling enables transferring class-agnostic objectness knowledge from base classes to novel classes. Moreover, implementing the disentanglement requires careful attention to two challenges: (1) Base set images contain unlabeled foreground instances, causing confusion between potential novel class instances and backgrounds. (2) Angular optimization depends exclusively on base class foreground instances, inducing overfitting of angular distributions to base classes. To address these challenges, we propose a Hybrid Background Optimization (HBO) strategy: (1) Constructing a pure background base set by removing unlabeled instances in original images to provide unbiased magnitude-based objectness supervision. (2) Incorporating unlabeled foreground instances in the original base set into angular optimization to enhance distribution uniformity. Additionally, we propose a Spatial-wise Attention Disentanglement and Association (SADA) module to address task conflicts between class-agnostic and class-specific tasks. Experiments demonstrate that our method significantly outperforms existing approaches based on entangled feature spaces. △ Less

Submitted 27 June, 2025; originally announced June 2025.

arXiv:2506.22090 [pdf, ps, other]

Updated measurement of $CP$ violation and polarisation in $B^0_s \rightarrow J/ψ\overline{K}{}^{*}\kern-1pt(892)^{0}$ decays

Authors: LHCb collaboration, R. Aaij, A. S. W. Abdelmotteleb, C. Abellan Beteta, F. Abudinén, T. Ackernley, A. A. Adefisoye, B. Adeva, M. Adinolfi, P. Adlarson, C. Agapopoulou, C. A. Aidala, Z. Ajaltouni, S. Akar, K. Akiba, P. Albicocco, J. Albrecht, R. Aleksiejunas, F. Alessio, Z. Aliouche, P. Alvarez Cartelle, R. Amalric, S. Amato, J. L. Amey, Y. Amhis , et al. (1168 additional authors not shown)

Abstract: A time-integrated angular analysis of the decay $B^0_s \rightarrow J/ψ\overline{K}{}^{*}\kern-1pt(892)^{0}$, with $J/ψ\rightarrow μ^{+} μ^{-}$ and $\overline{K}{}^{*}\kern-1pt(892)^{0} \rightarrow K^{-} π^{+}$, is presented. The analysis employs a sample of proton-proton collision data collected by the LHCb experiment during 2015-2018 at a centre-of-mass energy of $13 \text{TeV}$, corresponding to… ▽ More A time-integrated angular analysis of the decay $B^0_s \rightarrow J/ψ\overline{K}{}^{*}\kern-1pt(892)^{0}$, with $J/ψ\rightarrow μ^{+} μ^{-}$ and $\overline{K}{}^{*}\kern-1pt(892)^{0} \rightarrow K^{-} π^{+}$, is presented. The analysis employs a sample of proton-proton collision data collected by the LHCb experiment during 2015-2018 at a centre-of-mass energy of $13 \text{TeV}$, corresponding to an integrated luminosity of $6 \text{fb}^{-1}$. A simultaneous maximum-likelihood fit is performed to the angular distributions in bins of the $K^{-} π^{+}$ mass. This fit yields measurements of the $CP$-averaged polarisation fractions and $CP$ asymmetries for the P-wave component of the $K^{-} π^{+}$ system. The longitudinal and parallel polarisation fractions are determined to be $f_{0} = 0.534 \pm 0.012 \pm 0.009$ and $f_{\parallel} = 0.211 \pm 0.014 \pm 0.005$, respectively, where the first uncertainty is statistical and the second is systematic. The $CP$ asymmetries are measured with $3$-$7\%$ precision and are found to be consistent with zero. These measurements, along with an updated determination of the branching fraction relative to the $B^0 \rightarrow J/ψK^{*0}$ decay, are combined with previous LHCb results, providing the most precise values for these observables to date. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: All figures and tables, along with machine-readable versions and any supplementary material and additional information, are available at https://lbfence.cern.ch/alcm/public/analysis/full-details/4457/ (LHCb public pages)

Report number: LHCb-PAPER-2025-020, CERN-EP-2025-131

arXiv:2506.21682 [pdf, ps, other]

Do We Really Need GNNs with Explicit Structural Modeling? MLPs Suffice for Language Model Representations

Authors: Li Zhou, Hao Jiang, Junjie Li, Zefeng Zhao, Feng Jiang, Wenyu Chen, Haizhou Li

Abstract: Explicit structural information has been proven to be encoded by Graph Neural Networks (GNNs), serving as auxiliary knowledge to enhance model capabilities and improve performance in downstream NLP tasks. However, recent studies indicate that GNNs fail to fully utilize structural information, whereas Multi-Layer Perceptrons (MLPs), despite lacking the message-passing mechanisms inherent to GNNs, e… ▽ More Explicit structural information has been proven to be encoded by Graph Neural Networks (GNNs), serving as auxiliary knowledge to enhance model capabilities and improve performance in downstream NLP tasks. However, recent studies indicate that GNNs fail to fully utilize structural information, whereas Multi-Layer Perceptrons (MLPs), despite lacking the message-passing mechanisms inherent to GNNs, exhibit a surprising ability in structure-aware tasks. Motivated by these findings, this paper introduces a comprehensive probing framework from an information-theoretic perspective. The framework is designed to systematically assess the role of explicit structural modeling in enhancing language model (LM) representations and to investigate the potential of MLPs as efficient and scalable alternatives to GNNs. We extend traditional probing classifiers by incorporating a control module that allows for selective use of either the full GNN model or its decoupled components, specifically, the message-passing and feature-transformation operations.This modular approach isolates and assesses the individual contributions of these operations, avoiding confounding effects from the complete GNN architecture. Using the Edge Probing Suite, a diagnostic tool for evaluating the linguistic knowledge encoded in LMs, we find that MLPs, when used as feature-transformation modules, consistently improve the linguistic knowledge captured in LM representations across different architectures. They effectively encode both syntactic and semantic patterns. Similarly, GNNs that incorporate feature-transformation operations show beneficial effects. In contrast, models that rely solely on message-passing operations tend to underperform, often leading to negative impacts on probing task performance. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: Graph Neural Networks, Multi-Layer Perceptrons, Explicit Structural Modeling, Probing Classifier

arXiv:2506.21420 [pdf, ps, other]

EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting

Authors: Taoyu Wu, Yiyi Miao, Zhuoxiao Li, Haocheng Zhao, Kang Dang, Jionglong Su, Limin Yu, Haoang Li

Abstract: Efficient three-dimensional reconstruction and real-time visualization are critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in efficient 3D reconstruction and rendering. Most 3DGS-based Simultaneous Localization and Mapping (SLAM) methods only rely on the appearance constraints for optimizing both 3DGS and camer… ▽ More Efficient three-dimensional reconstruction and real-time visualization are critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in efficient 3D reconstruction and rendering. Most 3DGS-based Simultaneous Localization and Mapping (SLAM) methods only rely on the appearance constraints for optimizing both 3DGS and camera poses. However, in endoscopic scenarios, the challenges include photometric inconsistencies caused by non-Lambertian surfaces and dynamic motion from breathing affects the performance of SLAM systems. To address these issues, we additionally introduce optical flow loss as a geometric constraint, which effectively constrains both the 3D structure of the scene and the camera motion. Furthermore, we propose a depth regularisation strategy to mitigate the problem of photometric inconsistencies and ensure the validity of 3DGS depth rendering in endoscopic scenes. In addition, to improve scene representation in the SLAM system, we improve the 3DGS refinement strategy by focusing on viewpoints corresponding to Keyframes with suboptimal rendering quality frames, achieving better rendering results. Extensive experiments on the C3VD static dataset and the StereoMIS dynamic dataset demonstrate that our method outperforms existing state-of-the-art methods in novel view synthesis and pose estimation, exhibiting high performance in both static and dynamic surgical scenes. △ Less

Submitted 5 July, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

Comments: This paper has been accepted at MICCAI2025

arXiv:2506.21215 [pdf, ps, other]

Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?

Authors: Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, Bo Han

Abstract: Causal reasoning capability is critical in advancing large language models (LLMs) toward strong artificial intelligence. While versatile LLMs appear to have demonstrated capabilities in understanding contextual causality and providing responses that obey the laws of causality, it remains unclear whether they perform genuine causal reasoning akin to humans. However, current evidence indicates the c… ▽ More Causal reasoning capability is critical in advancing large language models (LLMs) toward strong artificial intelligence. While versatile LLMs appear to have demonstrated capabilities in understanding contextual causality and providing responses that obey the laws of causality, it remains unclear whether they perform genuine causal reasoning akin to humans. However, current evidence indicates the contrary. Specifically, LLMs are only capable of performing shallow (level-1) causal reasoning, primarily attributed to the causal knowledge embedded in their parameters, but they lack the capacity for genuine human-like (level-2) causal reasoning. To support this hypothesis, methodologically, we delve into the autoregression mechanism of transformer-based LLMs, revealing that it is not inherently causal. Empirically, we introduce a new causal Q&A benchmark called CausalProbe-2024, whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs exhibit a significant performance drop on CausalProbe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning. To bridge the gap towards level-2 causal reasoning, we draw inspiration from the fact that human reasoning is usually facilitated by general knowledge and intended goals. We propose G^2-Reasoner, a method that incorporates general knowledge and goal-oriented prompts into LLMs' causal reasoning processes. Experiments demonstrate that G^2-Reasoner significantly enhances LLMs' causal reasoning capability, particularly in fresh and counterfactual contexts. This work sheds light on a new path for LLMs to advance towards genuine causal reasoning, going beyond level-1 and making strides towards level-2. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 24 pages, accepted at NeurIPS 2024

Journal ref: Advances in Neural Information Processing Systems, 2024, 37: 96640-96670

arXiv:2506.21154 [pdf, ps, other]

Transformer-Based Spatial-Temporal Counterfactual Outcomes Estimation

Authors: He Li, Haoang Chi, Mingyu Liu, Wanrong Huang, Liyang Xu, Wenjing Yang

Abstract: The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatial-temporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attr… ▽ More The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatial-temporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attributes using the Transformer, exhibiting stronger estimation ability. Under mild assumptions, the proposed estimator within this framework is consistent and asymptotically normal. To validate the effectiveness of our approach, we conduct simulation experiments and real data experiments. Simulation experiments show that our estimator has a stronger estimation capability than baseline methods. Real data experiments provide a valuable conclusion to the causal effect of conflicts on forest loss in Colombia. The source code is available at https://github.com/lihe-maxsize/DeppSTCI_Release_Version-master. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 24 pages, accepted at ICML 2025

arXiv:2506.21032 [pdf, ps, other]

RecCoT: Enhancing Recommendation via Chain-of-Thought

Authors: Shuo Yang, Jiangxia Cao, Haipeng Li, Yuqi Mao, Shuchao Pang

Abstract: In real-world applications, users always interact with items in multiple aspects, such as through implicit binary feedback (e.g., clicks, dislikes, long views) and explicit feedback (e.g., comments, reviews). Modern recommendation systems (RecSys) learn user-item collaborative signals from these implicit feedback signals as a large-scale binary data-streaming, subsequently recommending other highl… ▽ More In real-world applications, users always interact with items in multiple aspects, such as through implicit binary feedback (e.g., clicks, dislikes, long views) and explicit feedback (e.g., comments, reviews). Modern recommendation systems (RecSys) learn user-item collaborative signals from these implicit feedback signals as a large-scale binary data-streaming, subsequently recommending other highly similar items based on users' personalized historical interactions. However, from this collaborative-connection perspective, the RecSys does not focus on the actual content of the items themselves but instead prioritizes higher-probability signals of behavioral co-occurrence among items. Consequently, under this binary learning paradigm, the RecSys struggles to understand why a user likes or dislikes certain items. To alleviate it, some works attempt to utilize the content-based reviews to capture the semantic knowledge to enhance recommender models. However, most of these methods focus on predicting the ratings of reviews, but do not provide a human-understandable explanation. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: Work in progress

arXiv:2506.21012 [pdf, ps, other]

doi 10.1145/3711896.3736957

FedSC: Federated Learning with Semantic-Aware Collaboration

Authors: Huan Wang, Haoran Li, Huaming Chen, Jun Yan, Jiahua Shi, Jun Shen

Abstract: Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global… ▽ More Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global model), often neglecting inherent semantic information contained in each client. To explore the possibility of using intra-client semantically meaningful knowledge in handling data heterogeneity, in this paper, we propose Federated Learning with Semantic-Aware Collaboration (FedSC) to capture client-specific and class-relevant knowledge across heterogeneous clients. The core idea of FedSC is to construct relational prototypes and consistent prototypes at semantic-level, aiming to provide fruitful class underlying knowledge and stable convergence signals in a prototype-wise collaborative way. On the one hand, FedSC introduces an inter-contrastive learning strategy to bring instance-level embeddings closer to relational prototypes with the same semantics and away from distinct classes. On the other hand, FedSC devises consistent prototypes via a discrepancy aggregation manner, as a regularization penalty to constrain the optimization region of the local model. Moreover, a theoretical analysis for FedSC is provided to ensure a convergence guarantee. Experimental results on various challenging scenarios demonstrate the effectiveness of FedSC and the efficiency of crucial components. △ Less

Submitted 26 June, 2025; originally announced June 2025.

Comments: 12 pages, KDD 2025

arXiv:2506.20926 [pdf, ps, other]

Towards Generalized and Stealthy Watermarking for Generative Code Models

Authors: Haoxuan Li, Jiale Zhang, Xiaobing Sun, Xiapu Luo

Abstract: Generative code models (GCMs) significantly enhance development efficiency through automated code generation and code summarization. However, building and training these models require computational resources and time, necessitating effective digital copyright protection to prevent unauthorized leaks and misuse. Backdoor watermarking, by embedding hidden identifiers, simplifies copyright verificat… ▽ More Generative code models (GCMs) significantly enhance development efficiency through automated code generation and code summarization. However, building and training these models require computational resources and time, necessitating effective digital copyright protection to prevent unauthorized leaks and misuse. Backdoor watermarking, by embedding hidden identifiers, simplifies copyright verification by breaking the model's black-box nature. Current backdoor watermarking techniques face two main challenges: first, limited generalization across different tasks and datasets, causing fluctuating verification rates; second, insufficient stealthiness, as watermarks are easily detected and removed by automated methods. To address these issues, we propose CodeGuard, a novel watermarking method combining attention mechanisms with distributed trigger embedding strategies. Specifically, CodeGuard employs attention mechanisms to identify watermark embedding positions, ensuring verifiability. Moreover, by using homomorphic character replacement, it avoids manual detection, while distributed trigger embedding reduces the likelihood of automated detection. Experimental results demonstrate that CodeGuard achieves up to 100% watermark verification rates in both code summarization and code generation tasks, with no impact on the primary task performance. In terms of stealthiness, CodeGuard performs exceptionally, with a maximum detection rate of only 0.078 against ONION detection methods, significantly lower than baseline methods. △ Less

Submitted 29 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

Comments: 13 pages

arXiv:2506.20788 [pdf, ps, other]

Robust and Flexible Microtransit Design: Chance-Constrained Dial-a-Ride Problem with Soft Time Windows

Authors: Hongli Li, Zengxiang Lei, Xinwu Qian, Satish V. Ukkusuri

Abstract: Microtransit offers a promising blend of rideshare flexibility and public transit efficiency. In practice, it faces unanticipated but spatially aligned requests, passengers seeking to join ongoing schedules, leading to underutilized capacity and degraded service if not properly managed. At the same time, it must accommodate diverse passenger needs, from routine errands to time-sensitive trips such… ▽ More Microtransit offers a promising blend of rideshare flexibility and public transit efficiency. In practice, it faces unanticipated but spatially aligned requests, passengers seeking to join ongoing schedules, leading to underutilized capacity and degraded service if not properly managed. At the same time, it must accommodate diverse passenger needs, from routine errands to time-sensitive trips such as medical appointments. To meet these expectations, incorporating time flexibility is essential. However, existing models seldom consider both spontaneous and heterogeneous demand, limiting their real-world applicability. We propose a robust and flexible microtransit framework that integrates time flexibility and demand uncertainty via a Chance-Constrained Dial-A-Ride Problem with Soft Time Windows (CCDARP-STW). Demand uncertainty is captured through nonlinear chance constraints with controllable violation probabilities, while time flexibility is modeled with soft time windows and penalized cost. We develop a bounded-support relaxation using limited statistical information to linearize the chance constraints and solve the model using a tailored Branch-and-Cut-and-Price (BCP) algorithm with a probabilistic dominance rule. This rule improves computational efficiency, reducing explored labels by 17.40% and CPU time by 22.27% in robust cases. A case study based on real-world Chicago data shows our framework yields 11.55 minutes and 11.13 miles of savings versus conventional microtransit, and achieves the highest service reliability (96.46%) among robust models. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: 27 pages, 10 figures, arXiv:2402.01265 [math.OC] (v1, Feb 2 2024); Plan to submit to Journal

arXiv:2506.20756 [pdf, ps, other]

StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation

Authors: Haodong Li, Chen Wang, Jiahui Lei, Kostas Daniilidis, Lingjie Liu

Abstract: Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally differ… ▽ More Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff's SoTA performance, showcasing its superior consistency and accuracy in video depth estimation. △ Less

Submitted 30 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

Comments: Work done in Nov 2024, during an internship at the University of Pennsylvania. Project page: https://stereodiff.github.io/

arXiv:2506.20661 [pdf, ps, other]

Architectural mechanisms of a universal fault-tolerant quantum computer

Authors: Dolev Bluvstein, Alexandra A. Geim, Sophie H. Li, Simon J. Evered, J. Pablo Bonilla Ataides, Gefen Baranes, Andi Gu, Tom Manovitz, Muqing Xu, Marcin Kalinowski, Shayan Majidy, Christian Kokail, Nishad Maskara, Elias C. Trapp, Luke M. Stewart, Simon Hollerith, Hengyun Zhou, Michael J. Gullans, Susanne F. Yelin, Markus Greiner, Vladan Vuletic, Madelyn Cain, Mikhail D. Lukin

Abstract: Quantum error correction (QEC) is believed to be essential for the realization of large-scale quantum computers. However, due to the complexity of operating on the encoded `logical' qubits, understanding the physical principles for building fault-tolerant quantum devices and combining them into efficient architectures is an outstanding scientific challenge. Here we utilize reconfigurable arrays of… ▽ More Quantum error correction (QEC) is believed to be essential for the realization of large-scale quantum computers. However, due to the complexity of operating on the encoded `logical' qubits, understanding the physical principles for building fault-tolerant quantum devices and combining them into efficient architectures is an outstanding scientific challenge. Here we utilize reconfigurable arrays of up to 448 neutral atoms to implement all key elements of a universal, fault-tolerant quantum processing architecture and experimentally explore their underlying working mechanisms. We first employ surface codes to study how repeated QEC suppresses errors, demonstrating 2.14(13)x below-threshold performance in a four-round characterization circuit by leveraging atom loss detection and machine learning decoding. We then investigate logical entanglement using transversal gates and lattice surgery, and extend it to universal logic through transversal teleportation with 3D [[15,1,3]] codes, enabling arbitrary-angle synthesis with logarithmic overhead. Finally, we develop mid-circuit qubit re-use, increasing experimental cycle rates by two orders of magnitude and enabling deep-circuit protocols with dozens of logical qubits and hundreds of logical teleportations with [[7,1,3]] and high-rate [[16,6,4]] codes while maintaining constant internal entropy. Our experiments reveal key principles for efficient architecture design, involving the interplay between quantum logic and entropy removal, judiciously using physical entanglement in logic gates and magic state generation, and leveraging teleportations for universality and physical qubit reset. These results establish foundations for scalable, universal error-corrected processing and its practical implementation with neutral atom systems. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Main text + Methods. Ancillary files: 3 movies, error model, raw experimental commands

arXiv:2506.20660 [pdf, ps, other]

Continuous operation of a coherent 3,000-qubit system

Authors: Neng-Chun Chiu, Elias C. Trapp, Jinen Guo, Mohamed H. Abobeih, Luke M. Stewart, Simon Hollerith, Pavel Stroganov, Marcin Kalinowski, Alexandra A. Geim, Simon J. Evered, Sophie H. Li, Lisa M. Peters, Dolev Bluvstein, Tout T. Wang, Markus Greiner, Vladan Vuletić, Mikhail D. Lukin

Abstract: Neutral atoms are a promising platform for quantum science, enabling advances in areas ranging from quantum simulations and computation to metrology, atomic clocks and quantum networking. While atom losses typically limit these systems to a pulsed mode, continuous operation could significantly enhance cycle rates, remove bottlenecks in metrology, and enable deep-circuit quantum evolution through q… ▽ More Neutral atoms are a promising platform for quantum science, enabling advances in areas ranging from quantum simulations and computation to metrology, atomic clocks and quantum networking. While atom losses typically limit these systems to a pulsed mode, continuous operation could significantly enhance cycle rates, remove bottlenecks in metrology, and enable deep-circuit quantum evolution through quantum error correction. Here we demonstrate an experimental architecture for high-rate, continuous reloading and operation of a large-scale atom array system while realizing coherent storage and manipulation of quantum information. Our approach utilizes a series of two optical lattice conveyor belts to transport atom reservoirs into the science region, where atoms are repeatedly extracted into optical tweezers without affecting the coherence of qubits stored nearby. Using a reloading rate of 300,000 atoms in tweezers per second, we create over 30,000 initialized qubits per second, which we leverage to assemble and maintain an array of over 3,000 atoms for more than two hours. Furthermore, we demonstrate persistent refilling of the array with atomic qubits in either a spin-polarized or a coherent superposition state while preserving the quantum state of stored qubits. Our results pave the way for realization of large-scale continuously operated atomic clocks, sensors, and fault-tolerant quantum computers. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Main text: 8 pages, 4 figures. Methods: 7 pages, 10 figures. Ancillary files: one supplementary movie and caption

arXiv:2506.20644 [pdf, ps, other]

Efficient Federated Learning with Encrypted Data Sharing for Data-Heterogeneous Edge Devices

Authors: Hangyu Li, Hongyue Wu, Guodong Fan, Zhen Zhang, Shizhan Chen, Zhiyong Feng

Abstract: As privacy protection gains increasing importance, more models are being trained on edge devices and subsequently merged into the central server through Federated Learning (FL). However, current research overlooks the impact of network topology, physical distance, and data heterogeneity on edge devices, leading to issues such as increased latency and degraded model performance. To address these is… ▽ More As privacy protection gains increasing importance, more models are being trained on edge devices and subsequently merged into the central server through Federated Learning (FL). However, current research overlooks the impact of network topology, physical distance, and data heterogeneity on edge devices, leading to issues such as increased latency and degraded model performance. To address these issues, we propose a new federated learning scheme on edge devices that called Federated Learning with Encrypted Data Sharing(FedEDS). FedEDS uses the client model and the model's stochastic layer to train the data encryptor. The data encryptor generates encrypted data and shares it with other clients. The client uses the corresponding client's stochastic layer and encrypted data to train and adjust the local model. FedEDS uses the client's local private data and encrypted shared data from other clients to train the model. This approach accelerates the convergence speed of federated learning training and mitigates the negative impact of data heterogeneity, making it suitable for application services deployed on edge devices requiring rapid convergence. Experiments results show the efficacy of FedEDS in promoting model performance. △ Less

Submitted 25 June, 2025; originally announced June 2025.

Comments: Accepted by ICWS 2025

arXiv:2506.20599 [pdf, ps, other]

SFNet: Fusion of Spatial and Frequency-Domain Features for Remote Sensing Image Forgery Detection

Authors: Ji Qi, Xinchang Zhang, Dingqi Ye, Yongjia Ruan, Xin Guo, Shaowen Wang, Haifeng Li

Abstract: The rapid advancement of generative artificial intelligence is producing fake remote sensing imagery (RSI) that is increasingly difficult to detect, potentially leading to erroneous intelligence, fake news, and even conspiracy theories. Existing forgery detection methods typically rely on single visual features to capture predefined artifacts, such as spatial-domain cues to detect forged objects l… ▽ More The rapid advancement of generative artificial intelligence is producing fake remote sensing imagery (RSI) that is increasingly difficult to detect, potentially leading to erroneous intelligence, fake news, and even conspiracy theories. Existing forgery detection methods typically rely on single visual features to capture predefined artifacts, such as spatial-domain cues to detect forged objects like roads or buildings in RSI, or frequency-domain features to identify artifacts from up-sampling operations in adversarial generative networks (GANs). However, the nature of artifacts can significantly differ depending on geographic terrain, land cover types, or specific features within the RSI. Moreover, these complex artifacts evolve as generative models become more sophisticated. In short, over-reliance on a single visual cue makes existing forgery detectors struggle to generalize across diverse remote sensing data. This paper proposed a novel forgery detection framework called SFNet, designed to identify fake images in diverse remote sensing data by leveraging spatial and frequency domain features. Specifically, to obtain rich and comprehensive visual information, SFNet employs two independent feature extractors to capture spatial and frequency domain features from input RSIs. To fully utilize the complementary domain features, the domain feature mapping module and the hybrid domain feature refinement module(CBAM attention) of SFNet are designed to successively align and fuse the multi-domain features while suppressing redundant information. Experiments on three datasets show that SFNet achieves an accuracy improvement of 4%-15.18% over the state-of-the-art RS forgery detection methods and exhibits robust generalization capabilities. The code is available at https://github.com/GeoX-Lab/RSTI/tree/main/SFNet. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20590 [pdf, ps, other]

WonderFree: Enhancing Novel View Quality and Cross-View Consistency for 3D Scene Exploration

Authors: Chaojun Ni, Jie Li, Haoyun Li, Hengyu Liu, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Boyuan Wang, Chenxin Li, Guan Huang, Wenjun Mei

Abstract: Interactive 3D scene generation from a single image has gained significant attention due to its potential to create immersive virtual worlds. However, a key challenge in current 3D generation methods is the limited explorability, which cannot render high-quality images during larger maneuvers beyond the original viewpoint, particularly when attempting to move forward into unseen areas. To address… ▽ More Interactive 3D scene generation from a single image has gained significant attention due to its potential to create immersive virtual worlds. However, a key challenge in current 3D generation methods is the limited explorability, which cannot render high-quality images during larger maneuvers beyond the original viewpoint, particularly when attempting to move forward into unseen areas. To address this challenge, we propose WonderFree, the first model that enables users to interactively generate 3D worlds with the freedom to explore from arbitrary angles and directions. Specifically, we decouple this challenge into two key subproblems: novel view quality, which addresses visual artifacts and floating issues in novel views, and cross-view consistency, which ensures spatial consistency across different viewpoints. To enhance rendering quality in novel views, we introduce WorldRestorer, a data-driven video restoration model designed to eliminate floaters and artifacts. In addition, a data collection pipeline is presented to automatically gather training data for WorldRestorer, ensuring it can handle scenes with varying styles needed for 3D scene generation. Furthermore, to improve cross-view consistency, we propose ConsistView, a multi-view joint restoration mechanism that simultaneously restores multiple perspectives while maintaining spatiotemporal coherence. Experimental results demonstrate that WonderFree not only enhances rendering quality across diverse viewpoints but also significantly improves global coherence and consistency. These improvements are confirmed by CLIP-based metrics and a user study showing a 77.20% preference for WonderFree over WonderWorld enabling a seamless and immersive 3D exploration experience. The code, model, and data will be publicly available. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20502 [pdf]

Probing Solar Polar Regions

Authors: Yuanyong Deng, Hui Tian, Jie Jiang, Shuhong Yang, Hao Li, Robert Cameron, Laurent Gizon, Louise Harra, Robert F. Wimmer-Schweingruber, Frédéric Auchère, Xianyong Bai, Luis Bellot Rubio, Linjie Chen, Pengfei Chen, Lakshmi Pradeep Chitta, Jackie Davies, Fabio Favata, Li Feng, Xueshang Feng, Weiqun Gan, Don Hassler, Jiansen He, Junfeng Hou, Zhenyong Hou, Chunlan Jin , et al. (23 additional authors not shown)

Abstract: The magnetic fields and dynamical processes in the solar polar regions play a crucial role in the solar magnetic cycle and in supplying mass and energy to the fast solar wind, ultimately being vital in controlling solar activities and driving space weather. Despite numerous efforts to explore these regions, to date no imaging observations of the Sun's poles have been achieved from vantage points o… ▽ More The magnetic fields and dynamical processes in the solar polar regions play a crucial role in the solar magnetic cycle and in supplying mass and energy to the fast solar wind, ultimately being vital in controlling solar activities and driving space weather. Despite numerous efforts to explore these regions, to date no imaging observations of the Sun's poles have been achieved from vantage points out of the ecliptic plane, leaving their behavior and evolution poorly understood. This observation gap has left three top-level scientific questions unanswered, 1) How does the solar dynamo work and drive the solar magnetic cycle? 2) What drives the fast solar wind? 3) How do space weather processes globally originate from the Sun and propagate throughout the solar system? The Solar Polar-orbit Observatory (SPO) mission, a solar polar exploration spacecraft, is proposed to address these three unanswered scientific questions by imaging the Sun's poles from high heliolatitudes. In order to achieve its scientific goals, SPO will carry six remote-sensing and four in-situ instruments to measure the vector magnetic fields and Doppler velocity fields in the photosphere, to observed the Sun in the extreme ultraviolet, X-ray, and radio wavelengths, to image the corona and the heliosphere up to 45 $R_\odot$, and to perform in-situ detection of magnetic fields, and low- and high-energy particles in the solar wind. △ Less

Submitted 28 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

Comments: Accepted for publication in Chinese Journal of Space Science

arXiv:2506.20443 [pdf, ps, other]

MHD simulation of tilt instability during the dynamic FRC magnetic compression process

Authors: Yiming Ma, Ping Zhu, Bo Rao, Haolong Li

Abstract: The nonlinear evolution of the tilt instability in a field reversed configuration (FRC) during the dynamic magnetic compression process has been investigated using magnetohydrodynamic (MHD) simulations with the NIMROD code [C. R. Sovinec \textit{et al.}, J. Comput. Phys. \textbf{195}, 355 (2004)]. The tilt mode induces significant deformations in the linear growth phase and results in complete con… ▽ More The nonlinear evolution of the tilt instability in a field reversed configuration (FRC) during the dynamic magnetic compression process has been investigated using magnetohydrodynamic (MHD) simulations with the NIMROD code [C. R. Sovinec \textit{et al.}, J. Comput. Phys. \textbf{195}, 355 (2004)]. The tilt mode induces significant deformations in the linear growth phase and results in complete confinement loss of the FRC in the nonlinear phase, with no evidence of dynamic nonlinear stabilization. The growth rate of the tilt mode increases with the compression field ramping rate and approaches an asymptotic value. Toroidal flow can reduce both the growth rate and the nonlinear saturation amplitude of the tilt mode. The stabilizing effect of the toroidal rotation is enhanced with higher compression field ramping rates due to the spontaneous toroidal field generation and increased flow shear during compression. Although the tilt mode remains unstable with a toroidal rotation Mach number close to 0.5, the onset of tilt distortion can be delayed, allowing a magnetic compression ratio up to 5.3 before the compressional heating terminates. △ Less

Submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.20231 [pdf, ps, other]

Sensing-Aware Transmit Waveform/Receive Filter Design for OFDM-MBS Systems

Authors: Xinghe Li, Kainan Cheng, Hongzhi Guo, Huiyong Li, Ziyang Cheng

Abstract: In this letter, we study the problem of cooperative sensing design for an orthogonal frequency division multiplexing (OFDM) multiple base stations (MBS) system. We consider a practical scenario where the base stations (BSs) exploit certain subcarriers to realize a sensing function. Since the high sidelobe level (SLL) of OFDM waveforms degrades radar detection for weak targets, and the cross-correl… ▽ More In this letter, we study the problem of cooperative sensing design for an orthogonal frequency division multiplexing (OFDM) multiple base stations (MBS) system. We consider a practical scenario where the base stations (BSs) exploit certain subcarriers to realize a sensing function. Since the high sidelobe level (SLL) of OFDM waveforms degrades radar detection for weak targets, and the cross-correlation generated by other BSs further exacerbates detection performance, we devise a joint design scheme for OFDM sequence and receive filter by minimizing the integrated sidelobe level (ISL) while satisfying mainlobe level, peak-to-average power ratio (PAPR) and spectrum allocation constraints. To address this non-convex problem, we propose an alternating optimization (AO)-based algorithm. Numerical simulations validate the effectiveness of the proposed method, demonstrating the superiority of SSL reduction in the MBS system over the matched filtering method. △ Less

Submitted 30 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

arXiv:2506.19816 [pdf, ps, other]

CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation

Authors: Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, Jiangmiao Pang

Abstract: Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substan… ▽ More Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness. △ Less

Submitted 24 June, 2025; originally announced June 2025.

Comments: 36 pages, 21 figures

arXiv:2506.19742 [pdf, ps, other]

NeRF-based CBCT Reconstruction needs Normalization and Initialization

Authors: Zhuowei Xu, Han Li, Dai Sun, Zhicheng Li, Yujia Li, Qingpeng Kong, Zhiwei Cheng, Nassir Navab, S. Kevin Zhou

Abstract: Cone Beam Computed Tomography (CBCT) is widely used in medical imaging. However, the limited number and intensity of X-ray projections make reconstruction an ill-posed problem with severe artifacts. NeRF-based methods have achieved great success in this task. However, they suffer from a local-global training mismatch between their two key components: the hash encoder and the neural network. Specif… ▽ More Cone Beam Computed Tomography (CBCT) is widely used in medical imaging. However, the limited number and intensity of X-ray projections make reconstruction an ill-posed problem with severe artifacts. NeRF-based methods have achieved great success in this task. However, they suffer from a local-global training mismatch between their two key components: the hash encoder and the neural network. Specifically, in each training step, only a subset of the hash encoder's parameters is used (local sparse), whereas all parameters in the neural network participate (global dense). Consequently, hash features generated in each step are highly misaligned, as they come from different subsets of the hash encoder. These misalignments from different training steps are then fed into the neural network, causing repeated inconsistent global updates in training, which leads to unstable training, slower convergence, and degraded reconstruction quality. Aiming to alleviate the impact of this local-global optimization mismatch, we introduce a Normalized Hash Encoder, which enhances feature consistency and mitigates the mismatch. Additionally, we propose a Mapping Consistency Initialization(MCI) strategy that initializes the neural network before training by leveraging the global mapping property from a well-trained model. The initialized neural network exhibits improved stability during early training, enabling faster convergence and enhanced reconstruction performance. Our method is simple yet effective, requiring only a few lines of code while substantially improving training efficiency on 128 CT cases collected from 4 different datasets, covering 7 distinct anatomical regions. △ Less

Submitted 24 June, 2025; originally announced June 2025.

arXiv:2506.19545 [pdf, ps, other]

Fast convergence of a primal-dual dynamical system with implicit Hessian damping and Tikhonov regularization

Authors: Hong-lu Li, Xin He, Yi-bin Xiao

Abstract: This paper proposes two primal-dual dynamical systems for solving linear equality constrained convex optimization problems: one with implicit Hessian damping only, and the other further incorporating Tikhonov regularization. We analyze the fast convergence properties of both dynamical systems and show that they achieve the same convergence rates. Moreover, we show that the trajectory generated by… ▽ More This paper proposes two primal-dual dynamical systems for solving linear equality constrained convex optimization problems: one with implicit Hessian damping only, and the other further incorporating Tikhonov regularization. We analyze the fast convergence properties of both dynamical systems and show that they achieve the same convergence rates. Moreover, we show that the trajectory generated by the dynamical system with Tikhonov regularization converges strongly to the minimum-norm solution of the underlying problem. Finally, numerical experiments are conducted to validate the theoretical findings. Interestingly, the trajectories exhibit smooth behavior even when the objective function is only continuously differentiable. △ Less

Submitted 26 June, 2025; v1 submitted 24 June, 2025; originally announced June 2025.

Comments: 25 pages, 9 figures

arXiv:2506.19180 [pdf, ps, other]

Precise Measurement of the $Λ$ Electric Dipole Moment through the Entangled Strange Baryon-Antibaryon System

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (696 additional authors not shown)

Abstract: The dominance of matter over antimatter in the universe has consistently driven the pursuit of new physics beyond the Standard Model that violates charge-parity symmetry. Unlike the well-constrained electrons and neutrons, strange baryons (hyperons) remain a largely unexplored territory, in which interactions between hyperons and particles from new physics could induce a non-trivial electric dipol… ▽ More The dominance of matter over antimatter in the universe has consistently driven the pursuit of new physics beyond the Standard Model that violates charge-parity symmetry. Unlike the well-constrained electrons and neutrons, strange baryons (hyperons) remain a largely unexplored territory, in which interactions between hyperons and particles from new physics could induce a non-trivial electric dipole moment (EDM). However, direct measurements of hyperon EDMs through spin precession are highly challenging due to their short lifetimes. In this paper, we present a novel method to extract the EDM of the lightest hyperon, $Λ$, using the entangled $Λ$$\overlineΛ$ system. Our result is consistent with zero, achieving a three-order-of-magnitude improvement over the previous upper limit established in the 1980s with comparable statistics, providing stringent constraints on potential new physics. △ Less

Submitted 28 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.18882 [pdf, ps, other]

Light of Normals: Unified Feature Representation for Universal Photometric Stereo

Authors: Hong Li, Houyuan Chen, Chongjie Ye, Zhaoxi Chen, Bohan Li, Shaocong Xu, Xianda Guo, Xuhui Liu, Yikai Wang, Baochang Zhang, Satoshi Ikehata, Boxin Shi, Anyi Rao, Hao Zhao

Abstract: Universal photometric stereo (PS) aims to recover high-quality surface normals from objects under arbitrary lighting conditions without relying on specific illumination models. Despite recent advances such as SDM-UniPS and Uni MS-PS, two fundamental challenges persist: 1) the deep coupling between varying illumination and surface normal features, where ambiguity in observed intensity makes it diff… ▽ More Universal photometric stereo (PS) aims to recover high-quality surface normals from objects under arbitrary lighting conditions without relying on specific illumination models. Despite recent advances such as SDM-UniPS and Uni MS-PS, two fundamental challenges persist: 1) the deep coupling between varying illumination and surface normal features, where ambiguity in observed intensity makes it difficult to determine whether brightness variations stem from lighting changes or surface orientation; and 2) the preservation of high-frequency geometric details in complex surfaces, where intricate geometries create self-shadowing, inter-reflections, and subtle normal variations that conventional feature processing operations struggle to capture accurately. △ Less

Submitted 24 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

Comments: Home: https://houyuanchen111.github.io/lino.github.io Github: https://github.com/houyuanchen111/LINO_UniPS HuggingFace Demo: https://huggingface.co/spaces/houyuanchen/lino

Showing 51–100 of 12,295 results for author: Li, h