Search | arXiv e-print repository

RePO: Replay-Enhanced Policy Optimization

Authors: Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, Chaochao Lu

Abstract: Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy sam… ▽ More Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO. △ Less

Submitted 10 June, 2025; originally announced June 2025.

Comments: Project Page: https://github.com/SihengLi99/RePO

arXiv:2506.09095 [pdf, ps, other]

Foundation Models in Medical Imaging -- A Review and Outlook

Authors: Vivien van Veldhuizen, Vanessa Botha, Chunyao Lu, Melis Erdal Cesur, Kevin Groot Lipman, Edwin D. de Jong, Hugo Horlings, Clárisa Sanchez, Cees Snoek, Ritse Mann, Eric Marcus, Jonas Teuwen

Abstract: Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pa… ▽ More Foundation models (FMs) are changing the way medical images are analyzed by learning from large collections of unlabeled data. Instead of relying on manually annotated examples, FMs are pre-trained to learn general-purpose visual features that can later be adapted to specific clinical tasks with little additional supervision. In this review, we examine how FMs are being developed and applied in pathology, radiology, and ophthalmology, drawing on evidence from over 150 studies. We explain the core components of FM pipelines, including model architectures, self-supervised learning methods, and strategies for downstream adaptation. We also review how FMs are being used in each imaging domain and compare design choices across applications. Finally, we discuss key challenges and open questions to guide future research. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2506.08399 [pdf, ps, other]

SafeCoT: Improving VLM Safety with Minimal Reasoning

Authors: Jiachen Ma, Zhanhui Zhou, Chao Yang, Chaochao Lu

Abstract: Ensuring safe and appropriate responses from vision-language models (VLMs) remains a critical challenge, particularly in high-risk or ambiguous scenarios. We introduce SafeCoT, a lightweight, interpretable framework that leverages rule-based chain-of-thought (CoT) supervision to improve refusal behavior in VLMs. Unlike prior methods that rely on large-scale safety annotations or complex modeling,… ▽ More Ensuring safe and appropriate responses from vision-language models (VLMs) remains a critical challenge, particularly in high-risk or ambiguous scenarios. We introduce SafeCoT, a lightweight, interpretable framework that leverages rule-based chain-of-thought (CoT) supervision to improve refusal behavior in VLMs. Unlike prior methods that rely on large-scale safety annotations or complex modeling, SafeCoT uses minimal supervision to help models reason about safety risks and make context-aware refusals. Experiments across multiple benchmarks show that SafeCoT significantly reduces overrefusal and enhances generalization, even with limited training data. Our approach offers a scalable solution for aligning VLMs with safety-critical objectives. △ Less

Submitted 11 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.08334 [pdf, ps, other]

Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos

Authors: Weikun Peng, Jun Lv, Cewu Lu, Manolis Savva

Abstract: Articulated objects are prevalent in daily life. Understanding their kinematic structure and reconstructing them have numerous applications in embodied AI and robotics. However, current methods require carefully captured data for training or inference, preventing practical, scalable, and generalizable reconstruction of articulated objects. We focus on reconstruction of an articulated object from a… ▽ More Articulated objects are prevalent in daily life. Understanding their kinematic structure and reconstructing them have numerous applications in embodied AI and robotics. However, current methods require carefully captured data for training or inference, preventing practical, scalable, and generalizable reconstruction of articulated objects. We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to acquire at scale using smartphones. However, this setting is quite challenging, as the object and camera move simultaneously and there are significant occlusions as the person interacts with the object. To tackle these challenges, we introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a 20$\times$ larger synthetic dataset of 784 videos containing 284 objects across 11 categories. We compare our approach with existing methods that also take video as input. Experiments show that our method can reconstruct synthetic and real articulated objects across different categories from dynamic RGBD videos, outperforming existing methods significantly. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: Project website can be found at https://3dlg-hcvc.github.io/video2articulation/

arXiv:2506.07664 [pdf, ps, other]

Synthesis by Design: Controlled Data Generation via Structural Guidance

Authors: Lei Xu, Sirui Chen, Yuxuan Huang, Chaochao Lu

Abstract: Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and gui… ▽ More Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and guide data generation with structured solutions. Applied to MATH and GSM8K, our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty. Results on our benchmark show that model performance declines as reasoning length increases. Additionally, we conducted fine-tuning experiments using the proposed training data on a range of LLMs, and the results validate the effectiveness of our dataset. We hope the proposed method and dataset will contribute to future research in enhancing LLM reasoning capabilities. Our code and data are available at https://github.com/OpenCausaLab/StructuralGeneration. △ Less

Submitted 10 June, 2025; v1 submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.07639 [pdf, ps, other]

Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse

Authors: Zhekai Duan, Yuan Zhang, Shikai Geng, Gaowen Liu, Joschka Boedecker, Chris Xiaoxuan Lu

Abstract: Embodied Chain-of-Thought (ECoT) reasoning enhances vision-language-action (VLA) models by improving performance and interpretability through intermediate reasoning steps. However, its sequential autoregressive token generation introduces significant inference latency, limiting real-time deployment. We propose Fast ECoT, an inference-time acceleration method that exploits the structured and repeti… ▽ More Embodied Chain-of-Thought (ECoT) reasoning enhances vision-language-action (VLA) models by improving performance and interpretability through intermediate reasoning steps. However, its sequential autoregressive token generation introduces significant inference latency, limiting real-time deployment. We propose Fast ECoT, an inference-time acceleration method that exploits the structured and repetitive nature of ECoT to (1) cache and reuse high-level reasoning across timesteps and (2) parallelise the generation of modular reasoning steps. Additionally, we introduce an asynchronous scheduler that decouples reasoning from action decoding, further boosting responsiveness. Fast ECoT requires no model changes or additional training and integrates easily into existing VLA pipelines. Experiments in both simulation (LIBERO) and real-world robot tasks show up to a 7.5% reduction in latency with comparable or improved task success rate and reasoning faithfulness, bringing ECoT policies closer to practical real-time deployment. △ Less

Submitted 9 June, 2025; originally announced June 2025.

arXiv:2506.06729 [pdf, other]

Mitigating Object Hallucination via Robust Local Perception Search

Authors: Zixian Gao, Chao Yang, Zhanhui Zhou, Xing Xu, Chaochao Lu

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled them to effectively integrate vision and language, addressing a variety of downstream tasks. However, despite their significant success, these models still exhibit hallucination phenomena, where the outputs appear plausible but do not align with the content of the images. To mitigate this issue, we introduce Local Percept… ▽ More Recent advancements in Multimodal Large Language Models (MLLMs) have enabled them to effectively integrate vision and language, addressing a variety of downstream tasks. However, despite their significant success, these models still exhibit hallucination phenomena, where the outputs appear plausible but do not align with the content of the images. To mitigate this issue, we introduce Local Perception Search (LPS), a decoding method during inference that is both simple and training-free, yet effectively suppresses hallucinations. This method leverages local visual prior information as a value function to correct the decoding process. Additionally, we observe that the impact of the local visual prior on model performance is more pronounced in scenarios with high levels of image noise. Notably, LPS is a plug-and-play approach that is compatible with various models. Extensive experiments on widely used hallucination benchmarks and noisy data demonstrate that LPS significantly reduces the incidence of hallucinations compared to the baseline, showing exceptional performance, particularly in noisy settings. △ Less

Submitted 7 June, 2025; originally announced June 2025.

arXiv:2506.05445 [pdf, ps, other]

Causal Policy Learning in Reinforcement Learning: Backdoor-Adjusted Soft Actor-Critic

Authors: Thanh Vinh Vo, Young Lee, Haozhe Ma, Chien Lu, Tze-Yun Leong

Abstract: Hidden confounders that influence both states and actions can bias policy learning in reinforcement learning (RL), leading to suboptimal or non-generalizable behavior. Most RL algorithms ignore this issue, learning policies from observational trajectories based solely on statistical associations rather than causal effects. We propose DoSAC (Do-Calculus Soft Actor-Critic with Backdoor Adjustment),… ▽ More Hidden confounders that influence both states and actions can bias policy learning in reinforcement learning (RL), leading to suboptimal or non-generalizable behavior. Most RL algorithms ignore this issue, learning policies from observational trajectories based solely on statistical associations rather than causal effects. We propose DoSAC (Do-Calculus Soft Actor-Critic with Backdoor Adjustment), a principled extension of the SAC algorithm that corrects for hidden confounding via causal intervention estimation. DoSAC estimates the interventional policy $π(a | \mathrm{do}(s))$ using the backdoor criterion, without requiring access to true confounders or causal labels. To achieve this, we introduce a learnable Backdoor Reconstructor that infers pseudo-past variables (previous state and action) from the current state to enable backdoor adjustment from observational data. This module is integrated into a soft actor-critic framework to compute both the interventional policy and its entropy. Empirical results on continuous control benchmarks show that DoSAC outperforms baselines under confounded settings, with improved robustness, generalization, and policy reliability. △ Less

Submitted 5 June, 2025; originally announced June 2025.

Comments: Preprint

arXiv:2506.03614 [pdf, ps, other]

VLMs Can Aggregate Scattered Training Patches

Authors: Zhanhui Zhou, Lingjie Chen, Chao Yang, Chaochao Lu

Abstract: One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from… ▽ More One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching. △ Less

Submitted 4 June, 2025; originally announced June 2025.

arXiv:2506.03106 [pdf, ps, other]

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Authors: Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

Abstract: Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demons… ▽ More Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration. Extensive experiments using Qwen2.5-7B-Base and Qwen3-8B-Base show that Critique-GRPO consistently outperforms supervised learning-based and RL-based fine-tuning approaches across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.5% and 5%, respectively. Notably, Critique-GRPO surpasses a strong baseline that incorporates expert demonstrations within online RL. Further analysis reveals two critical insights about policy exploration: (1) higher entropy does not always guarantee efficient learning from exploration, and (2) longer responses do not necessarily lead to more effective exploration. △ Less

Submitted 4 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

Comments: 38 pages

arXiv:2506.02860 [pdf, ps, other]

Tru-POMDP: Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs

Authors: Wenjing Tang, Xinyu He, Yongxi Huang, Yunxiao Xiao, Cewu Lu, Panpan Cai

Abstract: Task planning under uncertainty is essential for home-service robots operating in the real world. Tasks involve ambiguous human instructions, hidden or unknown object locations, and open-vocabulary object types, leading to significant open-ended uncertainty and a boundlessly large planning space. To address these challenges, we propose Tru-POMDP, a planner that combines structured belief generatio… ▽ More Task planning under uncertainty is essential for home-service robots operating in the real world. Tasks involve ambiguous human instructions, hidden or unknown object locations, and open-vocabulary object types, leading to significant open-ended uncertainty and a boundlessly large planning space. To address these challenges, we propose Tru-POMDP, a planner that combines structured belief generation using Large Language Models (LLMs) with principled POMDP planning. Tru-POMDP introduces a hierarchical Tree of Hypotheses (TOH), which systematically queries an LLM to construct high-quality particle beliefs over possible world states and human goals. We further formulate an open-ended POMDP model that enables rigorous Bayesian belief tracking and efficient belief-space planning over these LLM-generated hypotheses. Experiments on complex object rearrangement tasks across diverse kitchen environments show that Tru-POMDP significantly outperforms state-of-the-art LLM-based and LLM-tree-search hybrid planners, achieving higher success rates with significantly better plans, stronger robustness to ambiguity and occlusion, and greater planning efficiency. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.02449 [pdf, ps, other]

IP-Dialog: Evaluating Implicit Personalization in Dialogue Systems with Synthetic Data

Authors: Bo Peng, Zhiheng Wang, Heyang Gong, Chaochao Lu

Abstract: In modern dialogue systems, the ability to implicitly infer user backgrounds from conversations and leverage this information for personalized assistance is crucial. However, the scarcity of high-quality data remains a fundamental challenge to evaluating and improving this capability. Traditional dataset construction methods are labor-intensive, resource-demanding, and raise privacy concerns. To a… ▽ More In modern dialogue systems, the ability to implicitly infer user backgrounds from conversations and leverage this information for personalized assistance is crucial. However, the scarcity of high-quality data remains a fundamental challenge to evaluating and improving this capability. Traditional dataset construction methods are labor-intensive, resource-demanding, and raise privacy concerns. To address these issues, we propose a novel approach for automatic synthetic data generation and introduce the Implicit Personalized Dialogue (IP-Dialog) benchmark along with a training dataset, covering 10 tasks and 12 user attribute types. Additionally, we develop a systematic evaluation framework with four metrics to assess both attribute awareness and reasoning capabilities. We further propose five causal graphs to elucidate models' reasoning pathways during implicit personalization. Extensive experiments yield insightful observations and prove the reliability of our dataset. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2506.01687 [pdf, ps, other]

StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Authors: Anya Sims, Thom Foster, Klara Kaleb, Tuan-Duy H. Nguyen, Joseph Lee, Jakob N. Foerster, Yee Whye Teh, Cong Lu

Abstract: Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many 'r's in 'strawberry'?. A key factor behind these failures is tokenization which obscures the fine-grained struc… ▽ More Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many 'r's in 'strawberry'?. A key factor behind these failures is tokenization which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to 'see' their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs' downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok's simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: https://github.com/anyasims/stochastok. △ Less

Submitted 10 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

arXiv:2506.01551 [pdf, ps, other]

EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation

Authors: Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Liang Lin, Cewu Lu, Xiaodan Liang

Abstract: Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corp… ▽ More Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at https://github.com/expectorlin/EvolveNav. △ Less

Submitted 2 June, 2025; originally announced June 2025.

arXiv:2506.00902 [pdf]

doi 10.1103/PhysRevLett.134.016708

Observation of universal topological magnetoelectric switching in multiferroic GdMn2O5

Authors: Haowen Wang, Fan Wang, Ming Yang, Yuting Chang, Mengyi Shi, Liang Li, Jun-Ming Liu, Junfeng Wang, Shuai Dong, Chengliang Lu

Abstract: Topological magnetoelectricity was recently revealed as an emergent topic, which opens a unique route to precisely control magnetoelectric functionality. Here we report the synchronous magnetic-electric-cycle operation of topological magnetoelectric switching in GdMn2O5. Compared with pure magnetic-cycle operation, this topological winding can be accessed in a much broader parameter space, i.e. or… ▽ More Topological magnetoelectricity was recently revealed as an emergent topic, which opens a unique route to precisely control magnetoelectric functionality. Here we report the synchronous magnetic-electric-cycle operation of topological magnetoelectric switching in GdMn2O5. Compared with pure magnetic-cycle operation, this topological winding can be accessed in a much broader parameter space, i.e. orientation of magnetic field is not limited to the magic angle and the effect can persist up to the Curie temperature. The fine tuning of free energy landscape is responsible to this topological behavior. △ Less

Submitted 1 June, 2025; originally announced June 2025.

Journal ref: Phys. Rev. Lett. 134, 016708 (2025)

arXiv:2506.00765 [pdf, ps, other]

HouseTS: A Large-Scale, Multimodal Spatiotemporal U.S. Housing Dataset

Authors: Shengkun Wang, Yanshen Sun, Fanglan Chen, Linhan Wang, Naren Ramakrishnan, Chang-Tien Lu, Yinlin Chen

Abstract: Accurate house-price forecasting is essential for investors, planners, and researchers. However, reproducible benchmarks with sufficient spatiotemporal depth and contextual richness for long horizon prediction remain scarce. To address this, we introduce HouseTS a large scale, multimodal dataset covering monthly house prices from March 2012 to December 2023 across 6,000 ZIP codes in 30 major U.S.… ▽ More Accurate house-price forecasting is essential for investors, planners, and researchers. However, reproducible benchmarks with sufficient spatiotemporal depth and contextual richness for long horizon prediction remain scarce. To address this, we introduce HouseTS a large scale, multimodal dataset covering monthly house prices from March 2012 to December 2023 across 6,000 ZIP codes in 30 major U.S. metropolitan areas. The dataset includes over 890K records, enriched with points of Interest (POI), socioeconomic indicators, and detailed real estate metrics. To establish standardized performance baselines, we evaluate 14 models, spanning classical statistical approaches, deep neural networks (DNNs), and pretrained time-series foundation models. We further demonstrate the value of HouseTS in a multimodal case study, where a vision language model extracts structured textual descriptions of geographic change from time stamped satellite imagery. This enables interpretable, grounded insights into urban evolution. HouseTS is hosted on Kaggle, while all preprocessing pipelines, benchmark code, and documentation are openly maintained on GitHub to ensure full reproducibility and easy adoption. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2506.00298 [pdf, ps, other]

Millimeter-wave observations of Euclid Deep Field South using the South Pole Telescope: A data release of temperature maps and catalogs

Authors: M. Archipley, A. Hryciuk, L. E. Bleem, K. Kornoelje, M. Klein, A. J. Anderson, B. Ansarinejad, M. Aravena, L. Balkenhol, P. S. Barry, K. Benabed, A. N. Bender, B. A. Benson, F. Bianchini, S. Bocquet, F. R. Bouchet, E. Camphuis, M. G. Campitiello, J. E. Carlstrom, J. Cathey, C. L. Chang, S. C. Chapman, P. Chaubal, P. M. Chichura, A. Chokshi , et al. (86 additional authors not shown)

Abstract: Context. The South Pole Telescope third-generation camera (SPT-3G) has observed over 10,000 square degrees of sky at 95, 150, and 220 GHz (3.3, 2.0, 1.4 mm, respectively) overlapping the ongoing 14,000 square-degree Euclid Wide Survey. The Euclid collaboration recently released Euclid Deep Field observations in the first quick data release (Q1). Aims. With the goal of releasing complementary milli… ▽ More Context. The South Pole Telescope third-generation camera (SPT-3G) has observed over 10,000 square degrees of sky at 95, 150, and 220 GHz (3.3, 2.0, 1.4 mm, respectively) overlapping the ongoing 14,000 square-degree Euclid Wide Survey. The Euclid collaboration recently released Euclid Deep Field observations in the first quick data release (Q1). Aims. With the goal of releasing complementary millimeter-wave data and encouraging legacy science, we performed dedicated observations of a 57-square-degree field overlapping the Euclid Deep Field South (EDF-S). Methods. The observing time totaled 20 days and we reached noise depths of 4.3, 3.8, and 13.2 $μ$K-arcmin at 95, 150, and 220 GHz, respectively. Results. In this work we present the temperature maps and two catalogs constructed from these data. The emissive source catalog contains 601 objects (334 inside EDF-S) with 54% synchrotron-dominated sources and 46% thermal dust emission-dominated sources. The 5$σ$ detection thresholds are 1.7, 2.0, and 6.5 mJy in the three bands. The cluster catalog contains 217 cluster candidates (121 inside EDF-S) with median mass $M_{500c}=2.12 \times 10^{14} M_{\odot}/h_{70}$ and median redshift $z$ = 0.70, corresponding to an order-of-magnitude improvement in cluster density over previous tSZ-selected catalogs in this region (3.81 clusters per square degree). Conclusions. The overlap between SPT and Euclid data will enable a range of multiwavelength studies of the aforementioned source populations. This work serves as the first step towards joint projects between SPT and Euclid and provides a rich dataset containing information on galaxies, clusters, and their environments. △ Less

Submitted 30 May, 2025; originally announced June 2025.

Comments: 26 pages, 12 figures, to be submitted to A&A

arXiv:2505.24833 [pdf]

Cryogenic scanning photocurrent spectroscopy for materials responses to structured optical fields

Authors: Duxing Hao, Chun-I Lu, Ziqi Sun, Yu-Chen Chang, Wen-Hao Chang, Ye-Ru Chen, Akiyoshi Park, Beining Rao, Siyuan Qiu, Yann-Wen Lan, Ting-Hua Lu, Nai-Chang Yeh

Abstract: Circular dichroism spectroscopy is known to provide important insights into the interplay of different degrees of freedom in quantum materials, and yet spectroscopic study of the optoelectronic responses of quantum materials to structured optical fields, such as light with finite spin and orbital angular momentum, has not yet been widely explored, particularly at cryogenic temperature. Here we dem… ▽ More Circular dichroism spectroscopy is known to provide important insights into the interplay of different degrees of freedom in quantum materials, and yet spectroscopic study of the optoelectronic responses of quantum materials to structured optical fields, such as light with finite spin and orbital angular momentum, has not yet been widely explored, particularly at cryogenic temperature. Here we demonstrate the design and application of a novel instrument that integrates scanning spectroscopic photocurrent measurements with structured light of controlled spin and orbital angular momentum. For structured photons with wavelengths between 500 nm to 700 nm, this instrument can perform spatially resolved photocurrent measurements of two-dimensional materials or thin crystals under magnetic fields up to $\pm$ 14 Tesla, at temperatures from 300 K down to 3 K, with either spin angular momentum $\pm \hbar$ ororbital angular momentum $\pm \ell \hbar$ (where $\ell$=1,2,3... is the topological charge), and over a (35 $\times$ 25) $μm^2$ area with ~ 1 $μm$ spatial resolution. These capabilities of the instrument are exemplified by magneto-photocurrent spectroscopic measurements of monolayer 2H-$MoS_2$ field-effect transistors, which not only reveal the excitonic spectra but also demonstrate monotonically increasing photocurrents with increasing |$\ell $| as well as excitonic Zeeman splitting and an enhanced Landé g-factor due to the enhanced formation of intervalley dark excitons under magnetic field. These studies thus demonstrate the versatility of the scanning photocurrent spectrometry for investigating excitonic physics, optical selection rules, and optoelectronic responses of novel quantum materials and engineered quantum devices to structured light. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.24810 [pdf, ps, other]

New Physics Search at the CEPC: a General Perspective

Authors: Stefan Antusch, Peter Athron, Daniele Barducci, Long Chen, Mingshui Chen, Xiang Chen, Huajie Cheng, Kingman Cheung, Joao Guimaraes da Costa, Arindam Das, Frank F. Deppisch, P. S. Bhupal Dev, Xiaokang Du, Yong Du, Yaquan Fang, Andrew Fowlie, Yu Gao, Bruce Mellado Garcia, Shao-Feng Ge, Jiayin Gu, Yu-Chen Guo, Jan Hajer, Chengcheng Han, Tao Han, Sven Heinemeyer , et al. (68 additional authors not shown)

Abstract: The Circular Electron-Positron Collider (CEPC), a proposed next-generation Higgs factory, provides new opportunities to explore physics beyond the Standard Model (SM). With its clean electron-positron collision environment and the ability to collect large samples of Higgs, W, and Z bosons, the CEPC enables precision measurements and searches for new physics. This white paper outlines the CEPC's di… ▽ More The Circular Electron-Positron Collider (CEPC), a proposed next-generation Higgs factory, provides new opportunities to explore physics beyond the Standard Model (SM). With its clean electron-positron collision environment and the ability to collect large samples of Higgs, W, and Z bosons, the CEPC enables precision measurements and searches for new physics. This white paper outlines the CEPC's discovery potential, including studies of exotic decays of the Higgs, Z, and top quarks, dark matter and dark sector phenomena, long-lived particles, supersymmetry, and neutrino-related signatures. Advanced detector technologies and reconstruction techniques, such as one-to-one correspondence reconstruction and jet origin identification, significantly improve sensitivity to rare and weakly interacting processes. The CEPC is particularly well suited to probe the electroweak phase transition and test models of electroweak baryogenesis and dark sector interactions. In addition, global fit analyses highlight the CEPC's complementary role in constraining a wide range of new physics scenarios. These features position the CEPC as a powerful tool for exploring the next frontier in fundamental particle physics in the post-Higgs discovery era. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.24678 [pdf]

All-optical diode via nonreciprocal nonlinear absorption and interfacial charge transfer in two-dimensional van der Waals heterostructures

Authors: Erkang Li, Jinhong Liu, Yanqing Ge, Mingjian Shi, Yijie Wang, Chunhui Lu, Yixuan Zhou, Xinlong Xu

Abstract: Nonreciprocity is fundamental to photonic and optoelectronic devices such as all-optical diodes for ultrafast optical signal processing. However, previous nonreciprocity is mainly based on linear optical response instead of nonlinear optical response based on recently developed two-dimensional (2D) van der Waals heterostructures. Herein, an all-optical diode prototype based on nonreciprocal nonlin… ▽ More Nonreciprocity is fundamental to photonic and optoelectronic devices such as all-optical diodes for ultrafast optical signal processing. However, previous nonreciprocity is mainly based on linear optical response instead of nonlinear optical response based on recently developed two-dimensional (2D) van der Waals heterostructures. Herein, an all-optical diode prototype based on nonreciprocal nonlinear absorption and interfacial charge transfer is proposed and designed by both simulation and experiment based on ready van der Waals heterostructures. The giant saturable absorption from 2D MXenes (NbC) and reverse saturable absorption from 2D chalcogenides (GaS) play a synergistic role in the designed all-optical diodes, which is characterized by a femtosecond laser based Z-scan system. The comprehensive physical mechanism of this all-optical diode based on 2D van der Waals NbC/GaS heterostructure designed by simulations, is consistent with experiments under the consideration of both nonreciprocal nonlinear absorption and interfacial effect. This all-optical diode based on the 2D van der Waals heterostructure features the simplicity, scalability, stability, integration, and compatibility with the complementary planar fabrication technology, which can further extend and miniaturize the nonlinear photonic and optoelectric devices. △ Less

Submitted 30 May, 2025; originally announced May 2025.

arXiv:2505.24369 [pdf, ps, other]

Adversarial Preference Learning for Robust LLM Alignment

Authors: Yuanfu Wang, Pengyu Wang, Chenyang Xi, Bo Tang, Junyi Zhu, Wenqiang Wei, Chen Chen, Chao Yang, Jingfeng Zhang, Chaochao Lu, Yijun Niu, Keming Mao, Zhiyu Li, Feiyu Xiong, Jie Hu, Mingchuan Yang

Abstract: Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we… ▽ More Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model's intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and mitigation. Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly enhances robustness, achieving 83.33% harmlessness win rate over the base model (evaluated by GPT-4o), reducing harmful outputs from 5.88% to 0.43% (measured by LLaMA-Guard), and lowering attack success rate by up to 65% according to HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52% against the base model. △ Less

Submitted 30 May, 2025; originally announced May 2025.

Comments: Accepted at ACL2025 Findings

arXiv:2505.22954 [pdf, ps, other]

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Authors: Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune

Abstract: Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a sui… ▽ More Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The Gödel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin Gödel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation. △ Less

Submitted 28 May, 2025; originally announced May 2025.

Comments: Code at https://github.com/jennyzzt/dgm

arXiv:2505.22170 [pdf, ps, other]

Attention-Enhanced Prompt Decision Transformers for UAV-Assisted Communications with AoI

Authors: Chi Lu, Yiyang Ni, Zhe Wang, Xiaoli Shi, Jun Li, Shi Jin

Abstract: Decision Transformer (DT) has recently demonstrated strong generalizability in dynamic resource allocation within unmanned aerial vehicle (UAV) networks, compared to conventional deep reinforcement learning (DRL). However, its performance is hindered due to zero-padding for varying state dimensions, inability to manage long-term energy constraint, and challenges in acquiring expert samples for few… ▽ More Decision Transformer (DT) has recently demonstrated strong generalizability in dynamic resource allocation within unmanned aerial vehicle (UAV) networks, compared to conventional deep reinforcement learning (DRL). However, its performance is hindered due to zero-padding for varying state dimensions, inability to manage long-term energy constraint, and challenges in acquiring expert samples for few-shot fine-tuning in new scenarios. To overcome these limitations, we propose an attention-enhanced prompt Decision Transformer (APDT) framework to optimize trajectory planning and user scheduling, aiming to minimize the average age of information (AoI) under long-term energy constraint in UAV-assisted Internet of Things (IoT) networks. Specifically, we enhance the convenional DT framework by incorporating an attention mechanism to accommodate varying numbers of terrestrial users, introducing a prompt mechanism based on short trajectory demonstrations for rapid adaptation to new scenarios, and designing a token-assisted method to address the UAV's long-term energy constraint. The APDT framework is first pre-trained on offline datasets and then efficiently generalized to new scenarios. Simulations demonstrate that APDT achieves twice faster in terms of convergence rate and reduces average AoI by $8\%$ compared to conventional DT. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.22159 [pdf, ps, other]

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

Authors: Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, Cewu Lu, Wenqiang Zhang

Abstract: Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose \textbf{ForceVLA}, a novel end-to-end manipulation f… ▽ More Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose \textbf{ForceVLA}, a novel end-to-end manipulation framework that treats external force sensing as a first-class modality within VLA systems. ForceVLA introduces \textbf{FVLMoE}, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time 6-axis force feedback during action decoding. This enables context-aware routing across modality-specific experts, enhancing the robot's ability to adapt to subtle contact dynamics. We also introduce \textbf{ForceVLA-Data}, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23.2\% over strong $π_0$-based baselines, achieving up to 80\% success in tasks such as plug insertion. Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control. Code and data will be released at https://sites.google.com/view/forcevla2025. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2505.20709 [pdf, ps, other]

Fractional order derivative characterizations of Besov-Morrey type spaces with applications

Authors: Chen Lu, Mingjin Li, Jianren Long

Abstract: On the one hand, the fractional order derivative characterization of the Besov-Morrey type space $B_{p}^{K}(s)$ is established by $K$-Carleson measures, and it was also shown that $f \in B_{p}^{K}(s_1) \Leftrightarrow f^{\left(\frac{s_2 - s_1}{p}\right)} \in B_{p}^{K}(s_2)$, which extended the results of Sun et al. on the fractional derivative of Morrey type space. On the other hand, some sufficie… ▽ More On the one hand, the fractional order derivative characterization of the Besov-Morrey type space $B_{p}^{K}(s)$ is established by $K$-Carleson measures, and it was also shown that $f \in B_{p}^{K}(s_1) \Leftrightarrow f^{\left(\frac{s_2 - s_1}{p}\right)} \in B_{p}^{K}(s_2)$, which extended the results of Sun et al. on the fractional derivative of Morrey type space. On the other hand, some sufficient conditions for the growth of solutions to linear complex differential equations have been obtained by using $n$th derivative criterion. △ Less

Submitted 27 May, 2025; originally announced May 2025.

MSC Class: Primary 32A37; 32K15; Second 32M10

arXiv:2505.20678 [pdf, ps, other]

PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts

Authors: Tianhua Qi, Shiyan Wang, Cheng Lu, Tengfei Song, Hao Yang, Zhanglin Wu, Wenming Zheng

Abstract: Controllable emotional voice conversion (EVC) aims to manipulate emotional expressions to increase the diversity of synthesized speech. Existing methods typically rely on predefined labels, reference audios, or prespecified factor values, often overlooking individual differences in emotion perception and expression. In this paper, we introduce PromptEVC that utilizes natural language prompts for p… ▽ More Controllable emotional voice conversion (EVC) aims to manipulate emotional expressions to increase the diversity of synthesized speech. Existing methods typically rely on predefined labels, reference audios, or prespecified factor values, often overlooking individual differences in emotion perception and expression. In this paper, we introduce PromptEVC that utilizes natural language prompts for precise and flexible emotion control. To bridge text descriptions with emotional speech, we propose emotion descriptor and prompt mapper to generate fine-grained emotion embeddings, trained jointly with reference embeddings. To enhance naturalness, we present a prosody modeling and control pipeline that adjusts the rhythm based on linguistic content and emotional cues. Additionally, a speaker encoder is incorporated to preserve identity. Experimental results demonstrate that PromptEVC outperforms state-of-the-art controllable EVC methods in emotion conversion, intensity control, mixed emotion synthesis, and prosody manipulation. Speech samples are available at https://jeremychee4.github.io/PromptEVC/. △ Less