Search | arXiv e-print repository

BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning

Authors: Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, Samet Oymak

Abstract: Small language models (SLMs) struggle to learn complex reasoning behaviors, especially when high-quality traces are scarce or difficult to learn from. The standard training approach combines a supervised fine-tuning (SFT) stage, often to distill capabilities of a larger model, followed by a reinforcement learning (RL)stage such as Group Relative Policy Optimization (GRPO). In this paper, we invest… ▽ More Small language models (SLMs) struggle to learn complex reasoning behaviors, especially when high-quality traces are scarce or difficult to learn from. The standard training approach combines a supervised fine-tuning (SFT) stage, often to distill capabilities of a larger model, followed by a reinforcement learning (RL)stage such as Group Relative Policy Optimization (GRPO). In this paper, we investigate the fundamental limitations of this SFT + RL paradigm and propose methods to overcome them. Under a suitable theoretical model, we demonstrate that the SFT + RL strategy can fail completely when (1) the expert's traces are too difficult for the small model to express, or (2) the small model's initialization has exponentially small likelihood of success. To address these, we introduce BREAD: a GRPO variant that unifies the SFT and RL stages via partial expert guidance and branched rollouts. When self-generated traces fail, BREAD adaptively inserts short expert prefixes/hints, allowing the small model to complete the rest of the reasoning path, and ensuring that each update includes at least one successful trace. This mechanism both densifies the reward signal and induces a natural learning curriculum. BREAD requires fewer than 40% of ground-truth traces, consistently outperforming standard GRPO while speeding up the training by about 3 times. Importantly, we demonstrate that BREAD helps the model solve problems that are otherwise unsolvable by the SFT + RL strategy, highlighting how branched rollouts and expert guidance can substantially boost SLM reasoning. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.17206 [pdf, ps, other]

DreamCube: 3D Panorama Generation via Multi-plane Synchronization

Authors: Yukun Huang, Yanning Zhou, Jianan Wang, Kaiyi Huang, Xihui Liu

Abstract: 3D panorama synthesis is a promising yet challenging task that demands high-quality and diverse visual appearance and geometry of the generated omnidirectional content. Existing methods leverage rich image priors from pre-trained 2D foundation models to circumvent the scarcity of 3D panoramic data, but the incompatibility between 3D panoramas and 2D single views limits their effectiveness. In this… ▽ More 3D panorama synthesis is a promising yet challenging task that demands high-quality and diverse visual appearance and geometry of the generated omnidirectional content. Existing methods leverage rich image priors from pre-trained 2D foundation models to circumvent the scarcity of 3D panoramic data, but the incompatibility between 3D panoramas and 2D single views limits their effectiveness. In this work, we demonstrate that by applying multi-plane synchronization to the operators from 2D foundation models, their capabilities can be seamlessly extended to the omnidirectional domain. Based on this design, we further introduce DreamCube, a multi-plane RGB-D diffusion model for 3D panorama generation, which maximizes the reuse of 2D foundation model priors to achieve diverse appearances and accurate geometry while maintaining multi-view consistency. Extensive experiments demonstrate the effectiveness of our approach in panoramic image generation, panoramic depth estimation, and 3D scene generation. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Project page: https://yukun-huang.github.io/DreamCube/

arXiv:2506.17110 [pdf, ps, other]

Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping

Authors: Teng Guo, Baichuan Huang, Jingjin Yu

Abstract: Accurate 6D object pose estimation is a prerequisite for successfully completing robotic prehensile and non-prehensile manipulation tasks. At present, 6D pose estimation for robotic manipulation generally relies on depth sensors based on, e.g., structured light, time-of-flight, and stereo-vision, which can be expensive, produce noisy output (as compared with RGB cameras), and fail to handle transp… ▽ More Accurate 6D object pose estimation is a prerequisite for successfully completing robotic prehensile and non-prehensile manipulation tasks. At present, 6D pose estimation for robotic manipulation generally relies on depth sensors based on, e.g., structured light, time-of-flight, and stereo-vision, which can be expensive, produce noisy output (as compared with RGB cameras), and fail to handle transparent objects. On the other hand, state-of-the-art monocular depth estimation models (MDEMs) provide only affine-invariant depths up to an unknown scale and shift. Metric MDEMs achieve some successful zero-shot results on public datasets, but fail to generalize. We propose a novel framework, Monocular One-shot Metric-depth Alignment (MOMA), to recover metric depth from a single RGB image, through a one-shot adaptation building on MDEM techniques. MOMA performs scale-rotation-shift alignments during camera calibration, guided by sparse ground-truth depth points, enabling accurate depth estimation without additional data collection or model retraining on the testing setup. MOMA supports fine-tuning the MDEM on transparent objects, demonstrating strong generalization capabilities. Real-world experiments on tabletop 2-finger grasping and suction-based bin-picking applications show MOMA achieves high success rates in diverse tasks, confirming its effectiveness. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: Accepted to IROS 2025

arXiv:2506.17046 [pdf, ps, other]

MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

Authors: Xiaolong Wang, Zhaolu Kang, Wangyuxuan Zhai, Xinyue Lou, Yunghwei Lai, Ziyue Wang, Yawen Wang, Kaiyu Huang, Yile Wang, Peng Li, Yang Liu

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically… ▽ More Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models--encompassing both open-source and proprietary architectures--reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16957 [pdf, ps, other]

Wi-Fi Sensing Tool Release: Gathering 802.11ax Channel State Information from a Commercial Wi-Fi Access Point

Authors: Zisheng Wang, Feng Li, Hangbin Zhao, Zihuan Mao, Yaodong Zhang, Qisheng Huang, Bo Cao, Mingming Cao, Baolin He, Qilin Hou

Abstract: Wi-Fi sensing has emerged as a powerful technology, leveraging channel state information (CSI) extracted from wireless data packets to enable diverse applications, ranging from human presence detection to gesture recognition and health monitoring. However, CSI extraction from commercial Wi-Fi access point lacks and out of date. This paper introduces ZTECSITool,a toolkit designed to capture high-re… ▽ More Wi-Fi sensing has emerged as a powerful technology, leveraging channel state information (CSI) extracted from wireless data packets to enable diverse applications, ranging from human presence detection to gesture recognition and health monitoring. However, CSI extraction from commercial Wi-Fi access point lacks and out of date. This paper introduces ZTECSITool,a toolkit designed to capture high-resolution CSI measurements from commercial Wi-Fi 6 (802.11ax) access points, supporting bandwidths up to 160 MHz and 512 subcarriers. ZTECSITool bridges a critical gap in Wi-Fi sensing research, facilitating the development of next-generation sensing systems. The toolkit includes customized firmware and open-source software tools for configuring, collecting, and parsing CSI data, offering researchers a robust platform for advanced sensing applications. We detail the command protocols for CSI extraction, including band selection,STA filtering, and report configuration, and provide insights into the data structure of the reported CSI. Additionally, we present a Python-based graphical interface for real-time CSI visualization and analysis △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: This work has been submitted to the IEEE for possible publication

arXiv:2506.16955 [pdf, ps, other]

Search for the in-situ production of $^{77}$Ge in the GERDA neutrinoless double-beta decay experiment

Authors: M. Agostini, A. Alexander, G. Araujo, A. M. Bakalyarov, M. Balata, I. Barabanov, L. Baudis, C. Bauer, S. Belogurov, A. Bettini, L. Bezrukov, V. Biancacci, E. Bossio, V. Bothe, R. Brugnera, A. Caldwell, S. Calgaro, C. Cattadori, A. Chernogorov, P. -J. Chiu, T. Comellato, V. D'Andrea, E. V. Demidova, N. Di Marco, E. Doroshkevich , et al. (86 additional authors not shown)

Abstract: The beta decay of $^{77}$Ge and $^{77\mathrm{m}}$Ge, both produced by neutron capture on $^{76}$Ge, is a potential background for Germanium based neutrinoless double-beta decay search experiments such as GERDA or the LEGEND experiment. In this work we present a search for $^{77}$Ge decays in the full GERDA Phase II data set. A delayed coincidence method was employed to identify the decay of… ▽ More The beta decay of $^{77}$Ge and $^{77\mathrm{m}}$Ge, both produced by neutron capture on $^{76}$Ge, is a potential background for Germanium based neutrinoless double-beta decay search experiments such as GERDA or the LEGEND experiment. In this work we present a search for $^{77}$Ge decays in the full GERDA Phase II data set. A delayed coincidence method was employed to identify the decay of $^{77}$Ge via the isomeric state of $^{77}$As (9/2$^+$, 475 keV, ${T_{1/2} = 114}\,μ$s, $^{77\mathrm{m}}$As). New digital signal processing methods were employed to select and analyze pile-up signals. No signal was observed, and an upper limit on the production rate of was set at $<0.216$ nuc/(kg$\cdot$yr) (90% CL). This corresponds to a total production rate of $^{77}$Ge and $^{77\mathrm{m}}$Ge of $<0.38$ nuc/(kg$\cdot$ yr) (90% CL), assuming equal production rates. A previous Monte Carlo study predicted a value for in-situ $^{77}$Ge and $^{77\mathrm{m}}$Ge production of (0.21$\pm$0.07) nuc/(kg$\cdot$yr), a prediction that is now further corroborated by our experimental limit. Moreover, tagging the isomeric state of $^{77\mathrm{m}}$As can be utilised to further suppress the $^{77}$Ge background. Considering the similar experimental configurations of LEGEND-1000 and GERDA, the cosmogenic background in LEGEND-1000 at LNGS is estimated to remain at a sub-dominant level. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 11 pages, 7 figures

arXiv:2506.16953 [pdf, ps, other]

Dimensions of compositions modulo a prime

Authors: Jia Huang

Abstract: The (ordinary) representation theory of the symmetric group is fascinating and has rich connections to combinatorics, including the Frobenius correspondence to the self-dual graded Hopf algebra of symmetric functions. The $0$-Hecke algebra (of type $A$) is a deformation of the group algebra of the symmetric group, and its representation theory has an analogous correspondence to the dual graded Hop… ▽ More The (ordinary) representation theory of the symmetric group is fascinating and has rich connections to combinatorics, including the Frobenius correspondence to the self-dual graded Hopf algebra of symmetric functions. The $0$-Hecke algebra (of type $A$) is a deformation of the group algebra of the symmetric group, and its representation theory has an analogous correspondence to the dual graded Hopf algebras of quasisymmetric functions and noncommutative symmetric functions. Macdonald used the hook length formula for the number of standard Young tableaux of a fixed shape to determine how many irreducible representations of the symmetric group have dimensions indivisible by a prime $p$. In this paper, we study the dimensions of the projective indecomposable modules of the $0$-Hecke algebra modulo $p$; such a module is indexed by a composition and its dimension is given by a ribbon number, i.e., the cardinality of a descent class. Applying a result of Dickson on the congruence of multinomial coefficients, we count how many ribbon numbers belong to each congruence class modulo $p$. We also extend the result to other finite Coxeter groups. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 18 pages

MSC Class: 05E10

arXiv:2506.16934 [pdf]

PET Tracer Separation Using Conditional Diffusion Transformer with Multi-latent Space Learning

Authors: Bin Huang, Feihong Xu, Xinchong Shi, Shan Huang, Binxuan Li, Fei Li, Qiegen Liu

Abstract: In clinical practice, single-radiotracer positron emission tomography (PET) is commonly used for imaging. Although multi-tracer PET imaging can provide supplementary information of radiotracers that are sensitive to physiological function changes, enabling a more comprehensive characterization of physiological and pathological states, the gamma-photon pairs generated by positron annihilation react… ▽ More In clinical practice, single-radiotracer positron emission tomography (PET) is commonly used for imaging. Although multi-tracer PET imaging can provide supplementary information of radiotracers that are sensitive to physiological function changes, enabling a more comprehensive characterization of physiological and pathological states, the gamma-photon pairs generated by positron annihilation reactions of different tracers in PET imaging have the same energy, making it difficult to distinguish the tracer signals. In this study, a multi-latent space guided texture conditional diffusion transformer model (MS-CDT) is proposed for PET tracer separation. To the best of our knowledge, this is the first attempt to use texture condition and multi-latent space for tracer separation in PET imaging. The proposed model integrates diffusion and transformer architectures into a unified optimization framework, with the novel addition of texture masks as conditional inputs to enhance image details. By leveraging multi-latent space prior derived from different tracers, the model captures multi-level feature representations, aiming to balance computational efficiency and detail preservation. The texture masks, serving as conditional guidance, help the model focus on salient structural patterns, thereby improving the extraction and utilization of fine-grained image textures. When combined with the diffusion transformer backbone, this conditioning mechanism contributes to more accurate and robust tracer separation. To evaluate its effectiveness, the proposed MS-CDT is compared with several advanced methods on two types of 3D PET datasets: brain and chest scans. Experimental results indicate that MS-CDT achieved competitive performance in terms of image quality and preservation of clinically relevant information. Code is available at: https://github.com/yqx7150/MS-CDT. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16922 [pdf, ps, other]

Low-Energy Supernova Constraints on Lepton Flavor Violating Axions

Authors: Zi-Miao Huang, Zuowei Liu

Abstract: The extreme conditions within the supernova core, a high-temperature and high-density environment, create an ideal laboratory for the search for new physics beyond the Standard Model. Of particular interest are low-energy supernovae, characterized by their low explosion energies, which place strong constraints on the new-physics energy transfer from the core to the mantle. We compute low-energy su… ▽ More The extreme conditions within the supernova core, a high-temperature and high-density environment, create an ideal laboratory for the search for new physics beyond the Standard Model. Of particular interest are low-energy supernovae, characterized by their low explosion energies, which place strong constraints on the new-physics energy transfer from the core to the mantle. We compute low-energy supernova constraints on lepton-flavor-violating axions and axion-like particles that couple to both electrons and muons. For axion mass above the muon mass, the electron-muon coalescence and the axion decay are dominant production and reabsorption processes, respectively. We find that the low-energy supernovae provide the most stringent constraints on the axions in the mass range of $\sim (110,550)$ MeV, probing the coupling constant down to $g_{aeμ} \simeq {\cal O}(10^{-11})$. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: 12 pages, 8 figures

arXiv:2506.16796 [pdf, ps, other]

RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

Authors: Junbo Qiao, Miaomiao Cai, Wei Li, Yutong Liu, Xudong Huang, Gaoqi He, Jiao Xie, Jie Hu, Xinghao Chen, Shaohui Lin

Abstract: Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success o… ▽ More Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation. △ Less

Submitted 20 June, 2025; originally announced June 2025.

arXiv:2506.16756 [pdf, ps, other]

doi 10.1609/aaai.v39i2.32116

SocialSim: Towards Socialized Simulation of Emotional Support Conversation

Authors: Zhuang Chen, Yaru Cao, Guanqun Bi, Jincenzi Wu, Jinfeng Zhou, Xiyao Xiao, Si Chen, Hongning Wang, Minlie Huang

Abstract: Emotional support conversation (ESC) helps reduce people's psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this pape… ▽ More Emotional support conversation (ESC) helps reduce people's psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic help-seeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: AAAI 2025 Paper #32116 (Without Publication Edits)

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 1274-1282, 2025

arXiv:2506.16728 [pdf, ps, other]

Few-Shot Generalized Category Discovery With Retrieval-Guided Decision Boundary Enhancement

Authors: Yunhan Ren, Feng Luo, Siyu Huang

Abstract: While existing Generalized Category Discovery (GCD) models have achieved significant success, their performance with limited labeled samples and a small number of known categories remains largely unexplored. In this work, we introduce the task of Few-shot Generalized Category Discovery (FSGCD), aiming to achieve competitive performance in GCD tasks under conditions of known information scarcity. T… ▽ More While existing Generalized Category Discovery (GCD) models have achieved significant success, their performance with limited labeled samples and a small number of known categories remains largely unexplored. In this work, we introduce the task of Few-shot Generalized Category Discovery (FSGCD), aiming to achieve competitive performance in GCD tasks under conditions of known information scarcity. To tackle this challenge, we propose a decision boundary enhancement framework with affinity-based retrieval. Our framework is designed to learn the decision boundaries of known categories and transfer these boundaries to unknown categories. First, we use a decision boundary pre-training module to mitigate the overfitting of pre-trained information on known category boundaries and improve the learning of these decision boundaries using labeled samples. Second, we implement a two-stage retrieval-guided decision boundary optimization strategy. Specifically, this strategy further enhances the severely limited known boundaries by using affinity-retrieved pseudo-labeled samples. Then, these refined boundaries are applied to unknown clusters via guidance from affinity-based feature retrieval. Experimental results demonstrate that our proposed method outperforms existing methods on six public GCD benchmarks under the FSGCD setting. The codes are available at: https://github.com/Ryh1218/FSGCD △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted by ICMR 2025

arXiv:2506.16718 [pdf, ps, other]

Generalizable Agent Modeling for Agent Collaboration-Competition Adaptation with Multi-Retrieval and Dynamic Generation

Authors: Chenxu Wang, Yonggang Jin, Cheng Hu, Youpeng Zhao, Zipeng Dai, Jian Zhao, Shiyu Huang, Liuyu Xiang, Junge Zhang, Zhaofeng He

Abstract: Adapting a single agent to a new multi-agent system brings challenges, necessitating adjustments across various tasks, environments, and interactions with unknown teammates and opponents. Addressing this challenge is highly complex, and researchers have proposed two simplified scenarios, Multi-agent reinforcement learning for zero-shot learning and Ad-Hoc Teamwork. Building on these foundations, w… ▽ More Adapting a single agent to a new multi-agent system brings challenges, necessitating adjustments across various tasks, environments, and interactions with unknown teammates and opponents. Addressing this challenge is highly complex, and researchers have proposed two simplified scenarios, Multi-agent reinforcement learning for zero-shot learning and Ad-Hoc Teamwork. Building on these foundations, we propose a more comprehensive setting, Agent Collaborative-Competitive Adaptation (ACCA), which evaluates an agent to generalize across diverse scenarios, tasks, and interactions with both unfamiliar opponents and teammates. In ACCA, agents adjust to task and environmental changes, collaborate with unseen teammates, and compete against unknown opponents. We introduce a new modeling approach, Multi-Retrieval and Dynamic Generation (MRDG), that effectively models both teammates and opponents using their behavioral trajectories. This method incorporates a positional encoder for varying team sizes and a hypernetwork module to boost agents' learning and adaptive capabilities. Additionally, a viewpoint alignment module harmonizes the observational perspectives of retrieved teammates and opponents with the learning agent. Extensive tests in benchmark scenarios like SMAC, Overcooked-AI, and Melting Pot show that MRDG significantly improves robust collaboration and competition with unseen teammates and opponents, surpassing established baselines. Our code is available at: https://github.com/vcis-wangchenxu/MRDG.git △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: This manuscript is under submission to Neurocomputing

Report number: NEUCOM-D-25-02272R1

arXiv:2506.16695 [pdf]

Crystal Growth of Chalcogenides and Oxy-Chalcogenides Using Chloride Exchange Reaction

Authors: Shantanu Singh, Boyang Zhao, Christopher E. Stevens, Mythili Surendran, Tzu-Chi Huang, Bi-Hsuan Lin, Joshua R. Hendrickson, Jayakanth Ravichandran

Abstract: Chalcogenides and oxy-chalcogenides, including complex chalcogenides and transition metal dichalcogenides, are emerging semiconductors with direct or indirect band gaps within the visible spectrum. These materials are being explored for various photonic and electronic applications, such as photodetectors, photovoltaics, and phase-change electronics. Understanding the fundamental properties of thes… ▽ More Chalcogenides and oxy-chalcogenides, including complex chalcogenides and transition metal dichalcogenides, are emerging semiconductors with direct or indirect band gaps within the visible spectrum. These materials are being explored for various photonic and electronic applications, such as photodetectors, photovoltaics, and phase-change electronics. Understanding the fundamental properties of these materials is crucial for optimizing their functionalities. Therefore, the availability of large, high-quality single crystals of chalcogenides and oxy-chalcogenides is essential for a better comprehension of their structure and properties. In this study, we present a novel crystal growth method that utilizes the exchange reaction between BaS and ZrCl$_4$/ HfCl$_4$. By carefully controlling the stoichiometric ratio of the binary sulfide to the chloride, we can grow single crystals of several materials, such as ZrS$_2$, HfS$_2$, BaZrS$_3$, and ZrOS. This method results in large single crystals with a short reaction time of 24 to 48 hours. High-resolution thin film diffraction and single-crystal X-ray diffraction confirm the quality of the crystals produced through this exchange reaction. We also report the optical properties of these materials investigated using photoluminescence and Raman measurements. The chloride exchange reaction method paves the way for the synthesis of single crystals of chalcogenides and oxy-chalcogenide systems with a short reaction time but with low mosaicity and can be an alternative growth technique for single crystals of materials that are difficult to synthesize using conventional growth techniques. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16691 [pdf, ps, other]

LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation

Authors: Tongtian Yue, Longteng Guo, Yepeng Tang, Zijia Zhao, Xinxin Zhu, Hua Huang, Jing Liu

Abstract: Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present… ▽ More Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion by introducing a lightweight and adaptive transformation, which incorporates visual context by injecting token-wise vision-conditioned deltas into the affine parameters of layer normalization. This mechanism directly modulates linguistic hidden states based on visual input, ensuring precise vision-language alignment while preserving the LLM's linguistic priors and drastically reducing computational costs. Extensive evaluations across 15 image and video benchmarks demonstrate that LaVi not only achieves state-of-the-art multimodal performance but also dramatically enhances efficiency. Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half - establishing LaVi as a scalable and practical solution for real-time multimodal reasoning. The code and models will be released soon. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16683 [pdf, ps, other]

A Simple Contrastive Framework Of Item Tokenization For Generative Recommendation

Authors: Penglong Zhai, Yifang Yuan, Fanyi Di, Jie Li, Yue Liu, Chen Li, Jie Huang, Sicong Wang, Yao Xu, Xin Li

Abstract: Generative retrieval-based recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. However, in large-scale recommendation systems, this approach becomes increasingly cumbersome due to the redundancy and sheer scale of the token space. To overcome these limitations, recent research has explored the use of semantic tokens as an alter… ▽ More Generative retrieval-based recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. However, in large-scale recommendation systems, this approach becomes increasingly cumbersome due to the redundancy and sheer scale of the token space. To overcome these limitations, recent research has explored the use of semantic tokens as an alternative to ID tokens, which typically leveraged reconstruction-based strategies, like RQ-VAE, to quantize content embeddings and significantly reduce the embedding size. However, reconstructive quantization aims for the precise reconstruction of each item embedding independently, which conflicts with the goal of generative retrieval tasks focusing more on differentiating among items. Moreover, multi-modal side information of items, such as descriptive text and images, geographical knowledge in location-based recommendation services, has been shown to be effective in improving recommendations by providing richer contexts for interactions. Nevertheless, effectively integrating such complementary knowledge into existing generative recommendation frameworks remains challenging. To overcome these challenges, we propose a novel unsupervised deep quantization exclusively based on contrastive learning, named SimCIT (a Simple Contrastive Item Tokenization framework). Specifically, different from existing reconstruction-based strategies, SimCIT propose to use a learnable residual quantization module to align with the signals from different modalities of the items, which combines multi-modal knowledge alignment and semantic tokenization in a mutually beneficial contrastive learning framework. Extensive experiments across public datasets and a large-scale industrial dataset from various domains demonstrate SimCIT's effectiveness in LLM-based generative recommendation. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 12 pages,7 figures

arXiv:2506.16654 [pdf, ps, other]

Relational Deep Learning: Challenges, Foundations and Next-Generation Architectures

Authors: Vijay Prakash Dwivedi, Charilaos Kanatsoulis, Shenyang Huang, Jure Leskovec

Abstract: Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data and has been applied to molecules, social networks, recommendation systems, and transportation, among other domains. Data in multi-tabular relational databases can also be constructed as 'relational entity graphs' for Relational Deep Learning (RDL) - a new blueprint… ▽ More Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data and has been applied to molecules, social networks, recommendation systems, and transportation, among other domains. Data in multi-tabular relational databases can also be constructed as 'relational entity graphs' for Relational Deep Learning (RDL) - a new blueprint that enables end-to-end representation learning without traditional feature engineering. Compared to arbitrary graph-structured data, relational entity graphs have key properties: (i) their structure is defined by primary-foreign key relationships between entities in different tables, (ii) the structural connectivity is a function of the relational schema defining a database, and (iii) the graph connectivity is temporal and heterogeneous in nature. In this paper, we provide a comprehensive review of RDL by first introducing the representation of relational databases as relational entity graphs, and then reviewing public benchmark datasets that have been used to develop and evaluate recent GNN-based RDL models. We discuss key challenges including large-scale multi-table integration and the complexities of modeling temporal dynamics and heterogeneous data, while also surveying foundational neural network methods and recent architectural advances specialized for relational entity graphs. Finally, we explore opportunities to unify these distinct modeling challenges, highlighting how RDL converges multiple sub-fields in graph machine learning towards the design of foundation models that can transform the processing of relational data. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16633 [pdf, ps, other]

GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View

Authors: Fenghua Cheng, Jinxiang Wang, Sen Wang, Zi Huang, Xue Li

Abstract: Multimodal reasoning is a process of understanding, integrating and inferring information across different data modalities. It has recently attracted surging academic attention as a benchmark for Artificial Intelligence (AI). Although there are various tasks for evaluating multimodal reasoning ability, they still have limitations. Lack of reasoning on hierarchical visual clues at different levels… ▽ More Multimodal reasoning is a process of understanding, integrating and inferring information across different data modalities. It has recently attracted surging academic attention as a benchmark for Artificial Intelligence (AI). Although there are various tasks for evaluating multimodal reasoning ability, they still have limitations. Lack of reasoning on hierarchical visual clues at different levels of granularity, e.g., local details and global context, is of little discussion, despite its frequent involvement in real scenarios. To bridge the gap, we introduce a novel and challenging task for multimodal reasoning, namely GeoGuess. Given a street view image, the task is to identify its location and provide a detailed explanation. A system that succeeds in GeoGuess should be able to detect tiny visual clues, perceive the broader landscape, and associate with vast geographic knowledge. Therefore, GeoGuess would require the ability to reason between hierarchical visual information and geographic knowledge. In this work, we establish a benchmark for GeoGuess by introducing a specially curated dataset GeoExplain which consists of panoramas-geocoordinates-explanation tuples. Additionally, we present a multimodal and multilevel reasoning method, namely SightSense which can make prediction and generate comprehensive explanation based on hierarchy of visual information and external knowledge. Our analysis and experiments demonstrate their outstanding performance in GeoGuess. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16595 [pdf]

Optimizing Time-resolved Magneto-optical Kerr Effect for High-fidelity Magnetic Characterization

Authors: Yun Kim, Dingbin Huang, Deyuan Lyu, Haoyue Sun, Jian-Ping Wang, Paul A. Crowell, Xiaojia Wang

Abstract: Spintronics has emerged as a key technology for fast and non-volatile memory with great CMOS compatibility. As the building blocks for these cutting-edge devices, magnetic materials require precise characterization of their critical properties, such as the effective anisotropy field ($H_{\rm{k,eff}}$, related to magnetic stability) and damping ($α$ key factor in device energy efficiency). Accurate… ▽ More Spintronics has emerged as a key technology for fast and non-volatile memory with great CMOS compatibility. As the building blocks for these cutting-edge devices, magnetic materials require precise characterization of their critical properties, such as the effective anisotropy field ($H_{\rm{k,eff}}$, related to magnetic stability) and damping ($α$ key factor in device energy efficiency). Accurate measurements of these properties are essential for designing and fabricating high-performance spintronic devices. Among advanced metrology techniques, Time-resolved Magneto-Optical Kerr Effect (TR-MOKE) stands out for its superb temporal and spatial resolutions, surpassing traditional methods like ferromagnetic resonance (FMR). However, the full potential of TR-MOKE has not yet been fully pledged due to the lack of systematic optimization and robust operational guidelines. In this study, we address this gap by developing experimentally validated guidelines for optimizing TR-MOKE metrology across materials with perpendicular magnetic anisotropy (PMA) and in-plane magnetic anisotropy (IMA). Our work identifies the optimal ranges of the field angle to simultaneously achieve high signal amplitudes and improve measurement sensitivities to $H_{\rm{k,eff}}$ and $α$. By suppressing the influence of inhomogeneities and boosting sensitivity, our work significantly enhances TR-MOKE capability to extract magnetic properties with high accuracy and reliability. This optimization framework positions TR-MOKE as an indispensable tool for advancing spintronics, paving the way for energy-efficient and high-speed devices that will redefine the landscape of modern computing and memory technologies. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Submitted to Appl. Phys. Lett. Manuscript: 16 pages, 5 figures; Supplementary Materials: 18 pages, 12 figures

arXiv:2506.16594 [pdf, ps, other]

A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications

Authors: Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He, Xiaolei Huang

Abstract: Synthetic data generation--mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields--has been facilitated by rapid advances of large language models (LLMs). This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically exa… ▽ More Synthetic data generation--mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields--has been facilitated by rapid advances of large language models (LLMs). This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically examines biomedical research and application trends in synthetic data generation, emphasizing clinical applications, methodologies, and evaluations. Our analysis identifies data modalities of unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%); generation methods of prompting (72.9%), fine-tuning (22.0%) LLMs and specialized model (5.1%); and heterogeneous evaluations of intrinsic metrics (27.1%), human-in-the-loop assessments (55.9%), and LLM-based evaluations (13.6%). The analysis addresses current limitations in what, where, and how health professionals can leverage synthetic data generation for biomedical domains. Our review also highlights challenges in adaption across clinical domains, resource and model accessibility, and evaluation standardizations. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16578 [pdf, ps, other]

SafeTriage: Facial Video De-identification for Privacy-Preserving Stroke Triage

Authors: Tongan Cai, Haomiao Ni, Wenchao Ma, Yuan Xue, Qian Ma, Rachel Leicht, Kelvin Wong, John Volpi, Stephen T. C. Wong, James Z. Wang, Sharon X. Huang

Abstract: Effective stroke triage in emergency settings often relies on clinicians' ability to identify subtle abnormalities in facial muscle coordination. While recent AI models have shown promise in detecting such patterns from patient facial videos, their reliance on real patient data raises significant ethical and privacy challenges -- especially when training robust and generalizable models across inst… ▽ More Effective stroke triage in emergency settings often relies on clinicians' ability to identify subtle abnormalities in facial muscle coordination. While recent AI models have shown promise in detecting such patterns from patient facial videos, their reliance on real patient data raises significant ethical and privacy challenges -- especially when training robust and generalizable models across institutions. To address these concerns, we propose SafeTriage, a novel method designed to de-identify patient facial videos while preserving essential motion cues crucial for stroke diagnosis. SafeTriage leverages a pretrained video motion transfer (VMT) model to map the motion characteristics of real patient faces onto synthetic identities. This approach retains diagnostically relevant facial dynamics without revealing the patients' identities. To mitigate the distribution shift between normal population pre-training videos and patient population test videos, we introduce a conditional generative model for visual prompt tuning, which adapts the input space of the VMT model to ensure accurate motion transfer without needing to fine-tune the VMT model backbone. Comprehensive evaluation, including quantitative metrics and clinical expert assessments, demonstrates that SafeTriage-produced synthetic videos effectively preserve stroke-relevant facial patterns, enabling reliable AI-based triage. Our evaluations also show that SafeTriage provides robust privacy protection while maintaining diagnostic accuracy, offering a secure and ethically sound foundation for data sharing and AI-driven clinical analysis in neurological disorders. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: IPMI 2025

arXiv:2506.16531 [pdf, ps, other]

How Hard Is Snow? A Paired Domain Adaptation Dataset for Clear and Snowy Weather: CADC+

Authors: Mei Qi Tang, Sean Sedwards, Chengjie Huang, Krzysztof Czarnecki

Abstract: The impact of snowfall on 3D object detection performance remains underexplored. Conducting such an evaluation requires a dataset with sufficient labelled data from both weather conditions, ideally captured in the same driving environment. Current driving datasets with LiDAR point clouds either do not provide enough labelled data in both snowy and clear weather conditions, or rely on de-snowing me… ▽ More The impact of snowfall on 3D object detection performance remains underexplored. Conducting such an evaluation requires a dataset with sufficient labelled data from both weather conditions, ideally captured in the same driving environment. Current driving datasets with LiDAR point clouds either do not provide enough labelled data in both snowy and clear weather conditions, or rely on de-snowing methods to generate synthetic clear weather. Synthetic data often lacks realism and introduces an additional domain shift that confounds accurate evaluations. To address these challenges, we present CADC+, the first paired weather domain adaptation dataset for autonomous driving in winter conditions. CADC+ extends the Canadian Adverse Driving Conditions dataset (CADC) using clear weather data that was recorded on the same roads and in the same period as CADC. To create CADC+, we pair each CADC sequence with a clear weather sequence that matches the snowy sequence as closely as possible. CADC+ thus minimizes the domain shift resulting from factors unrelated to the presence of snow. We also present some preliminary results using CADC+ to evaluate the effect of snow on 3D object detection performance. We observe that snow introduces a combination of aleatoric and epistemic uncertainties, acting as both noise and a distinct data domain. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: IEEE IV 2025

arXiv:2506.16504 [pdf, ps, other]

Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

Authors: Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxiang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, Sheng Zhang, Xin Huang, Di Luo, Fan Yang, Fang Yang, Lifu Wang, Sicong Liu, Yixuan Tang, Yulin Cai, Zebin He, Tian Liu, Yuhong Liu, Jie Jiang, Linus, Jingwei Huang , et al. (1 additional authors not shown)

Abstract: In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which… ▽ More In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which is trained with scaled high-quality datasets, model-size, and compute. Our largest model reaches 10B parameters and generates sharp and detailed 3D shape with precise image-3D following while keeping mesh surface clean and smooth, significantly closing the gap between generated and handcrafted 3D shapes. In terms of texture generation, it is upgraded with phyiscal-based rendering (PBR) via a novel multi-view architecture extended from Hunyuan3D 2.0 Paint model. Our extensive evaluation shows that Hunyuan3D 2.5 significantly outperforms previous methods in both shape and end-to-end texture generation. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Technical report

arXiv:2506.16481 [pdf, ps, other]

SO emission in the dynamically perturbed protoplanetary disks around CQ Tau and MWC 758

Authors: Francesco Zagaria, Haochang Jiang, Gianni Cataldi, Stefano Facchini, Myriam Benisty, Yuri Aikawa, Sean Andrews, Jaehan Bae, Marcelo Barraza-Alfaro, Pietro Curone, Ian Czekala, Daniele Fasano, Cassandra Hall, Iain Hammond, Jane Huang, John D. Ilee, Andrés F. Izquierdo, Jensen Lawrence, Giuseppe Lodato, François Ménard, Christophe Pinte, Giovanni P. Rosotti, Jochen Stadler, Richard Teague, Leonardo Testi , et al. (3 additional authors not shown)

Abstract: We report the serendipitous detection of the SO $J_N=6_5-5_4$ (219.949 GHz) rotational transition in archival Atacama Large Millimeter/submillimeter Array (ALMA) observations of the spiral hosting protoplanetary disks around CQ Tau (with $\approx4.9σ$ significance) and MWC 758 (with $\approx3.4σ$ significance). In the former, the SO emission comes in the shape of a ring, arises from the edge of th… ▽ More We report the serendipitous detection of the SO $J_N=6_5-5_4$ (219.949 GHz) rotational transition in archival Atacama Large Millimeter/submillimeter Array (ALMA) observations of the spiral hosting protoplanetary disks around CQ Tau (with $\approx4.9σ$ significance) and MWC 758 (with $\approx3.4σ$ significance). In the former, the SO emission comes in the shape of a ring, arises from the edge of the continuum cavity, and is qualitatively consistent, at the currently available spectral resolution, with being in Keplerian rotation. In the latter, instead, while arising primarily from inside the continuum cavity, the SO emission also extends to the continuum ring(s), and its morphology and kinematics are less clear. We put these sources in the context of the other protoplanetary disks where SO detections have been previously reported in the literature and discuss the possible origins of SO in terms of (thermal) desorption or formation in the gas phase. We argue that these processes might be fostered by dynamical perturbations caused by unseen embedded massive companions, shadows, or late-time infall, thus suggesting a possible link between perturbed dynamics and SO emission in (these) protoplanetary disks. If confirmed, our interpretation would imply that chemical evolution timescales could be significantly shorter in these systems than is commonly assumed, indicating that dynamical perturbations might influence the composition of newborn (proto-)planets by altering the volatile makeup of their formation environment. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted for publication in ApJ. 23 pages 7 figures

arXiv:2506.16447 [pdf, ps, other]

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Authors: Biao Yi, Tiansheng Huang, Sishuo Chen, Tong Li, Zheli Liu, Zhixuan Chu, Yiming Li

Abstract: Backdoor unalignment attacks against Large Language Models (LLMs) enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service (LLMaaS) setting, where the deployed model is a fully black-box system that can only interact through t… ▽ More Backdoor unalignment attacks against Large Language Models (LLMs) enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service (LLMaaS) setting, where the deployed model is a fully black-box system that can only interact through text. Furthermore, the sample-dependent nature of the attack target exacerbates the threat. Instead of outputting a fixed label, the backdoored LLM follows the semantics of any malicious command with the hidden trigger, significantly expanding the target space. In this paper, we introduce BEAT, a black-box defense that detects triggered samples during inference to deactivate the backdoor. It is motivated by an intriguing observation (dubbed the probe concatenate effect), where concatenated triggered samples significantly reduce the refusal rate of the backdoored LLM towards a malicious probe, while non-triggered samples have little effect. Specifically, BEAT identifies whether an input is triggered by measuring the degree of distortion in the output distribution of the probe before and after concatenation with the input. Our method addresses the challenges of sample-dependent targets from an opposite perspective. It captures the impact of the trigger on the refusal signal (which is sample-independent) instead of sample-specific successful attack behaviors. It overcomes black-box access limitations by using multiple sampling to approximate the output distribution. Extensive experiments are conducted on various backdoor attacks and LLMs (including the closed-source GPT-3.5-turbo), verifying the effectiveness and efficiency of our defense. Besides, we also preliminarily verify that BEAT can effectively defend against popular jailbreak attacks, as they can be regarded as 'natural backdoors'. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted at ICLR 2025

Journal ref: Proceedings of The Thirteenth International Conference on Learning Representations (ICLR 2025)

arXiv:2506.16398 [pdf, ps, other]

HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis

Authors: Peixiang Huang, Yanyan Huang, Weiqin Zhao, Junjun He, Lequan Yu

Abstract: Pathology is essential for cancer diagnosis, with multiple instance learning (MIL) widely used for whole slide image (WSI) analysis. WSIs exhibit a natural hierarchy -- patches, regions, and slides -- with distinct semantic associations. While some methods attempt to leverage this hierarchy for improved representation, they predominantly rely on Euclidean embeddings, which struggle to fully captur… ▽ More Pathology is essential for cancer diagnosis, with multiple instance learning (MIL) widely used for whole slide image (WSI) analysis. WSIs exhibit a natural hierarchy -- patches, regions, and slides -- with distinct semantic associations. While some methods attempt to leverage this hierarchy for improved representation, they predominantly rely on Euclidean embeddings, which struggle to fully capture semantic hierarchies. To address this limitation, we propose HyperPath, a novel method that integrates knowledge from textual descriptions to guide the modeling of semantic hierarchies of WSIs in hyperbolic space, thereby enhancing WSI classification. Our approach adapts both visual and textual features extracted by pathology vision-language foundation models to the hyperbolic space. We design an Angular Modality Alignment Loss to ensure robust cross-modal alignment, while a Semantic Hierarchy Consistency Loss further refines feature hierarchies through entailment and contradiction relationships and thus enhance semantic coherence. The classification is performed with geodesic distance, which measures the similarity between entities in the hyperbolic semantic hierarchy. This eliminates the need for linear classifiers and enables a geometry-aware approach to WSI analysis. Extensive experiments show that our method achieves superior performance across tasks compared to existing methods, highlighting the potential of hyperbolic embeddings for WSI analysis. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16381 [pdf, ps, other]

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

Authors: Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu

Abstract: In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language inst… ▽ More In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 19 pages, 9 figures

arXiv:2506.16346 [pdf]

Preferred Synthesis of Armchair SnS2 Nanotubes

Authors: Abid, Luneng Zhao, Ju Huang, Yongjia Zheng, Yuta Sato, Qingyun Lin, Zhen Han, Chunxia Yang, Tianyu Wang, Bill Herve Nduwarugira, Yicheng Ma, Lingfeng Wang, Yige Zheng, Hang Wang, Salman Ullah, Afzal Khan, Qi Zhang, Wenbin Li, Junfeng Gao, Bingfeng Ju, Feng Ding, Yan Li, Kazu Suenaga, Shigeo Maruyama, Huayong Yang , et al. (1 additional authors not shown)

Abstract: In this work, we present the synthesis of tin disulfide (SnS2) nanotubes (NTs) with preferred chiral angle. A sacrificial template is used to create channels of boron nitride nanotubes (BNNTs) with an optimized diameter of 4-5 nm, inside of which SnS2 NTs are formed with the high yield and structural purity. Atomic resolution imaging and nano-area electron diffraction reveal that these synthesized… ▽ More In this work, we present the synthesis of tin disulfide (SnS2) nanotubes (NTs) with preferred chiral angle. A sacrificial template is used to create channels of boron nitride nanotubes (BNNTs) with an optimized diameter of 4-5 nm, inside of which SnS2 NTs are formed with the high yield and structural purity. Atomic resolution imaging and nano-area electron diffraction reveal that these synthesized SnS2 NTs prefer to have an armchair configuration with a probability of approximately 85%. Calculations using density functional theory (DFT) reveal a negligible difference in the formation energy between armchair and zigzag NTs, suggesting that structural stability does not play a key role in this chirality-selective growth. However, a detailed TEM investigation revealed that some SnS2 nanoribbons are found connected to the ends of SnS2 NTs, and that these nanoribbons primarily have a zigzag configuration. Subsequent DFT and machine learning potential molecular dynamic simulations verify that nanoribbons with zigzag configurations are more stable than armchair ones, and indeed zigzag nanoribbons aligned along the BNNT axis tend to roll up to form an armchair SnS2 NTs. Finally, this "zigzag nanoribbon to armchair nanotube" transition hypothesis is verified by in-situ high-resolution transmission electron microscopy, in which the transformation of SnS2 nanoribbons into a nanotube is reproduced in real time. This work is the first demonstration of preferred-chirality growth of transition metal dichalcogenide nanotubes. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16336 [pdf, ps, other]

Goal-conditioned Hierarchical Reinforcement Learning for Sample-efficient and Safe Autonomous Driving at Intersections

Authors: Yiou Huang

Abstract: Reinforcement learning (RL) exhibits remarkable potential in addressing autonomous driving tasks. However, it is difficult to train a sample-efficient and safe policy in complex scenarios. In this article, we propose a novel hierarchical reinforcement learning (HRL) framework with a goal-conditioned collision prediction (GCCP) module. In the hierarchical structure, the GCCP module predicts collisi… ▽ More Reinforcement learning (RL) exhibits remarkable potential in addressing autonomous driving tasks. However, it is difficult to train a sample-efficient and safe policy in complex scenarios. In this article, we propose a novel hierarchical reinforcement learning (HRL) framework with a goal-conditioned collision prediction (GCCP) module. In the hierarchical structure, the GCCP module predicts collision risks according to different potential subgoals of the ego vehicle. A high-level decision-maker choose the best safe subgoal. A low-level motion-planner interacts with the environment according to the subgoal. Compared to traditional RL methods, our algorithm is more sample-efficient, since its hierarchical structure allows reusing the policies of subgoals across similar tasks for various navigation scenarios. In additional, the GCCP module's ability to predict both the ego vehicle's and surrounding vehicles' future actions according to different subgoals, ensures the safety of the ego vehicle throughout the decision-making process. Experimental results demonstrate that the proposed method converges to an optimal policy faster and achieves higher safety than traditional RL methods. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16317 [pdf, ps, other]

Two loop QCD corrections to $e^+ e^- \to J/ψ+ η_c$ in asymptotic expansion

Authors: Cong Li, Xu-Dong Huang, Wen-Long Sang

Abstract: Within the framework of NRQCD, the short-distance coefficients (SDCs) for the process $e^+e^-\to J/ψ+η_c$ have been obtained up to NNLO in asymptotic expansions over $r={16m_c^2}/{s}$ up to $r^{15}$. Although these asymptotic expressions are deviated from the full results near the threshold $r= 1$, they provide excellent approximations to the full results for $r<0.8$, with deviations less than… ▽ More Within the framework of NRQCD, the short-distance coefficients (SDCs) for the process $e^+e^-\to J/ψ+η_c$ have been obtained up to NNLO in asymptotic expansions over $r={16m_c^2}/{s}$ up to $r^{15}$. Although these asymptotic expressions are deviated from the full results near the threshold $r= 1$, they provide excellent approximations to the full results for $r<0.8$, with deviations less than $3\%$. Therefore, these asymptotic expressions offer reliable applications for phenomenological predictions across a wide range of center-of-mass energies $\sqrt{s}$. Utilizing these asymptotic expressions, we present phenomenological predictions for the cross sections in both the on-shell mass scheme and the $\overline{\rm MS}$ mass scheme, with the uncertainty arising from the renormalization scale $μ_R$ included. The $μ_R$ uncertainty for predictions from the $\overline{\rm MS}$ mass scheme is slightly larger than that from the on-shell mass scheme, which is partly attributed to the helicity flip in the process $e^+e^-\to J/ψ+η_c$. We observe that both mass schemes yield quite similar predictions, and our theoretical results are consistent with the available experimental data. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 18 pages, 4 figures, 1 tables, 1 attached file

arXiv:2506.16305 [pdf, ps, other]

A remark for fully non-linear elliptic equations on compact almost Hermitian manifolds

Authors: Liding Huang

Abstract: In this paper, we generalize the definition of sub-slope, introduced by Guo-Song, to almost Hermitian manifolds and prove the existence of solutions for a general class of fully non-linear equations on compact almost Hermitian manifolds. As an application, we solve the complex Hessian quotient equation and the deformed Hermitian-Yang-Mills equation in the almost Hermitian setting. In this paper, we generalize the definition of sub-slope, introduced by Guo-Song, to almost Hermitian manifolds and prove the existence of solutions for a general class of fully non-linear equations on compact almost Hermitian manifolds. As an application, we solve the complex Hessian quotient equation and the deformed Hermitian-Yang-Mills equation in the almost Hermitian setting. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16265 [pdf, ps, other]

Dense 3D Displacement Estimation for Landslide Monitoring via Fusion of TLS Point Clouds and Embedded RGB Images

Authors: Zhaoyi Wang, Jemil Avers Butt, Shengyu Huang, Tomislav Medic, Andreas Wieser

Abstract: Landslide monitoring is essential for understanding geohazards and mitigating associated risks. However, existing point cloud-based methods typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partition-based coarse-to-fine approach that fuses 3D point clouds and co-registered RGB images to e… ▽ More Landslide monitoring is essential for understanding geohazards and mitigating associated risks. However, existing point cloud-based methods typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partition-based coarse-to-fine approach that fuses 3D point clouds and co-registered RGB images to estimate dense 3D displacement vector fields. We construct patch-level matches using both 3D geometry and 2D image features. These matches are refined via geometric consistency checks, followed by rigid transformation estimation per match. Experimental results on two real-world landslide datasets demonstrate that our method produces 3D displacement estimates with high spatial coverage (79% and 97%) and high accuracy. Deviations in displacement magnitude with respect to external measurements (total station or GNSS observations) are 0.15 m and 0.25 m on the two datasets, respectively, and only 0.07 m and 0.20 m compared to manually derived references. These values are below the average scan resolutions (0.08 m and 0.30 m). Our method outperforms the state-of-the-art method F2S3 in spatial coverage while maintaining comparable accuracy. Our approach offers a practical and adaptable solution for TLS-based landslide monitoring and is extensible to other types of point clouds and monitoring tasks. Our example data and source code are publicly available at https://github.com/zhaoyiww/fusion4landslide. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 20 pages, 16 figures. Preprint under peer review. Example data and code available at [GitHub](https://github.com/zhaoyiww/fusion4landslide)

arXiv:2506.16261 [pdf, ps, other]

Global well-posedness for 2D compressible radially symmetric Navier-Stokes equations with swirl

Authors: Xiangdi Huang, Weili Meng

Abstract: In this paper, we consider the radially symmetric compressible Navier-Stokes equations with swirl in two-dimensional disks, where the shear viscosity coefficient $μ= \text{const}> 0$, and the bulk one $λ= ρ^β(β>0)$. When $β\geq 1$, we prove the global existence and asymptotic behavior of the large strong solutions for initial values that allow for vacuum. One of the key ingredients is to sho… ▽ More In this paper, we consider the radially symmetric compressible Navier-Stokes equations with swirl in two-dimensional disks, where the shear viscosity coefficient $μ= \text{const}> 0$, and the bulk one $λ= ρ^β(β>0)$. When $β\geq 1$, we prove the global existence and asymptotic behavior of the large strong solutions for initial values that allow for vacuum. One of the key ingredients is to show the uniform boundedness of the density independent of the time. When $β\in(0,1)$, we prove the same conclusion holds when the initial value satisfies $\norm{ρ_0}_{L^\infty} \leq a_0$, where $a_0$ is given by \eqref{def a_0} as in Theorem \ref{Thm3}. To the best of our knowledge, this is the first result on the global existence of large strong solutions for 2D compressible Navier-Stokes equation with real non-slip (non Navier-slip) boundary conditions when $β\ge1$ and the first result on the global existence of strong solutions when $β\in(0,1)$ △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 44 pages

MSC Class: 35Q30; 76N10

arXiv:2506.16250 [pdf, ps, other]

Graph-Cover-based Characterization of the Bethe Partition Function of Double-Edge Factor Graphs

Authors: Yuwen Huang, Pascal O. Vontobel

Abstract: For standard factor graphs (S-FGs) with non-negative real-valued local functions, Vontobel provided a combinatorial characterization of the Bethe approximation of the partition function, also known as the Bethe partition function, using finite graph covers. The proof of this characterization, i.e., the graph-cover theorem for S-FGs, heavily relied on the method of types. In this paper, we study… ▽ More For standard factor graphs (S-FGs) with non-negative real-valued local functions, Vontobel provided a combinatorial characterization of the Bethe approximation of the partition function, also known as the Bethe partition function, using finite graph covers. The proof of this characterization, i.e., the graph-cover theorem for S-FGs, heavily relied on the method of types. In this paper, we study double-edge factor graphs (DE-FGs), a class of factor graphs where each local function takes complex values and satisfies some positive semi-definiteness constraints. DE-FGs and their partition functions are particularly relevant for quantum information processing. Approximating the partition function of a DE-FG is more difficult than for an S-FG, as it involves summing complex values instead of non-negative real values. We develop the sum-product algorithm (SPA) fixed-point-based Bethe approximation of the partition function. However, one cannot directly apply the method of types to prove a similar combinatorial characterization as in the case of S-FGs. We provide a combinatorial characterization of the Bethe partition function in terms of finite graph covers for a class of DE-FGs that satisfy a specific, easily checkable condition. Towards proving this characterization, we apply a suitable loop-calculus transform (LCT) to these graphs. Originally, the LCT was introduced by Chertkov and Chernyak as a special linear transform for S-FGs and later extended by Mori. Our proposed LCT is applicable for both DE-FGs and S-FGs and generalizes prior versions by handling zero-valued SPA fixed-point message components, which are common in DE-FGs. Supported by numerical results, we conjecture that this combinatorial characterization of the Bethe partition function in terms of finite graph covers holds more broadly for DE-FGs. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: arXiv admin note: substantial text overlap with arXiv:2412.05942

arXiv:2506.16233 [pdf, ps, other]

Can AI Dream of Unseen Galaxies? Conditional Diffusion Model for Galaxy Morphology Augmentation

Authors: Chenrui Ma, Zechang Sun, Tao Jing, Zheng Cai, Yuan-Sen Ting, Song Huang, Mingyu Li

Abstract: Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets -- whether from simulations or human annotation -- a challenge pronounced for rare yet scientifically va… ▽ More Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets -- whether from simulations or human annotation -- a challenge pronounced for rare yet scientifically valuable objects. To address this, we propose a conditional diffusion model to synthesize realistic galaxy images for augmenting ML training data. Leveraging the Galaxy Zoo 2 dataset which contains visual feature -- galaxy image pairs from volunteer annotation, we demonstrate that our model generates diverse, high-fidelity galaxy images closely adhere to the specified morphological feature conditions. Moreover, this model enables generative extrapolation to project well-annotated data into unseen domains and advancing rare object detection. Integrating synthesized images into ML pipelines improves performance in standard morphology classification, boosting completeness and purity by up to 30\% across key metrics. For rare object detection, using early-type galaxies with prominent dust lane features ( $\sim$0.1\% in GZ2 dataset) as a test case, our approach doubled the number of detected instances from 352 to 872, compared to previous studies based on visual inspection. This study highlights the power of generative models to bridge gaps between scarce labeled data and the vast, uncharted parameter space of observational astronomy and sheds insight for future astrophysical foundation model developments. Our project homepage is available at https://galaxysd-webpage.streamlit.app/. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: We have submitted to AAS journals. See another independent work for further reference -- Category-based Galaxy Image Generation via Diffusion Models (Fan, Tang et al.). Comments are welcome

arXiv:2506.16211 [pdf, ps, other]

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

Authors: Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang

Abstract: Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose… ▽ More Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations -- a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA's extensibility to long-horizon tasks and robustness to unseen objects and backgrounds. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Website: https://controlvla.github.io

arXiv:2506.16210 [pdf, ps, other]

From Coarse to Continuous: Progressive Refinement Implicit Neural Representation for Motion-Robust Anisotropic MRI Reconstruction

Authors: Zhenxuan Zhang, Lipei Zhang, Yanqi Cheng, Zi Wang, Fanwen Wang, Haosen Zhang, Yue Yang, Yinzhe Wu, Jiahao Huang, Angelica I Aviles-Rivero, Zhifan Gao, Guang Yang, Peter J. Lally

Abstract: In motion-robust magnetic resonance imaging (MRI), slice-to-volume reconstruction is critical for recovering anatomically consistent 3D brain volumes from 2D slices, especially under accelerated acquisitions or patient motion. However, this task remains challenging due to hierarchical structural disruptions. It includes local detail loss from k-space undersampling, global structural aliasing cause… ▽ More In motion-robust magnetic resonance imaging (MRI), slice-to-volume reconstruction is critical for recovering anatomically consistent 3D brain volumes from 2D slices, especially under accelerated acquisitions or patient motion. However, this task remains challenging due to hierarchical structural disruptions. It includes local detail loss from k-space undersampling, global structural aliasing caused by motion, and volumetric anisotropy. Therefore, we propose a progressive refinement implicit neural representation (PR-INR) framework. Our PR-INR unifies motion correction, structural refinement, and volumetric synthesis within a geometry-aware coordinate space. Specifically, a motion-aware diffusion module is first employed to generate coarse volumetric reconstructions that suppress motion artifacts and preserve global anatomical structures. Then, we introduce an implicit detail restoration module that performs residual refinement by aligning spatial coordinates with visual features. It corrects local structures and enhances boundary precision. Further, a voxel continuous-aware representation module represents the image as a continuous function over 3D coordinates. It enables accurate inter-slice completion and high-frequency detail recovery. We evaluate PR-INR on five public MRI datasets under various motion conditions (3% and 5% displacement), undersampling rates (4x and 8x) and slice resolutions (scale = 5). Experimental results demonstrate that PR-INR outperforms state-of-the-art methods in both quantitative reconstruction metrics and visual quality. It further shows generalization and robustness across diverse unseen domains. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16136 [pdf, ps, other]

Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing

Authors: Kai Huang, Jian Zhang, Xiaofei Xie, Chunyang Chen

Abstract: Large language model-(LLM) based automated program repair (APR) techniques have shown promising results in resolving real-world GitHub issue tasks. Existing APR systems are primarily evaluated in unimodal settings (e.g., SWE-bench). However, these autonomous systems struggle to resolve multimodal problem scenarios (e.g., SWE-bench M) due to limitations in interpreting and leveraging visual informa… ▽ More Large language model-(LLM) based automated program repair (APR) techniques have shown promising results in resolving real-world GitHub issue tasks. Existing APR systems are primarily evaluated in unimodal settings (e.g., SWE-bench). However, these autonomous systems struggle to resolve multimodal problem scenarios (e.g., SWE-bench M) due to limitations in interpreting and leveraging visual information. In multimodal scenarios, LLMs need to rely on visual information in the graphical user interface (GUI) to understand bugs and generate fixes. To bridge this gap, we propose GUIRepair, a cross-modal reasoning approach for resolving multimodal issue scenarios by understanding and capturing visual information. Specifically, GUIRepair integrates two key components, Image2Code and Code2Image, to enhance fault comprehension and patch validation. Image2Code extracts relevant project documents based on the issue report, then applies this domain knowledge to generate the reproduced code responsible for the visual symptoms, effectively translating GUI images into executable context for better fault comprehension. Code2Image replays the visual issue scenario using the reproduced code and captures GUI renderings of the patched program to assess whether the fix visually resolves the issue, providing feedback for patch validation. We evaluate GUIRepair on SWE-bench M, and the approach demonstrates significant effectiveness. When utilizing GPT-4o as the base model, GUIRepair solves 157 instances, outperforming the best open-source baseline by 26 instances. Furthermore, when using o4-mini as the base model, GUIRepair can achieve even better results and solve 175 instances, outperforming the top commercial system by 22 instances. This emphasizes the success of our new perspective on incorporating cross-modal reasoning by understanding and capturing visual information to resolve multimodal issues. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16112 [pdf, ps, other]

AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models

Authors: Yuan Zhang, Chun-Kai Fan, Tao Huang, Ming Lu, Sicheng Yu, Junwen Pan, Kuan Cheng, Qi She, Shanghang Zhang

Abstract: Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often… ▽ More Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we developed an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experimental results indicate that AutoV enhances the performance of various LVLMs across multiple popular image understanding tasks. For instance, LLaVA-OV with AutoV achieves $\textbf{1.7}\%$ accuracy gain on LLaVA$^{\text{Wild}}$, and AutoV boosts Qwen2.5-VL by $\textbf{1.9}\%$ on MMMU, highlighting its potential as an optimal visual prompting method for LVLMs. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 19 pages

arXiv:2506.16102 [pdf, ps, other]

Fast Training-free Perceptual Image Compression

Authors: Ziran Zhu, Tongda Xu, Minye Huang, Dailan He, Xingtong Ge, Xinjie Zhang, Ling Li, Yan Wang

Abstract: Training-free perceptual image codec adopt pre-trained unconditional generative model during decoding to avoid training new conditional generative model. However, they heavily rely on diffusion inversion or sample communication, which take 1 min to intractable amount of time to decode a single image. In this paper, we propose a training-free algorithm that improves the perceptual quality of any ex… ▽ More Training-free perceptual image codec adopt pre-trained unconditional generative model during decoding to avoid training new conditional generative model. However, they heavily rely on diffusion inversion or sample communication, which take 1 min to intractable amount of time to decode a single image. In this paper, we propose a training-free algorithm that improves the perceptual quality of any existing codec with theoretical guarantee. We further propose different implementations for optimal perceptual quality when decoding time budget is $\approx 0.1$s, $0.1-10$s and $\ge 10$s. Our approach: 1). improves the decoding time of training-free codec from 1 min to $0.1-10$s with comparable perceptual quality. 2). can be applied to non-differentiable codec such as VTM. 3). can be used to improve previous perceptual codecs, such as MS-ILLM. 4). can easily achieve perception-distortion trade-off. Empirically, we show that our approach successfully improves the perceptual quality of ELIC, VTM and MS-ILLM with fast decoding. Our approach achieves comparable FID to previous training-free codec with significantly less decoding time. And our approach still outperforms previous conditional generative model based codecs such as HiFiC and MS-ILLM in terms of FID. The source code is provided in the supplementary material. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16100 [pdf, ps, other]

Seesaw Portal to Super Heavy Dark Matter with $Z_3$ Symmetry

Authors: Cai-Xia Yang, Zhi-Long Han, Fei Huang, Yi Jin, Honglei Li

Abstract: Right-handed neutrinos $N$ are introduced to explain the origin of the tiny neutrino masses via the seesaw mechanism. Required by relatively large Yukawa coupling and leptogenesis, masses of right-handed neutrinos are beyond $10^{9}$ GeV. Such heavy right-handed neutrino can mediate the production of super heavy dark matter $χ$ via the freeze-in mechanism. In the minimal $Z_2$ symmetric model, the… ▽ More Right-handed neutrinos $N$ are introduced to explain the origin of the tiny neutrino masses via the seesaw mechanism. Required by relatively large Yukawa coupling and leptogenesis, masses of right-handed neutrinos are beyond $10^{9}$ GeV. Such heavy right-handed neutrino can mediate the production of super heavy dark matter $χ$ via the freeze-in mechanism. In the minimal $Z_2$ symmetric model, the right-hand neutrino portal interaction is $y_N φ\barχ N$ with the dark scalar $φ$. One drawback of the $Z_2$ symmetric model is that the mass ordering $m_N>m_φ$ with long-lived $φ$ is almost ruled out by Big Bang Nucleosynthesis. In this paper, we propose that by extending the dark symmetry to $Z_3$, one additional interaction $y_χφ\barχ^c χ$ is further allowed. In this way, the new decay mode $φ\to χχ$ would lead to the dark scalar $φ$ being short-lived even with a feeble $y_χ$, thus it is allowed by the cosmological constraints. The phenomenology of the $Z_3$ symmetric super heavy dark matter model is also studied in this paper. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 19 pages, 7 figures

arXiv:2506.16078 [pdf, ps, other]

Probing the Robustness of Large Language Models Safety to Latent Perturbations

Authors: Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang

Abstract: Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal r… ▽ More Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16037 [pdf, ps, other]

Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3

Authors: Xinyue Huang, Ziqi Lin, Fang Sun, Wenchao Zhang, Kejian Tong, Yunbo Liu

Abstract: This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent respons… ▽ More This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent response generation. A joint optimization strategy combining retrieval likelihood and generation cross-entropy improves the model's robustness and adaptability. Experimental results show that the proposed system outperforms existing retrieval-augmented and generative baselines, confirming its effectiveness in delivering precise, contextually grounded answers. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16031 [pdf, ps, other]

Longtime Monitoring of TeV Radio Galaxies with HAWC

Authors: R. Alfaro, C. Alvarez, E. Anita-Rangel, J. C. Arteaga-Velázquez, D. Avila Rojas, H. A. Ayala Solares, R. Babu, P. Bangale, E. Belmont-Moreno, A. Bernal, K. S. Caballero-Mora, T. Capistrán, A. Carramiñana, F. Carreón, S. Casanova, U. Cotti, J. Cotzomi, S. Coutiño de León, E. De la Fuente, D. Depaoli, P. Desiati, N. Di Lalla, R. Diaz Hernandez, M. A. DuVernois, J. C. Díaz-Vélez , et al. (63 additional authors not shown)

Abstract: We present the monitoring of the TeV-emitting radio galaxies M87, NGC~1275, 3C~264, and IC~310 with the High Altitude Water Cherenkov Observatory (HAWC) over a period of approximately $7.5$ years. The analysis includes light curves at daily, weekly and monthly time scales for the four sources. We report the detection of gamma-ray emission from M87 with a significance exceeding 5$σ$. Due to its sig… ▽ More We present the monitoring of the TeV-emitting radio galaxies M87, NGC~1275, 3C~264, and IC~310 with the High Altitude Water Cherenkov Observatory (HAWC) over a period of approximately $7.5$ years. The analysis includes light curves at daily, weekly and monthly time scales for the four sources. We report the detection of gamma-ray emission from M87 with a significance exceeding 5$σ$. Due to its significant detection, this work reports the integrated TeV spectrum of M87 from the longest temporal coverage up to date. The source is well described as a point-like source modeled by a power law spectrum with spectral index $α= 2.53\pm0.29$ and a flux of $(7.09\pm 1.24)\times10^{-13}$ $\rm{cm}^{-2}\,{s}^{-1}\,{TeV}^{-1}$ at $1\,\rm{TeV}$. The maximum energy of the detected emission in M87, at 1$σ$ confidence level (C.L.), reaches 26.5 TeV. HAWC's observation of M87 reveals a low flux spectrum for the longest observation to date of this radio galaxy. 3C~264 is marginally detected with a significance slightly below 4$σ$, while NGC~1275 and IC~310 are not detected. The weekly light curves show an increased number of fluxes above $2σ$ for M87 starting in 2019, and for 3C~264 starting in 2018, which can be interpreted as the moment for which these sources start to exhibit an enhanced steady TeV emission. Overall, in the four radio galaxies, the cumulative significance over time indicates a behavior that resembles that of a gamma-ray variable active galaxy, such as the blazar Markarian 421. This supports the importance of monitoring radio galaxies to identify periods of higher activity and flares, enabling further multi-messenger studies. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 14 pages, 1 table, 7 figures

arXiv:2506.16020 [pdf, ps, other]

VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger Bridge

Authors: Zijing Zhao, Kai Wang, Hao Huang, Ying Hu, Liang He, Jichen Yang

Abstract: To explore the potential advantages of utilizing spatial cues from images for generating stereo singing voices with room reverberation, we introduce VS-Singer, a vision-guided model designed to produce stereo singing voices with room reverberation from scene images. VS-Singer comprises three modules: firstly, a modal interaction network integrates spatial features into text encoding to create a li… ▽ More To explore the potential advantages of utilizing spatial cues from images for generating stereo singing voices with room reverberation, we introduce VS-Singer, a vision-guided model designed to produce stereo singing voices with room reverberation from scene images. VS-Singer comprises three modules: firstly, a modal interaction network integrates spatial features into text encoding to create a linguistic representation enriched with spatial information. Secondly, the decoder employs a consistency Schrödinger bridge to facilitate one-step sample generation. Moreover, we utilize the SFE module to improve the consistency of audio-visual matching. To our knowledge, this study is the first to combine stereo singing voice synthesis with visual acoustic matching within a unified framework. Experimental results demonstrate that VS-Singer can effectively generate stereo singing voices that align with the scene perspective in a single step. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted by Interspeech 2025

arXiv:2506.16001 [pdf, ps, other]

AutoHFormer: Efficient Hierarchical Autoregressive Transformer for Time Series Prediction

Authors: Qianru Zhang, Honggang Wen, Ming Li, Dong Huang, Siu-Ming Yiu, Christian S. Jensen, Pietro Liò

Abstract: Time series forecasting requires architectures that simultaneously achieve three competing objectives: (1) strict temporal causality for reliable predictions, (2) sub-quadratic complexity for practical scalability, and (3) multi-scale pattern recognition for accurate long-horizon forecasting. We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges throug… ▽ More Time series forecasting requires architectures that simultaneously achieve three competing objectives: (1) strict temporal causality for reliable predictions, (2) sub-quadratic complexity for practical scalability, and (3) multi-scale pattern recognition for accurate long-horizon forecasting. We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges through three key innovations: 1) Hierarchical Temporal Modeling: Our architecture decomposes predictions into segment-level blocks processed in parallel, followed by intra-segment sequential refinement. This dual-scale approach maintains temporal coherence while enabling efficient computation. 2) Dynamic Windowed Attention: The attention mechanism employs learnable causal windows with exponential decay, reducing complexity while preserving precise temporal relationships. This design avoids both the anti-causal violations of standard transformers and the sequential bottlenecks of RNN hybrids. 3) Adaptive Temporal Encoding: a novel position encoding system is adopted to capture time patterns at multiple scales. It combines fixed oscillating patterns for short-term variations with learnable decay rates for long-term trends. Comprehensive experiments demonstrate that AutoHFormer 10.76X faster training and 6.06X memory reduction compared to PatchTST on PEMS08, while maintaining consistent accuracy across 96-720 step horizons in most of cases. These breakthroughs establish new benchmarks for efficient and precise time series modeling. Implementations of our method and all baselines in hierarchical autoregressive mechanism are available at https://github.com/lizzyhku/Autotime. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 14 pages

arXiv:2506.15961 [pdf]

TrainVerify: Equivalence-Based Verification for Distributed LLM Training

Authors: Yunchi Lu, Youshan Miao, Cheng Tan, Peng Huang, Yi Zhu, Xian Zhang, Fan Yang

Abstract: Training large language models (LLMs) at scale requires parallel execution across thousands of devices, incurring enormous computational costs. Yet, these costly distributed trainings are rarely verified, leaving them prone to silent errors and potentially wasting millions of GPU hours. We introduce TrainVerify, a system for verifiable distributed training of LLMs. Given a deep learning model's lo… ▽ More Training large language models (LLMs) at scale requires parallel execution across thousands of devices, incurring enormous computational costs. Yet, these costly distributed trainings are rarely verified, leaving them prone to silent errors and potentially wasting millions of GPU hours. We introduce TrainVerify, a system for verifiable distributed training of LLMs. Given a deep learning model's logical specification as the ground truth, TrainVerify formally verifies that a distributed parallel execution plan is mathematically equivalent to it. Direct verification is notoriously difficult due to the sheer scale of LLMs which often involves billions of variables and highly intricate computation graphs. Therefore, TrainVerify introduces shape-reduction techniques and a stage-wise parallel verification algorithm that significantly reduces complexity while preserving formal correctness. TrainVerify scales to frontier LLMs, including the successful verification of the Llama3 (405B) and DeepSeek-V3 (671B) training plans. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.15956 [pdf, ps, other]

Scalable quantum current source on commercial 22-nm CMOS process technology

Authors: Ajit Dash, Suyash Pati Tripathi, Dimitrios Georgakopoulos, MengKe Feng, Steve Yianni, Ensar Vahapoglu, Md Mamunur Rahman, Shai Bonen, Owen Brace, Jonathan Y. Huang, Wee Han Lim, Kok Wai Chan, Will Gilbert, Arne Laucht, Andrea Morello, Andre Saraiva, Christopher C. Escott, Sorin P. Voinigescu, Andrew S. Dzurak, Tuomo Tanttu

Abstract: Utilizing quantum effects in nanoscopic devices has in the past mostly been accessible through academic cleanrooms and research foundries. Opening the quantum frontier for wider industrial applications likely requires the scale of well-established complementary metal-oxide-semiconductor (CMOS) foundries for manufacturing transistor-based quantum devices operable above subkelvin temperatures. Here,… ▽ More Utilizing quantum effects in nanoscopic devices has in the past mostly been accessible through academic cleanrooms and research foundries. Opening the quantum frontier for wider industrial applications likely requires the scale of well-established complementary metal-oxide-semiconductor (CMOS) foundries for manufacturing transistor-based quantum devices operable above subkelvin temperatures. Here, we operate a commercial 22-nm-node fully depleted silicon-on-insulator (FDSOI) CMOS device as dual parallel-connected charge-pumps for the implementation of a quantum current standard in the International System of Units (SI). We measure the accuracy of (1.2 +/- 0.1)E-3 A/A for this scalable architecture at 50 MHz with reference to SI-traceable voltage and resistance standards in a pumped helium system. Looking ahead we propose a practical monolithic CMOS chip that incorporates one million parallel-connected charge pumps along with on-chip control electronics. This can be operated as a table-top primary standard, generating quantum currents up to microampere levels. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 16 pages, 4 figures, 3 extended data figures, 7 extended data tables

arXiv:2506.15943 [pdf, ps, other]

On the optimal regret of collaborative personalized linear bandits

Authors: Bruce Huang, Ruida Zhou, Lin F. Yang, Suhas Diggavi

Abstract: Stochastic linear bandits are a fundamental model for sequential decision making, where an agent selects a vector-valued action and receives a noisy reward with expected value given by an unknown linear function. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. Applyi… ▽ More Stochastic linear bandits are a fundamental model for sequential decision making, where an agent selects a vector-valued action and receives a noisy reward with expected value given by an unknown linear function. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. Applying single agent algorithms independently ignores cross-agent similarity and learning opportunities. This paper investigates the optimal regret achievable in collaborative personalized linear bandits. We provide an information-theoretic lower bound that characterizes how the number of agents, the interaction rounds, and the degree of heterogeneity jointly affect regret. We then propose a new two-stage collaborative algorithm that achieves the optimal regret. Our analysis models heterogeneity via a hierarchical Bayesian framework and introduces a novel information-theoretic technique for bounding regret. Our results offer a complete characterization of when and how collaboration helps with a optimal regret bound $\tilde{O}(d\sqrt{mn})$, $\tilde{O}(dm^{1-γ}\sqrt{n})$, $\tilde{O}(dm\sqrt{n})$ for the number of rounds $n$ in the range of $(0, \frac{d}{m σ^2})$, $[\frac{d}{m^{2γ} σ^2}, \frac{d}{σ^2}]$ and $(\frac{d}{σ^2}, \infty)$ respectively, where $σ$ measures the level of heterogeneity, $m$ is the number of agents, and $γ\in[0, 1/2]$ is an absolute constant. In contrast, agents without collaboration achieve a regret bound $O(dm\sqrt{n})$ at best. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 30 pages, 4 figures

arXiv:2506.15873 [pdf, ps, other]

DeckFlow: Iterative Specification on a Multimodal Generative Canvas

Authors: Gregory Croisdale, Emily Huang, John Joon Young Chung, Anhong Guo, Xu Wang, Austin Z. Henley, Cyrus Omar

Abstract: Generative AI promises to allow people to create high-quality personalized media. Although powerful, we identify three fundamental design problems with existing tooling through a literature review. We introduce a multimodal generative AI tool, DeckFlow, to address these problems. First, DeckFlow supports task decomposition by allowing users to maintain multiple interconnected subtasks on an infini… ▽ More Generative AI promises to allow people to create high-quality personalized media. Although powerful, we identify three fundamental design problems with existing tooling through a literature review. We introduce a multimodal generative AI tool, DeckFlow, to address these problems. First, DeckFlow supports task decomposition by allowing users to maintain multiple interconnected subtasks on an infinite canvas populated by cards connected through visual dataflow affordances. Second, DeckFlow supports a specification decomposition workflow where an initial goal is iteratively decomposed into smaller parts and combined using feature labels and clusters. Finally, DeckFlow supports generative space exploration by generating multiple prompt and output variations, presented in a grid, that can feed back recursively into the next design iteration. We evaluate DeckFlow for text-to-image generation against a state-of-practice conversational AI baseline for image generation tasks. We then add audio generation and investigate user behaviors in a more open-ended creative setting with text, image, and audio outputs. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Showing 1–50 of 43,810 results for author: Huang