Search | arXiv e-print repository

doi 10.1609/aaai.v39i2.32116

SocialSim: Towards Socialized Simulation of Emotional Support Conversation

Authors: Zhuang Chen, Yaru Cao, Guanqun Bi, Jincenzi Wu, Jinfeng Zhou, Xiyao Xiao, Si Chen, Hongning Wang, Minlie Huang

Abstract: Emotional support conversation (ESC) helps reduce people's psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this pape… ▽ More Emotional support conversation (ESC) helps reduce people's psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic help-seeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical. △ Less

Submitted 20 June, 2025; originally announced June 2025.

Comments: AAAI 2025 Paper #32116 (Without Publication Edits)

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 39(2), 1274-1282, 2025

arXiv:2506.16728 [pdf, ps, other]

Few-Shot Generalized Category Discovery With Retrieval-Guided Decision Boundary Enhancement

Authors: Yunhan Ren, Feng Luo, Siyu Huang

Abstract: While existing Generalized Category Discovery (GCD) models have achieved significant success, their performance with limited labeled samples and a small number of known categories remains largely unexplored. In this work, we introduce the task of Few-shot Generalized Category Discovery (FSGCD), aiming to achieve competitive performance in GCD tasks under conditions of known information scarcity. T… ▽ More While existing Generalized Category Discovery (GCD) models have achieved significant success, their performance with limited labeled samples and a small number of known categories remains largely unexplored. In this work, we introduce the task of Few-shot Generalized Category Discovery (FSGCD), aiming to achieve competitive performance in GCD tasks under conditions of known information scarcity. To tackle this challenge, we propose a decision boundary enhancement framework with affinity-based retrieval. Our framework is designed to learn the decision boundaries of known categories and transfer these boundaries to unknown categories. First, we use a decision boundary pre-training module to mitigate the overfitting of pre-trained information on known category boundaries and improve the learning of these decision boundaries using labeled samples. Second, we implement a two-stage retrieval-guided decision boundary optimization strategy. Specifically, this strategy further enhances the severely limited known boundaries by using affinity-retrieved pseudo-labeled samples. Then, these refined boundaries are applied to unknown clusters via guidance from affinity-based feature retrieval. Experimental results demonstrate that our proposed method outperforms existing methods on six public GCD benchmarks under the FSGCD setting. The codes are available at: https://github.com/Ryh1218/FSGCD △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted by ICMR 2025

arXiv:2506.16718 [pdf, ps, other]

Generalizable Agent Modeling for Agent Collaboration-Competition Adaptation with Multi-Retrieval and Dynamic Generation

Authors: Chenxu Wang, Yonggang Jin, Cheng Hu, Youpeng Zhao, Zipeng Dai, Jian Zhao, Shiyu Huang, Liuyu Xiang, Junge Zhang, Zhaofeng He

Abstract: Adapting a single agent to a new multi-agent system brings challenges, necessitating adjustments across various tasks, environments, and interactions with unknown teammates and opponents. Addressing this challenge is highly complex, and researchers have proposed two simplified scenarios, Multi-agent reinforcement learning for zero-shot learning and Ad-Hoc Teamwork. Building on these foundations, w… ▽ More Adapting a single agent to a new multi-agent system brings challenges, necessitating adjustments across various tasks, environments, and interactions with unknown teammates and opponents. Addressing this challenge is highly complex, and researchers have proposed two simplified scenarios, Multi-agent reinforcement learning for zero-shot learning and Ad-Hoc Teamwork. Building on these foundations, we propose a more comprehensive setting, Agent Collaborative-Competitive Adaptation (ACCA), which evaluates an agent to generalize across diverse scenarios, tasks, and interactions with both unfamiliar opponents and teammates. In ACCA, agents adjust to task and environmental changes, collaborate with unseen teammates, and compete against unknown opponents. We introduce a new modeling approach, Multi-Retrieval and Dynamic Generation (MRDG), that effectively models both teammates and opponents using their behavioral trajectories. This method incorporates a positional encoder for varying team sizes and a hypernetwork module to boost agents' learning and adaptive capabilities. Additionally, a viewpoint alignment module harmonizes the observational perspectives of retrieved teammates and opponents with the learning agent. Extensive tests in benchmark scenarios like SMAC, Overcooked-AI, and Melting Pot show that MRDG significantly improves robust collaboration and competition with unseen teammates and opponents, surpassing established baselines. Our code is available at: https://github.com/vcis-wangchenxu/MRDG.git △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: This manuscript is under submission to Neurocomputing

Report number: NEUCOM-D-25-02272R1

arXiv:2506.16695 [pdf]

Crystal Growth of Chalcogenides and Oxy-Chalcogenides Using Chloride Exchange Reaction

Authors: Shantanu Singh, Boyang Zhao, Christopher E. Stevens, Mythili Surendran, Tzu-Chi Huang, Bi-Hsuan Lin, Joshua R. Hendrickson, Jayakanth Ravichandran

Abstract: Chalcogenides and oxy-chalcogenides, including complex chalcogenides and transition metal dichalcogenides, are emerging semiconductors with direct or indirect band gaps within the visible spectrum. These materials are being explored for various photonic and electronic applications, such as photodetectors, photovoltaics, and phase-change electronics. Understanding the fundamental properties of thes… ▽ More Chalcogenides and oxy-chalcogenides, including complex chalcogenides and transition metal dichalcogenides, are emerging semiconductors with direct or indirect band gaps within the visible spectrum. These materials are being explored for various photonic and electronic applications, such as photodetectors, photovoltaics, and phase-change electronics. Understanding the fundamental properties of these materials is crucial for optimizing their functionalities. Therefore, the availability of large, high-quality single crystals of chalcogenides and oxy-chalcogenides is essential for a better comprehension of their structure and properties. In this study, we present a novel crystal growth method that utilizes the exchange reaction between BaS and ZrCl$_4$/ HfCl$_4$. By carefully controlling the stoichiometric ratio of the binary sulfide to the chloride, we can grow single crystals of several materials, such as ZrS$_2$, HfS$_2$, BaZrS$_3$, and ZrOS. This method results in large single crystals with a short reaction time of 24 to 48 hours. High-resolution thin film diffraction and single-crystal X-ray diffraction confirm the quality of the crystals produced through this exchange reaction. We also report the optical properties of these materials investigated using photoluminescence and Raman measurements. The chloride exchange reaction method paves the way for the synthesis of single crystals of chalcogenides and oxy-chalcogenide systems with a short reaction time but with low mosaicity and can be an alternative growth technique for single crystals of materials that are difficult to synthesize using conventional growth techniques. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16691 [pdf, ps, other]

LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation

Authors: Tongtian Yue, Longteng Guo, Yepeng Tang, Zijia Zhao, Xinxin Zhu, Hua Huang, Jing Liu

Abstract: Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present… ▽ More Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion by introducing a lightweight and adaptive transformation, which incorporates visual context by injecting token-wise vision-conditioned deltas into the affine parameters of layer normalization. This mechanism directly modulates linguistic hidden states based on visual input, ensuring precise vision-language alignment while preserving the LLM's linguistic priors and drastically reducing computational costs. Extensive evaluations across 15 image and video benchmarks demonstrate that LaVi not only achieves state-of-the-art multimodal performance but also dramatically enhances efficiency. Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half - establishing LaVi as a scalable and practical solution for real-time multimodal reasoning. The code and models will be released soon. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16683 [pdf, ps, other]

A Simple Contrastive Framework Of Item Tokenization For Generative Recommendation

Authors: Penglong Zhai, Yifang Yuan, Fanyi Di, Jie Li, Yue Liu, Chen Li, Jie Huang, Sicong Wang, Yao Xu, Xin Li

Abstract: Generative retrieval-based recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. However, in large-scale recommendation systems, this approach becomes increasingly cumbersome due to the redundancy and sheer scale of the token space. To overcome these limitations, recent research has explored the use of semantic tokens as an alter… ▽ More Generative retrieval-based recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. However, in large-scale recommendation systems, this approach becomes increasingly cumbersome due to the redundancy and sheer scale of the token space. To overcome these limitations, recent research has explored the use of semantic tokens as an alternative to ID tokens, which typically leveraged reconstruction-based strategies, like RQ-VAE, to quantize content embeddings and significantly reduce the embedding size. However, reconstructive quantization aims for the precise reconstruction of each item embedding independently, which conflicts with the goal of generative retrieval tasks focusing more on differentiating among items. Moreover, multi-modal side information of items, such as descriptive text and images, geographical knowledge in location-based recommendation services, has been shown to be effective in improving recommendations by providing richer contexts for interactions. Nevertheless, effectively integrating such complementary knowledge into existing generative recommendation frameworks remains challenging. To overcome these challenges, we propose a novel unsupervised deep quantization exclusively based on contrastive learning, named SimCIT (a Simple Contrastive Item Tokenization framework). Specifically, different from existing reconstruction-based strategies, SimCIT propose to use a learnable residual quantization module to align with the signals from different modalities of the items, which combines multi-modal knowledge alignment and semantic tokenization in a mutually beneficial contrastive learning framework. Extensive experiments across public datasets and a large-scale industrial dataset from various domains demonstrate SimCIT's effectiveness in LLM-based generative recommendation. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 12 pages,7 figures

arXiv:2506.16654 [pdf, ps, other]

Relational Deep Learning: Challenges, Foundations and Next-Generation Architectures

Authors: Vijay Prakash Dwivedi, Charilaos Kanatsoulis, Shenyang Huang, Jure Leskovec

Abstract: Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data and has been applied to molecules, social networks, recommendation systems, and transportation, among other domains. Data in multi-tabular relational databases can also be constructed as 'relational entity graphs' for Relational Deep Learning (RDL) - a new blueprint… ▽ More Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data and has been applied to molecules, social networks, recommendation systems, and transportation, among other domains. Data in multi-tabular relational databases can also be constructed as 'relational entity graphs' for Relational Deep Learning (RDL) - a new blueprint that enables end-to-end representation learning without traditional feature engineering. Compared to arbitrary graph-structured data, relational entity graphs have key properties: (i) their structure is defined by primary-foreign key relationships between entities in different tables, (ii) the structural connectivity is a function of the relational schema defining a database, and (iii) the graph connectivity is temporal and heterogeneous in nature. In this paper, we provide a comprehensive review of RDL by first introducing the representation of relational databases as relational entity graphs, and then reviewing public benchmark datasets that have been used to develop and evaluate recent GNN-based RDL models. We discuss key challenges including large-scale multi-table integration and the complexities of modeling temporal dynamics and heterogeneous data, while also surveying foundational neural network methods and recent architectural advances specialized for relational entity graphs. Finally, we explore opportunities to unify these distinct modeling challenges, highlighting how RDL converges multiple sub-fields in graph machine learning towards the design of foundation models that can transform the processing of relational data. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16633 [pdf, ps, other]

GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View

Authors: Fenghua Cheng, Jinxiang Wang, Sen Wang, Zi Huang, Xue Li

Abstract: Multimodal reasoning is a process of understanding, integrating and inferring information across different data modalities. It has recently attracted surging academic attention as a benchmark for Artificial Intelligence (AI). Although there are various tasks for evaluating multimodal reasoning ability, they still have limitations. Lack of reasoning on hierarchical visual clues at different levels… ▽ More Multimodal reasoning is a process of understanding, integrating and inferring information across different data modalities. It has recently attracted surging academic attention as a benchmark for Artificial Intelligence (AI). Although there are various tasks for evaluating multimodal reasoning ability, they still have limitations. Lack of reasoning on hierarchical visual clues at different levels of granularity, e.g., local details and global context, is of little discussion, despite its frequent involvement in real scenarios. To bridge the gap, we introduce a novel and challenging task for multimodal reasoning, namely GeoGuess. Given a street view image, the task is to identify its location and provide a detailed explanation. A system that succeeds in GeoGuess should be able to detect tiny visual clues, perceive the broader landscape, and associate with vast geographic knowledge. Therefore, GeoGuess would require the ability to reason between hierarchical visual information and geographic knowledge. In this work, we establish a benchmark for GeoGuess by introducing a specially curated dataset GeoExplain which consists of panoramas-geocoordinates-explanation tuples. Additionally, we present a multimodal and multilevel reasoning method, namely SightSense which can make prediction and generate comprehensive explanation based on hierarchy of visual information and external knowledge. Our analysis and experiments demonstrate their outstanding performance in GeoGuess. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16595 [pdf]

Optimizing Time-resolved Magneto-optical Kerr Effect for High-fidelity Magnetic Characterization

Authors: Yun Kim, Dingbin Huang, Deyuan Lyu, Haoyue Sun, Jian-Ping Wang, Paul A. Crowell, Xiaojia Wang

Abstract: Spintronics has emerged as a key technology for fast and non-volatile memory with great CMOS compatibility. As the building blocks for these cutting-edge devices, magnetic materials require precise characterization of their critical properties, such as the effective anisotropy field ($H_{\rm{k,eff}}$, related to magnetic stability) and damping ($α$ key factor in device energy efficiency). Accurate… ▽ More Spintronics has emerged as a key technology for fast and non-volatile memory with great CMOS compatibility. As the building blocks for these cutting-edge devices, magnetic materials require precise characterization of their critical properties, such as the effective anisotropy field ($H_{\rm{k,eff}}$, related to magnetic stability) and damping ($α$ key factor in device energy efficiency). Accurate measurements of these properties are essential for designing and fabricating high-performance spintronic devices. Among advanced metrology techniques, Time-resolved Magneto-Optical Kerr Effect (TR-MOKE) stands out for its superb temporal and spatial resolutions, surpassing traditional methods like ferromagnetic resonance (FMR). However, the full potential of TR-MOKE has not yet been fully pledged due to the lack of systematic optimization and robust operational guidelines. In this study, we address this gap by developing experimentally validated guidelines for optimizing TR-MOKE metrology across materials with perpendicular magnetic anisotropy (PMA) and in-plane magnetic anisotropy (IMA). Our work identifies the optimal ranges of the field angle to simultaneously achieve high signal amplitudes and improve measurement sensitivities to $H_{\rm{k,eff}}$ and $α$. By suppressing the influence of inhomogeneities and boosting sensitivity, our work significantly enhances TR-MOKE capability to extract magnetic properties with high accuracy and reliability. This optimization framework positions TR-MOKE as an indispensable tool for advancing spintronics, paving the way for energy-efficient and high-speed devices that will redefine the landscape of modern computing and memory technologies. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Submitted to Appl. Phys. Lett. Manuscript: 16 pages, 5 figures; Supplementary Materials: 18 pages, 12 figures

arXiv:2506.16594 [pdf, ps, other]

A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications

Authors: Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He, Xiaolei Huang

Abstract: Synthetic data generation--mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields--has been facilitated by rapid advances of large language models (LLMs). This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically exa… ▽ More Synthetic data generation--mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields--has been facilitated by rapid advances of large language models (LLMs). This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically examines biomedical research and application trends in synthetic data generation, emphasizing clinical applications, methodologies, and evaluations. Our analysis identifies data modalities of unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%); generation methods of prompting (72.9%), fine-tuning (22.0%) LLMs and specialized model (5.1%); and heterogeneous evaluations of intrinsic metrics (27.1%), human-in-the-loop assessments (55.9%), and LLM-based evaluations (13.6%). The analysis addresses current limitations in what, where, and how health professionals can leverage synthetic data generation for biomedical domains. Our review also highlights challenges in adaption across clinical domains, resource and model accessibility, and evaluation standardizations. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16578 [pdf, ps, other]

SafeTriage: Facial Video De-identification for Privacy-Preserving Stroke Triage

Authors: Tongan Cai, Haomiao Ni, Wenchao Ma, Yuan Xue, Qian Ma, Rachel Leicht, Kelvin Wong, John Volpi, Stephen T. C. Wong, James Z. Wang, Sharon X. Huang

Abstract: Effective stroke triage in emergency settings often relies on clinicians' ability to identify subtle abnormalities in facial muscle coordination. While recent AI models have shown promise in detecting such patterns from patient facial videos, their reliance on real patient data raises significant ethical and privacy challenges -- especially when training robust and generalizable models across inst… ▽ More Effective stroke triage in emergency settings often relies on clinicians' ability to identify subtle abnormalities in facial muscle coordination. While recent AI models have shown promise in detecting such patterns from patient facial videos, their reliance on real patient data raises significant ethical and privacy challenges -- especially when training robust and generalizable models across institutions. To address these concerns, we propose SafeTriage, a novel method designed to de-identify patient facial videos while preserving essential motion cues crucial for stroke diagnosis. SafeTriage leverages a pretrained video motion transfer (VMT) model to map the motion characteristics of real patient faces onto synthetic identities. This approach retains diagnostically relevant facial dynamics without revealing the patients' identities. To mitigate the distribution shift between normal population pre-training videos and patient population test videos, we introduce a conditional generative model for visual prompt tuning, which adapts the input space of the VMT model to ensure accurate motion transfer without needing to fine-tune the VMT model backbone. Comprehensive evaluation, including quantitative metrics and clinical expert assessments, demonstrates that SafeTriage-produced synthetic videos effectively preserve stroke-relevant facial patterns, enabling reliable AI-based triage. Our evaluations also show that SafeTriage provides robust privacy protection while maintaining diagnostic accuracy, offering a secure and ethically sound foundation for data sharing and AI-driven clinical analysis in neurological disorders. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: IPMI 2025

arXiv:2506.16531 [pdf, ps, other]

How Hard Is Snow? A Paired Domain Adaptation Dataset for Clear and Snowy Weather: CADC+

Authors: Mei Qi Tang, Sean Sedwards, Chengjie Huang, Krzysztof Czarnecki

Abstract: The impact of snowfall on 3D object detection performance remains underexplored. Conducting such an evaluation requires a dataset with sufficient labelled data from both weather conditions, ideally captured in the same driving environment. Current driving datasets with LiDAR point clouds either do not provide enough labelled data in both snowy and clear weather conditions, or rely on de-snowing me… ▽ More The impact of snowfall on 3D object detection performance remains underexplored. Conducting such an evaluation requires a dataset with sufficient labelled data from both weather conditions, ideally captured in the same driving environment. Current driving datasets with LiDAR point clouds either do not provide enough labelled data in both snowy and clear weather conditions, or rely on de-snowing methods to generate synthetic clear weather. Synthetic data often lacks realism and introduces an additional domain shift that confounds accurate evaluations. To address these challenges, we present CADC+, the first paired weather domain adaptation dataset for autonomous driving in winter conditions. CADC+ extends the Canadian Adverse Driving Conditions dataset (CADC) using clear weather data that was recorded on the same roads and in the same period as CADC. To create CADC+, we pair each CADC sequence with a clear weather sequence that matches the snowy sequence as closely as possible. CADC+ thus minimizes the domain shift resulting from factors unrelated to the presence of snow. We also present some preliminary results using CADC+ to evaluate the effect of snow on 3D object detection performance. We observe that snow introduces a combination of aleatoric and epistemic uncertainties, acting as both noise and a distinct data domain. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: IEEE IV 2025

arXiv:2506.16504 [pdf, ps, other]

Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

Authors: Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxiang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, Sheng Zhang, Xin Huang, Di Luo, Fan Yang, Fang Yang, Lifu Wang, Sicong Liu, Yixuan Tang, Yulin Cai, Zebin He, Tian Liu, Yuhong Liu, Jie Jiang, Linus, Jingwei Huang , et al. (1 additional authors not shown)

Abstract: In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which… ▽ More In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which is trained with scaled high-quality datasets, model-size, and compute. Our largest model reaches 10B parameters and generates sharp and detailed 3D shape with precise image-3D following while keeping mesh surface clean and smooth, significantly closing the gap between generated and handcrafted 3D shapes. In terms of texture generation, it is upgraded with phyiscal-based rendering (PBR) via a novel multi-view architecture extended from Hunyuan3D 2.0 Paint model. Our extensive evaluation shows that Hunyuan3D 2.5 significantly outperforms previous methods in both shape and end-to-end texture generation. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Technical report

arXiv:2506.16481 [pdf, ps, other]

SO emission in the dynamically perturbed protoplanetary disks around CQ Tau and MWC 758

Authors: Francesco Zagaria, Haochang Jiang, Gianni Cataldi, Stefano Facchini, Myriam Benisty, Yuri Aikawa, Sean Andrews, Jaehan Bae, Marcelo Barraza-Alfaro, Pietro Curone, Ian Czekala, Daniele Fasano, Cassandra Hall, Iain Hammond, Jane Huang, John D. Ilee, Andrés F. Izquierdo, Jensen Lawrence, Giuseppe Lodato, François Ménard, Christophe Pinte, Giovanni P. Rosotti, Jochen Stadler, Richard Teague, Leonardo Testi , et al. (3 additional authors not shown)

Abstract: We report the serendipitous detection of the SO $J_N=6_5-5_4$ (219.949 GHz) rotational transition in archival Atacama Large Millimeter/submillimeter Array (ALMA) observations of the spiral hosting protoplanetary disks around CQ Tau (with $\approx4.9σ$ significance) and MWC 758 (with $\approx3.4σ$ significance). In the former, the SO emission comes in the shape of a ring, arises from the edge of th… ▽ More We report the serendipitous detection of the SO $J_N=6_5-5_4$ (219.949 GHz) rotational transition in archival Atacama Large Millimeter/submillimeter Array (ALMA) observations of the spiral hosting protoplanetary disks around CQ Tau (with $\approx4.9σ$ significance) and MWC 758 (with $\approx3.4σ$ significance). In the former, the SO emission comes in the shape of a ring, arises from the edge of the continuum cavity, and is qualitatively consistent, at the currently available spectral resolution, with being in Keplerian rotation. In the latter, instead, while arising primarily from inside the continuum cavity, the SO emission also extends to the continuum ring(s), and its morphology and kinematics are less clear. We put these sources in the context of the other protoplanetary disks where SO detections have been previously reported in the literature and discuss the possible origins of SO in terms of (thermal) desorption or formation in the gas phase. We argue that these processes might be fostered by dynamical perturbations caused by unseen embedded massive companions, shadows, or late-time infall, thus suggesting a possible link between perturbed dynamics and SO emission in (these) protoplanetary disks. If confirmed, our interpretation would imply that chemical evolution timescales could be significantly shorter in these systems than is commonly assumed, indicating that dynamical perturbations might influence the composition of newborn (proto-)planets by altering the volatile makeup of their formation environment. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted for publication in ApJ. 23 pages 7 figures

arXiv:2506.16447 [pdf, ps, other]

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Authors: Biao Yi, Tiansheng Huang, Sishuo Chen, Tong Li, Zheli Liu, Zhixuan Chu, Yiming Li

Abstract: Backdoor unalignment attacks against Large Language Models (LLMs) enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service (LLMaaS) setting, where the deployed model is a fully black-box system that can only interact through t… ▽ More Backdoor unalignment attacks against Large Language Models (LLMs) enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service (LLMaaS) setting, where the deployed model is a fully black-box system that can only interact through text. Furthermore, the sample-dependent nature of the attack target exacerbates the threat. Instead of outputting a fixed label, the backdoored LLM follows the semantics of any malicious command with the hidden trigger, significantly expanding the target space. In this paper, we introduce BEAT, a black-box defense that detects triggered samples during inference to deactivate the backdoor. It is motivated by an intriguing observation (dubbed the probe concatenate effect), where concatenated triggered samples significantly reduce the refusal rate of the backdoored LLM towards a malicious probe, while non-triggered samples have little effect. Specifically, BEAT identifies whether an input is triggered by measuring the degree of distortion in the output distribution of the probe before and after concatenation with the input. Our method addresses the challenges of sample-dependent targets from an opposite perspective. It captures the impact of the trigger on the refusal signal (which is sample-independent) instead of sample-specific successful attack behaviors. It overcomes black-box access limitations by using multiple sampling to approximate the output distribution. Extensive experiments are conducted on various backdoor attacks and LLMs (including the closed-source GPT-3.5-turbo), verifying the effectiveness and efficiency of our defense. Besides, we also preliminarily verify that BEAT can effectively defend against popular jailbreak attacks, as they can be regarded as 'natural backdoors'. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted at ICLR 2025

Journal ref: Proceedings of The Thirteenth International Conference on Learning Representations (ICLR 2025)

arXiv:2506.16398 [pdf, ps, other]

HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis

Authors: Peixiang Huang, Yanyan Huang, Weiqin Zhao, Junjun He, Lequan Yu

Abstract: Pathology is essential for cancer diagnosis, with multiple instance learning (MIL) widely used for whole slide image (WSI) analysis. WSIs exhibit a natural hierarchy -- patches, regions, and slides -- with distinct semantic associations. While some methods attempt to leverage this hierarchy for improved representation, they predominantly rely on Euclidean embeddings, which struggle to fully captur… ▽ More Pathology is essential for cancer diagnosis, with multiple instance learning (MIL) widely used for whole slide image (WSI) analysis. WSIs exhibit a natural hierarchy -- patches, regions, and slides -- with distinct semantic associations. While some methods attempt to leverage this hierarchy for improved representation, they predominantly rely on Euclidean embeddings, which struggle to fully capture semantic hierarchies. To address this limitation, we propose HyperPath, a novel method that integrates knowledge from textual descriptions to guide the modeling of semantic hierarchies of WSIs in hyperbolic space, thereby enhancing WSI classification. Our approach adapts both visual and textual features extracted by pathology vision-language foundation models to the hyperbolic space. We design an Angular Modality Alignment Loss to ensure robust cross-modal alignment, while a Semantic Hierarchy Consistency Loss further refines feature hierarchies through entailment and contradiction relationships and thus enhance semantic coherence. The classification is performed with geodesic distance, which measures the similarity between entities in the hyperbolic semantic hierarchy. This eliminates the need for linear classifiers and enables a geometry-aware approach to WSI analysis. Extensive experiments show that our method achieves superior performance across tasks compared to existing methods, highlighting the potential of hyperbolic embeddings for WSI analysis. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16381 [pdf, ps, other]

InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

Authors: Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang, Dong Zhang, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu

Abstract: In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language inst… ▽ More In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 19 pages, 9 figures

arXiv:2506.16346 [pdf]

Preferred Synthesis of Armchair SnS2 Nanotubes

Authors: Abid, Luneng Zhao, Ju Huang, Yongjia Zheng, Yuta Sato, Qingyun Lin, Zhen Han, Chunxia Yang, Tianyu Wang, Bill Herve Nduwarugira, Yicheng Ma, Lingfeng Wang, Yige Zheng, Hang Wang, Salman Ullah, Afzal Khan, Qi Zhang, Wenbin Li, Junfeng Gao, Bingfeng Ju, Feng Ding, Yan Li, Kazu Suenaga, Shigeo Maruyama, Huayong Yang , et al. (1 additional authors not shown)

Abstract: In this work, we present the synthesis of tin disulfide (SnS2) nanotubes (NTs) with preferred chiral angle. A sacrificial template is used to create channels of boron nitride nanotubes (BNNTs) with an optimized diameter of 4-5 nm, inside of which SnS2 NTs are formed with the high yield and structural purity. Atomic resolution imaging and nano-area electron diffraction reveal that these synthesized… ▽ More In this work, we present the synthesis of tin disulfide (SnS2) nanotubes (NTs) with preferred chiral angle. A sacrificial template is used to create channels of boron nitride nanotubes (BNNTs) with an optimized diameter of 4-5 nm, inside of which SnS2 NTs are formed with the high yield and structural purity. Atomic resolution imaging and nano-area electron diffraction reveal that these synthesized SnS2 NTs prefer to have an armchair configuration with a probability of approximately 85%. Calculations using density functional theory (DFT) reveal a negligible difference in the formation energy between armchair and zigzag NTs, suggesting that structural stability does not play a key role in this chirality-selective growth. However, a detailed TEM investigation revealed that some SnS2 nanoribbons are found connected to the ends of SnS2 NTs, and that these nanoribbons primarily have a zigzag configuration. Subsequent DFT and machine learning potential molecular dynamic simulations verify that nanoribbons with zigzag configurations are more stable than armchair ones, and indeed zigzag nanoribbons aligned along the BNNT axis tend to roll up to form an armchair SnS2 NTs. Finally, this "zigzag nanoribbon to armchair nanotube" transition hypothesis is verified by in-situ high-resolution transmission electron microscopy, in which the transformation of SnS2 nanoribbons into a nanotube is reproduced in real time. This work is the first demonstration of preferred-chirality growth of transition metal dichalcogenide nanotubes. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16336 [pdf, ps, other]

Goal-conditioned Hierarchical Reinforcement Learning for Sample-efficient and Safe Autonomous Driving at Intersections

Authors: Yiou Huang

Abstract: Reinforcement learning (RL) exhibits remarkable potential in addressing autonomous driving tasks. However, it is difficult to train a sample-efficient and safe policy in complex scenarios. In this article, we propose a novel hierarchical reinforcement learning (HRL) framework with a goal-conditioned collision prediction (GCCP) module. In the hierarchical structure, the GCCP module predicts collisi… ▽ More Reinforcement learning (RL) exhibits remarkable potential in addressing autonomous driving tasks. However, it is difficult to train a sample-efficient and safe policy in complex scenarios. In this article, we propose a novel hierarchical reinforcement learning (HRL) framework with a goal-conditioned collision prediction (GCCP) module. In the hierarchical structure, the GCCP module predicts collision risks according to different potential subgoals of the ego vehicle. A high-level decision-maker choose the best safe subgoal. A low-level motion-planner interacts with the environment according to the subgoal. Compared to traditional RL methods, our algorithm is more sample-efficient, since its hierarchical structure allows reusing the policies of subgoals across similar tasks for various navigation scenarios. In additional, the GCCP module's ability to predict both the ego vehicle's and surrounding vehicles' future actions according to different subgoals, ensures the safety of the ego vehicle throughout the decision-making process. Experimental results demonstrate that the proposed method converges to an optimal policy faster and achieves higher safety than traditional RL methods. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16317 [pdf, ps, other]

Two loop QCD corrections to $e^+ e^- \to J/ψ+ η_c$ in asymptotic expansion

Authors: Cong Li, Xu-Dong Huang, Wen-Long Sang

Abstract: Within the framework of NRQCD, the short-distance coefficients (SDCs) for the process $e^+e^-\to J/ψ+η_c$ have been obtained up to NNLO in asymptotic expansions over $r={16m_c^2}/{s}$ up to $r^{15}$. Although these asymptotic expressions are deviated from the full results near the threshold $r= 1$, they provide excellent approximations to the full results for $r<0.8$, with deviations less than… ▽ More Within the framework of NRQCD, the short-distance coefficients (SDCs) for the process $e^+e^-\to J/ψ+η_c$ have been obtained up to NNLO in asymptotic expansions over $r={16m_c^2}/{s}$ up to $r^{15}$. Although these asymptotic expressions are deviated from the full results near the threshold $r= 1$, they provide excellent approximations to the full results for $r<0.8$, with deviations less than $3\%$. Therefore, these asymptotic expressions offer reliable applications for phenomenological predictions across a wide range of center-of-mass energies $\sqrt{s}$. Utilizing these asymptotic expressions, we present phenomenological predictions for the cross sections in both the on-shell mass scheme and the $\overline{\rm MS}$ mass scheme, with the uncertainty arising from the renormalization scale $μ_R$ included. The $μ_R$ uncertainty for predictions from the $\overline{\rm MS}$ mass scheme is slightly larger than that from the on-shell mass scheme, which is partly attributed to the helicity flip in the process $e^+e^-\to J/ψ+η_c$. We observe that both mass schemes yield quite similar predictions, and our theoretical results are consistent with the available experimental data. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 18 pages, 4 figures, 1 tables, 1 attached file

arXiv:2506.16305 [pdf, ps, other]

A remark for fully non-linear elliptic equations on compact almost Hermitian manifolds

Authors: Liding Huang

Abstract: In this paper, we generalize the definition of sub-slope, introduced by Guo-Song, to almost Hermitian manifolds and prove the existence of solutions for a general class of fully non-linear equations on compact almost Hermitian manifolds. As an application, we solve the complex Hessian quotient equation and the deformed Hermitian-Yang-Mills equation in the almost Hermitian setting. In this paper, we generalize the definition of sub-slope, introduced by Guo-Song, to almost Hermitian manifolds and prove the existence of solutions for a general class of fully non-linear equations on compact almost Hermitian manifolds. As an application, we solve the complex Hessian quotient equation and the deformed Hermitian-Yang-Mills equation in the almost Hermitian setting. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16265 [pdf, ps, other]

Dense 3D Displacement Estimation for Landslide Monitoring via Fusion of TLS Point Clouds and Embedded RGB Images

Authors: Zhaoyi Wang, Jemil Avers Butt, Shengyu Huang, Tomislav Medic, Andreas Wieser

Abstract: Landslide monitoring is essential for understanding geohazards and mitigating associated risks. However, existing point cloud-based methods typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partition-based coarse-to-fine approach that fuses 3D point clouds and co-registered RGB images to e… ▽ More Landslide monitoring is essential for understanding geohazards and mitigating associated risks. However, existing point cloud-based methods typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partition-based coarse-to-fine approach that fuses 3D point clouds and co-registered RGB images to estimate dense 3D displacement vector fields. We construct patch-level matches using both 3D geometry and 2D image features. These matches are refined via geometric consistency checks, followed by rigid transformation estimation per match. Experimental results on two real-world landslide datasets demonstrate that our method produces 3D displacement estimates with high spatial coverage (79% and 97%) and high accuracy. Deviations in displacement magnitude with respect to external measurements (total station or GNSS observations) are 0.15 m and 0.25 m on the two datasets, respectively, and only 0.07 m and 0.20 m compared to manually derived references. These values are below the average scan resolutions (0.08 m and 0.30 m). Our method outperforms the state-of-the-art method F2S3 in spatial coverage while maintaining comparable accuracy. Our approach offers a practical and adaptable solution for TLS-based landslide monitoring and is extensible to other types of point clouds and monitoring tasks. Our example data and source code are publicly available at https://github.com/zhaoyiww/fusion4landslide. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 20 pages, 16 figures. Preprint under peer review. Example data and code available at [GitHub](https://github.com/zhaoyiww/fusion4landslide)

arXiv:2506.16261 [pdf, ps, other]

Global well-posedness for 2D compressible radially symmetric Navier-Stokes equations with swirl

Authors: Xiangdi Huang, Weili Meng

Abstract: In this paper, we consider the radially symmetric compressible Navier-Stokes equations with swirl in two-dimensional disks, where the shear viscosity coefficient $μ= \text{const}> 0$, and the bulk one $λ= ρ^β(β>0)$. When $β\geq 1$, we prove the global existence and asymptotic behavior of the large strong solutions for initial values that allow for vacuum. One of the key ingredients is to sho… ▽ More In this paper, we consider the radially symmetric compressible Navier-Stokes equations with swirl in two-dimensional disks, where the shear viscosity coefficient $μ= \text{const}> 0$, and the bulk one $λ= ρ^β(β>0)$. When $β\geq 1$, we prove the global existence and asymptotic behavior of the large strong solutions for initial values that allow for vacuum. One of the key ingredients is to show the uniform boundedness of the density independent of the time. When $β\in(0,1)$, we prove the same conclusion holds when the initial value satisfies $\norm{ρ_0}_{L^\infty} \leq a_0$, where $a_0$ is given by \eqref{def a_0} as in Theorem \ref{Thm3}. To the best of our knowledge, this is the first result on the global existence of large strong solutions for 2D compressible Navier-Stokes equation with real non-slip (non Navier-slip) boundary conditions when $β\ge1$ and the first result on the global existence of strong solutions when $β\in(0,1)$ △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 44 pages

MSC Class: 35Q30; 76N10

arXiv:2506.16250 [pdf, ps, other]

Graph-Cover-based Characterization of the Bethe Partition Function of Double-Edge Factor Graphs

Authors: Yuwen Huang, Pascal O. Vontobel

Abstract: For standard factor graphs (S-FGs) with non-negative real-valued local functions, Vontobel provided a combinatorial characterization of the Bethe approximation of the partition function, also known as the Bethe partition function, using finite graph covers. The proof of this characterization, i.e., the graph-cover theorem for S-FGs, heavily relied on the method of types. In this paper, we study… ▽ More For standard factor graphs (S-FGs) with non-negative real-valued local functions, Vontobel provided a combinatorial characterization of the Bethe approximation of the partition function, also known as the Bethe partition function, using finite graph covers. The proof of this characterization, i.e., the graph-cover theorem for S-FGs, heavily relied on the method of types. In this paper, we study double-edge factor graphs (DE-FGs), a class of factor graphs where each local function takes complex values and satisfies some positive semi-definiteness constraints. DE-FGs and their partition functions are particularly relevant for quantum information processing. Approximating the partition function of a DE-FG is more difficult than for an S-FG, as it involves summing complex values instead of non-negative real values. We develop the sum-product algorithm (SPA) fixed-point-based Bethe approximation of the partition function. However, one cannot directly apply the method of types to prove a similar combinatorial characterization as in the case of S-FGs. We provide a combinatorial characterization of the Bethe partition function in terms of finite graph covers for a class of DE-FGs that satisfy a specific, easily checkable condition. Towards proving this characterization, we apply a suitable loop-calculus transform (LCT) to these graphs. Originally, the LCT was introduced by Chertkov and Chernyak as a special linear transform for S-FGs and later extended by Mori. Our proposed LCT is applicable for both DE-FGs and S-FGs and generalizes prior versions by handling zero-valued SPA fixed-point message components, which are common in DE-FGs. Supported by numerical results, we conjecture that this combinatorial characterization of the Bethe partition function in terms of finite graph covers holds more broadly for DE-FGs. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: arXiv admin note: substantial text overlap with arXiv:2412.05942

arXiv:2506.16233 [pdf, ps, other]

Can AI Dream of Unseen Galaxies? Conditional Diffusion Model for Galaxy Morphology Augmentation

Authors: Chenrui Ma, Zechang Sun, Tao Jing, Zheng Cai, Yuan-Sen Ting, Song Huang, Mingyu Li

Abstract: Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets -- whether from simulations or human annotation -- a challenge pronounced for rare yet scientifically va… ▽ More Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets -- whether from simulations or human annotation -- a challenge pronounced for rare yet scientifically valuable objects. To address this, we propose a conditional diffusion model to synthesize realistic galaxy images for augmenting ML training data. Leveraging the Galaxy Zoo 2 dataset which contains visual feature -- galaxy image pairs from volunteer annotation, we demonstrate that our model generates diverse, high-fidelity galaxy images closely adhere to the specified morphological feature conditions. Moreover, this model enables generative extrapolation to project well-annotated data into unseen domains and advancing rare object detection. Integrating synthesized images into ML pipelines improves performance in standard morphology classification, boosting completeness and purity by up to 30\% across key metrics. For rare object detection, using early-type galaxies with prominent dust lane features ( $\sim$0.1\% in GZ2 dataset) as a test case, our approach doubled the number of detected instances from 352 to 872, compared to previous studies based on visual inspection. This study highlights the power of generative models to bridge gaps between scarce labeled data and the vast, uncharted parameter space of observational astronomy and sheds insight for future astrophysical foundation model developments. Our project homepage is available at https://galaxysd-webpage.streamlit.app/. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: We have submitted to AAS journals. See another independent work for further reference -- Category-based Galaxy Image Generation via Diffusion Models (Fan, Tang et al.). Comments are welcome

arXiv:2506.16211 [pdf, ps, other]

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

Authors: Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang

Abstract: Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose… ▽ More Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations -- a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA's extensibility to long-horizon tasks and robustness to unseen objects and backgrounds. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Website: https://controlvla.github.io

arXiv:2506.16210 [pdf, ps, other]

From Coarse to Continuous: Progressive Refinement Implicit Neural Representation for Motion-Robust Anisotropic MRI Reconstruction

Authors: Zhenxuan Zhang, Lipei Zhang, Yanqi Cheng, Zi Wang, Fanwen Wang, Haosen Zhang, Yue Yang, Yinzhe Wu, Jiahao Huang, Angelica I Aviles-Rivero, Zhifan Gao, Guang Yang, Peter J. Lally

Abstract: In motion-robust magnetic resonance imaging (MRI), slice-to-volume reconstruction is critical for recovering anatomically consistent 3D brain volumes from 2D slices, especially under accelerated acquisitions or patient motion. However, this task remains challenging due to hierarchical structural disruptions. It includes local detail loss from k-space undersampling, global structural aliasing cause… ▽ More In motion-robust magnetic resonance imaging (MRI), slice-to-volume reconstruction is critical for recovering anatomically consistent 3D brain volumes from 2D slices, especially under accelerated acquisitions or patient motion. However, this task remains challenging due to hierarchical structural disruptions. It includes local detail loss from k-space undersampling, global structural aliasing caused by motion, and volumetric anisotropy. Therefore, we propose a progressive refinement implicit neural representation (PR-INR) framework. Our PR-INR unifies motion correction, structural refinement, and volumetric synthesis within a geometry-aware coordinate space. Specifically, a motion-aware diffusion module is first employed to generate coarse volumetric reconstructions that suppress motion artifacts and preserve global anatomical structures. Then, we introduce an implicit detail restoration module that performs residual refinement by aligning spatial coordinates with visual features. It corrects local structures and enhances boundary precision. Further, a voxel continuous-aware representation module represents the image as a continuous function over 3D coordinates. It enables accurate inter-slice completion and high-frequency detail recovery. We evaluate PR-INR on five public MRI datasets under various motion conditions (3% and 5% displacement), undersampling rates (4x and 8x) and slice resolutions (scale = 5). Experimental results demonstrate that PR-INR outperforms state-of-the-art methods in both quantitative reconstruction metrics and visual quality. It further shows generalization and robustness across diverse unseen domains. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16136 [pdf, ps, other]

Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing

Authors: Kai Huang, Jian Zhang, Xiaofei Xie, Chunyang Chen

Abstract: Large language model-(LLM) based automated program repair (APR) techniques have shown promising results in resolving real-world GitHub issue tasks. Existing APR systems are primarily evaluated in unimodal settings (e.g., SWE-bench). However, these autonomous systems struggle to resolve multimodal problem scenarios (e.g., SWE-bench M) due to limitations in interpreting and leveraging visual informa… ▽ More Large language model-(LLM) based automated program repair (APR) techniques have shown promising results in resolving real-world GitHub issue tasks. Existing APR systems are primarily evaluated in unimodal settings (e.g., SWE-bench). However, these autonomous systems struggle to resolve multimodal problem scenarios (e.g., SWE-bench M) due to limitations in interpreting and leveraging visual information. In multimodal scenarios, LLMs need to rely on visual information in the graphical user interface (GUI) to understand bugs and generate fixes. To bridge this gap, we propose GUIRepair, a cross-modal reasoning approach for resolving multimodal issue scenarios by understanding and capturing visual information. Specifically, GUIRepair integrates two key components, Image2Code and Code2Image, to enhance fault comprehension and patch validation. Image2Code extracts relevant project documents based on the issue report, then applies this domain knowledge to generate the reproduced code responsible for the visual symptoms, effectively translating GUI images into executable context for better fault comprehension. Code2Image replays the visual issue scenario using the reproduced code and captures GUI renderings of the patched program to assess whether the fix visually resolves the issue, providing feedback for patch validation. We evaluate GUIRepair on SWE-bench M, and the approach demonstrates significant effectiveness. When utilizing GPT-4o as the base model, GUIRepair solves 157 instances, outperforming the best open-source baseline by 26 instances. Furthermore, when using o4-mini as the base model, GUIRepair can achieve even better results and solve 175 instances, outperforming the top commercial system by 22 instances. This emphasizes the success of our new perspective on incorporating cross-modal reasoning by understanding and capturing visual information to resolve multimodal issues. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16112 [pdf, ps, other]

AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models

Authors: Yuan Zhang, Chun-Kai Fan, Tao Huang, Ming Lu, Sicheng Yu, Junwen Pan, Kuan Cheng, Qi She, Shanghang Zhang

Abstract: Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often… ▽ More Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we developed an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experimental results indicate that AutoV enhances the performance of various LVLMs across multiple popular image understanding tasks. For instance, LLaVA-OV with AutoV achieves $\textbf{1.7}\%$ accuracy gain on LLaVA$^{\text{Wild}}$, and AutoV boosts Qwen2.5-VL by $\textbf{1.9}\%$ on MMMU, highlighting its potential as an optimal visual prompting method for LVLMs. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 19 pages

arXiv:2506.16102 [pdf, ps, other]

Fast Training-free Perceptual Image Compression

Authors: Ziran Zhu, Tongda Xu, Minye Huang, Dailan He, Xingtong Ge, Xinjie Zhang, Ling Li, Yan Wang

Abstract: Training-free perceptual image codec adopt pre-trained unconditional generative model during decoding to avoid training new conditional generative model. However, they heavily rely on diffusion inversion or sample communication, which take 1 min to intractable amount of time to decode a single image. In this paper, we propose a training-free algorithm that improves the perceptual quality of any ex… ▽ More Training-free perceptual image codec adopt pre-trained unconditional generative model during decoding to avoid training new conditional generative model. However, they heavily rely on diffusion inversion or sample communication, which take 1 min to intractable amount of time to decode a single image. In this paper, we propose a training-free algorithm that improves the perceptual quality of any existing codec with theoretical guarantee. We further propose different implementations for optimal perceptual quality when decoding time budget is $\approx 0.1$s, $0.1-10$s and $\ge 10$s. Our approach: 1). improves the decoding time of training-free codec from 1 min to $0.1-10$s with comparable perceptual quality. 2). can be applied to non-differentiable codec such as VTM. 3). can be used to improve previous perceptual codecs, such as MS-ILLM. 4). can easily achieve perception-distortion trade-off. Empirically, we show that our approach successfully improves the perceptual quality of ELIC, VTM and MS-ILLM with fast decoding. Our approach achieves comparable FID to previous training-free codec with significantly less decoding time. And our approach still outperforms previous conditional generative model based codecs such as HiFiC and MS-ILLM in terms of FID. The source code is provided in the supplementary material. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16100 [pdf, ps, other]

Seesaw Portal to Super Heavy Dark Matter with $Z_3$ Symmetry

Authors: Cai-Xia Yang, Zhi-Long Han, Fei Huang, Yi Jin, Honglei Li

Abstract: Right-handed neutrinos $N$ are introduced to explain the origin of the tiny neutrino masses via the seesaw mechanism. Required by relatively large Yukawa coupling and leptogenesis, masses of right-handed neutrinos are beyond $10^{9}$ GeV. Such heavy right-handed neutrino can mediate the production of super heavy dark matter $χ$ via the freeze-in mechanism. In the minimal $Z_2$ symmetric model, the… ▽ More Right-handed neutrinos $N$ are introduced to explain the origin of the tiny neutrino masses via the seesaw mechanism. Required by relatively large Yukawa coupling and leptogenesis, masses of right-handed neutrinos are beyond $10^{9}$ GeV. Such heavy right-handed neutrino can mediate the production of super heavy dark matter $χ$ via the freeze-in mechanism. In the minimal $Z_2$ symmetric model, the right-hand neutrino portal interaction is $y_N φ\barχ N$ with the dark scalar $φ$. One drawback of the $Z_2$ symmetric model is that the mass ordering $m_N>m_φ$ with long-lived $φ$ is almost ruled out by Big Bang Nucleosynthesis. In this paper, we propose that by extending the dark symmetry to $Z_3$, one additional interaction $y_χφ\barχ^c χ$ is further allowed. In this way, the new decay mode $φ\to χχ$ would lead to the dark scalar $φ$ being short-lived even with a feeble $y_χ$, thus it is allowed by the cosmological constraints. The phenomenology of the $Z_3$ symmetric super heavy dark matter model is also studied in this paper. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 19 pages, 7 figures

arXiv:2506.16078 [pdf, ps, other]

Probing the Robustness of Large Language Models Safety to Latent Perturbations

Authors: Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang

Abstract: Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal r… ▽ More Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16037 [pdf, ps, other]

Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3

Authors: Xinyue Huang, Ziqi Lin, Fang Sun, Wenchao Zhang, Kejian Tong, Yunbo Liu

Abstract: This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent respons… ▽ More This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent response generation. A joint optimization strategy combining retrieval likelihood and generation cross-entropy improves the model's robustness and adaptability. Experimental results show that the proposed system outperforms existing retrieval-augmented and generative baselines, confirming its effectiveness in delivering precise, contextually grounded answers. △ Less

Submitted 19 June, 2025; originally announced June 2025.

arXiv:2506.16031 [pdf, ps, other]

Longtime Monitoring of TeV Radio Galaxies with HAWC

Authors: R. Alfaro, C. Alvarez, E. Anita-Rangel, J. C. Arteaga-Velázquez, D. Avila Rojas, H. A. Ayala Solares, R. Babu, P. Bangale, E. Belmont-Moreno, A. Bernal, K. S. Caballero-Mora, T. Capistrán, A. Carramiñana, F. Carreón, S. Casanova, U. Cotti, J. Cotzomi, S. Coutiño de León, E. De la Fuente, D. Depaoli, P. Desiati, N. Di Lalla, R. Diaz Hernandez, M. A. DuVernois, J. C. Díaz-Vélez , et al. (63 additional authors not shown)

Abstract: We present the monitoring of the TeV-emitting radio galaxies M87, NGC~1275, 3C~264, and IC~310 with the High Altitude Water Cherenkov Observatory (HAWC) over a period of approximately $7.5$ years. The analysis includes light curves at daily, weekly and monthly time scales for the four sources. We report the detection of gamma-ray emission from M87 with a significance exceeding 5$σ$. Due to its sig… ▽ More We present the monitoring of the TeV-emitting radio galaxies M87, NGC~1275, 3C~264, and IC~310 with the High Altitude Water Cherenkov Observatory (HAWC) over a period of approximately $7.5$ years. The analysis includes light curves at daily, weekly and monthly time scales for the four sources. We report the detection of gamma-ray emission from M87 with a significance exceeding 5$σ$. Due to its significant detection, this work reports the integrated TeV spectrum of M87 from the longest temporal coverage up to date. The source is well described as a point-like source modeled by a power law spectrum with spectral index $α= 2.53\pm0.29$ and a flux of $(7.09\pm 1.24)\times10^{-13}$ $\rm{cm}^{-2}\,{s}^{-1}\,{TeV}^{-1}$ at $1\,\rm{TeV}$. The maximum energy of the detected emission in M87, at 1$σ$ confidence level (C.L.), reaches 26.5 TeV. HAWC's observation of M87 reveals a low flux spectrum for the longest observation to date of this radio galaxy. 3C~264 is marginally detected with a significance slightly below 4$σ$, while NGC~1275 and IC~310 are not detected. The weekly light curves show an increased number of fluxes above $2σ$ for M87 starting in 2019, and for 3C~264 starting in 2018, which can be interpreted as the moment for which these sources start to exhibit an enhanced steady TeV emission. Overall, in the four radio galaxies, the cumulative significance over time indicates a behavior that resembles that of a gamma-ray variable active galaxy, such as the blazar Markarian 421. This supports the importance of monitoring radio galaxies to identify periods of higher activity and flares, enabling further multi-messenger studies. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: 14 pages, 1 table, 7 figures

arXiv:2506.16020 [pdf, ps, other]

VS-Singer: Vision-Guided Stereo Singing Voice Synthesis with Consistency Schrödinger Bridge

Authors: Zijing Zhao, Kai Wang, Hao Huang, Ying Hu, Liang He, Jichen Yang

Abstract: To explore the potential advantages of utilizing spatial cues from images for generating stereo singing voices with room reverberation, we introduce VS-Singer, a vision-guided model designed to produce stereo singing voices with room reverberation from scene images. VS-Singer comprises three modules: firstly, a modal interaction network integrates spatial features into text encoding to create a li… ▽ More To explore the potential advantages of utilizing spatial cues from images for generating stereo singing voices with room reverberation, we introduce VS-Singer, a vision-guided model designed to produce stereo singing voices with room reverberation from scene images. VS-Singer comprises three modules: firstly, a modal interaction network integrates spatial features into text encoding to create a linguistic representation enriched with spatial information. Secondly, the decoder employs a consistency Schrödinger bridge to facilitate one-step sample generation. Moreover, we utilize the SFE module to improve the consistency of audio-visual matching. To our knowledge, this study is the first to combine stereo singing voice synthesis with visual acoustic matching within a unified framework. Experimental results demonstrate that VS-Singer can effectively generate stereo singing voices that align with the scene perspective in a single step. △ Less

Submitted 19 June, 2025; originally announced June 2025.

Comments: Accepted by Interspeech 2025

arXiv:2506.16001 [pdf, ps, other]

AutoHFormer: Efficient Hierarchical Autoregressive Transformer for Time Series Prediction

Authors: Qianru Zhang, Honggang Wen, Ming Li, Dong Huang, Siu-Ming Yiu, Christian S. Jensen, Pietro Liò

Abstract: Time series forecasting requires architectures that simultaneously achieve three competing objectives: (1) strict temporal causality for reliable predictions, (2) sub-quadratic complexity for practical scalability, and (3) multi-scale pattern recognition for accurate long-horizon forecasting. We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges throug… ▽ More Time series forecasting requires architectures that simultaneously achieve three competing objectives: (1) strict temporal causality for reliable predictions, (2) sub-quadratic complexity for practical scalability, and (3) multi-scale pattern recognition for accurate long-horizon forecasting. We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges through three key innovations: 1) Hierarchical Temporal Modeling: Our architecture decomposes predictions into segment-level blocks processed in parallel, followed by intra-segment sequential refinement. This dual-scale approach maintains temporal coherence while enabling efficient computation. 2) Dynamic Windowed Attention: The attention mechanism employs learnable causal windows with exponential decay, reducing complexity while preserving precise temporal relationships. This design avoids both the anti-causal violations of standard transformers and the sequential bottlenecks of RNN hybrids. 3) Adaptive Temporal Encoding: a novel position encoding system is adopted to capture time patterns at multiple scales. It combines fixed oscillating patterns for short-term variations with learnable decay rates for long-term trends. Comprehensive experiments demonstrate that AutoHFormer 10.76X faster training and 6.06X memory reduction compared to PatchTST on PEMS08, while maintaining consistent accuracy across 96-720 step horizons in most of cases. These breakthroughs establish new benchmarks for efficient and precise time series modeling. Implementations of our method and all baselines in hierarchical autoregressive mechanism are available at https://github.com/lizzyhku/Autotime. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 14 pages

arXiv:2506.15961 [pdf]

TrainVerify: Equivalence-Based Verification for Distributed LLM Training

Authors: Yunchi Lu, Youshan Miao, Cheng Tan, Peng Huang, Yi Zhu, Xian Zhang, Fan Yang

Abstract: Training large language models (LLMs) at scale requires parallel execution across thousands of devices, incurring enormous computational costs. Yet, these costly distributed trainings are rarely verified, leaving them prone to silent errors and potentially wasting millions of GPU hours. We introduce TrainVerify, a system for verifiable distributed training of LLMs. Given a deep learning model's lo… ▽ More Training large language models (LLMs) at scale requires parallel execution across thousands of devices, incurring enormous computational costs. Yet, these costly distributed trainings are rarely verified, leaving them prone to silent errors and potentially wasting millions of GPU hours. We introduce TrainVerify, a system for verifiable distributed training of LLMs. Given a deep learning model's logical specification as the ground truth, TrainVerify formally verifies that a distributed parallel execution plan is mathematically equivalent to it. Direct verification is notoriously difficult due to the sheer scale of LLMs which often involves billions of variables and highly intricate computation graphs. Therefore, TrainVerify introduces shape-reduction techniques and a stage-wise parallel verification algorithm that significantly reduces complexity while preserving formal correctness. TrainVerify scales to frontier LLMs, including the successful verification of the Llama3 (405B) and DeepSeek-V3 (671B) training plans. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.15956 [pdf, ps, other]

Scalable quantum current source on commercial 22-nm CMOS process technology

Authors: Ajit Dash, Suyash Pati Tripathi, Dimitrios Georgakopoulos, MengKe Feng, Steve Yianni, Ensar Vahapoglu, Md Mamunur Rahman, Shai Bonen, Owen Brace, Jonathan Y. Huang, Wee Han Lim, Kok Wai Chan, Will Gilbert, Arne Laucht, Andrea Morello, Andre Saraiva, Christopher C. Escott, Sorin P. Voinigescu, Andrew S. Dzurak, Tuomo Tanttu

Abstract: Utilizing quantum effects in nanoscopic devices has in the past mostly been accessible through academic cleanrooms and research foundries. Opening the quantum frontier for wider industrial applications likely requires the scale of well-established complementary metal-oxide-semiconductor (CMOS) foundries for manufacturing transistor-based quantum devices operable above subkelvin temperatures. Here,… ▽ More Utilizing quantum effects in nanoscopic devices has in the past mostly been accessible through academic cleanrooms and research foundries. Opening the quantum frontier for wider industrial applications likely requires the scale of well-established complementary metal-oxide-semiconductor (CMOS) foundries for manufacturing transistor-based quantum devices operable above subkelvin temperatures. Here, we operate a commercial 22-nm-node fully depleted silicon-on-insulator (FDSOI) CMOS device as dual parallel-connected charge-pumps for the implementation of a quantum current standard in the International System of Units (SI). We measure the accuracy of (1.2 +/- 0.1)E-3 A/A for this scalable architecture at 50 MHz with reference to SI-traceable voltage and resistance standards in a pumped helium system. Looking ahead we propose a practical monolithic CMOS chip that incorporates one million parallel-connected charge pumps along with on-chip control electronics. This can be operated as a table-top primary standard, generating quantum currents up to microampere levels. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 16 pages, 4 figures, 3 extended data figures, 7 extended data tables

arXiv:2506.15943 [pdf, ps, other]

On the optimal regret of collaborative personalized linear bandits

Authors: Bruce Huang, Ruida Zhou, Lin F. Yang, Suhas Diggavi

Abstract: Stochastic linear bandits are a fundamental model for sequential decision making, where an agent selects a vector-valued action and receives a noisy reward with expected value given by an unknown linear function. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. Applyi… ▽ More Stochastic linear bandits are a fundamental model for sequential decision making, where an agent selects a vector-valued action and receives a noisy reward with expected value given by an unknown linear function. Although well studied in the single-agent setting, many real-world scenarios involve multiple agents solving heterogeneous bandit problems, each with a different unknown parameter. Applying single agent algorithms independently ignores cross-agent similarity and learning opportunities. This paper investigates the optimal regret achievable in collaborative personalized linear bandits. We provide an information-theoretic lower bound that characterizes how the number of agents, the interaction rounds, and the degree of heterogeneity jointly affect regret. We then propose a new two-stage collaborative algorithm that achieves the optimal regret. Our analysis models heterogeneity via a hierarchical Bayesian framework and introduces a novel information-theoretic technique for bounding regret. Our results offer a complete characterization of when and how collaboration helps with a optimal regret bound $\tilde{O}(d\sqrt{mn})$, $\tilde{O}(dm^{1-γ}\sqrt{n})$, $\tilde{O}(dm\sqrt{n})$ for the number of rounds $n$ in the range of $(0, \frac{d}{m σ^2})$, $[\frac{d}{m^{2γ} σ^2}, \frac{d}{σ^2}]$ and $(\frac{d}{σ^2}, \infty)$ respectively, where $σ$ measures the level of heterogeneity, $m$ is the number of agents, and $γ\in[0, 1/2]$ is an absolute constant. In contrast, agents without collaboration achieve a regret bound $O(dm\sqrt{n})$ at best. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 30 pages, 4 figures

arXiv:2506.15873 [pdf, ps, other]

DeckFlow: Iterative Specification on a Multimodal Generative Canvas

Authors: Gregory Croisdale, Emily Huang, John Joon Young Chung, Anhong Guo, Xu Wang, Austin Z. Henley, Cyrus Omar

Abstract: Generative AI promises to allow people to create high-quality personalized media. Although powerful, we identify three fundamental design problems with existing tooling through a literature review. We introduce a multimodal generative AI tool, DeckFlow, to address these problems. First, DeckFlow supports task decomposition by allowing users to maintain multiple interconnected subtasks on an infini… ▽ More Generative AI promises to allow people to create high-quality personalized media. Although powerful, we identify three fundamental design problems with existing tooling through a literature review. We introduce a multimodal generative AI tool, DeckFlow, to address these problems. First, DeckFlow supports task decomposition by allowing users to maintain multiple interconnected subtasks on an infinite canvas populated by cards connected through visual dataflow affordances. Second, DeckFlow supports a specification decomposition workflow where an initial goal is iteratively decomposed into smaller parts and combined using feature labels and clusters. Finally, DeckFlow supports generative space exploration by generating multiple prompt and output variations, presented in a grid, that can feed back recursively into the next design iteration. We evaluate DeckFlow for text-to-image generation against a state-of-practice conversational AI baseline for image generation tasks. We then add audio generation and investigate user behaviors in a more open-ended creative setting with text, image, and audio outputs. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.15843 [pdf]

Optimized cerebral blood flow measurement in speckle contrast optical spectroscopy via refinement of noise calibration

Authors: Ninghe Liu, Yu Xi Huang, Simon Mahler, Changhuei Yang

Abstract: Speckle contrast optical spectroscopy (SCOS) offers a non-invasive and cost-effective method for monitoring cerebral blood flow (CBF). However, extracting accurate CBF from SCOS necessitates precise noise pre-calibration. Errors from this can degrade CBF measurement fidelity, particularly when the overall signal level is low. Such errors primarily stem from residual speckle contrast associated wit… ▽ More Speckle contrast optical spectroscopy (SCOS) offers a non-invasive and cost-effective method for monitoring cerebral blood flow (CBF). However, extracting accurate CBF from SCOS necessitates precise noise pre-calibration. Errors from this can degrade CBF measurement fidelity, particularly when the overall signal level is low. Such errors primarily stem from residual speckle contrast associated with camera and shot noise, whose fluctuations exhibit a temporal structure that mimics cerebral blood volume (CBV) waveforms. We propose an optimization-based framework that performs an adaptive refinement of noise calibration, mitigating the CBV-mimicking artifacts by reducing the CBF-CBV waveform correlation. Validated on 10 human subjects, our approach effectively lowered the signal threshold for reliable CBF signal from 97 to 26 electrons per pixel for a 1920x1200 pixels SCOS system. This improvement enables more accurate and robust CBF measurements in SCOS, especially at large source-detector (SD) distances for deeper tissue interrogation. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 5 pages, 3 figures

arXiv:2506.15839 [pdf, ps, other]

Link Priority Buffer-Aided Relay Selection with Energy Storage from Energy Harvest

Authors: Mohammad Alkhawatrah, Yu Gong, Chong Huang, Gaojie Chen

Abstract: This paper proposes a novel relay selection scheme for buffer-aided wireless networks with relays equipped with both data buffers and energy storage. While buffer-aided relay networks have demonstrated significantly improved performance, energy harvesting has become an attractive solution in many wireless systems, garnering considerable attention when applied to buffer-aided relay networks. It is… ▽ More This paper proposes a novel relay selection scheme for buffer-aided wireless networks with relays equipped with both data buffers and energy storage. While buffer-aided relay networks have demonstrated significantly improved performance, energy harvesting has become an attractive solution in many wireless systems, garnering considerable attention when applied to buffer-aided relay networks. It is known that state-dependent selection rules must be used to achieve full diversity order in buffer-aided relay networks, requiring link priorities for data transmission to be set based on system states. This task becomes challenging when both data buffers and energy storage are involved. In this paper, we introduce a novel method for setting link priorities, which forms the basis for a new selection rule. The outage probability of the proposed selection scheme is derived. The simulation results demonstrate the superiority of our proposed algorithm which achieves full diversity in buffer-aided relay selection with energy storage, and consistently outperforms baseline approaches across various metrics. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 11 pages

arXiv:2506.15786 [pdf, ps, other]

Graphics4Science: Computer Graphics for Scientific Impacts

Authors: Peter Yichen Chen, Minghao Guo, Hanspeter Pfister, Ming Lin, William Freeman, Qixing Huang, Han-Wei Shen, Wojciech Matusik

Abstract: Computer graphics, often associated with films, games, and visual effects, has long been a powerful tool for addressing scientific challenges--from its origins in 3D visualization for medical imaging to its role in modern computational modeling and simulation. This course explores the deep and evolving relationship between computer graphics and science, highlighting past achievements, ongoing cont… ▽ More Computer graphics, often associated with films, games, and visual effects, has long been a powerful tool for addressing scientific challenges--from its origins in 3D visualization for medical imaging to its role in modern computational modeling and simulation. This course explores the deep and evolving relationship between computer graphics and science, highlighting past achievements, ongoing contributions, and open questions that remain. We show how core methods, such as geometric reasoning and physical modeling, provide inductive biases that help address challenges in both fields, especially in data-scarce settings. To that end, we aim to reframe graphics as a modeling language for science by bridging vocabulary gaps between the two communities. Designed for both newcomers and experts, Graphics4Science invites the graphics community to engage with science, tackle high-impact problems where graphics expertise can make a difference, and contribute to the future of scientific discovery. Additional details are available on the course website: https://graphics4science.github.io △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.15770 [pdf, ps, other]

Probing the pseudogap and beyond: examining single-particle properties of the hole- and electron-doped Hubbard model

Authors: Wen O. Wang, Edwin W. Huang, Brian Moritz, Thomas P. Devereaux

Abstract: We compute high-resolution angle-resolved photoemission spectroscopy of the Hubbard model using the unbiased determinant quantum Monte Carlo algorithm, revealing an asymmetry between electron and hole doping. Electron doping exhibits more coherent quasiparticles and stronger antiferromagnetic correlations compared to hole doping. At low doping, a nodal-antinodal dichotomy on the Fermi surface is o… ▽ More We compute high-resolution angle-resolved photoemission spectroscopy of the Hubbard model using the unbiased determinant quantum Monte Carlo algorithm, revealing an asymmetry between electron and hole doping. Electron doping exhibits more coherent quasiparticles and stronger antiferromagnetic correlations compared to hole doping. At low doping, a nodal-antinodal dichotomy on the Fermi surface is observed, similar to cuprate experiments. The dichotomy reflects the momentum dependence of the Mott gap, as manifested in both the spectral function and the self-energy. For hole doping, we observe a transition towards the pseudogap, without signature of pocket formation. The simulated nuclear magnetic resonance pseudogap temperatures do not necessarily agree with the temperature determined by spectroscopy. These findings collectively suggest the pseudogap is a smooth crossover driven by strong correlations. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 16 pages, 14 figures. Appendix: 9 pages, 11 figures

arXiv:2506.15755 [pdf, ps, other]

VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service

Authors: Xiasi Wang, Tianliang Yao, Simin Chen, Runqi Wang, Lei YE, Kuofeng Gao, Yi Huang, Yuan Yao

Abstract: Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unr… ▽ More Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unrealistic assumptions, requiring access to the model architecture and parameters -- an impractical scenario in ML-as-a-service settings, where VLMs are deployed via inference APIs. To address this gap, we propose VLMInferSlow, a novel approach for evaluating VLM efficiency robustness in a realistic black-box setting. VLMInferSlow incorporates fine-grained efficiency modeling tailored to VLM inference and leverages zero-order optimization to search for adversarial examples. Experimental results show that VLMInferSlow generates adversarial images with imperceptible perturbations, increasing the computational cost by up to 128.47%. We hope this research raises the community's awareness about the efficiency robustness of VLMs. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: Accepted by ACL 2025

arXiv:2506.15741 [pdf, ps, other]

OAgents: An Empirical Study of Building Effective Agents

Authors: He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Wangchunshu Zhou, Jiaheng Liu

Abstract: Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we… ▽ More Recently, Agentic AI has become an increasingly popular research field. However, we argue that current agent research practices lack standardization and scientific rigor, making it hard to conduct fair comparisons among methods. As a result, it is still unclear how different design choices in agent frameworks affect effectiveness, and measuring their progress remains challenging. In this work, we conduct a systematic empirical study on GAIA benchmark and BrowseComp to examine the impact of popular design choices in key agent components in a fair and rigorous manner. We find that the lack of a standard evaluation protocol makes previous works, even open-sourced ones, non-reproducible, with significant variance between random runs. Therefore, we introduce a more robust evaluation protocol to stabilize comparisons. Our study reveals which components and designs are crucial for effective agents, while others are redundant, despite seeming logical. Based on our findings, we build and open-source OAgents, a new foundation agent framework that achieves state-of-the-art performance among open-source projects. OAgents offers a modular design for various agent components, promoting future research in Agentic AI. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: 28 pages

arXiv:2506.15617 [pdf, ps, other]

The Compositional Architecture of Regret in Large Language Models

Authors: Xiangxiang Cui, Shu Yang, Tianjin Huang, Wanyu Lin, Lijie Hu, Di Wang

Abstract: Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then an… ▽ More Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model's hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 23 pages

arXiv:2506.15565 [pdf, ps, other]

Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification

Authors: Junhao Wu, Aboagye-Ntow Stephen, Chuyuan Wang, Gang Chen, Xin Huang

Abstract: Ultra-high Spatial Resolution Land Cover Classification is essential for fine-grained land cover analysis, yet it remains challenging due to the high cost of pixel-level annotations, significant scale variation, and the limited adaptability of large-scale vision models. Existing methods typically focus on 1-meter spatial resolution imagery and rely heavily on annotated data, whereas practical appl… ▽ More Ultra-high Spatial Resolution Land Cover Classification is essential for fine-grained land cover analysis, yet it remains challenging due to the high cost of pixel-level annotations, significant scale variation, and the limited adaptability of large-scale vision models. Existing methods typically focus on 1-meter spatial resolution imagery and rely heavily on annotated data, whereas practical applications often require processing higher-resolution imagery under weak supervision. To address this, we propose a parameter-efficient semi-supervised segmentation framework for 0.3 m spatial resolution imagery, which leverages the knowledge of SAM2 and introduces a remote sensing-specific FreqWeaver Adapter to enhance fine-grained detail modeling while maintaining a lightweight design at only 5.96% of the total model parameters. By effectively leveraging unlabeled data and maintaining minimal parameter overhead, the proposed method delivers robust segmentation results with superior structural consistency, achieving a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.15533 [pdf, ps, other]

Measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $D^+\to K^+η^{\prime}$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann, H. Cai , et al. (697 additional authors not shown)

Abstract: Using $20.3\,\rm fb^{-1}$ of $e^+e^-$ collision data collected at a center-of-mass energy of 3.773\,GeV with the BESIII detector, we present improved measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $ D^+ \to K^+ η^{\prime}$ with the double-tag method. The statistical significance of each signal decay exceeds $10σ$. The bra… ▽ More Using $20.3\,\rm fb^{-1}$ of $e^+e^-$ collision data collected at a center-of-mass energy of 3.773\,GeV with the BESIII detector, we present improved measurements of the absolute branching fractions of the doubly Cabibbo-suppressed decays $D^+\to K^+π^0$, $D^+\to K^+η$ and $ D^+ \to K^+ η^{\prime}$ with the double-tag method. The statistical significance of each signal decay exceeds $10σ$. The branching fractions are determined to be ${\mathcal B}(D^+\to K^+ π^0) = (1.45 \pm 0.06 \pm 0.06)\times 10^{-4}$, ${\mathcal B}(D^+\to K^+ η) = (1.17 \pm 0.10 \pm 0.03)\times 10^{-4}$ and ${\mathcal B}(D^+\to K^+ η^{\prime}) = (1.88 \pm 0.15 \pm 0.06)\times 10^{-4}$, where the first uncertainties are statistical and the second systematic. These results are consistent with the world average values but with significantly improved precision. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Comments: 20 pages, 4 figures

arXiv:2506.15492 [pdf, ps, other]

LIT-LVM: Structured Regularization for Interaction Terms in Linear Predictors using Latent Variable Models

Authors: Mohammadreza Nemati, Zhipeng Huang, Kevin S. Xu

Abstract: Some of the simplest, yet most frequently used predictors in statistics and machine learning use weighted linear combinations of features. Such linear predictors can model non-linear relationships between features by adding interaction terms corresponding to the products of all pairs of features. We consider the problem of accurately estimating coefficients for interaction terms in linear predicto… ▽ More Some of the simplest, yet most frequently used predictors in statistics and machine learning use weighted linear combinations of features. Such linear predictors can model non-linear relationships between features by adding interaction terms corresponding to the products of all pairs of features. We consider the problem of accurately estimating coefficients for interaction terms in linear predictors. We hypothesize that the coefficients for different interaction terms have an approximate low-dimensional structure and represent each feature by a latent vector in a low-dimensional space. This low-dimensional representation can be viewed as a structured regularization approach that further mitigates overfitting in high-dimensional settings beyond standard regularizers such as the lasso and elastic net. We demonstrate that our approach, called LIT-LVM, achieves superior prediction accuracy compared to elastic net and factorization machines on a wide variety of simulated and real data, particularly when the number of interaction terms is high compared to the number of samples. LIT-LVM also provides low-dimensional latent representations for features that are useful for visualizing and analyzing their relationships. △ Less

Submitted 18 June, 2025; originally announced June 2025.

Showing 51–100 of 43,851 results for author: Huang