-
A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data
Authors:
Cheng-Kang Chou,
Chan-Jan Hsu,
Ho-Lam Chung,
Liang-Hsuan Tseng,
Hsi-Chun Cheng,
Yu-Kuan Fu,
Kuan Po Huang,
Hung-Yi Lee
Abstract:
We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. W…
▽ More
We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. We demonstrated the effectiveness of the framework on Taiwanese Mandarin speech. Leveraging 6,000 hours of unlabeled speech, a moderate amount of text data, and synthetic content from the AI models, we adapt Whisper-large-v2 into a specialized model, Twister. Twister reduces error rates by up to 20% on Mandarin and 50% on Mandarin-English code-switching benchmarks compared to Whisper. Results highlight the framework as a compelling alternative to pseudo-labeling self-distillation approaches and provides a practical pathway for improving ASR performance in low-resource or domain-specific settings.
△ Less
Submitted 16 June, 2025; v1 submitted 10 June, 2025;
originally announced June 2025.
-
GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions
Authors:
Wenkang Han,
Zhixiong Zeng,
Jing Huang,
Shu Jiang,
Liming Zheng,
Longrong Yang,
Haibo Qiu,
Chang Yao,
Jingyuan Chen,
Lin Ma
Abstract:
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screensho…
▽ More
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this gap, we propose GUIRoboTron-Speech, the first end-to-end autonomous GUI agent that directly accepts speech instructions and on-device screenshots to predict actions. Confronted with the scarcity of speech-based GUI agent datasets, we initially generated high-quality speech instructions for training by leveraging a random timbre text-to-speech (TTS) model to convert existing text instructions. We then develop GUIRoboTron-Speech's capabilities through progressive grounding and planning training stages. A key contribution is a heuristic mixed-instruction training strategy designed to mitigate the modality imbalance inherent in pre-trained foundation models. Comprehensive experiments on several benchmark datasets validate the robust and superior performance of GUIRoboTron-Speech, demonstrating the significant potential and widespread applicability of speech as an effective instruction modality for driving GUI agents. Our code and datasets are available at https://github.com/GUIRoboTron/GUIRoboTron-Speech.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR
Authors:
Wei-Ping Huang,
Guan-Ting Lin,
Hung-yi Lee
Abstract:
Despite progress in end-to-end ASR, real-world domain mismatches still cause performance drops, which Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference. Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction. In this work, we identify a previously overlooked challenge: TTA can inter…
▽ More
Despite progress in end-to-end ASR, real-world domain mismatches still cause performance drops, which Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference. Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction. In this work, we identify a previously overlooked challenge: TTA can interfere with language model rescoring, revealing the nontrivial nature of effectively combining the two methods. Based on this insight, we propose SUTA-LM, a simple yet effective extension of SUTA, an entropy-minimization-based TTA approach, with language model rescoring. SUTA-LM first applies a controlled adaptation process guided by an auto-step selection mechanism leveraging both acoustic and linguistic information, followed by language model rescoring to refine the outputs. Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
Graph-based RAG Enhancement via Global Query Disambiguation and Dependency-Aware Reranking
Authors:
Ningyuan Li,
Junrui Liu,
Yi Shan,
Minghui Huang,
Tong Li
Abstract:
Contemporary graph-based retrieval-augmented generation (RAG) methods typically begin by extracting entities from user queries and then leverage pre-constructed knowledge graphs to retrieve related relationships and metadata. However, this pipeline's exclusive reliance on entity-level extraction can lead to the misinterpretation or omission of latent yet critical information and relations. As a re…
▽ More
Contemporary graph-based retrieval-augmented generation (RAG) methods typically begin by extracting entities from user queries and then leverage pre-constructed knowledge graphs to retrieve related relationships and metadata. However, this pipeline's exclusive reliance on entity-level extraction can lead to the misinterpretation or omission of latent yet critical information and relations. As a result, retrieved content may be irrelevant or contradictory, and essential knowledge may be excluded, exacerbating hallucination risks and degrading the fidelity of generated responses. To address these limitations, we introduce PankRAG, a framework that combines a globally aware, hierarchical query-resolution strategy with a novel dependency-aware reranking mechanism. PankRAG first constructs a multi-level resolution path that captures both parallel and sequential interdependencies within a query, guiding large language models (LLMs) through structured reasoning. It then applies its dependency-aware reranker to exploit the dependency structure among resolved sub-questions, enriching and validating retrieval results for subsequent sub-questions. Empirical evaluations demonstrate that PankRAG consistently outperforms state-of-the-art approaches across multiple benchmarks, underscoring its robustness and generalizability.
△ Less
Submitted 7 June, 2025;
originally announced June 2025.
-
DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration
Authors:
Hanzhi Zhang,
Heng Fan,
Kewei Sha,
Yan Huang,
Yunhe Feng
Abstract:
Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-s…
▽ More
Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: https://github.com/HanzhiZhang-Ulrica/DAM.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs
Authors:
Songyang Liu,
Chaozhuo Li,
Jiameng Qiu,
Xi Zhang,
Feiran Huang,
Litian Zhang,
Yiming Hei,
Philip S. Yu
Abstract:
With the rapid advancement of artificial intelligence technology, Large Language Models (LLMs) have demonstrated remarkable potential in the field of Natural Language Processing (NLP), including areas such as content generation, human-computer interaction, machine translation, and code generation, among others. However, their widespread deployment has also raised significant safety concerns. In re…
▽ More
With the rapid advancement of artificial intelligence technology, Large Language Models (LLMs) have demonstrated remarkable potential in the field of Natural Language Processing (NLP), including areas such as content generation, human-computer interaction, machine translation, and code generation, among others. However, their widespread deployment has also raised significant safety concerns. In recent years, LLM-generated content has occasionally exhibited unsafe elements like toxicity and bias, particularly in adversarial scenarios, which has garnered extensive attention from both academia and industry. While numerous efforts have been made to evaluate the safety risks associated with LLMs, there remains a lack of systematic reviews summarizing these research endeavors. This survey aims to provide a comprehensive and systematic overview of recent advancements in LLMs safety evaluation, focusing on several key aspects: (1) "Why evaluate" that explores the background of LLMs safety evaluation, how they differ from general LLMs evaluation, and the significance of such evaluation; (2) "What to evaluate" that examines and categorizes existing safety evaluation tasks based on key capabilities, including dimensions such as toxicity, robustness, ethics, bias and fairness, truthfulness, and so on; (3) "Where to evaluate" that summarizes the evaluation metrics, datasets and benchmarks currently used in safety evaluations; (4) "How to evaluate" that reviews existing evaluation toolkit, and categorizing mainstream evaluation methods based on the roles of the evaluators. Finally, we identify the challenges in LLMs safety evaluation and propose potential research directions to promote further advancement in this field. We emphasize the importance of prioritizing LLMs safety evaluation to ensure the safe deployment of these models in real-world applications.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention
Authors:
Zekai Ye,
Qiming Li,
Xiaocheng Feng,
Libo Qin,
Yichong Huang,
Baohang Li,
Kui Jiang,
Yang Xiang,
Zhirui Zhang,
Yunfei Lu,
Duyu Tang,
Dandan Tu,
Bing Qin
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensiv…
▽ More
Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.
△ Less
Submitted 3 June, 2025;
originally announced June 2025.
-
NSW-EPNews: A News-Augmented Benchmark for Electricity Price Forecasting with LLMs
Authors:
Zhaoge Bi,
Linghan Huang,
Haolin Jin,
Qingwen Zeng,
Huaming Chen
Abstract:
Electricity price forecasting is a critical component of modern energy-management systems, yet existing approaches heavily rely on numerical histories and ignore contemporaneous textual signals. We introduce NSW-EPNews, the first benchmark that jointly evaluates time-series models and large language models (LLMs) on real-world electricity-price prediction. The dataset includes over 175,000 half-ho…
▽ More
Electricity price forecasting is a critical component of modern energy-management systems, yet existing approaches heavily rely on numerical histories and ignore contemporaneous textual signals. We introduce NSW-EPNews, the first benchmark that jointly evaluates time-series models and large language models (LLMs) on real-world electricity-price prediction. The dataset includes over 175,000 half-hourly spot prices from New South Wales, Australia (2015-2024), daily temperature readings, and curated market-news summaries from WattClarity. We frame the task as 48-step-ahead forecasting, using multimodal input, including lagged prices, vectorized news and weather features for classical models, and prompt-engineered structured contexts for LLMs. Our datasets yields 3.6k multimodal prompt-output pairs for LLM evaluation using specific templates. Through compresive benchmark design, we identify that for traditional statistical and machine learning models, the benefits gain is marginal from news feature. For state-of-the-art LLMs, such as GPT-4o and Gemini 1.5 Pro, we observe modest performance increase while it also produce frequent hallucinations such as fabricated and malformed price sequences. NSW-EPNews provides a rigorous testbed for evaluating grounded numerical reasoning in multimodal settings, and highlights a critical gap between current LLM capabilities and the demands of high-stakes energy forecasting.
△ Less
Submitted 21 May, 2025;
originally announced June 2025.
-
ChemHGNN: A Hierarchical Hypergraph Neural Network for Reaction Virtual Screening and Discovery
Authors:
Xiaobao Huang,
Yihong Ma,
Anjali Gurajapu,
Jules Schleinitz,
Zhichun Guo,
Sarah E. Reisman,
Nitesh V. Chawla
Abstract:
Reaction virtual screening and discovery are fundamental challenges in chemistry and materials science, where traditional graph neural networks (GNNs) struggle to model multi-reactant interactions. In this work, we propose ChemHGNN, a hypergraph neural network (HGNN) framework that effectively captures high-order relationships in reaction networks. Unlike GNNs, which require constructing complete…
▽ More
Reaction virtual screening and discovery are fundamental challenges in chemistry and materials science, where traditional graph neural networks (GNNs) struggle to model multi-reactant interactions. In this work, we propose ChemHGNN, a hypergraph neural network (HGNN) framework that effectively captures high-order relationships in reaction networks. Unlike GNNs, which require constructing complete graphs for multi-reactant reactions, ChemHGNN naturally models multi-reactant reactions through hyperedges, enabling more expressive reaction representations. To address key challenges, such as combinatorial explosion, model collapse, and chemically invalid negative samples, we introduce a reaction center-aware negative sampling strategy (RCNS) and a hierarchical embedding approach combining molecule, reaction and hypergraph level features. Experiments on the USPTO dataset demonstrate that ChemHGNN significantly outperforms HGNN and GNN baselines, particularly in large-scale settings, while maintaining interpretability and chemical plausibility. Our work establishes HGNNs as a superior alternative to GNNs for reaction virtual screening and discovery, offering a chemically informed framework for accelerating reaction discovery.
△ Less
Submitted 21 May, 2025;
originally announced June 2025.
-
SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis
Authors:
Weiliang Chen,
Jiayi Bi,
Yuanhui Huang,
Wenzhao Zheng,
Yueqi Duan
Abstract:
Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry…
▽ More
Generative models have gained significant attention in novel view synthesis (NVS) by alleviating the reliance on dense multi-view captures. However, existing methods typically fall into a conventional paradigm, where generative models first complete missing areas in 2D, followed by 3D recovery techniques to reconstruct the scene, which often results in overly smooth surfaces and distorted geometry, as generative models struggle to infer 3D structure solely from RGB data. In this paper, we propose SceneCompleter, a novel framework that achieves 3D-consistent generative novel view synthesis through dense 3D scene completion. SceneCompleter achieves both visual coherence and 3D-consistent generative scene completion through two key components: (1) a geometry-appearance dual-stream diffusion model that jointly synthesizes novel views in RGBD space; (2) a scene embedder that encodes a more holistic scene understanding from the reference image. By effectively fusing structural and textural information, our method demonstrates superior coherence and plausibility in generative novel view synthesis across diverse datasets. Project Page: https://chen-wl20.github.io/SceneCompleter
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation
Authors:
Ning Gao,
Yilun Chen,
Shuai Yang,
Xinyi Chen,
Yang Tian,
Hao Li,
Haifeng Huang,
Hanqing Wang,
Tai Wang,
Jiangmiao Pang
Abstract:
Robotic manipulation in real-world settings remains challenging, especially regarding robust generalization. Existing simulation platforms lack sufficient support for exploring how policies adapt to varied instructions and scenarios. Thus, they lag behind the growing interest in instruction-following foundation models like LLMs, whose adaptability is crucial yet remains underexplored in fair compa…
▽ More
Robotic manipulation in real-world settings remains challenging, especially regarding robust generalization. Existing simulation platforms lack sufficient support for exploring how policies adapt to varied instructions and scenarios. Thus, they lag behind the growing interest in instruction-following foundation models like LLMs, whose adaptability is crucial yet remains underexplored in fair comparisons. To bridge this gap, we introduce GenManip, a realistic tabletop simulation platform tailored for policy generalization studies. It features an automatic pipeline via LLM-driven task-oriented scene graph to synthesize large-scale, diverse tasks using 10K annotated 3D object assets. To systematically assess generalization, we present GenManip-Bench, a benchmark of 200 scenarios refined via human-in-the-loop corrections. We evaluate two policy types: (1) modular manipulation systems integrating foundation models for perception, reasoning, and planning, and (2) end-to-end policies trained through scalable data collection. Results show that while data scaling benefits end-to-end methods, modular systems enhanced with foundation models generalize more effectively across diverse scenarios. We anticipate this platform to facilitate critical insights for advancing policy generalization in realistic conditions. Project Page: https://genmanip.axi404.top/.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Apparent inconsistency between Streda formula and Hall conductivity in reentrant integer quantum anomalous Hall effect in twisted MoTe$_2$
Authors:
Yi Huang,
Seth Musser,
Jihang Zhu,
Yang-Zhi Chou,
Sankar Das Sarma
Abstract:
Recent experiments in twisted bilayer MoTe$_2$ (tMoTe$_2$) have uncovered a rich landscape of correlated phases. In this work, we investigate the reentrant integer quantum anomalous Hall (RIQAH) states reported in F. Xu, et. al., arXiv:2504.06972 which displays a notable mismatch between the Hall conductivity measured via transport and that inferred from the Streda formula. We argue that this disc…
▽ More
Recent experiments in twisted bilayer MoTe$_2$ (tMoTe$_2$) have uncovered a rich landscape of correlated phases. In this work, we investigate the reentrant integer quantum anomalous Hall (RIQAH) states reported in F. Xu, et. al., arXiv:2504.06972 which displays a notable mismatch between the Hall conductivity measured via transport and that inferred from the Streda formula. We argue that this discrepancy can be explained if the RIQAH state is a quantum Hall bubble phase, analogous to similar well-established phenomena in 2D GaAs quantum wells. While this explains the RIQAH state at a filling $ν= -0.63$, the other RIQAH state at $ν= -0.7$ has a smaller slope necessitating a different interpretation. We propose that this discrepancy arises due to a nearby resistive peak masking the true slope. Furthermore, we identify this resistive peak as a signature of a magnetic phase transition near $ν= -0.75$, possibly driven by a Van Hove singularity. The anomalous Hall response and Landau fan evolution across this transition suggest a change in Fermi surface topology and a phase with zero Chern number, potentially corresponding to an inter-valley coherent state. These observations offer new insights into the nature of the RIQAH states and raise the possibility that the nearby superconducting phase may have an inter-valley coherent normal state.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
SpectralAR: Spectral Autoregressive Visual Generation
Authors:
Yuanhui Huang,
Weiliang Chen,
Wenzhao Zheng,
Yueqi Duan,
Jie Zhou,
Jiwen Lu
Abstract:
Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a S…
▽ More
Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Physics-informed Machine Learning Analysis for Nanoscale Grain Mapping by Synchrotron Laue Microdiffraction
Authors:
Ka Hung Chan,
Xinyue Huang,
Nobumichi Tamura,
Xian Chen
Abstract:
Understanding the grain morphology, orientation distribution, and crystal structure of nanocrystals is essential for optimizing the mechanical and physical properties of functional materials. Synchrotron X-ray Laue microdiffraction is a powerful technique for characterizing crystal structures and orientation mapping using focused X-rays. However, when grain sizes are smaller than the beam size, mi…
▽ More
Understanding the grain morphology, orientation distribution, and crystal structure of nanocrystals is essential for optimizing the mechanical and physical properties of functional materials. Synchrotron X-ray Laue microdiffraction is a powerful technique for characterizing crystal structures and orientation mapping using focused X-rays. However, when grain sizes are smaller than the beam size, mixed peaks in the Laue pattern from neighboring grains limit the resolution of grain morphology mapping. We propose a physics-informed machine learning (PIML) approach that combines a CNN feature extractor with a physics-informed filtering algorithm to overcome the spatial resolution limits of X-rays, achieving nanoscale resolution for grain mapping. Our PIML method successfully resolves the grain size, orientation distribution, and morphology of Au nanocrystals through synchrotron microdiffraction scans, showing good agreement with electron backscatter diffraction results. This PIML-assisted synchrotron microdiffraction analysis can be generalized to other diffraction-based probes, enabling the characterization of nanosized structures with micron-sized probes.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
M4V: Multi-Modal Mamba for Text-to-Video Generation
Authors:
Jiancheng Huang,
Gengwei Zhang,
Zequn Jie,
Siyu Jiao,
Yinlong Qian,
Ling Chen,
Yunchao Wei,
Lin Ma
Abstract:
Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence m…
▽ More
Text-to-video generation has significantly enriched content creation and holds the potential to evolve into powerful world simulators. However, modeling the vast spatiotemporal space remains computationally demanding, particularly when employing Transformers, which incur quadratic complexity in sequence processing and thus limit practical applications. Recent advancements in linear-time sequence modeling, particularly the Mamba architecture, offer a more efficient alternative. Nevertheless, its plain design limits its direct applicability to multi-modal and spatiotemporal video generation tasks. To address these challenges, we introduce M4V, a Multi-Modal Mamba framework for text-to-video generation. Specifically, we propose a multi-modal diffusion Mamba (MM-DiM) block that enables seamless integration of multi-modal information and spatiotemporal modeling through a multi-modal token re-composition design. As a result, the Mamba blocks in M4V reduce FLOPs by 45% compared to the attention-based alternative when generating videos at 768$\times$1280 resolution. Additionally, to mitigate the visual quality degradation in long-context autoregressive generation processes, we introduce a reward learning strategy that further enhances per-frame visual realism. Extensive experiments on text-to-video benchmarks demonstrate M4V's ability to produce high-quality videos while significantly lowering computational costs. Code and models will be publicly available at https://huangjch526.github.io/M4V_project.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Authors:
Yixiao Huang,
Hanlin Zhu,
Tianyu Guo,
Jiantao Jiao,
Somayeh Sojoudi,
Michael I. Jordan,
Stuart Russell,
Song Mei
Abstract:
Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoni…
▽ More
Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Authors:
Jiashuo Yu,
Yue Wu,
Meng Chu,
Zhifei Ren,
Zizheng Huang,
Pei Chu,
Ruijie Zhang,
Yinan He,
Qirui Li,
Songze Li,
Zhenxiang Li,
Zhongying Tu,
Conghui He,
Yu Qiao,
Yali Wang,
Yi Wang,
Limin Wang
Abstract:
We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning st…
▽ More
We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
BNMusic: Blending Environmental Noises into Personalized Music
Authors:
Chi Zuo,
Martin B. Møller,
Pablo MartÃnez-Nuevo,
Huayang Huang,
Yu Wu,
Ye Zhu
Abstract:
While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivate…
▽ More
While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplify the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain
Authors:
Hong Huang,
Weixiang Sun,
Zhijian Wu,
Jingwen Niu,
Donghuan Lu,
Xian Wu,
Yefeng Zheng
Abstract:
Recently, the rapid advancements of vision-language models, such as CLIP, leads to significant progress in zero-/few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based ZFSAD methods commonly assume prior knowledge of categories and rely on carefully crafted prompts tailored to specific scenarios. While such meticulously designed text prompts effectively capture semantic inform…
▽ More
Recently, the rapid advancements of vision-language models, such as CLIP, leads to significant progress in zero-/few-shot anomaly detection (ZFSAD) tasks. However, most existing CLIP-based ZFSAD methods commonly assume prior knowledge of categories and rely on carefully crafted prompts tailored to specific scenarios. While such meticulously designed text prompts effectively capture semantic information in the textual space, they fall short of distinguishing normal and anomalous instances within the joint embedding space. Moreover, these ZFSAD methods are predominantly explored in industrial scenarios, with few efforts conducted to medical tasks. To this end, we propose an innovative framework for ZFSAD tasks in medical domain, denoted as IQE-CLIP. We reveal that query embeddings, which incorporate both textual and instance-aware visual information, are better indicators for abnormalities. Specifically, we first introduce class-based prompting tokens and learnable prompting tokens for better adaptation of CLIP to the medical domain. Then, we design an instance-aware query module (IQM) to extract region-level contextual information from both text prompts and visual features, enabling the generation of query embeddings that are more sensitive to anomalies. Extensive experiments conducted on six medical datasets demonstrate that IQE-CLIP achieves state-of-the-art performance on both zero-shot and few-shot tasks. We release our code and data at https://github.com/hongh0/IQE-CLIP/.
△ Less
Submitted 20 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
Observation of High-Order Quantum Pancharatnam-Berry Phase with Structured Photons
Authors:
Shuang-Yin Huang,
He Jiang,
Zhi-Cheng Ren,
Zi-Mo Cheng,
Wen-Zheng Zhu,
Jing Gao,
Chang Liu,
Xi-Lin Wang,
Hui-Tian Wang
Abstract:
When a quantum system evolves so that it returns to its initial state, it will acquire a geometric phase acting as a memory of the transformation of a physical system, which has been experimentally measured in a variety of physical systems. In optics, the most prominent example is the Pancharatnam-Berry (PB) phase. Recent technological advances in phase and polarization structure have led to the d…
▽ More
When a quantum system evolves so that it returns to its initial state, it will acquire a geometric phase acting as a memory of the transformation of a physical system, which has been experimentally measured in a variety of physical systems. In optics, the most prominent example is the Pancharatnam-Berry (PB) phase. Recent technological advances in phase and polarization structure have led to the discovery of high-order PB phases with structured light fields. The study on the high-order PB phase is limited in the context of elementary quantum states of light, especially in the case of photon number states. Here, we experimentally investigate the differences of high-order PB phases between single-photon and N00N states. Our results show that the PB phase, like the dynamic phase, can also be doubled under two-photon states, which can greatly improve the phase sensitivity for greater $N$ in N00N states and high-order structured photons. This may show some implications for quantum precision measurement and quantum state engineering based on geometric phase.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning
Authors:
Xiaoyi Bao,
Jindi Lv,
Xiaofeng Wang,
Zheng Zhu,
Xinze Chen,
YuKun Zhou,
Jiancheng Lv,
Xingang Wang,
Guan Huang
Abstract:
Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we prop…
▽ More
Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of about 4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Hierarchical Error Assessment of CAD Models for Aircraft Manufacturing-and-Measurement
Authors:
Jin Huang,
Honghua Chen,
Mingqiang Wei
Abstract:
The most essential feature of aviation equipment is high quality, including high performance, high stability and high reliability. In this paper, we propose a novel hierarchical error assessment framework for aircraft CAD models within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs structured light scanners to obtain comprehensive 3D measurements of manufactured workpieces…
▽ More
The most essential feature of aviation equipment is high quality, including high performance, high stability and high reliability. In this paper, we propose a novel hierarchical error assessment framework for aircraft CAD models within a manufacturing-and-measurement platform, termed HEA-MM. HEA-MM employs structured light scanners to obtain comprehensive 3D measurements of manufactured workpieces. The measured point cloud is registered with the reference CAD model, followed by an error analysis conducted at three hierarchical levels: global, part, and feature. At the global level, the error analysis evaluates the overall deviation of the scanned point cloud from the reference CAD model. At the part level, error analysis is performed on these patches underlying the point clouds. We propose a novel optimization-based primitive refinement method to obtain a set of meaningful patches of point clouds. Two basic operations, splitting and merging, are introduced to refine the coarse primitives. At the feature level, error analysis is performed on circular holes, which are commonly found in CAD models. To facilitate it, a two-stage algorithm is introduced for the detection of circular holes. First, edge points are identified using a tensor-voting algorithm. Then, multiple circles are fitted through a hypothesize-and-clusterize framework, ensuring accurate detection and analysis of the circular features. Experimental results on various aircraft CAD models demonstrate the effectiveness of our proposed method.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
A Taylor Series Approximation Model for Characterizing the Output Resistance of a GFET
Authors:
Xiomara Ribero-Figueroa,
Anibal Pacheco-Sanchez,
Tzu-Jung Huang,
David Jiménez,
Ivan Puchades,
Reydezel Torres-Torres
Abstract:
The mobility-degradation-based model for the drain-to-source or output resistance of a graphene field-effect-transistor is linearized here using a Taylor series approximation. This simplification is shown to be valid from magnitudes of the gate voltage not significantly higher than the Dirac voltage, and it enables the analytical determination of the transconductance parameter, the voltage related…
▽ More
The mobility-degradation-based model for the drain-to-source or output resistance of a graphene field-effect-transistor is linearized here using a Taylor series approximation. This simplification is shown to be valid from magnitudes of the gate voltage not significantly higher than the Dirac voltage, and it enables the analytical determination of the transconductance parameter, the voltage related to residual charges, and a bias-independent series resistance of the GFET. Furthermore, a continuous representation of the device's static response is achieved when substituting the extracted parameters into the model, regardless the transfer characteristic symmetry with respect to the Dirac voltage.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Transformer IMU Calibrator: Dynamic On-body IMU Calibration for Inertial Motion Capture
Authors:
Chengxu Zuo,
Jiawei Huang,
Xiao Jiang,
Yuan Yao,
Xiangren Shi,
Rui Cao,
Xinyu Yi,
Feng Xu,
Shihui Guo,
Yipeng Qin
Abstract:
In this paper, we propose a novel dynamic calibration method for sparse inertial motion capture systems, which is the first to break the restrictive absolute static assumption in IMU calibration, i.e., the coordinate drift RG'G and measurement offset RBS remain constant during the entire motion, thereby significantly expanding their application scenarios. Specifically, we achieve real-time estimat…
▽ More
In this paper, we propose a novel dynamic calibration method for sparse inertial motion capture systems, which is the first to break the restrictive absolute static assumption in IMU calibration, i.e., the coordinate drift RG'G and measurement offset RBS remain constant during the entire motion, thereby significantly expanding their application scenarios. Specifically, we achieve real-time estimation of RG'G and RBS under two relaxed assumptions: i) the matrices change negligibly in a short time window; ii) the human movements/IMU readings are diverse in such a time window. Intuitively, the first assumption reduces the number of candidate matrices, and the second assumption provides diverse constraints, which greatly reduces the solution space and allows for accurate estimation of RG'G and RBS from a short history of IMU readings in real time. To achieve this, we created synthetic datasets of paired RG'G, RBS matrices and IMU readings, and learned their mappings using a Transformer-based model. We also designed a calibration trigger based on the diversity of IMU readings to ensure that assumption ii) is met before applying our method. To our knowledge, we are the first to achieve implicit IMU calibration (i.e., seamlessly putting IMUs into use without the need for an explicit calibration process), as well as the first to enable long-term and accurate motion capture using sparse IMUs. The code and dataset are available at https://github.com/ZuoCX1996/TIC.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Authors:
Yuhao Zhou,
Yiheng Wang,
Xuming He,
Ruoyao Xiao,
Zhiwei Li,
Qiantai Feng,
Zijie Guo,
Yuejin Yang,
Hao Wu,
Wenxuan Huang,
Jiaqi Wei,
Dan Si,
Xiuqi Yao,
Jia Bu,
Haiwen Huang,
Tianfan Fu,
Shixiang Tang,
Ben Fei,
Dongzhan Zhou,
Fenghua Ling,
Yan Lu,
Siqi Sun,
Chenhui Li,
Guanjie Zheng,
Jiancheng Lv
, et al. (2 additional authors not shown)
Abstract:
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on ev…
▽ More
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
△ Less
Submitted 12 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
Macro Graph of Experts for Billion-Scale Multi-Task Recommendation
Authors:
Hongyu Yao,
Zijin Hong,
Hao Chen,
Yuanchen Bei,
Zhiqing Li,
Qijie Shen,
Zuobin Ying,
Huan Gong,
Feiran Huang
Abstract:
Graph-based multi-task learning at billion-scale presents a significant challenge, as different tasks correspond to distinct billion-scale graphs. Traditional multi-task learning methods often neglect these graph structures, relying solely on individual user and item embeddings. However, disregarding graph structures overlooks substantial potential for improving performance. In this paper, we intr…
▽ More
Graph-based multi-task learning at billion-scale presents a significant challenge, as different tasks correspond to distinct billion-scale graphs. Traditional multi-task learning methods often neglect these graph structures, relying solely on individual user and item embeddings. However, disregarding graph structures overlooks substantial potential for improving performance. In this paper, we introduce the Macro Graph of Expert (MGOE) framework, the first approach capable of leveraging macro graph embeddings to capture task-specific macro features while modeling the correlations between task-specific experts. Specifically, we propose the concept of a Macro Graph Bottom, which, for the first time, enables multi-task learning models to incorporate graph information effectively. We design the Macro Prediction Tower to dynamically integrate macro knowledge across tasks. MGOE has been deployed at scale, powering multi-task learning for the homepage of a leading billion-scale recommender system. Extensive offline experiments conducted on three public benchmark datasets demonstrate its superiority over state-of-the-art multi-task learning methods, establishing MGOE as a breakthrough in multi-task graph-based recommendation. Furthermore, online A/B tests confirm the superiority of MGOE in billion-scale recommender systems.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Uniqueness and dimension for the geodesic of the critical long-range percolation metric
Authors:
Jian Ding,
Zherui Fan,
Lu-Jing Huang
Abstract:
By recent works of Bäumler [2] and of the authors of this paper [5], the (limiting) random metric for the critical long-range percolation was constructed. In this paper, we prove the uniqueness of the geodesic between two fixed points, for which an important ingredient of independent interest is the continuity of the metric distribution. In addition, we establish the Hausdorff dimension of the geo…
▽ More
By recent works of Bäumler [2] and of the authors of this paper [5], the (limiting) random metric for the critical long-range percolation was constructed. In this paper, we prove the uniqueness of the geodesic between two fixed points, for which an important ingredient of independent interest is the continuity of the metric distribution. In addition, we establish the Hausdorff dimension of the geodesics.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs
Authors:
Yilin Xiao,
Chuang Zhou,
Qinggang Zhang,
Bo Li,
Qing Li,
Xiao Huang
Abstract:
Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relat…
▽ More
Large language models (LLMs) often struggle with knowledge-intensive tasks due to a lack of background knowledge and a tendency to hallucinate. To address these limitations, integrating knowledge graphs (KGs) with LLMs has been intensively studied. Existing KG-enhanced LLMs focus on supplementary factual knowledge, but still struggle with solving complex questions. We argue that refining the relationships among facts and organizing them into a logically consistent reasoning path is equally important as factual knowledge itself. Despite their potential, extracting reliable reasoning paths from KGs poses the following challenges: the complexity of graph structures and the existence of multiple generated paths, making it difficult to distinguish between useful and redundant ones. To tackle these challenges, we propose the RRP framework to mine the knowledge graph, which combines the semantic strengths of LLMs with structural information obtained through relation embedding and bidirectional distribution learning. Additionally, we introduce a rethinking module that evaluates and refines reasoning paths according to their significance. Experimental results on two public datasets show that RRP achieves state-of-the-art performance compared to existing baseline methods. Moreover, RRP can be easily integrated into various LLMs to enhance their reasoning abilities in a plug-and-play manner. By generating high-quality reasoning paths tailored to specific questions, RRP distills effective guidance for LLM reasoning.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Edit360: 2D Image Edits to 3D Assets from Any Angle
Authors:
Junchao Huang,
Xinting Hu,
Zhuotao Tian,
Shaoshuai Shi,
Li Jiang
Abstract:
Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tun…
▽ More
Recent advances in diffusion models have significantly improved image generation and editing, but extending these capabilities to 3D assets remains challenging, especially for fine-grained edits that require multi-view consistency. Existing methods typically restrict editing to predetermined viewing angles, severely limiting their flexibility and practical applications. We introduce Edit360, a tuning-free framework that extends 2D modifications to multi-view consistent 3D editing. Built upon video diffusion models, Edit360 enables user-specific editing from arbitrary viewpoints while ensuring structural coherence across all views. The framework selects anchor views for 2D modifications and propagates edits across the entire 360-degree range. To achieve this, Edit360 introduces a novel Anchor-View Editing Propagation mechanism, which effectively aligns and merges multi-view information within the latent and attention spaces of diffusion models. The resulting edited multi-view sequences facilitate the reconstruction of high-quality 3D assets, enabling customizable 3D content creation.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
J-DDL: Surface Damage Detection and Localization System for Fighter Aircraft
Authors:
Jin Huang,
Mingqiang Wei,
Zikuan Li,
Hangyu Qu,
Wei Zhao,
Xinyu Bai
Abstract:
Ensuring the safety and extended operational life of fighter aircraft necessitates frequent and exhaustive inspections. While surface defect detection is feasible for human inspectors, manual methods face critical limitations in scalability, efficiency, and consistency due to the vast surface area, structural complexity, and operational demands of aircraft maintenance. We propose a smart surface d…
▽ More
Ensuring the safety and extended operational life of fighter aircraft necessitates frequent and exhaustive inspections. While surface defect detection is feasible for human inspectors, manual methods face critical limitations in scalability, efficiency, and consistency due to the vast surface area, structural complexity, and operational demands of aircraft maintenance. We propose a smart surface damage detection and localization system for fighter aircraft, termed J-DDL. J-DDL integrates 2D images and 3D point clouds of the entire aircraft surface, captured using a combined system of laser scanners and cameras, to achieve precise damage detection and localization. Central to our system is a novel damage detection network built on the YOLO architecture, specifically optimized for identifying surface defects in 2D aircraft images. Key innovations include lightweight Fasternet blocks for efficient feature extraction, an optimized neck architecture incorporating Efficient Multiscale Attention (EMA) modules for superior feature aggregation, and the introduction of a novel loss function, Inner-CIOU, to enhance detection accuracy. After detecting damage in 2D images, the system maps the identified anomalies onto corresponding 3D point clouds, enabling accurate 3D localization of defects across the aircraft surface. Our J-DDL not only streamlines the inspection process but also ensures more comprehensive and detailed coverage of large and complex aircraft exteriors. To facilitate further advancements in this domain, we have developed the first publicly available dataset specifically focused on aircraft damage. Experimental evaluations validate the effectiveness of our framework, underscoring its potential to significantly advance automated aircraft inspection technologies.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models
Authors:
Yu Huang,
Zelin Peng,
Yichen Zhao,
Piao Yang,
Xiaokang Yang,
Wei Shen
Abstract:
Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segment…
▽ More
Medical image segmentation is crucial for clinical diagnosis, yet existing models are limited by their reliance on explicit human instructions and lack the active reasoning capabilities to understand complex clinical questions. While recent advancements in multimodal large language models (MLLMs) have improved medical question-answering (QA) tasks, most methods struggle to generate precise segmentation masks, limiting their application in automatic medical diagnosis. In this paper, we introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions. To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions while also capable of producing corresponding precise segmentation masks for medical images. It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks and textual responses. Furthermore, we introduce MedSeg-QA, a large-scale dataset tailored for the medical image reasoning segmentation task. It includes over 10,000 image-mask pairs and multi-turn conversations, automatically annotated using large language models and refined through physician reviews. Experiments show MedSeg-R's superior performance across several benchmarks, achieving high segmentation accuracy and enabling interpretable textual analysis of medical images.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Time To Impeach LLM-as-a-Judge: Programs are the Future of Evaluation
Authors:
Tzu-Heng Huang,
Harit Vishwakarma,
Frederic Sala
Abstract:
Large language models (LLMs) are widely used to evaluate the quality of LLM generations and responses, but this leads to significant challenges: high API costs, uncertain reliability, inflexible pipelines, and inherent biases. To address these, we introduce PAJAMA (Program-As-a-Judge for Automated Model Assessment), a new alternative that uses LLMs to synthesize executable judging programs instead…
▽ More
Large language models (LLMs) are widely used to evaluate the quality of LLM generations and responses, but this leads to significant challenges: high API costs, uncertain reliability, inflexible pipelines, and inherent biases. To address these, we introduce PAJAMA (Program-As-a-Judge for Automated Model Assessment), a new alternative that uses LLMs to synthesize executable judging programs instead of directly scoring responses. These synthesized programs can be stored and run locally, costing orders of magnitude less while providing interpretable, and auditable judging logic that can be easily adapted. Program-based judges mitigate biases, improving judgment consistency by 15.83% and reducing biased responses by 23.7% on average compared to a Qwen2.5-14B-based LLM-as-a-judge. When program judgments are distilled into a model, PAJAMA outperforms LLM-as-a-judge on the challenging CHAT-HARD subset of RewardBench, outperforming metrics by 2.19% on Prometheus and 8.67% on the JudgeLM dataset, all at three orders of magnitude lower cost.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Authors:
Zhiyang Xu,
Jiuhai Chen,
Zhaojiang Lin,
Xichen Pan,
Lifu Huang,
Tianyi Zhou,
Madian Khabsa,
Qifan Wang,
Di Jin,
Michihiro Yasunaga,
Lili Yu,
Xi Victoria Lin,
Shaoliang Nie
Abstract:
Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image underst…
▽ More
Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We evaluate Pisces on over 20 public benchmarks for image understanding, where it demonstrates strong performance across a wide range of tasks. Additionally, on GenEval, a widely adopted benchmark for image generation, Pisces exhibits robust generative capabilities. Our extensive analysis reveals the synergistic relationship between image understanding and generation, and the benefits of using separate visual encoders, advancing the field of unified multimodal models.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Can We Infer Confidential Properties of Training Data from LLMs?
Authors:
Pengrun Huang,
Chhavi Yadav,
Ruihan Wu,
Kamalika Chaudhuri
Abstract:
Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties -- such as patient demographics or disease prevalence -- that are not intended to be revealed. While prior work has studied property inference attacks on…
▽ More
Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties -- such as patient demographics or disease prevalence -- that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.
△ Less
Submitted 15 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
Relaxation-Free Min-k-Partition for PCI Assignment in 5G Networks
Authors:
Yeqing Qiu,
Chengpiao Huang,
Ye Xue,
Zhipeng Jiang,
Qingjiang Shi,
Dong Zhang,
Zhi-Quan Luo
Abstract:
Physical Cell Identity (PCI) is a critical parameter in 5G networks. Efficient and accurate PCI assignment is essential for mitigating mod-3 interference, mod-30 interference, collisions, and confusions among cells, which directly affect network reliability and user experience. In this paper, we propose a novel framework for PCI assignment by decomposing the problem into Min-3-Partition, Min-10-Pa…
▽ More
Physical Cell Identity (PCI) is a critical parameter in 5G networks. Efficient and accurate PCI assignment is essential for mitigating mod-3 interference, mod-30 interference, collisions, and confusions among cells, which directly affect network reliability and user experience. In this paper, we propose a novel framework for PCI assignment by decomposing the problem into Min-3-Partition, Min-10-Partition, and a graph coloring problem, leveraging the Chinese Remainder Theorem (CRT). Furthermore, we develop a relaxation-free approach to the general Min-k-Partition problem by reformulating it as a quadratic program with a norm-equality constraint and solving it using a penalized mirror descent (PMD) algorithm. The proposed method demonstrates superior computational efficiency and scalability, significantly reducing interference while eliminating collisions and confusions in large-scale 5G networks. Numerical evaluations on real-world datasets show that our approach reduces computational time by up to 20 times compared to state-of-the-art methods, making it highly practical for real-time PCI optimization in large-scale networks. These results highlight the potential of our method to improve network performance and reduce deployment costs in modern 5G systems.
△ Less
Submitted 13 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation
Authors:
Runqi Ouyang,
Haoyun Li,
Zhenyuan Zhang,
Xiaofeng Wang,
Zheng Zhu,
Guan Huang,
Xingang Wang
Abstract:
Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, gen…
▽ More
Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model's ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.
△ Less
Submitted 16 June, 2025; v1 submitted 12 June, 2025;
originally announced June 2025.
-
PointGS: Point Attention-Aware Sparse View Synthesis with Gaussian Splatting
Authors:
Lintao Xiang,
Hongpei Zheng,
Yating Huang,
Qijun Yang,
Hujun Yin
Abstract:
3D Gaussian splatting (3DGS) is an innovative rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and visual quality by leveraging an explicit 3D scene representation. Existing 3DGS approaches require a large number of calibrated views to generate a consistent and complete scene representation. When input views are limited, 3DGS tends to overfit the training…
▽ More
3D Gaussian splatting (3DGS) is an innovative rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and visual quality by leveraging an explicit 3D scene representation. Existing 3DGS approaches require a large number of calibrated views to generate a consistent and complete scene representation. When input views are limited, 3DGS tends to overfit the training views, leading to noticeable degradation in rendering quality. To address this limitation, we propose a Point-wise Feature-Aware Gaussian Splatting framework that enables real-time, high-quality rendering from sparse training views. Specifically, we first employ the latest stereo foundation model to estimate accurate camera poses and reconstruct a dense point cloud for Gaussian initialization. We then encode the colour attributes of each 3D Gaussian by sampling and aggregating multiscale 2D appearance features from sparse inputs. To enhance point-wise appearance representation, we design a point interaction network based on a self-attention mechanism, allowing each Gaussian point to interact with its nearest neighbors. These enriched features are subsequently decoded into Gaussian parameters through two lightweight multi-layer perceptrons (MLPs) for final rendering. Extensive experiments on diverse benchmarks demonstrate that our method significantly outperforms NeRF-based approaches and achieves competitive performance under few-shot settings compared to the state-of-the-art 3DGS methods.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Search for sub-GeV invisible particles in inclusive decays of $J/ψ$ to $φ$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann,
H. Cai
, et al. (704 additional authors not shown)
Abstract:
A search for an invisible particle, $X$, with a mass between 0 and 0.96 $\textrm{GeV}/\textit{c}^{2}$, is performed in the process $J/ψ\rightarrowφ+ X$ using $(8774.0\pm39.4)\times10^{6}$ $J/ψ$ events collected with the BESIII detector from 2017 to 2019. The $φ$ meson is fully reconstructed and an efficient veto of photons, neutral and charged hadrons up to twice the $K_L^0$ mass is applied to the…
▽ More
A search for an invisible particle, $X$, with a mass between 0 and 0.96 $\textrm{GeV}/\textit{c}^{2}$, is performed in the process $J/ψ\rightarrowφ+ X$ using $(8774.0\pm39.4)\times10^{6}$ $J/ψ$ events collected with the BESIII detector from 2017 to 2019. The $φ$ meson is fully reconstructed and an efficient veto of photons, neutral and charged hadrons up to twice the $K_L^0$ mass is applied to the rest of the events, and the recoil mass against the $φ$ is obtained precisely from the kinematic constraint in the event. No significant signal is observed in the investigated region and the upper limit on the inclusive branching fraction of $J/ψ\rightarrowφ+ X$ is determined to be $7.5\times10^{-8}$ at 90% confidence level. Upper limits at a 90% confidence level are also given for this branching fraction as a function of the invisible particle mass, varying from $9\times10^{-9}$ to $4\times10^{-8}$ over the investigated mass range. Additionally, a 90% confidence level upper limit on the branching fraction of $η\rightarrow \rm{invisible}$ is determined to $2.6\times10^{-5}$, which improves the previous best results by more than four times. The analysis technique in this work offers a clean window to search for sub-GeV invisible particles, which can be adapted for other $J/ψ$ decays and direct $e^+e^-$ annihilation experiments in future studies, and improve the sensitivity by orders of magnitude.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
PyLO: Towards Accessible Learned Optimizers in PyTorch
Authors:
Paul Janson,
Benjamin Therien,
Quentin Anthony,
Xiaolong Huang,
Abhinav Moudgil,
Eugene Belilovsky
Abstract:
Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances -- such as VeLO, which was meta-trained for 4000 TPU-months -- remain largely inaccessible to the broader community, in part due to their reliance on JAX a…
▽ More
Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances -- such as VeLO, which was meta-trained for 4000 TPU-months -- remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for applying the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the broader machine learning community through familiar, widely adopted workflows. Unlike prior work focused on synthetic or convex tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our release includes a CUDA-accelerated version of the small_fc_lopt learned optimizer architecture from (Metz et al., 2022a), delivering substantial speedups -- from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32. PyLO also allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we find that learned optimizers can substantially benefit. Our code is available at https://github.com/Belilovsky-Lab/pylo
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Coupled Lindblad pseudomode theory for simulating open quantum systems
Authors:
Zhen Huang,
Gunhee Park,
Garnet Kin-Lic Chan,
Lin Lin
Abstract:
Coupled Lindblad pseudomode theory is a promising approach for simulating non-Markovian quantum dynamics on both classical and quantum platforms, with dynamics that can be realized as a quantum channel. We provide theoretical evidence that the number of coupled pseudomodes only needs to scale as $\mathrm{polylog}(T/\varepsilon)$ in the simulation time $T$ and precision $\varepsilon$. Inspired by t…
▽ More
Coupled Lindblad pseudomode theory is a promising approach for simulating non-Markovian quantum dynamics on both classical and quantum platforms, with dynamics that can be realized as a quantum channel. We provide theoretical evidence that the number of coupled pseudomodes only needs to scale as $\mathrm{polylog}(T/\varepsilon)$ in the simulation time $T$ and precision $\varepsilon$. Inspired by the realization problem in control theory, we also develop a robust numerical algorithm for constructing the coupled modes that avoids the non-convex optimization required by existing approaches. We demonstrate the effectiveness of our method by computing population dynamics and absorption spectra for the spin-boson model. This work provides a significant theoretical and computational improvement to the coupled Lindblad framework, which impacts a broad range of applications from classical simulations of quantum impurity problems to quantum simulations on near-term quantum platforms.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Learning-Based Stable Optimal Control for Infinite-Time Nonlinear Regulation Problems
Authors:
Han Wang,
Di Wu,
Lin Cheng,
Shengping Gong,
Xu Huang
Abstract:
Infinite-time nonlinear optimal regulation control is widely utilized in aerospace engineering as a systematic method for synthesizing stable controllers. However, conventional methods often rely on linearization hypothesis, while recent learning-based approaches rarely consider stability guarantees. This paper proposes a learning-based framework to learn a stable optimal controller for nonlinear…
▽ More
Infinite-time nonlinear optimal regulation control is widely utilized in aerospace engineering as a systematic method for synthesizing stable controllers. However, conventional methods often rely on linearization hypothesis, while recent learning-based approaches rarely consider stability guarantees. This paper proposes a learning-based framework to learn a stable optimal controller for nonlinear optimal regulation problems. First, leveraging the equivalence between Pontryagin Maximum Principle (PMP) and Hamilton-Jacobi-Bellman (HJB) equation, we improve the backward generation of optimal examples (BGOE) method for infinite-time optimal regulation problems. A state-transition-matrix-guided data generation method is then proposed to efficiently generate a complete dataset that covers the desired state space. Finally, we incorporate the Lyapunov stability condition into the learning framework, ensuring the stability of the learned optimal policy by jointly learning the optimal value function and control policy. Simulations on three nonlinear optimal regulation problems show that the learned optimal policy achieves near-optimal regulation control and the code is provided at https://github.com/wong-han/PaperNORC
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models
Authors:
Qiyue Yin,
Pei Xu,
Qiaozhe Li,
Shengda Liu,
Shengqi Shen,
Tong Wang,
Yihong Han,
Xiaonan Zhao,
Likun Yang,
Shiyue Cao,
Shiyu Qiu,
Yuxuan Liu,
Shizhao Yu,
Lei Cui,
Chengxin Yan,
Jie Sun,
Xiangquan Tang,
Kaiqi Huang
Abstract:
Recent breakthroughs in Large Language Models (LLMs) have led to a qualitative leap in artificial intelligence' s performance on reasoning tasks, particularly demonstrating remarkable capabilities in mathematical, symbolic, and commonsense reasoning. However, as a critical component of advanced human cognition, strategic reasoning, i.e., the ability to assess multi-agent behaviors in dynamic envir…
▽ More
Recent breakthroughs in Large Language Models (LLMs) have led to a qualitative leap in artificial intelligence' s performance on reasoning tasks, particularly demonstrating remarkable capabilities in mathematical, symbolic, and commonsense reasoning. However, as a critical component of advanced human cognition, strategic reasoning, i.e., the ability to assess multi-agent behaviors in dynamic environments, formulate action plans, and adapt strategies, has yet to be systematically evaluated or modeled. To address this gap, this paper introduces WGSR-Bench, the first strategy reasoning benchmark for LLMs using wargame as its evaluation environment. Wargame, a quintessential high-complexity strategic scenario, integrates environmental uncertainty, adversarial dynamics, and non-unique strategic choices, making it an effective testbed for assessing LLMs' capabilities in multi-agent decision-making, intent inference, and counterfactual reasoning. WGSR-Bench designs test samples around three core tasks, i.e., Environmental situation awareness, Opponent risk modeling and Policy generation, which serve as the core S-POE architecture, to systematically assess main abilities of strategic reasoning. Finally, an LLM-based wargame agent is designed to integrate these parts for a comprehensive strategy reasoning assessment. With WGSR-Bench, we hope to assess the strengths and limitations of state-of-the-art LLMs in game-theoretic strategic reasoning and to advance research in large model-driven strategic intelligence.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Constructive interference at the edge of quantum ergodic dynamics
Authors:
Dmitry A. Abanin,
Rajeev Acharya,
Laleh Aghababaie-Beni,
Georg Aigeldinger,
Ashok Ajoy,
Ross Alcaraz,
Igor Aleiner,
Trond I. Andersen,
Markus Ansmann,
Frank Arute,
Kunal Arya,
Abraham Asfaw,
Nikita Astrakhantsev,
Juan Atalaya,
Ryan Babbush,
Dave Bacon,
Brian Ballard,
Joseph C. Bardin,
Christian Bengs,
Andreas Bengtsson,
Alexander Bilmes,
Sergio Boixo,
Gina Bortoli,
Alexandre Bourassa,
Jenna Bovaird
, et al. (240 additional authors not shown)
Abstract:
Quantum observables in the form of few-point correlators are the key to characterizing the dynamics of quantum many-body systems. In dynamics with fast entanglement generation, quantum observables generally become insensitive to the details of the underlying dynamics at long times due to the effects of scrambling. In experimental systems, repeated time-reversal protocols have been successfully imp…
▽ More
Quantum observables in the form of few-point correlators are the key to characterizing the dynamics of quantum many-body systems. In dynamics with fast entanglement generation, quantum observables generally become insensitive to the details of the underlying dynamics at long times due to the effects of scrambling. In experimental systems, repeated time-reversal protocols have been successfully implemented to restore sensitivities of quantum observables. Using a 103-qubit superconducting quantum processor, we characterize ergodic dynamics using the second-order out-of-time-order correlators, OTOC$^{(2)}$. In contrast to dynamics without time reversal, OTOC$^{(2)}$ are observed to remain sensitive to the underlying dynamics at long time scales. Furthermore, by inserting Pauli operators during quantum evolution and randomizing the phases of Pauli strings in the Heisenberg picture, we observe substantial changes in OTOC$^{(2)}$ values. This indicates that OTOC$^{(2)}$ is dominated by constructive interference between Pauli strings that form large loops in configuration space. The observed interference mechanism endows OTOC$^{(2)}$ with a high degree of classical simulation complexity, which culminates in a set of large-scale OTOC$^{(2)}$ measurements exceeding the simulation capacity of known classical algorithms. Further supported by an example of Hamiltonian learning through OTOC$^{(2)}$, our results indicate a viable path to practical quantum advantage.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Rethinking Brain Tumor Segmentation from the Frequency Domain Perspective
Authors:
Minye Shao,
Zeyu Wang,
Haoran Duan,
Yawen Huang,
Bing Zhai,
Shizheng Wang,
Yang Long,
Yefeng Zheng
Abstract:
Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient considerati…
▽ More
Precise segmentation of brain tumors, particularly contrast-enhancing regions visible in post-contrast MRI (areas highlighted by contrast agent injection), is crucial for accurate clinical diagnosis and treatment planning but remains challenging. However, current methods exhibit notable performance degradation in segmenting these enhancing brain tumor areas, largely due to insufficient consideration of MRI-specific tumor features such as complex textures and directional variations. To address this, we propose the Harmonized Frequency Fusion Network (HFF-Net), which rethinks brain tumor segmentation from a frequency-domain perspective. To comprehensively characterize tumor regions, we develop a Frequency Domain Decomposition (FDD) module that separates MRI images into low-frequency components, capturing smooth tumor contours and high-frequency components, highlighting detailed textures and directional edges. To further enhance sensitivity to tumor boundaries, we introduce an Adaptive Laplacian Convolution (ALC) module that adaptively emphasizes critical high-frequency details using dynamically updated convolution kernels. To effectively fuse tumor features across multiple scales, we design a Frequency Domain Cross-Attention (FDCA) integrating semantic, positional, and slice-specific information. We further validate and interpret frequency-domain improvements through visualization, theoretical reasoning, and experimental analyses. Extensive experiments on four public datasets demonstrate that HFF-Net achieves an average relative improvement of 4.48\% (ranging from 2.39\% to 7.72\%) in the mean Dice scores across the three major subregions, and an average relative improvement of 7.33% (ranging from 5.96% to 8.64%) in the segmentation of contrast-enhancing tumor regions, while maintaining favorable computational efficiency and clinical applicability. Code: https://github.com/VinyehShaw/HFF.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
Authors:
Xiyao Wang,
Zhengyuan Yang,
Chao Feng,
Yongyuan Liang,
Yuhang Zhou,
Xiaoyu Liu,
Ziyi Zang,
Ming Li,
Chung-Ching Lin,
Kevin Lin,
Linjie Li,
Furong Huang,
Lijuan Wang
Abstract:
Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously v…
▽ More
Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
GPU Acceleration of SQL Analytics on Compressed Data
Authors:
Zezhou Huang,
Krystian Sakowski,
Hans Lehnert,
Wei Cui,
Carlo Curino,
Matteo Interlandi,
Marius Dumitru,
Rathijit Sen
Abstract:
GPUs are uniquely suited to accelerate (SQL) analytics workloads thanks to their massive compute parallelism and High Bandwidth Memory (HBM) -- when datasets fit in the GPU HBM, performance is unparalleled. Unfortunately, GPU HBMs remain typically small when compared with lower-bandwidth CPU main memory. Besides brute-force scaling across many GPUs, current solutions to accelerate queries on large…
▽ More
GPUs are uniquely suited to accelerate (SQL) analytics workloads thanks to their massive compute parallelism and High Bandwidth Memory (HBM) -- when datasets fit in the GPU HBM, performance is unparalleled. Unfortunately, GPU HBMs remain typically small when compared with lower-bandwidth CPU main memory. Besides brute-force scaling across many GPUs, current solutions to accelerate queries on large datasets include leveraging data partitioning and loading smaller data batches in GPU HBM, and hybrid execution with a connected device (e.g., CPUs). Unfortunately, these approaches are exposed to the limitations of lower main memory and host-to-device interconnect bandwidths, introduce additional I/O overheads, or incur higher costs. This is a substantial problem when trying to scale adoption of GPUs on larger datasets. Data compression can alleviate this bottleneck, but to avoid paying for costly decompression/decoding, an ideal solution must include computation primitives to operate directly on data in compressed form.
This is the focus of our paper: a set of new methods for running queries directly on light-weight compressed data using schemes such as Run-Length Encoding (RLE), index encoding, bit-width reductions, and dictionary encoding. Our novelty includes operating on multiple RLE columns without decompression, handling heterogeneous column encodings, and leveraging PyTorch tensor operations for portability across devices. Experimental evaluations show speedups of an order of magnitude compared to state-of-the-art commercial CPU-only analytics systems, for real-world queries on a production dataset that would not fit into GPU memory uncompressed. This work paves the road for GPU adoption in a much broader set of use cases, and it is complementary to most other scale-out or fallback mechanisms.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
Authors:
Chenjian Gao,
Lihe Ding,
Xin Cai,
Zhanpeng Huang,
Zibin Wang,
Tianfan Xue
Abstract:
Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) t…
▽ More
Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos. However, current methods often rely on large-scale pretraining, limiting flexibility for specific edits. First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames. To address this, we propose a mask-based LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing. Our approach preserves background regions while enabling controllable edits propagation. This solution offers efficient and adaptable video editing without altering the model architecture. To better steer this process, we incorporate additional references, such as alternate viewpoints or representative scene states, which serve as visual anchors for how content should unfold. We address the control challenge using a mask-driven LoRA tuning strategy that adapts a pre-trained image-to-video model to the editing context. The model must learn from two distinct sources: the input video provides spatial structure and motion cues, while reference images offer appearance guidance. A spatial mask enables region-specific learning by dynamically modulating what the model attends to, ensuring that each area draws from the appropriate source. Experimental results show our method achieves superior video editing performance compared to state-of-the-art methods. Project Page: https://cjeen.github.io/LoraEditPaper
△ Less
Submitted 19 June, 2025; v1 submitted 11 June, 2025;
originally announced June 2025.
-
Learning-based density-equalizing map
Authors:
Yanwen Huang,
Lok Ming Lui,
Gary P. T. Choi
Abstract:
Density-equalizing map (DEM) serves as a powerful technique for creating shape deformations with the area changes reflecting an underlying density function. In recent decades, DEM has found widespread applications in fields such as data visualization, geometry processing, and medical imaging. Traditional approaches to DEM primarily rely on iterative numerical solvers for diffusion equations or opt…
▽ More
Density-equalizing map (DEM) serves as a powerful technique for creating shape deformations with the area changes reflecting an underlying density function. In recent decades, DEM has found widespread applications in fields such as data visualization, geometry processing, and medical imaging. Traditional approaches to DEM primarily rely on iterative numerical solvers for diffusion equations or optimization-based methods that minimize handcrafted energy functionals. However, these conventional techniques often face several challenges: they may suffer from limited accuracy, produce overlapping artifacts in extreme cases, and require substantial algorithmic redesign when extended from 2D to 3D, due to the derivative-dependent nature of their energy formulations. In this work, we propose a novel learning-based density-equalizing mapping framework (LDEM) using deep neural networks. Specifically, we introduce a loss function that enforces density uniformity and geometric regularity, and utilize a hierarchical approach to predict the transformations at both the coarse and dense levels. Our method demonstrates superior density-equalizing and bijectivity properties compared to prior methods for a wide range of simple and complex density distributions, and can be easily applied to surface remeshing with different effects. Also, it generalizes seamlessly from 2D to 3D domains without structural changes to the model architecture or loss formulation. Altogether, our work opens up new possibilities for scalable and robust computation of density-equalizing maps for practical applications.
△ Less
Submitted 9 June, 2025;
originally announced June 2025.
-
A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations
Authors:
Tian Lan,
Yang-Hao Zhou,
Zi-Ao Ma,
Fanshu Sun,
Rui-Qing Sun,
Junyu Luo,
Rong-Cheng Tu,
Heyan Huang,
Chen Xu,
Zhijing Wu,
Xian-Ling Mao
Abstract:
Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio mo…
▽ More
Recent advances in deep learning have significantly enhanced generative AI capabilities across text, images, and audio. However, automatically evaluating the quality of these generated outputs presents ongoing challenges. Although numerous automatic evaluation methods exist, current research lacks a systematic framework that comprehensively organizes these methods across text, visual, and audio modalities. To address this issue, we present a comprehensive review and a unified taxonomy of automatic evaluation methods for generated content across all three modalities; We identify five fundamental paradigms that characterize existing evaluation approaches across these domains. Our analysis begins by examining evaluation methods for text generation, where techniques are most mature. We then extend this framework to image and audio generation, demonstrating its broad applicability. Finally, we discuss promising directions for future research in cross-modal evaluation methodologies.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
eFlesh: Highly customizable Magnetic Touch Sensing using Cut-Cell Microstructures
Authors:
Venkatesh Pattabiraman,
Zizhou Huang,
Daniele Panozzo,
Denis Zorin,
Lerrel Pinto,
Raunaq Bhirangi
Abstract:
If human experience is any guide, operating effectively in unstructured environments -- like homes and offices -- requires robots to sense the forces during physical interaction. Yet, the lack of a versatile, accessible, and easily customizable tactile sensor has led to fragmented, sensor-specific solutions in robotic manipulation -- and in many cases, to force-unaware, sensorless approaches. With…
▽ More
If human experience is any guide, operating effectively in unstructured environments -- like homes and offices -- requires robots to sense the forces during physical interaction. Yet, the lack of a versatile, accessible, and easily customizable tactile sensor has led to fragmented, sensor-specific solutions in robotic manipulation -- and in many cases, to force-unaware, sensorless approaches. With eFlesh, we bridge this gap by introducing a magnetic tactile sensor that is low-cost, easy to fabricate, and highly customizable. Building an eFlesh sensor requires only four components: a hobbyist 3D printer, off-the-shelf magnets (<$5), a CAD model of the desired shape, and a magnetometer circuit board. The sensor is constructed from tiled, parameterized microstructures, which allow for tuning the sensor's geometry and its mechanical response. We provide an open-source design tool that converts convex OBJ/STL files into 3D-printable STLs for fabrication. This modular design framework enables users to create application-specific sensors, and to adjust sensitivity depending on the task. Our sensor characterization experiments demonstrate the capabilities of eFlesh: contact localization RMSE of 0.5 mm, and force prediction RMSE of 0.27 N for normal force and 0.12 N for shear force. We also present a learned slip detection model that generalizes to unseen objects with 95% accuracy, and visuotactile control policies that improve manipulation performance by 40% over vision-only baselines -- achieving 91% average success rate for four precise tasks that require sub-mm accuracy for successful completion. All design files, code and the CAD-to-eFlesh STL conversion tool are open-sourced and available on https://e-flesh.com.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.