Search | arXiv e-print repository

Recent Advances in Diffusion Models for Hyperspectral Image Processing and Analysis: A Review

Authors: Xing Hu, Xiangcheng Liu, Danfeng Hong, Qianqian Duan, Linghua Jiang, Haima Yang, Dawei Zhan

Abstract: Hyperspectral image processing and analysis has important application value in remote sensing, agriculture and environmental monitoring, but its high dimensionality, data redundancy and noise interference etc. bring great challenges to the analysis. Traditional models have limitations in dealing with these complex data, and it is difficult to meet the increasing demand for analysis. In recent year… ▽ More Hyperspectral image processing and analysis has important application value in remote sensing, agriculture and environmental monitoring, but its high dimensionality, data redundancy and noise interference etc. bring great challenges to the analysis. Traditional models have limitations in dealing with these complex data, and it is difficult to meet the increasing demand for analysis. In recent years, Diffusion models, as a class of emerging generative approaches, have demonstrated promising capabilities in hyperspectral image (HSI) processing tasks. By simulating the diffusion process of data in time, the Diffusion Model are capable of modeling high-dimensional spectral structures, generate high-quality samples, and achieve competitive performance in spectral-spatial denoising tasks and data enhancement. In this paper, we review the recent research advances in diffusion modeling for hyperspectral image processing and analysis, and discuss its applications in tasks such as high-dimensional data processing, noise removal, classification, and anomaly detection. The performance of diffusion-based models on image processing is compared and the challenges are summarized. It is shown that the diffusion model can significantly improve the accuracy and efficiency of hyperspectral image analysis, providing a new direction for future research. △ Less

Submitted 27 May, 2025; v1 submitted 16 May, 2025; originally announced May 2025.

arXiv:2505.11095 [pdf, ps, other]

Towards Better Evaluation for Generated Patent Claims

Authors: Lekang Jiang, Pascal A Scherz, Stephan Goetz

Abstract: Patent claims define the scope of protection and establish the legal boundaries of an invention. Drafting these claims is a complex and time-consuming process that usually requires the expertise of skilled patent attorneys, which can form a large access barrier for many small enterprises. To solve these challenges, researchers have investigated the use of large language models (LLMs) for automatin… ▽ More Patent claims define the scope of protection and establish the legal boundaries of an invention. Drafting these claims is a complex and time-consuming process that usually requires the expertise of skilled patent attorneys, which can form a large access barrier for many small enterprises. To solve these challenges, researchers have investigated the use of large language models (LLMs) for automating patent claim generation. However, existing studies highlight inconsistencies between automated evaluation metrics and human expert assessments. To bridge this gap, we introduce Patent-CE, the first comprehensive benchmark for evaluating patent claims. Patent-CE includes comparative claim evaluations annotated by patent experts, focusing on five key criteria: feature completeness, conceptual clarity, terminology consistency, logical linkage, and overall quality. Additionally, we propose PatClaimEval, a novel multi-dimensional evaluation method specifically designed for patent claims. Our experiments demonstrate that PatClaimEval achieves the highest correlation with human expert evaluations across all assessment criteria among all tested metrics. This research provides the groundwork for more accurate evaluations of automated patent claim generation systems. △ Less

Submitted 16 May, 2025; originally announced May 2025.

Comments: Accepted to ACL 2025. 14 pages, 8 tables

arXiv:2505.10499 [pdf, ps, other]

Achievable rates for concatenated square Gottesman-Kitaev-Preskill codes

Authors: Mahadevan Subramanian, Guo Zheng, Liang Jiang

Abstract: The Gottesman-Kitaev-Preskill (GKP) codes are known to achieve optimal rates under displacement noise and pure loss channels, which establishes theoretical foundations for its optimality. However, such optimal rates are only known to be achieved at a discrete set of noise strength with the current self-dual symplectic lattice construction. In this work, we develop a new coding strategy using conca… ▽ More The Gottesman-Kitaev-Preskill (GKP) codes are known to achieve optimal rates under displacement noise and pure loss channels, which establishes theoretical foundations for its optimality. However, such optimal rates are only known to be achieved at a discrete set of noise strength with the current self-dual symplectic lattice construction. In this work, we develop a new coding strategy using concatenated continuous variable - discrete variable encodings to go beyond past results and establish GKP's optimal rate over all noise strengths. In particular, for displacement noise, the rate is obtained through a constructive approach by concatenating GKP codes with a quantum polar code and analog decoding. For pure loss channel, we prove the existence of capacity-achieving GKP codes through a random coding approach. These results highlight the capability of concatenation-based GKP codes and provides new methods for constructing good GKP lattices. △ Less

Submitted 15 May, 2025; originally announced May 2025.

Comments: 14+15 pages, 6+4 figures

arXiv:2505.09687 [pdf, ps, other]

Efficient benchmarking of logical magic state

Authors: Su-un Lee, Ming Yuan, Senrui Chen, Kento Tsubouchi, Liang Jiang

Abstract: High-fidelity logical magic states are a critical resource for fault-tolerant quantum computation, enabling non-Clifford logical operations through state injection. However, benchmarking these states presents significant challenges: one must estimate the infidelity $ε$ with multiplicative precision, while many quantum error-correcting codes only permit Clifford operations to be implemented fault-t… ▽ More High-fidelity logical magic states are a critical resource for fault-tolerant quantum computation, enabling non-Clifford logical operations through state injection. However, benchmarking these states presents significant challenges: one must estimate the infidelity $ε$ with multiplicative precision, while many quantum error-correcting codes only permit Clifford operations to be implemented fault-tolerantly. Consequently, conventional state tomography requires $\sim1/ε^2$ samples, making benchmarking impractical for high-fidelity states. In this work, we show that any benchmarking scheme measuring one copy of the magic state per round necessarily requires $Ω(1/ε^2)$ samples for single-qubit magic states. We then propose two approaches to overcome this limitation: (i) Bell measurements on two copies of the twirled state and (ii) single-copy schemes leveraging twirled multi-qubit magic states. Both benchmarking schemes utilize measurements with stabilizer states orthogonal to the ideal magic state and we show that $O(1/ε)$ sample complexity is achieved, which we prove to be optimal. Finally, we demonstrate the robustness of our protocols through numerical simulations under realistic noise models, confirming that their advantage persists even at moderate error rates currently achievable in state-of-the-art experiments. △ Less

Submitted 14 May, 2025; originally announced May 2025.

arXiv:2505.06810 [pdf, ps, other]

QSeer: A Quantum-Inspired Graph Neural Network for Parameter Initialization in Quantum Approximate Optimization Algorithm Circuits

Authors: Lei Jiang, Chi Zhang, Fan Chen

Abstract: To mitigate the barren plateau problem, effective parameter initialization is crucial for optimizing the Quantum Approximate Optimization Algorithm (QAOA) in the near-term Noisy Intermediate-Scale Quantum (NISQ) era. Prior physics-driven approaches leveraged the optimal parameter concentration phenomenon, utilizing medium values of previously optimized QAOA parameters stored in databases as initia… ▽ More To mitigate the barren plateau problem, effective parameter initialization is crucial for optimizing the Quantum Approximate Optimization Algorithm (QAOA) in the near-term Noisy Intermediate-Scale Quantum (NISQ) era. Prior physics-driven approaches leveraged the optimal parameter concentration phenomenon, utilizing medium values of previously optimized QAOA parameters stored in databases as initialization for new graphs. However, this medium-value-based strategy lacks generalization capability. Conversely, prior computer-science-based approaches employed graph neural networks (GNNs) trained on previously optimized QAOA parameters to predict initialization values for new graphs. However, these approaches neglect key physics-informed QAOA principles, such as parameter concentration, symmetry, and adiabatic evolution, resulting in suboptimal parameter predictions and limited performance improvements. Furthermore, no existing GNN-based methods support parameter initialization for QAOA circuits with variable depths or for solving weighted Max-Cut problems. This paper introduces QSeer, a quantum-inspired GNN designed for accurate QAOA parameter prediction. Compared to prior physics- and computer-science-driven methods, QSeer improves the initial approximation ratio and convergence speed of QAOA circuits across diverse graphs by 6%-68% and 5x-10x, respectively. △ Less

Submitted 10 May, 2025; originally announced May 2025.

arXiv:2505.06690 [pdf]

E2E-FANet: A Highly Generalizable Framework for Waves prediction Behind Floating Breakwaters via Exogenous-to-Endogenous Variable Attention

Authors: Jianxin Zhang, Lianzi Jiang, Xinyu Han, Xiangrong Wang, Weinan Huang

Abstract: Accurate prediction of waves behind floating breakwaters (FB) is crucial for optimizing coastal engineering structures, enhancing safety, and improving design efficiency. Existing methods demonstrate limitations in capturing nonlinear interactions between waves and structures, while exhibiting insufficient capability in modeling the complex frequency-domain relationships among elevations of differ… ▽ More Accurate prediction of waves behind floating breakwaters (FB) is crucial for optimizing coastal engineering structures, enhancing safety, and improving design efficiency. Existing methods demonstrate limitations in capturing nonlinear interactions between waves and structures, while exhibiting insufficient capability in modeling the complex frequency-domain relationships among elevations of different wave gauges. To address these challenges, this study introduces the Exogenous-to-Endogenous Frequency-Aware Network (E2E-FANet), a novel end-to-end neural network designed to model relationships between waves and structures. The E2E-FANetarchitecture incorporates a Dual-Basis Frequency Mapping (DBFM) module that leverages orthogonal cosine and sine bases to extract wave features from the frequency domain while preserving temporal information. Additionally, we introduce the Exogenous-to-Endogenous Cross-Attention (E2ECA) module, which employs cross attention to model the interactions between endogenous and exogenous variables. We incorporate a Temporal-wise Attention (TA) mechanism that adaptively captures complex dependencies in endogenous variables. These integrated modules function synergistically, enabling E2E-FANet to achieve both comprehensive feature perception in the time-frequency domain and precise modeling of wave-structure interactions. To comprehensively evaluate the performance of E2E-FANet, we constructed a multi-level validation framework comprising three distinct testing scenarios: internal validation under identical wave conditions, generalization testing across different wave conditions, and adaptability testing with varying relative water density (RW) conditions. These comprehensive tests demonstrate that E2E-FANet provides accurate waves behind FB predictions while successfully generalizing diverse wave conditions. △ Less

Submitted 10 May, 2025; originally announced May 2025.

arXiv:2505.06688 [pdf]

A Novel Framework for Significant Wave Height Prediction based on Adaptive Feature Extraction Time-Frequency Network

Authors: Jianxin Zhang, Lianzi Jiang, Xinyu Han, Xiangrong Wang

Abstract: Precise forecasting of significant wave height (Hs) is essential for the development and utilization of wave energy. The challenges in predicting Hs arise from its non-linear and non-stationary characteristics. The combination of decomposition preprocessing and machine learning models have demonstrated significant effectiveness in Hs prediction by extracting data features. However, decomposing the… ▽ More Precise forecasting of significant wave height (Hs) is essential for the development and utilization of wave energy. The challenges in predicting Hs arise from its non-linear and non-stationary characteristics. The combination of decomposition preprocessing and machine learning models have demonstrated significant effectiveness in Hs prediction by extracting data features. However, decomposing the unknown data in the test set can lead to data leakage issues. To simultaneously achieve data feature extraction and prevent data leakage, a novel Adaptive Feature Extraction Time-Frequency Network (AFE-TFNet) is proposed to improve prediction accuracy and stability. It is encoder-decoder rolling framework. The encoder consists of two stages: feature extraction and feature fusion. In the feature extraction stage, global and local frequency domain features are extracted by combining Wavelet Transform (WT) and Fourier Transform (FT), and multi-scale frequency analysis is performed using Inception blocks. In the feature fusion stage, time-domain and frequency-domain features are integrated through dominant harmonic sequence energy weighting (DHSEW). The decoder employed an advanced long short-term memory (LSTM) model. Hourly measured wind speed (Ws), dominant wave period (DPD), average wave period (APD) and Hs from three stations are used as the dataset, and the four metrics are employed to evaluate the forecasting performance. Results show that AFE-TFNet significantly outperforms benchmark methods in terms of prediction accuracy. Feature extraction can significantly improve the prediction accuracy. DHSEW has substantially increased the accuracy of medium-term to long-term forecasting. The prediction accuracy of AFE-TFNet does not demonstrate significant variability with changes of rolling time window size. Overall, AFE-TFNet shows strong potential for handling complex signal forecasting. △ Less

Submitted 10 May, 2025; originally announced May 2025.

arXiv:2505.06635 [pdf, ps, other]

Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization

Authors: Xu Zheng, Yuanhuiyi Lyu, Lutao Jiang, Danda Pani Paudel, Luc Van Gool, Xuming Hu

Abstract: Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-wo… ▽ More Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-world scenarios where the dominant modality may be unavailable, resulting in severe performance degradation. To this end, we apply a simple but effective plug-and-play regularization term based on functional entropy, which introduces no additional parameters or modules. This term is designed to intuitively balance the contribution of each visual modality to the segmentation results. Specifically, we leverage the log-Sobolev inequality to bound functional entropy using functional-Fisher-information. By maximizing the information contributed by each visual modality, our approach mitigates unimodal dominance and establishes a more balanced and robust segmentation framework. A multi-scale regularization module is proposed to apply our proposed plug-and-play term on high-level features and also segmentation predictions for more balanced multi-modal learning. Extensive experiments on three datasets demonstrate that our proposed method achieves superior performance, i.e., +13.94%, +3.25%, and +3.64%, without introducing any additional parameters. △ Less

Submitted 10 May, 2025; originally announced May 2025.

arXiv:2505.06523 [pdf, ps, other]

doi 10.1145/3721238.3730602

Virtualized 3D Gaussians: Flexible Cluster-based Level-of-Detail System for Real-Time Rendering of Composed Scenes

Authors: Xijie Yang, Linning Xu, Lihan Jiang, Dahua Lin, Bo Dai

Abstract: 3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless composition of complex digital worlds, offering significant advantages over previous neural implicit methods. However, when applied to large-scale compositions, su… ▽ More 3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless composition of complex digital worlds, offering significant advantages over previous neural implicit methods. However, when applied to large-scale compositions, such as crowd-level scenes, it can encompass numerous 3D Gaussians, posing substantial challenges for real-time rendering. To address this, inspired by Unreal Engine 5's Nanite system, we propose Virtualized 3D Gaussians (V3DG), a cluster-based LOD solution that constructs hierarchical 3D Gaussian clusters and dynamically selects only the necessary ones to accelerate rendering speed. Our approach consists of two stages: (1) Offline Build, where hierarchical clusters are generated using a local splatting method to minimize visual differences across granularities, and (2) Online Selection, where footprint evaluation determines perceptible clusters for efficient rasterization during rendering. We curate a dataset of synthetic and real-world scenes, including objects, trees, people, and buildings, each requiring 0.1 billion 3D Gaussians to capture fine details. Experiments show that our solution balances rendering efficiency and visual quality across user-defined tolerances, facilitating downstream interactive applications that compose extensive 3DGS assets for consistent rendering performance. △ Less

Submitted 10 May, 2025; originally announced May 2025.

Comments: project page: https://xijie-yang.github.io/V3DG/

arXiv:2505.05766 [pdf, ps, other]

Measurement of separate electron and positron spectra from 10 GeV to 20GeV with the geomagnetic field on DAMPE

Authors: DAMPE Collaboration, F. Alemanno, Q. An, P. Azzarello, F. C. T. Barbato, P. Bernardini, X. J. Bi, H. Boutin, I. Cagnoli, M. S. Cai, E. Casilli, E. Catanzani, J. Chang, D. Y. Chen, J. L. Chen, Z. F. Chen, Z. X. Chen, P. Coppin, M. Y. Cui, T. S. Cui, Y. X. Cui, I. DeMitri, F. dePalma, A. DiGiovanni, T. K. Dong , et al. (127 additional authors not shown)

Abstract: The cosmic-ray (CR) electrons and positrons in space are of great significance for studying the origin and propagation of cosmic-rays. The satellite-borne experiment DArk Matter Particle Explorer (DAMPE) has been used to measure the separate electron and positron spectra, as well as the positron fraction. In this work, the Earth's magnetic field is used to distinguish CR electrons and positrons, a… ▽ More The cosmic-ray (CR) electrons and positrons in space are of great significance for studying the origin and propagation of cosmic-rays. The satellite-borne experiment DArk Matter Particle Explorer (DAMPE) has been used to measure the separate electron and positron spectra, as well as the positron fraction. In this work, the Earth's magnetic field is used to distinguish CR electrons and positrons, as the DAMPE detector does not carry an onboard magnet. The energy range for the measurements is from 10 to 20 GeV, being currently limited at high energy by the zenith pointing orientation of DAMPE. The results are consistent with previous measurements based on the magnetic spectrometer by AMS-02 and PAMELA, while the results of Fermi-LAT seem then to be systematically shifted to larger values. △ Less

Submitted 9 May, 2025; originally announced May 2025.

Comments: 18 pages, 5 figures

arXiv:2505.05732 [pdf, other]

doi 10.1137/1.9781611978520.1

Automated Learning of Semantic Embedding Representations for Diffusion Models

Authors: Limai Jiang, Yunpeng Cai

Abstract: Generative models capture the true distribution of data, yielding semantically rich representations. Denoising diffusion models (DDMs) exhibit superior generative capabilities, though efficient representation learning for them are lacking. In this work, we employ a multi-level denoising autoencoder framework to expand the representation capacity of DDMs, which introduces sequentially consistent Di… ▽ More Generative models capture the true distribution of data, yielding semantically rich representations. Denoising diffusion models (DDMs) exhibit superior generative capabilities, though efficient representation learning for them are lacking. In this work, we employ a multi-level denoising autoencoder framework to expand the representation capacity of DDMs, which introduces sequentially consistent Diffusion Transformers and an additional timestep-dependent encoder to acquire embedding representations on the denoising Markov chain through self-conditional diffusion learning. Intuitively, the encoder, conditioned on the entire diffusion process, compresses high-dimensional data into directional vectors in latent under different noise levels, facilitating the learning of image embeddings across all timesteps. To verify the semantic adequacy of embeddings generated through this approach, extensive experiments are conducted on various datasets, demonstrating that optimally learned embeddings by DDMs surpass state-of-the-art self-supervised representation learning methods in most cases, achieving remarkable discriminative semantic representation quality. Our work justifies that DDMs are not only suitable for generative tasks, but also potentially advantageous for general-purpose deep learning applications. △ Less

Submitted 8 May, 2025; originally announced May 2025.

Comments: Extended version of the paper published in SDM25

arXiv:2505.05679 [pdf, other]

From Bias To Improved Prompts: A Case Study of Bias Mitigation of Clone Detection Models

Authors: QiHong Chen, Lianghao Jiang, Iftekhar Ahmed

Abstract: The issue of clone code has persisted in software engineering, primarily because developers often copy and paste code segments. This common practice has elevated the importance of clone code detection, garnering attention from both software engineering researchers and industry professionals. Their collective concern arises from the potential negative impacts that clone code can have on software qu… ▽ More The issue of clone code has persisted in software engineering, primarily because developers often copy and paste code segments. This common practice has elevated the importance of clone code detection, garnering attention from both software engineering researchers and industry professionals. Their collective concern arises from the potential negative impacts that clone code can have on software quality. The emergence of powerful Generative Large Language Models (LLMs) like ChatGPT has exacerbated the clone code problem. These advanced models possess code generation capabilities that can inadvertently create code clones. As a result, the need to detect clone code has become more critical than ever before. In this study, we assess the suitability of LLMs for clone code detection. Our results demonstrate that the Palm model achieved a high F1 score of 89.30 for the avatar dataset and 86.41 for the poolC dataset. A known issue with LLMs is their susceptibility to prompt bias, where the performance of these models fluctuates based on the input prompt provided. In our research, we delve deeper into the reasons behind these fluctuations and propose a framework to mitigate prompt bias for clone detection. Our analysis identifies eight distinct categories of prompt bias, and our devised approach leveraging these biases yields a significant improvement of up to 10.81% in the F1 score. These findings underscore the substantial impact of prompt bias on the performance of LLMs and highlight the potential for leveraging model errors to alleviate this bias. △ Less

Submitted 8 May, 2025; originally announced May 2025.

arXiv:2505.05271 [pdf, other]

T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction

Authors: Kun Peng, Chaodong Tong, Cong Cao, Hao Peng, Qian Li, Guanlin Wu, Lei Jiang, Yanbing Liu, Philip S. Yu

Abstract: Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstr… ▽ More Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstream relation learning modules to better capture interactions between tokens in the table, revealing that a stronger capability to capture relations can lead to greater improvements in the model. Motivated by this, we attempt to directly utilize transformer layers as downstream relation learning modules. Due to the powerful semantic modeling capability of transformers, it is foreseeable that this will lead to excellent improvement. However, owing to the quadratic relation between the length of the table and the length of the input sentence sequence, using transformers directly faces two challenges: overly long table sequences and unfair local attention interaction. To address these challenges, we propose a novel Table-Transformer (T-T) for the tagging-based ASTE method. Specifically, we introduce a stripe attention mechanism with a loop-shift strategy to tackle these challenges. The former modifies the global attention mechanism to only attend to a 2-dimensional local attention window, while the latter facilitates interaction between different attention windows. Extensive and comprehensive experiments demonstrate that the T-T, as a downstream relation learning module, achieves state-of-the-art performance with lower computational costs. △ Less

Submitted 8 May, 2025; originally announced May 2025.

Comments: Accepted by IJCAI2025

arXiv:2505.04591 [pdf, other]

Timescales, Squeezing and Heisenberg Scalings in Many-Body Continuous Sensing

Authors: Gideon Lee, Ron Belyansky, Liang Jiang, Aashish A. Clerk

Abstract: The continuous monitoring of driven-dissipative systems offers new avenues for quantum advantage in metrology. This approach mixes temporal and spatial correlations in a manner distinct from traditional metrology, leading to ambiguities in how one identifies Heisenberg scalings (e.g.~standard asymptotic metrics like the sensitivity are not bounded by system size). Here, we propose a new metric for… ▽ More The continuous monitoring of driven-dissipative systems offers new avenues for quantum advantage in metrology. This approach mixes temporal and spatial correlations in a manner distinct from traditional metrology, leading to ambiguities in how one identifies Heisenberg scalings (e.g.~standard asymptotic metrics like the sensitivity are not bounded by system size). Here, we propose a new metric for continuous sensing, the optimized finite-time environmental quantum Fisher information (QFI), that remedies the above issues by simultaneously treating time and system size as finite resources. In addition to having direct experimental relevance, this quantity is rigorously bounded by both system size and integration time, allowing for a precise formulation of Heisenberg scaling. We also introduce two many-body continuous sensors: the high-temperature superradiant sensor, and the dissipative spin squeezer. Both exhibit Heisenberg scaling of a collective magnetic field for multiple directions. The spin squeezed sensor has a striking advantage over previously studied many-body continuous sensors: the optimal measurement achieving the full QFI does not require the construction of a complex decoder system, but can be achieved using direct photodetection of the cavity output field. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: 18 pages including supplementary material, 2 figures

arXiv:2505.02351 [pdf, ps, other]

Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques

Authors: Jie Kong, Junxiang Zhang, Jiheng Xu, Yalong Li, Shouhua Zhang, Jiehan Zhou, Yuhai Liu, Peng Liang, Quan Zhang, Luohan Jiang

Abstract: In the field of deep learning, traditional attention mechanisms face significant challenges related to high computational complexity and large memory consumption when processing long sequence data. To address these limitations, we propose Opt-GPTQ, an optimized Gradient-based Post Training Quantization (GPTQ) combining the Grouped Query Attention (GQA) mechanism with paging memory management, opti… ▽ More In the field of deep learning, traditional attention mechanisms face significant challenges related to high computational complexity and large memory consumption when processing long sequence data. To address these limitations, we propose Opt-GPTQ, an optimized Gradient-based Post Training Quantization (GPTQ) combining the Grouped Query Attention (GQA) mechanism with paging memory management, optimizing the traditional Multi-Head Attention (MHA) mechanism by grouping query heads and sharing key-value vectors. Optimized GQA (Opt-GQA) effectively reduces computational complexity, minimizes memory fragmentation, and enhances memory utilization for large-scale models. Opt-GPTQ is optimized for Data Center Units (DCUs) and integrated into the vLLM model to maximize hardware efficiency. It customizes GPU kernels to further enhance attention computation by reducing memory access latency and boosting parallel computing capabilities. Opt-GQA integrates Attention with Linear Biases (ALiBi) to reduce overhead and enhance long-sequence processing. Experimental results show that Opt-GPTQ significantly reduces computation time and memory usage while improving model performance. △ Less

Submitted 10 July, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

arXiv:2505.01947 [pdf, ps, other]

Runtime Anomaly Detection for Drones: An Integrated Rule-Mining and Unsupervised-Learning Approach

Authors: Ivan Tan, Wei Minn, Christopher M. Poskitt, Lwin Khin Shar, Lingxiao Jiang

Abstract: UAVs, commonly referred to as drones, have witnessed a remarkable surge in popularity due to their versatile applications. These cyber-physical systems depend on multiple sensor inputs, such as cameras, GPS receivers, accelerometers, and gyroscopes, with faults potentially leading to physical instability and serious safety concerns. To mitigate such risks, anomaly detection has emerged as a crucia… ▽ More UAVs, commonly referred to as drones, have witnessed a remarkable surge in popularity due to their versatile applications. These cyber-physical systems depend on multiple sensor inputs, such as cameras, GPS receivers, accelerometers, and gyroscopes, with faults potentially leading to physical instability and serious safety concerns. To mitigate such risks, anomaly detection has emerged as a crucial safeguarding mechanism, capable of identifying the physical manifestations of emerging issues and allowing operators to take preemptive action at runtime. Recent anomaly detection methods based on LSTM neural networks have shown promising results, but three challenges persist: the need for models that can generalise across the diverse mission profiles of drones; the need for interpretability, enabling operators to understand the nature of detected problems; and the need for capturing domain knowledge that is difficult to infer solely from log data. Motivated by these challenges, this paper introduces RADD, an integrated approach to anomaly detection in drones that combines rule mining and unsupervised learning. In particular, we leverage rules (or invariants) to capture expected relationships between sensors and actuators during missions, and utilise unsupervised learning techniques to cover more subtle relationships that the rules may have missed. We implement this approach using the ArduPilot drone software in the Gazebo simulator, utilising 44 rules derived across the main phases of drone missions, in conjunction with an ensemble of five unsupervised learning models. We find that our integrated approach successfully detects 93.84% of anomalies over six types of faults with a low false positive rate (2.33%), and can be deployed effectively at runtime. Furthermore, RADD outperforms a state-of-the-art LSTM-based method in detecting the different types of faults evaluated in our study. △ Less

Submitted 3 May, 2025; originally announced May 2025.

Comments: Accepted by the 29th International Conference on Engineering of Complex Computer Systems (ICECCS 2025)

arXiv:2505.01657 [pdf, other]

RAGAR: Retrieval Augment Personalized Image Generation Guided by Recommendation

Authors: Run Ling, Wenji Wang, Yuting Liu, Guibing Guo, Linying Jiang, Xingwei Wang

Abstract: Personalized image generation is crucial for improving the user experience, as it renders reference images into preferred ones according to user visual preferences. Although effective, existing methods face two main issues. First, existing methods treat all items in the user historical sequence equally when extracting user preferences, overlooking the varying semantic similarities between historic… ▽ More Personalized image generation is crucial for improving the user experience, as it renders reference images into preferred ones according to user visual preferences. Although effective, existing methods face two main issues. First, existing methods treat all items in the user historical sequence equally when extracting user preferences, overlooking the varying semantic similarities between historical items and the reference item. Disproportionately high weights for low-similarity items distort users' visual preferences for the reference item. Second, existing methods heavily rely on consistency between generated and reference images to optimize the generation, which leads to underfitting user preferences and hinders personalization. To address these issues, we propose Retrieval Augment Personalized Image GenerAtion guided by Recommendation (RAGAR). Our approach uses a retrieval mechanism to assign different weights to historical items according to their similarities to the reference item, thereby extracting more refined users' visual preferences for the reference item. Then we introduce a novel rank task based on the multi-modal ranking model to optimize the personalization of the generated images instead of forcing depend on consistency. Extensive experiments and human evaluations on three real-world datasets demonstrate that RAGAR achieves significant improvements in both personalization and semantic metrics compared to five baselines. △ Less

Submitted 2 May, 2025; originally announced May 2025.

arXiv:2505.01236 [pdf, other]

Qracle: A Graph-Neural-Network-based Parameter Initializer for Variational Quantum Eigensolvers

Authors: Chi Zhang, Lei Jiang, Fan Chen

Abstract: Variational Quantum Eigensolvers (VQEs) are a leading class of noisy intermediate-scale quantum (NISQ) algorithms with broad applications in quantum physics and quantum chemistry. However, as system size increases, VQE optimization is increasingly hindered by the barren plateau phenomenon, where gradients vanish and the loss function becomes trapped in local minima. While machine learning-based pa… ▽ More Variational Quantum Eigensolvers (VQEs) are a leading class of noisy intermediate-scale quantum (NISQ) algorithms with broad applications in quantum physics and quantum chemistry. However, as system size increases, VQE optimization is increasingly hindered by the barren plateau phenomenon, where gradients vanish and the loss function becomes trapped in local minima. While machine learning-based parameter initialization methods have been proposed to address this challenge, they often show limited effectiveness in complex VQE problems. This is primarily due to their inadequate ability to model the intricate correlations embedded in the Hamiltonian structure and the associated ansatz circuits. In this paper, we propose \textit{Qracle}, a graph neural network (GNN)-based parameter initializer for VQEs. \textit{Qracle} systematically encodes both the Hamiltonian and the associated ansatz circuit into a unified graph representation and leverages a GNN to learn a mapping from VQE problem graphs to optimized ansatz parameters. Compared to state-of-the-art initialization techniques, \textit{Qracle} achieves a reduction in initial loss of up to $10.86$, accelerates convergence by decreasing optimization steps by up to $64.42\%$, and improves final performance with up to a $26.43\%$ reduction in Symmetric Mean Absolute Percentage Error (SMAPE). △ Less

Submitted 2 May, 2025; originally announced May 2025.

arXiv:2505.01073 [pdf, other]

Retrieval Augmented Learning: A Retrial-based Large Language Model Self-Supervised Learning and Autonomous Knowledge Generation

Authors: Zongyuan Li, Pengfei Li, Runnan Qi, Yanan Ni, Lumin Jiang, Hui Wu, Xuebo Zhang, Kuihua Huang, Xian Guo

Abstract: The lack of domain-specific data in the pre-training of Large Language Models (LLMs) severely limits LLM-based decision systems in specialized applications, while post-training a model in the scenarios requires significant computational resources. In this paper, we present Retrial-Augmented Learning (RAL), a reward-free self-supervised learning framework for LLMs that operates without model traini… ▽ More The lack of domain-specific data in the pre-training of Large Language Models (LLMs) severely limits LLM-based decision systems in specialized applications, while post-training a model in the scenarios requires significant computational resources. In this paper, we present Retrial-Augmented Learning (RAL), a reward-free self-supervised learning framework for LLMs that operates without model training. By developing Retrieval-Augmented Generation (RAG) into a module for organizing intermediate data, we realized a three-stage autonomous knowledge generation of proposing a hypothesis, validating the hypothesis, and generating the knowledge. The method is evaluated in the LLM-PySC2 environment, a representative decision-making platform that combines sufficient complexity with domain-specific knowledge requirements. Experiments demonstrate that the proposed method effectively reduces hallucination by generating and utilizing validated knowledge, and increases decision-making performance at an extremely low cost. Meanwhile, the approach exhibits potential in out-of-distribution(OOD) tasks, robustness, and transferability, making it a cost-friendly but effective solution for decision-making problems and autonomous knowledge generation. △ Less

Submitted 2 May, 2025; originally announced May 2025.

arXiv:2505.00946 [pdf, other]

Addressing Noise and Stochasticity in Fraud Detection for Service Networks

Authors: Wenxin Zhang, Ding Xu, Xi Xuan, Lei Jiang, Guangzhen Yao, Renda Han, Xiangxiang Lang, Cuicui Luo

Abstract: Fraud detection is crucial in social service networks to maintain user trust and improve service network security. Existing spectral graph-based methods address this challenge by leveraging different graph filters to capture signals with different frequencies in service networks. However, most graph filter-based methods struggle with deriving clean and discriminative graph signals. On the one hand… ▽ More Fraud detection is crucial in social service networks to maintain user trust and improve service network security. Existing spectral graph-based methods address this challenge by leveraging different graph filters to capture signals with different frequencies in service networks. However, most graph filter-based methods struggle with deriving clean and discriminative graph signals. On the one hand, they overlook the noise in the information propagation process, resulting in degradation of filtering ability. On the other hand, they fail to discriminate the frequency-specific characteristics of graph signals, leading to distortion of signals fusion. To address these issues, we develop a novel spectral graph network based on information bottleneck theory (SGNN-IB) for fraud detection in service networks. SGNN-IB splits the original graph into homophilic and heterophilic subgraphs to better capture the signals at different frequencies. For the first limitation, SGNN-IB applies information bottleneck theory to extract key characteristics of encoded representations. For the second limitation, SGNN-IB introduces prototype learning to implement signal fusion, preserving the frequency-specific characteristics of signals. Extensive experiments on three real-world datasets demonstrate that SGNN-IB outperforms state-of-the-art fraud detection methods. △ Less

Submitted 1 May, 2025; originally announced May 2025.

arXiv:2505.00824 [pdf, ps, other]

Enhancing Microwave-Optical Bell Pairs Generation for Quantum Transduction Using Kerr Nonlinearity

Authors: Fangxin Li, Ming Yuan, Zhaoyou Wang, Changchun Zhong, Liang Jiang

Abstract: Microwave-optical quantum transduction can be achieved via quantum teleportation using microwave-optical photon Bell pairs. The standard spontaneous parametric down-conversion (SPDC) has to trade off between generation fidelity and probability due to unwanted higher-excitation pairs in the output. In this work, we propose a pulsed SPDC scheme that employs strong Kerr nonlinearity in the microwave… ▽ More Microwave-optical quantum transduction can be achieved via quantum teleportation using microwave-optical photon Bell pairs. The standard spontaneous parametric down-conversion (SPDC) has to trade off between generation fidelity and probability due to unwanted higher-excitation pairs in the output. In this work, we propose a pulsed SPDC scheme that employs strong Kerr nonlinearity in the microwave mode. This nonlinearity causes significant detuning of higher excitations due to the anharmonicity of energy levels, and the system can be pulse-driven to produce single-photon pairs in the output. Our pulsed nonlinear approach can generate high-fidelity Bell pairs with high probability, alleviating the trade-off between fidelity and probability inherent in traditional SPDC schemes. We optimize both the pulse width and driving strength, demonstrating that our protocol outperforms the SPDC scheme in a realistic setting of finite nonlinearity and intrinsic photon loss. △ Less

Submitted 1 May, 2025; originally announced May 2025.

arXiv:2504.17670 [pdf, other]

DiMeR: Disentangled Mesh Reconstruction Model

Authors: Lutao Jiang, Jiantao Lin, Kanghao Chen, Wenhang Ge, Xin Yang, Yifan Jiang, Yuanhuiyi Lyu, Xu Zheng, Yinchuan Li, Yingcong Chen

Abstract: We propose DiMeR, a novel geometry-texture disentangled feed-forward model with 3D supervision for sparse-view mesh reconstruction. Existing methods confront two persistent obstacles: (i) textures can conceal geometric errors, i.e., visually plausible images can be rendered even with wrong geometry, producing multiple ambiguous optimization objectives in geometry-texture mixed solution space for s… ▽ More We propose DiMeR, a novel geometry-texture disentangled feed-forward model with 3D supervision for sparse-view mesh reconstruction. Existing methods confront two persistent obstacles: (i) textures can conceal geometric errors, i.e., visually plausible images can be rendered even with wrong geometry, producing multiple ambiguous optimization objectives in geometry-texture mixed solution space for similar objects; and (ii) prevailing mesh extraction methods are redundant, unstable, and lack 3D supervision. To solve these challenges, we rethink the inductive bias for mesh reconstruction. First, we disentangle the unified geometry-texture solution space, where a single input admits multiple feasible solutions, into geometry and texture spaces individually. Specifically, given that normal maps are strictly consistent with geometry and accurately capture surface variations, the normal maps serve as the sole input for geometry prediction in DiMeR, while the texture is estimated from RGB images. Second, we streamline the algorithm of mesh extraction by eliminating modules with low performance/cost ratios and redesigning regularization losses with 3D supervision. Notably, DiMeR still accepts raw RGB images as input by leveraging foundation models for normal prediction. Extensive experiments demonstrate that DiMeR generalises across sparse-view-, single-image-, and text-to-3D tasks, consistently outperforming baselines. On the GSO and OmniObject3D datasets, DiMeR significantly reduces Chamfer Distance by more than 30%. △ Less

Submitted 26 May, 2025; v1 submitted 24 April, 2025; originally announced April 2025.

Comments: Project Page: https://lutao2021.github.io/DiMeR_page/

arXiv:2504.17542 [pdf, other]

Large Language Model-Driven Concolic Execution for Highly Structured Test Input Generation

Authors: Haoxin Tu, Seongmin Lee, Yuxian Li, Peng Chen, Lingxiao Jiang, Marcel Böhme

Abstract: How can we perform concolic execution to generate highly structured test inputs for systematically testing parsing programs? Existing concolic execution engines are significantly restricted by (1) input structure-agnostic path constraint selection, leading to the waste of testing effort or missing coverage; (2) limited constraint-solving capability, yielding many syntactically invalid test inputs;… ▽ More How can we perform concolic execution to generate highly structured test inputs for systematically testing parsing programs? Existing concolic execution engines are significantly restricted by (1) input structure-agnostic path constraint selection, leading to the waste of testing effort or missing coverage; (2) limited constraint-solving capability, yielding many syntactically invalid test inputs; (3) reliance on manual acquisition of highly structured seed inputs, resulting in non-continuous testing. This paper proposes Cottontail, a new Large Language Model (LLM)-driven concolic execution engine, to mitigate the above limitations. A more complete program path representation, named Expressive Structural Coverage Tree (ESCT), is first constructed to select structure-aware path constraints. Later, an LLM-driven constraint solver based on a Solve-Complete paradigm is designed to solve the path constraints smartly to get test inputs that are not only satisfiable to the constraints but also valid to the input syntax. Finally, a history-guided seed acquisition is employed to obtain new highly structured test inputs either before testing starts or after testing is saturated. We implemented Cottontail on top of SymCC and evaluated eight extensively tested open-source libraries across four different formats (XML, SQL, JavaScript, and JSON). The experimental result is promising: it shows that Cottontail outperforms state-of-the-art approaches (SymCC and Marco) by 14.15% and 14.31% in terms of line coverage. Besides, Cottontail found 6 previously unknown vulnerabilities (six new CVEs have been assigned). We have reported these issues to developers, and 4 out of them have been fixed so far. △ Less

Submitted 24 April, 2025; originally announced April 2025.

Comments: 18 pages (including Appendix)

arXiv:2504.15678 [pdf, ps, other]

doi 10.1145/3735452.3735526

Zoozve: A Strip-Mining-Free RISC-V Vector Extension with Arbitrary Register Grouping Compilation Support (WIP)

Authors: Siyi Xu, Limin Jiang, Yintao Liu, Yihao Shen, Yi Shi, Shan Cao, Zhiyuan Jiang

Abstract: Vector processing is crucial for boosting processor performance and efficiency, particularly with data-parallel tasks. The RISC-V "V" Vector Extension (RVV) enhances algorithm efficiency by supporting vector registers of dynamic sizes and their grouping. Nevertheless, for very long vectors, the static number of RVV vector registers and its power-of-two grouping can lead to performance restrictions… ▽ More Vector processing is crucial for boosting processor performance and efficiency, particularly with data-parallel tasks. The RISC-V "V" Vector Extension (RVV) enhances algorithm efficiency by supporting vector registers of dynamic sizes and their grouping. Nevertheless, for very long vectors, the static number of RVV vector registers and its power-of-two grouping can lead to performance restrictions. To counteract this limitation, this work introduces Zoozve, a RISC-V vector instruction extension that eliminates the need for strip-mining. Zoozve allows for flexible vector register length and count configurations to boost data computation parallelism. With a data-adaptive register allocation approach, Zoozve permits any register groupings and accurately aligns vector lengths, cutting down register overhead and alleviating performance declines from strip-mining. Additionally, the paper details Zoozve's compiler and hardware implementations using LLVM and SystemVerilog. Initial results indicate Zoozve yields a minimum 10.10$\times$ reduction in dynamic instruction count for fast Fourier transform (FFT), with a mere 5.2\% increase in overall silicon area. △ Less

Submitted 19 June, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

Comments: 6 pages, 4 figures, LCTES'25

Journal ref: Proceedings of the 26th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. (2025), 51-56

arXiv:2504.15054 [pdf, other]

Structure-guided Diffusion Transformer for Low-Light Image Enhancement

Authors: Xiangchen Yin, Zhenda Yu, Longtao Jiang, Xin Gao, Xiao Sun, Zhi Liu, Xun Yang

Abstract: While the diffusion transformer (DiT) has become a focal point of interest in recent years, its application in low-light image enhancement remains a blank area for exploration. Current methods recover the details from low-light images while inevitably amplifying the noise in images, resulting in poor visual quality. In this paper, we firstly introduce DiT into the low-light enhancement task and de… ▽ More While the diffusion transformer (DiT) has become a focal point of interest in recent years, its application in low-light image enhancement remains a blank area for exploration. Current methods recover the details from low-light images while inevitably amplifying the noise in images, resulting in poor visual quality. In this paper, we firstly introduce DiT into the low-light enhancement task and design a novel Structure-guided Diffusion Transformer based Low-light image enhancement (SDTL) framework. We compress the feature through wavelet transform to improve the inference efficiency of the model and capture the multi-directional frequency band. Then we propose a Structure Enhancement Module (SEM) that uses structural prior to enhance the texture and leverages an adaptive fusion strategy to achieve more accurate enhancement effect. In Addition, we propose a Structure-guided Attention Block (SAB) to pay more attention to texture-riched tokens and avoid interference from noisy areas in noise prediction. Extensive qualitative and quantitative experiments demonstrate that our method achieves SOTA performance on several popular datasets, validating the effectiveness of SDTL in improving image quality and the potential of DiT in low-light enhancement tasks. △ Less

Submitted 21 April, 2025; originally announced April 2025.

Comments: Accepted by IEEE Transactions on Multimedia (TMM)

arXiv:2504.14202 [pdf, other]

Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis

Authors: Zichuan Liu, Liming Jiang, Qing Yan, Yumin Jia, Hao Kang, Xin Lu

Abstract: We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference… ▽ More We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority. △ Less

Submitted 21 May, 2025; v1 submitted 19 April, 2025; originally announced April 2025.

arXiv:2504.13203 [pdf, other]

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Authors: Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel

Abstract: Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically e… ▽ More Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs. △ Less

Submitted 15 April, 2025; originally announced April 2025.

arXiv:2504.13168 [pdf, other]

Restoring Heisenberg scaling in time via autonomous quantum error correction

Authors: Hyukgun Kwon, Uwe R. Fischer, Seung-Woo Lee, Liang Jiang

Abstract: We establish a sufficient condition under which autonomous quantum error correction (AutoQEC) can effectively restore Heisenberg scaling (HS) in quantum metrology. Specifically, we show that if all Lindblad operators associated with the noise commute with the signal Hamiltonian and a particular constrained linear equation admits a solution, then an ancilla-free AutoQEC scheme with finite $R$ (wher… ▽ More We establish a sufficient condition under which autonomous quantum error correction (AutoQEC) can effectively restore Heisenberg scaling (HS) in quantum metrology. Specifically, we show that if all Lindblad operators associated with the noise commute with the signal Hamiltonian and a particular constrained linear equation admits a solution, then an ancilla-free AutoQEC scheme with finite $R$ (where $R$ represents the ratio between the engineered dissipation rate for AutoQEC and the noise rate,) can approximately preserve HS with desired small additive error $ε> 0$ over any time interval $0 \leq t \leq T$. We emphasize that the error scales as $ ε= O(κT / R^c) $ with $c \geq 1$, indicating that the required $R$ decreases significantly with increasing $c$ to achieve a desired error. Furthermore, we discuss that if the sufficient condition is not satisfied, logical errors may be induced that cannot be efficiently corrected by the canonical AutoQEC framework. Finally, we numerically verify our analytical results by employing the concrete example of phase estimation under dephasing noise. △ Less

Submitted 17 April, 2025; originally announced April 2025.

Comments: 5 pages, 3 figures, 10 pages of supplemental material

arXiv:2504.11478 [pdf, other]

Flux Already Knows -- Activating Subject-Driven Image Generation without Training

Authors: Hao Kang, Stathi Fotiadis, Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Min Jin Chong, Xin Lu

Abstract: We propose a simple yet effective zero-shot framework for subject-driven image generation using a vanilla Flux model. By framing the task as grid-based image completion and simply replicating the subject image(s) in a mosaic layout, we activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning. This "free lunch" approach is further strengt… ▽ More We propose a simple yet effective zero-shot framework for subject-driven image generation using a vanilla Flux model. By framing the task as grid-based image completion and simply replicating the subject image(s) in a mosaic layout, we activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning. This "free lunch" approach is further strengthened by a novel cascade attention design and meta prompting technique, boosting fidelity and versatility. Experimental results show that our method outperforms baselines across multiple key metrics in benchmarks and human preference studies, with trade-offs in certain aspects. Additionally, it supports diverse edits, including logo insertion, virtual try-on, and subject replacement or insertion. These results demonstrate that a pre-trained foundational text-to-image model can enable high-quality, resource-efficient subject-driven generation, opening new possibilities for lightweight customization in downstream applications. △ Less

Submitted 19 April, 2025; v1 submitted 12 April, 2025; originally announced April 2025.

arXiv:2504.10832 [pdf, other]

Unlimited Vector Processing for Wireless Baseband Based on RISC-V Extension

Authors: Limin Jiang, Yi Shi, Yihao Shen, Shan Cao, Zhiyuan Jiang, Sheng Zhou

Abstract: Wireless baseband processing (WBP) serves as an ideal scenario for utilizing vector processing, which excels in managing data-parallel operations due to its parallel structure. However, conventional vector architectures face certain constraints such as limited vector register sizes, reliance on power-of-two vector length multipliers, and vector permutation capabilities tied to specific architectur… ▽ More Wireless baseband processing (WBP) serves as an ideal scenario for utilizing vector processing, which excels in managing data-parallel operations due to its parallel structure. However, conventional vector architectures face certain constraints such as limited vector register sizes, reliance on power-of-two vector length multipliers, and vector permutation capabilities tied to specific architectures. To address these challenges, we have introduced an instruction set extension (ISE) based on RISC-V known as unlimited vector processing (UVP). This extension enhances both the flexibility and efficiency of vector computations. UVP employs a novel programming model that supports non-power-of-two register groupings and hardware strip-mining, thus enabling smooth handling of vectors of varying lengths while reducing the software strip-mining burden. Vector instructions are categorized into symmetric and asymmetric classes, complemented by specialized load/store strategies to optimize execution. Moreover, we present a hardware implementation of UVP featuring sophisticated hazard detection mechanisms, optimized pipelines for symmetric tasks such as fixed-point multiplication and division, and a robust permutation engine for effective asymmetric operations. Comprehensive evaluations demonstrate that UVP significantly enhances performance, achieving up to 3.0$\times$ and 2.1$\times$ speedups in matrix multiplication and fast Fourier transform (FFT) tasks, respectively, when measured against lane-based vector architectures. Our synthesized RTL for a 16-lane configuration using SMIC 40nm technology spans 0.94 mm$^2$ and achieves an area efficiency of 21.2 GOPS/mm$^2$. △ Less

Submitted 14 April, 2025; originally announced April 2025.

Comments: 13 pages, 9 figures, 3 tables, Under Review

arXiv:2504.10686 [pdf, other]

The Tenth NTIRE 2025 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timofte, Yawei Li, Yao Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Li Song, Hongyuan Yu, Pufan Xu, Cheng Wan, Zhijuan Huang, Peng Guo, Shuyuan Cui, Chenjun Li, Xuehai Hu, Pan Pan, Xin Zhang, Heng Zhang, Qing Luo, Linyan Jiang , et al. (122 additional authors not shown)

Abstract: This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the… ▽ More This paper presents a comprehensive review of the NTIRE 2025 Challenge on Single-Image Efficient Super-Resolution (ESR). The challenge aimed to advance the development of deep models that optimize key computational metrics, i.e., runtime, parameters, and FLOPs, while achieving a PSNR of at least 26.90 dB on the $\operatorname{DIV2K\_LSDIR\_valid}$ dataset and 26.99 dB on the $\operatorname{DIV2K\_LSDIR\_test}$ dataset. A robust participation saw \textbf{244} registered entrants, with \textbf{43} teams submitting valid entries. This report meticulously analyzes these methods and results, emphasizing groundbreaking advancements in state-of-the-art single-image ESR techniques. The analysis highlights innovative approaches and establishes benchmarks for future research in the field. △ Less

Submitted 14 April, 2025; originally announced April 2025.

Comments: Accepted by CVPR2025 NTIRE Workshop, Efficient Super-Resolution Challenge Report. 50 pages

arXiv:2504.09090 [pdf, other]

Leveraging Large Self-Supervised Time-Series Models for Transferable Diagnosis in Cross-Aircraft Type Bleed Air System

Authors: Yilin Wang, Peixuan Lei, Xuyang Wang, Liangliang Jiang, Liming Xuan, Wei Cheng, Honghua Zhao, Yuanxiang Li

Abstract: Bleed Air System (BAS) is critical for maintaining flight safety and operational efficiency, supporting functions such as cabin pressurization, air conditioning, and engine anti-icing. However, BAS malfunctions, including overpressure, low pressure, and overheating, pose significant risks such as cabin depressurization, equipment failure, or engine damage. Current diagnostic approaches face notabl… ▽ More Bleed Air System (BAS) is critical for maintaining flight safety and operational efficiency, supporting functions such as cabin pressurization, air conditioning, and engine anti-icing. However, BAS malfunctions, including overpressure, low pressure, and overheating, pose significant risks such as cabin depressurization, equipment failure, or engine damage. Current diagnostic approaches face notable limitations when applied across different aircraft types, particularly for newer models that lack sufficient operational data. To address these challenges, this paper presents a self-supervised learning-based foundation model that enables the transfer of diagnostic knowledge from mature aircraft (e.g., A320, A330) to newer ones (e.g., C919). Leveraging self-supervised pretraining, the model learns universal feature representations from flight signals without requiring labeled data, making it effective in data-scarce scenarios. This model enhances both anomaly detection and baseline signal prediction, thereby improving system reliability. The paper introduces a cross-model dataset, a self-supervised learning framework for BAS diagnostics, and a novel Joint Baseline and Anomaly Detection Loss Function tailored to real-world flight data. These innovations facilitate efficient transfer of diagnostic knowledge across aircraft types, ensuring robust support for early operational stages of new models. Additionally, the paper explores the relationship between model capacity and transferability, providing a foundation for future research on large-scale flight signal models. △ Less

Submitted 12 April, 2025; originally announced April 2025.

arXiv:2504.08685 [pdf, other]

Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Authors: Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo, Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Meng Wei, Zhiwu Qing, Fei Xiao, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi , et al. (30 additional authors not shown)

Abstract: This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary… ▽ More This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/ △ Less

Submitted 4 May, 2025; v1 submitted 11 April, 2025; originally announced April 2025.

Comments: Technical report (some typos fixed)

arXiv:2504.08371 [pdf, other]

Passive Underwater Acoustic Signal Separation based on Feature Decoupling Dual-path Network

Authors: Yucheng Liu, Longyu Jiang

Abstract: Signal separation in the passive underwater acoustic domain has heavily relied on deep learning techniques to isolate ship radiated noise. However, the separation networks commonly used in this domain stem from speech separation applications and may not fully consider the unique aspects of underwater acoustics beforehand, such as the influence of different propagation media, signal frequencies and… ▽ More Signal separation in the passive underwater acoustic domain has heavily relied on deep learning techniques to isolate ship radiated noise. However, the separation networks commonly used in this domain stem from speech separation applications and may not fully consider the unique aspects of underwater acoustics beforehand, such as the influence of different propagation media, signal frequencies and modulation characteristics. This oversight highlights the need for tailored approaches that account for the specific characteristics of underwater sound propagation. This study introduces a novel temporal network designed to separate ship radiated noise by employing a dual-path model and a feature decoupling approach. The mixed signals' features are transformed into a space where they exhibit greater independence, with each dimension's significance decoupled. Subsequently, a fusion of local and global attention mechanisms is employed in the separation layer. Extensive comparisons showcase the effectiveness of this method when compared to other prevalent network models, as evidenced by its performance in the ShipsEar and DeepShip datasets. △ Less

Submitted 11 April, 2025; originally announced April 2025.

Comments: 10pages,4 figures

MSC Class: 68T10 ACM Class: I.5.4; I.2.6; J.2

arXiv:2504.07158

Holistic Capability Preservation: Towards Compact Yet Comprehensive Reasoning Models

Authors: Ling Team, Caizhi Tang, Chilin Fu, Chunwei Wu, Jia Guo, Jianwen Wang, Jingyu Hu, Liang Jiang, Meng Li, Peng Jiao, Pingping Liu, Shaomian Zheng, Shiwei Liang, Shuaicheng Li, Yalin Zhang, Yingting Wu, Yongkang Liu, Zhenyu Huang

Abstract: This technical report presents Ring-Lite-Distill, a lightweight reasoning model derived from our open-source Mixture-of-Experts (MoE) Large Language Models (LLMs) Ling-Lite. This study demonstrates that through meticulous high-quality data curation and ingenious training paradigms, the compact MoE model Ling-Lite can be further trained to achieve exceptional reasoning capabilities, while maintaini… ▽ More This technical report presents Ring-Lite-Distill, a lightweight reasoning model derived from our open-source Mixture-of-Experts (MoE) Large Language Models (LLMs) Ling-Lite. This study demonstrates that through meticulous high-quality data curation and ingenious training paradigms, the compact MoE model Ling-Lite can be further trained to achieve exceptional reasoning capabilities, while maintaining its parameter-efficient architecture with only 2.75 billion activated parameters, establishing an efficient lightweight reasoning architecture. In particular, in constructing this model, we have not merely focused on enhancing advanced reasoning capabilities, exemplified by high-difficulty mathematical problem solving, but rather aimed to develop a reasoning model with more comprehensive competency coverage. Our approach ensures coverage across reasoning tasks of varying difficulty levels while preserving generic capabilities, such as instruction following, tool use, and knowledge retention. We show that, Ring-Lite-Distill's reasoning ability reaches a level comparable to DeepSeek-R1-Distill-Qwen-7B, while its general capabilities significantly surpass those of DeepSeek-R1-Distill-Qwen-7B. The models are accessible at https://huggingface.co/inclusionAI △ Less

Submitted 10 April, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

Comments: Based on the further discussion of the working group, the current version is deemed unsuitable for release. We are currently undertaking further work that is expected to involve significant revisions, but this process will require some additional time. We plan to proceed with the release once these updates have been fully implemented

arXiv:2504.06968 [pdf, other]

doi 10.1038/s41467-025-57791-w

Probable evidence for a transient mega-electron volt emission line in the GRB 221023A

Authors: Lu-Yao Jiang, Yun Wang, Yu-Jia Wei, Da-Ming Wei, Xiang Li, Hao-Ning He, Jia Ren, Zhao-Qiang Shen, Zhi-Ping Jin

Abstract: Detection of spectral line in gamma-ray bursts (GRBs) is importance for studying GRB physics, as it provides insights into the composition and physical conditions of the GRB environment. However, progress in detecting X-ray or gamma-ray emission and absorption lines in GRB spectra has been relatively slow, only the narrow emission line feature of about 10 MeV found in GRB 221009A has exhibited a s… ▽ More Detection of spectral line in gamma-ray bursts (GRBs) is importance for studying GRB physics, as it provides insights into the composition and physical conditions of the GRB environment. However, progress in detecting X-ray or gamma-ray emission and absorption lines in GRB spectra has been relatively slow, only the narrow emission line feature of about 10 MeV found in GRB 221009A has exhibited a significance exceeding $5 σ$. Here, we report the probable evidence of a narrow emission feature at about 2.1 mega-electron volts (MeV) in the spectrum of GRB 221023A. The highest statistical significance of this feature is observed in the time interval between 8 and 30 seconds after Fermi Gamma-Ray Burst Monitor trigger, with the chance probability value $<2.56 \times 10^{-5}$ (after accounting for the look-elsewhere effect), corresponding to a Gaussian-equivalent significance $> 4.20 σ$. We interpret this feature as being generated through the de-excitation of excited electrons in the relativistic hydrogen-like high-atomic-number ions entrained in the GRB jet. △ Less

Submitted 9 April, 2025; originally announced April 2025.

Comments: 20 pages, 5 figures, 3 tables. Publication in the Nature Communications

arXiv:2504.05706 [pdf, other]

SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

Authors: Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Piyush Bagad, Hazel Doughty, Bernard Ghanem, Cees G. M. Snoek

Abstract: Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kine… ▽ More Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by removing the need for manual annotations. Despite strong performance on standard action recognition benchmarks, video self-supervised learning methods are largely evaluated under narrow protocols, typically pretraining on Kinetics-400 and fine-tuning on similar datasets, limiting our understanding of their generalization in real world scenarios. In this work, we present a comprehensive evaluation of modern video self-supervised models, focusing on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity. Building on our prior work analyzing benchmark sensitivity in CNN-based contrastive learning, we extend the study to cover state-of-the-art transformer-based video-only and video-text models. Specifically, we benchmark 12 transformer-based methods (7 video-only, 5 video-text) and compare them to 10 CNN-based methods, totaling over 1100 experiments across 8 datasets and 7 downstream tasks. Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions. No method generalizes consistently across all factors, video-only transformers perform better under domain shifts, CNNs outperform for fine-grained tasks, and video-text models often underperform despite large scale pretraining. We also find that recent transformer models do not consistently outperform earlier approaches. Our findings provide a detailed view of the strengths and limitations of current video SSL methods and offer a unified benchmark for evaluating generalization in video representation learning. △ Less

Submitted 8 April, 2025; originally announced April 2025.

Comments: Under Review

arXiv:2504.04377 [pdf, other]

PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

Authors: Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, Maarten Sap

Abstract: Truly multilingual safety moderation efforts for Large Language Models (LLMs) have been hindered by a narrow focus on a small set of languages (e.g., English, Chinese) as well as a limited scope of safety definition, resulting in significant gaps in moderation capabilities. To bridge these gaps, we release POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding LLM generations… ▽ More Truly multilingual safety moderation efforts for Large Language Models (LLMs) have been hindered by a narrow focus on a small set of languages (e.g., English, Chinese) as well as a limited scope of safety definition, resulting in significant gaps in moderation capabilities. To bridge these gaps, we release POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding LLM generations, and the corresponding training and evaluation datasets. POLYGUARD is trained on POLYGUARDMIX, the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages (e.g., Chinese, Czech, English, Hindi). We also introduce POLYGUARDPROMPTS, a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails. Created by combining naturally occurring multilingual human-LLM interactions and human-verified machine translations of an English-only safety dataset (WildGuardMix; Han et al., 2024), our datasets contain prompt-output pairs with labels of prompt harmfulness, response harmfulness, and response refusal. Through extensive evaluations across multiple safety and toxicity benchmarks, we demonstrate that POLYGUARD outperforms existing state-of-the-art open-weight and commercial safety classifiers by 5.5%. Our contributions advance efforts toward safer multilingual LLMs for all global users. △ Less

Submitted 6 April, 2025; originally announced April 2025.

arXiv:2504.04099 [pdf, other]

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Authors: Chunzhao Xie, Tongxuan Liu, Lei Jiang, Yuting Zeng, jinrong Guo, Yunheng Shen, Weizhe Huang, Jing Li, Xiaohua Xu

Abstract: Large Vision-Language Models have demonstrated remarkable performance across various tasks; however, the challenge of hallucinations constrains their practical applications. The hallucination problem arises from multiple factors, including the inherent hallucinations in language models, the limitations of visual encoders in perception, and biases introduced by multimodal data. Extensive research h… ▽ More Large Vision-Language Models have demonstrated remarkable performance across various tasks; however, the challenge of hallucinations constrains their practical applications. The hallucination problem arises from multiple factors, including the inherent hallucinations in language models, the limitations of visual encoders in perception, and biases introduced by multimodal data. Extensive research has explored ways to mitigate hallucinations. For instance, OPERA prevents the model from overly focusing on "anchor tokens", thereby reducing hallucinations, whereas VCD mitigates hallucinations by employing a contrastive decoding approach. In this paper, we investigate the correlation between the decay of attention to image tokens and the occurrence of hallucinations. Based on this finding, we propose Temporal Attention Real-time Accumulative Connection (TARAC), a novel training-free method that dynamically accumulates and updates LVLMs' attention on image tokens during generation. By enhancing the model's attention to image tokens, TARAC mitigates hallucinations caused by the decay of attention on image tokens. We validate the effectiveness of TARAC across multiple models and datasets, demonstrating that our approach substantially mitigates hallucinations. In particular, TARAC reduces $C_S$ by 25.2 and $C_I$ by 8.7 compared to VCD on the CHAIR benchmark. △ Less

Submitted 5 April, 2025; originally announced April 2025.

arXiv:2504.03661 [pdf, other]

MILLION: Mastering Long-Context LLM Inference Via Outlier-Immunized KV Product Quantization

Authors: Zongwu Wang, Peng Xu, Fangxin Liu, Yiwei Hu, Qingxiao Sun, Gezi Li, Cheng Li, Xuan Wang, Li Jiang, Haibing Guan

Abstract: Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128K or 1M tokens. This trend, however, presents significant challenges in inference speed and memory management. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization… ▽ More Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128K or 1M tokens. This trend, however, presents significant challenges in inference speed and memory management. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization schemes often yield suboptimal compression results for KV caches due to two key factors: i) On-the-fly quantization and de-quantization, causing significant performance overhead; ii) Prevalence of outliers in KV values, challenging low-bitwidth uniform quantization. To this end, we propose MILLION, a novel quantization framework achieving low-bitwidth KV cache through product quantization. First, we conduct a thorough analysis of KV cache distribution, revealing the limitations of existing quantization schemes. Second, we introduce a non-uniform quantization algorithm based on product quantization, which efficiently compresses data while preserving accuracy. Third, we develop a high-performance GPU inference framework with efficient attention kernel and pipeline design for MILLION that leverages sparse computation and asynchronous quantization, significantly enhancing inference speed. Comprehensive evaluation results demonstrate that MILLION can achieve 4 bits quantization with trivial perplexity and accuracy loss, and achieve 2.09x end-to-end performance gains at 32K context length. Code is released at https://github.com/ZongwuWang/MILLION. △ Less

Submitted 8 April, 2025; v1 submitted 12 March, 2025; originally announced April 2025.

Comments: 7 pages, 7 figures and 4 tables

ACM Class: I.2.0

arXiv:2504.01523 [pdf, other]

Adapting Knowledge Prompt Tuning for Enhanced Automated Program Repair

Authors: Xuemeng Cai, Lingxiao Jiang

Abstract: Automated Program Repair (APR) aims to enhance software reliability by automatically generating bug-fixing patches. Recent work has improved the state-of-the-art of APR by fine-tuning pre-trained large language models (LLMs), such as CodeT5, for APR. However, the effectiveness of fine-tuning becomes weakened in data scarcity scenarios, and data scarcity can be a common issue in practice, limiting… ▽ More Automated Program Repair (APR) aims to enhance software reliability by automatically generating bug-fixing patches. Recent work has improved the state-of-the-art of APR by fine-tuning pre-trained large language models (LLMs), such as CodeT5, for APR. However, the effectiveness of fine-tuning becomes weakened in data scarcity scenarios, and data scarcity can be a common issue in practice, limiting fine-tuning performance. To alleviate this limitation, this paper adapts prompt tuning for enhanced APR and conducts a comprehensive study to evaluate its effectiveness in data scarcity scenarios, using three LLMs of different sizes and six diverse datasets across four programming languages. Prompt tuning rewrites the input to a model by adding extra prompt tokens and tunes both the model and the prompts on a small dataset. These tokens provide task-specific knowledge that can improve the model for APR, which is especially critical in data scarcity scenarios. Moreover, domain knowledge has proven crucial in many code intelligence tasks, but existing studies fail to leverage domain knowledge during the prompt tuning for APR. To close this gap, we introduce knowledge prompt tuning, an approach that adapts prompt tuning with six distinct types of code- or bug-related domain knowledge for APR. Our work, to the best of our knowledge, is the first to adapt and evaluate prompt tuning and the effectiveness of code- or bug-related domain knowledge for APR, particularly under data scarcity settings. Our evaluation results demonstrate that prompt tuning with knowledge generally outperforms fine-tuning under various experimental settings, achieving an average improvement of 87.33% over fine-tuning in data scarcity scenarios. △ Less

Submitted 2 April, 2025; originally announced April 2025.

arXiv:2504.00527 [pdf, other]

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

Authors: Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Bernard Ghanem

Abstract: Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL a… ▽ More Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with their high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations. Code is available: https://github.com/fmthoker/SMILE △ Less

Submitted 1 April, 2025; originally announced April 2025.

Comments: Accepted to CVPR 2025

arXiv:2504.00387 [pdf, other]

Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration

Authors: Zilong Huang, Jun He, Junyan Ye, Lihan Jiang, Weijia Li, Yiping Chen, Ting Han

Abstract: The reconstruction of immersive and realistic 3D scenes holds significant practical importance in various fields of computer vision and computer graphics. Typically, immersive and realistic scenes should be free from obstructions by dynamic objects, maintain global texture consistency, and allow for unrestricted exploration. The current mainstream methods for image-driven scene construction involv… ▽ More The reconstruction of immersive and realistic 3D scenes holds significant practical importance in various fields of computer vision and computer graphics. Typically, immersive and realistic scenes should be free from obstructions by dynamic objects, maintain global texture consistency, and allow for unrestricted exploration. The current mainstream methods for image-driven scene construction involves iteratively refining the initial image using a moving virtual camera to generate the scene. However, previous methods struggle with visual discontinuities due to global texture inconsistencies under varying camera poses, and they frequently exhibit scene voids caused by foreground-background occlusions. To this end, we propose a novel layered 3D scene reconstruction framework from panoramic image, named Scene4U. Specifically, Scene4U integrates an open-vocabulary segmentation model with a large language model to decompose a real panorama into multiple layers. Then, we employs a layered repair module based on diffusion model to restore occluded regions using visual cues and depth information, generating a hierarchical representation of the scene. The multi-layer panorama is then initialized as a 3D Gaussian Splatting representation, followed by layered optimization, which ultimately produces an immersive 3D scene with semantic and structural consistency that supports free exploration. Scene4U outperforms state-of-the-art method, improving by 24.24% in LPIPS and 24.40% in BRISQUE, while also achieving the fastest training speed. Additionally, to demonstrate the robustness of Scene4U and allow users to experience immersive scenes from various landmarks, we build WorldVista3D dataset for 3D scene reconstruction, which contains panoramic images of globally renowned sites. The implementation code and dataset will be released at https://github.com/LongHZ140516/Scene4U . △ Less

Submitted 20 April, 2025; v1 submitted 31 March, 2025; originally announced April 2025.

Comments: CVPR 2025, 11 pages, 7 figures

arXiv:2504.00296 [pdf, other]

Dependence of Planet populations on Stellar Mass and Metallicity: A Pebble Accretion-based Planet Population Synthesis

Authors: Mengrui Pan, Beibei Liu, Linjie Jiang, Jiwei Xie, Wei Zhu, Ignasi Ribas

Abstract: The formation and evolution of planetary systems are linked to their host stellar environment. In this study, we employ a pebble accretion-based planet population synthesis model to explore the correlation between planetary properties and stellar mass/metallicity. Our numerical results reproduce several main aspects of exoplanetary observations. First, we find that the occurrence rate of super-Ear… ▽ More The formation and evolution of planetary systems are linked to their host stellar environment. In this study, we employ a pebble accretion-based planet population synthesis model to explore the correlation between planetary properties and stellar mass/metallicity. Our numerical results reproduce several main aspects of exoplanetary observations. First, we find that the occurrence rate of super-Earths $η_{\rm SE}$ follows an inverted V-shape in relation to stellar mass: it increases with stellar mass among lower-mass dwarfs, peaks at early-M dwarfs, and declines toward higher-mass GK stars. Second, super-Earths grow ubiquitously around stars with various metallicities, exhibiting a flat or weak $η_{\rm SE}$ dependence on $Z_{\star}$. Third, giant planets, in contrast, form more frequently around stars with higher-mass/metallicity. Lastly, we extend a subset of simulations to $1$ Gyr to investigate the long-term evolution of the systems' architecture. By converting our simulated systems into synthetic observations, we find that the eccentricities and inclinations of single-transit systems increase with stellar metallicity, while these dependencies in multi-planet systems remains relatively weak. The alignment between our results and observations provides key insights into the connection between planet populations and stellar properties. △ Less

Submitted 31 March, 2025; originally announced April 2025.

Comments: 15 pages, 7 figures, accepted by AJ

arXiv:2503.23644 [pdf, other]

Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers

Authors: Chaojian Li, Sixu Li, Linrui Jiang, Jingqun Zhang, Yingyan Celine Lin

Abstract: Recent advancements in neural rendering technologies and their supporting devices have paved the way for immersive 3D experiences, significantly transforming human interaction with intelligent devices across diverse applications. However, achieving the desired real-time rendering speeds for immersive interactions is still hindered by (1) the lack of a universal algorithmic solution for different a… ▽ More Recent advancements in neural rendering technologies and their supporting devices have paved the way for immersive 3D experiences, significantly transforming human interaction with intelligent devices across diverse applications. However, achieving the desired real-time rendering speeds for immersive interactions is still hindered by (1) the lack of a universal algorithmic solution for different application scenarios and (2) the dedication of existing devices or accelerators to merely specific rendering pipelines. To overcome this challenge, we have developed a unified neural rendering accelerator that caters to a wide array of typical neural rendering pipelines, enabling real-time and on-device rendering across different applications while maintaining both efficiency and compatibility. Our accelerator design is based on the insight that, although neural rendering pipelines vary and their algorithm designs are continually evolving, they typically share common operators, predominantly executing similar workloads. Building on this insight, we propose a reconfigurable hardware architecture that can dynamically adjust dataflow to align with specific rendering metric requirements for diverse applications, effectively supporting both typical and the latest hybrid rendering pipelines. Benchmarking experiments and ablation studies on both synthetic and real-world scenes demonstrate the effectiveness of the proposed accelerator. The proposed unified accelerator stands out as the first solution capable of achieving real-time neural rendering across varied representative pipelines on edge devices, potentially paving the way for the next generation of neural graphics applications. △ Less

Submitted 30 March, 2025; originally announced March 2025.

Comments: Accepted by HPCA'25

arXiv:2503.23025 [pdf, other]

Simplification of Trajectory Streams

Authors: Siu-Wing Cheng, Haoqiang Huang, Le Jiang

Abstract: While there are software systems that simplify trajectory streams on the fly, few curve simplification algorithms with quality guarantees fit the streaming requirements. We present streaming algorithms for two such problems under the Fréchet distance $d_F$ in $\mathbb{R}^d$ for some constant $d \geq 2$. Consider a polygonal curve $τ$ in $\mathbb{R}^d$ in a stream. We present a streaming algorith… ▽ More While there are software systems that simplify trajectory streams on the fly, few curve simplification algorithms with quality guarantees fit the streaming requirements. We present streaming algorithms for two such problems under the Fréchet distance $d_F$ in $\mathbb{R}^d$ for some constant $d \geq 2$. Consider a polygonal curve $τ$ in $\mathbb{R}^d$ in a stream. We present a streaming algorithm that, for any $\varepsilon\in (0,1)$ and $δ> 0$, produces a curve $σ$ such that $d_F(σ,τ[v_1,v_i])\le (1+\varepsilon)δ$ and $|σ|\le 2\,\mathrm{opt}-2$, where $τ[v_1,v_i]$ is the prefix in the stream so far, and $\mathrm{opt} = \min\{|σ'|: d_F(σ',τ[v_1,v_i])\le δ\}$. Let $α= 2(d-1){\lfloor d/2 \rfloor}^2 + d$. The working storage is $O(\varepsilon^{-α})$. Each vertex is processed in $O(\varepsilon^{-α}\log\frac{1}{\varepsilon})$ time for $d \in \{2,3\}$ and $O(\varepsilon^{-α})$ time for $d \geq 4$ . Thus, the whole $τ$ can be simplified in $O(\varepsilon^{-α}|τ|\log\frac{1}{\varepsilon})$ time. Ignoring polynomial factors in $1/\varepsilon$, this running time is a factor $|τ|$ faster than the best static algorithm that offers the same guarantees. We present another streaming algorithm that, for any integer $k \geq 2$ and any $\varepsilon \in (0,\frac{1}{17})$, maintains a curve $σ$ such that $|σ| \leq 2k-2$ and $d_F(σ,τ[v_1,v_i])\le (1+\varepsilon) \cdot \min\{d_F(σ',τ[v_1,v_i]): |σ'| \leq k\}$, where $τ[v_1,v_i]$ is the prefix in the stream so far. The working storage is $O((k\varepsilon^{-1}+\varepsilon^{-(α+1)})\log \frac{1}{\varepsilon})$. Each vertex is processed in $O(k\varepsilon^{-(α+1)}\log^2\frac{1}{\varepsilon})$ time for $d \in \{2,3\}$ and $O(k\varepsilon^{-(α+1)}\log\frac{1}{\varepsilon})$ time for $d \geq 4$. △ Less

Submitted 29 March, 2025; originally announced March 2025.

Comments: SoCG 2025

arXiv:2503.22436 [pdf, other]

NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving

Authors: Fuhao Li, Huan Jin, Bin Gao, Liaoyuan Fan, Lihui Jiang, Long Zeng

Abstract: Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchma… ▽ More Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We present a Hierarchy of Grounding (HoG) method to construct NuGrounding to generate hierarchical multi-level instructions, ensuring comprehensive coverage of human instruction patterns. To tackle this challenging dataset, we propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs (MLLMs) with precise localization abilities of specialist detection models. Our approach introduces two decoupled task tokens and a context query to aggregate 3D geometric information and semantic instructions, followed by a fusion decoder to refine spatial-semantic feature fusion for precise localization. Extensive experiments demonstrate that our method significantly outperforms the baselines adapted from representative 3D scene understanding methods by a significant margin and achieves 0.59 in precision and 0.64 in recall, with improvements of 50.8% and 54.7%. △ Less

Submitted 25 May, 2025; v1 submitted 28 March, 2025; originally announced March 2025.

arXiv:2503.21844 [pdf, other]

doi 10.1145/3706598.3713997

"Ignorance is Not Bliss": Designing Personalized Moderation to Address Ableist Hate on Social Media

Authors: Sharon Heung, Lucy Jiang, Shiri Azenkot, Aditya Vashistha

Abstract: Disabled people on social media often experience ableist hate and microaggressions. Prior work has shown that platform moderation often fails to remove ableist hate leaving disabled users exposed to harmful content. This paper examines how personalized moderation can safeguard users from viewing ableist comments. During interviews and focus groups with 23 disabled social media users, we presented… ▽ More Disabled people on social media often experience ableist hate and microaggressions. Prior work has shown that platform moderation often fails to remove ableist hate leaving disabled users exposed to harmful content. This paper examines how personalized moderation can safeguard users from viewing ableist comments. During interviews and focus groups with 23 disabled social media users, we presented design probes to elicit perceptions on configuring their filters of ableist speech (e.g. intensity of ableism and types of ableism) and customizing the presentation of the ableist speech to mitigate the harm (e.g. AI rephrasing the comment and content warnings). We found that participants preferred configuring their filters through types of ableist speech and favored content warnings. We surface participants distrust in AI-based moderation, skepticism in AI's accuracy, and varied tolerances in viewing ableist hate. Finally we share design recommendations to support users' agency, mitigate harm from hate, and promote safety. △ Less

Submitted 27 March, 2025; originally announced March 2025.

arXiv:2503.20822 [pdf, other]

Synthetic Video Enhances Physical Fidelity in Video Synthesis

Authors: Qi Zhao, Xingyu Ni, Ziyu Wang, Feng Cheng, Ziyan Yang, Lu Jiang, Bohan Wang

Abstract: We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. To harness this potential, we propose a solution that curates and integrate… ▽ More We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. To harness this potential, we propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model, significantly reducing unwanted artifacts. Through experiments on three representative tasks emphasizing physical consistency, we demonstrate its efficacy in enhancing physical fidelity. While our model still lacks a deep understanding of physics, our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis. Website: https://kevinz8866.github.io/simulation/ △ Less

Submitted 25 March, 2025; originally announced March 2025.

arXiv:2503.18016 [pdf, other]

Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook

Authors: Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Lutao Jiang, Haiwei Xue, Bin Ren, Danda Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu

Abstract: Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, t… ▽ More Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI), particularly in enhancing the capabilities of large language models (LLMs) by enabling access to external, reliable, and up-to-date knowledge sources. In the context of AI-Generated Content (AIGC), RAG has proven invaluable by augmenting model outputs with supplementary, relevant information, thus improving their quality. Recently, the potential of RAG has extended beyond natural language processing, with emerging methods integrating retrieval-augmented strategies into the computer vision (CV) domain. These approaches aim to address the limitations of relying solely on internal model knowledge by incorporating authoritative external knowledge bases, thereby improving both the understanding and generation capabilities of vision models. This survey provides a comprehensive review of the current state of retrieval-augmented techniques in CV, focusing on two main areas: (I) visual understanding and (II) visual generation. In the realm of visual understanding, we systematically review tasks ranging from basic image recognition to complex applications such as medical report generation and multimodal question answering. For visual content generation, we examine the application of RAG in tasks related to image, video, and 3D generation. Furthermore, we explore recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains. Given that the integration of retrieval-augmented techniques in CV is still in its early stages, we also highlight the key limitations of current approaches and propose future research directions to drive the development of this promising area. △ Less

Submitted 23 March, 2025; originally announced March 2025.

Comments: 19 pages, 10 figures

Showing 51–100 of 1,963 results for author: Jiang, L