-
MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting
Authors:
Mengqiu Xu,
Kaixin Chen,
Heng Guo,
Yixiang Huang,
Ming Wu,
Zhenwei Shi,
Chuang Zhang,
Jun Guo
Abstract:
Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and…
▽ More
Deep learning approaches for marine fog detection and forecasting have outperformed traditional methods, demonstrating significant scientific and practical importance. However, the limited availability of open-source datasets remains a major challenge. Existing datasets, often focused on a single region or satellite, restrict the ability to evaluate model performance across diverse conditions and hinder the exploration of intrinsic marine fog characteristics. To address these limitations, we introduce \textbf{MFogHub}, the first multi-regional and multi-satellite dataset to integrate annotated marine fog observations from 15 coastal fog-prone regions and six geostationary satellites, comprising over 68,000 high-resolution samples. By encompassing diverse regions and satellite perspectives, MFogHub facilitates rigorous evaluation of both detection and forecasting methods under varying conditions. Extensive experiments with 16 baseline models demonstrate that MFogHub can reveal generalization fluctuations due to regional and satellite discrepancy, while also serving as a valuable resource for the development of targeted and scalable fog prediction techniques. Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. The dataset and code are at \href{https://github.com/kaka0910/MFogHub}{https://github.com/kaka0910/MFogHub}.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Differentiable Quantum Architecture Search in Quantum-Enhanced Neural Network Parameter Generation
Authors:
Samuel Yen-Chi Chen,
Chen-Yu Liu,
Kuan-Cheng Chen,
Wei-Jia Huang,
Yen-Jui Chang,
Wei-Hao Huang
Abstract:
The rapid advancements in quantum computing (QC) and machine learning (ML) have led to the emergence of quantum machine learning (QML), which integrates the strengths of both fields. Among QML approaches, variational quantum circuits (VQCs), also known as quantum neural networks (QNNs), have shown promise both empirically and theoretically. However, their broader adoption is hindered by reliance o…
▽ More
The rapid advancements in quantum computing (QC) and machine learning (ML) have led to the emergence of quantum machine learning (QML), which integrates the strengths of both fields. Among QML approaches, variational quantum circuits (VQCs), also known as quantum neural networks (QNNs), have shown promise both empirically and theoretically. However, their broader adoption is hindered by reliance on quantum hardware during inference. Hardware imperfections and limited access to quantum devices pose practical challenges. To address this, the Quantum-Train (QT) framework leverages the exponential scaling of quantum amplitudes to generate classical neural network parameters, enabling inference without quantum hardware and achieving significant parameter compression. Yet, designing effective quantum circuit architectures for such quantum-enhanced neural programmers remains non-trivial and often requires expertise in quantum information science. In this paper, we propose an automated solution using differentiable optimization. Our method jointly optimizes both conventional circuit parameters and architectural parameters in an end-to-end manner via automatic differentiation. We evaluate the proposed framework on classification, time-series prediction, and reinforcement learning tasks. Simulation results show that our method matches or outperforms manually designed QNN architectures. This work offers a scalable and automated pathway for designing QNNs that can generate classical neural network parameters across diverse applications.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Improved Sample Upper and Lower Bounds for Trace Estimation of Quantum State Powers
Authors:
Kean Chen,
Qisheng Wang
Abstract:
As often emerges in various basic quantum properties such as entropy, the trace of quantum state powers $\operatorname{tr}(ρ^q)$ has attracted a lot of attention. The recent work of Liu and Wang (SODA 2025) showed that $\operatorname{tr}(ρ^q)$ can be estimated to within additive error $\varepsilon$ with a dimension-independent sample complexity of $\widetilde O(1/\varepsilon^{3+\frac{2}{q-1}})$ fo…
▽ More
As often emerges in various basic quantum properties such as entropy, the trace of quantum state powers $\operatorname{tr}(ρ^q)$ has attracted a lot of attention. The recent work of Liu and Wang (SODA 2025) showed that $\operatorname{tr}(ρ^q)$ can be estimated to within additive error $\varepsilon$ with a dimension-independent sample complexity of $\widetilde O(1/\varepsilon^{3+\frac{2}{q-1}})$ for any constant $q > 1$, where only an $Ω(1/\varepsilon)$ lower bound was given. In this paper, we significantly improve the sample complexity of estimating $\operatorname{tr}(ρ^q)$ in both the upper and lower bounds. In particular:
- For $q > 2$, we settle the sample complexity with matching upper and lower bounds $\widetilde Θ(1/\varepsilon^2)$.
- For $1 < q < 2$, we provide an upper bound $\widetilde O(1/\varepsilon^{\frac{2}{q-1}})$, with a lower bound $Ω(1/\varepsilon^{\max\{\frac{1}{q-1}, 2\}})$ for dimension-independent estimators, implying there is only room for a quadratic improvement.
Our upper bounds are obtained by (non-plug-in) quantum estimators based on weak Schur sampling, in sharp contrast to the prior approach based on quantum singular value transformation and samplizer.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Quantum-Enhanced Parameter-Efficient Learning for Typhoon Trajectory Forecasting
Authors:
Chen-Yu Liu,
Kuan-Cheng Chen,
Yi-Chien Chen,
Samuel Yen-Chi Chen,
Wei-Hao Huang,
Wei-Jia Huang,
Yen-Jui Chang
Abstract:
Typhoon trajectory forecasting is essential for disaster preparedness but remains computationally demanding due to the complexity of atmospheric dynamics and the resource requirements of deep learning models. Quantum-Train (QT), a hybrid quantum-classical framework that leverages quantum neural networks (QNNs) to generate trainable parameters exclusively during training, eliminating the need for q…
▽ More
Typhoon trajectory forecasting is essential for disaster preparedness but remains computationally demanding due to the complexity of atmospheric dynamics and the resource requirements of deep learning models. Quantum-Train (QT), a hybrid quantum-classical framework that leverages quantum neural networks (QNNs) to generate trainable parameters exclusively during training, eliminating the need for quantum hardware at inference time. Building on QT's success across multiple domains, including image classification, reinforcement learning, flood prediction, and large language model (LLM) fine-tuning, we introduce Quantum Parameter Adaptation (QPA) for efficient typhoon forecasting model learning. Integrated with an Attention-based Multi-ConvGRU model, QPA enables parameter-efficient training while maintaining predictive accuracy. This work represents the first application of quantum machine learning (QML) to large-scale typhoon trajectory prediction, offering a scalable and energy-efficient approach to climate modeling. Our results demonstrate that QPA significantly reduces the number of trainable parameters while preserving performance, making high-performance forecasting more accessible and sustainable through hybrid quantum-classical learning.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Zero-Shot Multi-modal Large Language Model v.s. Supervised Deep Learning: A Comparative Study on CT-Based Intracranial Hemorrhage Subtyping
Authors:
Yinuo Wang,
Yue Zeng,
Kai Chen,
Cai Meng,
Chao Pan,
Zhouping Tang
Abstract:
Introduction: Timely identification of intracranial hemorrhage (ICH) subtypes on non-contrast computed tomography is critical for prognosis prediction and therapeutic decision-making, yet remains challenging due to low contrast and blurring boundaries. This study evaluates the performance of zero-shot multi-modal large language models (MLLMs) compared to traditional deep learning methods in ICH bi…
▽ More
Introduction: Timely identification of intracranial hemorrhage (ICH) subtypes on non-contrast computed tomography is critical for prognosis prediction and therapeutic decision-making, yet remains challenging due to low contrast and blurring boundaries. This study evaluates the performance of zero-shot multi-modal large language models (MLLMs) compared to traditional deep learning methods in ICH binary classification and subtyping. Methods: We utilized a dataset provided by RSNA, comprising 192 NCCT volumes. The study compares various MLLMs, including GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet V2, with conventional deep learning models, including ResNet50 and Vision Transformer. Carefully crafted prompts were used to guide MLLMs in tasks such as ICH presence, subtype classification, localization, and volume estimation. Results: The results indicate that in the ICH binary classification task, traditional deep learning models outperform MLLMs comprehensively. For subtype classification, MLLMs also exhibit inferior performance compared to traditional deep learning models, with Gemini 2.0 Flash achieving an macro-averaged precision of 0.41 and a macro-averaged F1 score of 0.31. Conclusion: While MLLMs excel in interactive capabilities, their overall accuracy in ICH subtyping is inferior to deep networks. However, MLLMs enhance interpretability through language interactions, indicating potential in medical imaging analysis. Future efforts will focus on model refinement and developing more precise MLLMs to improve performance in three-dimensional medical image processing.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
Improved Algorithms for Differentially Private Language Model Alignment
Authors:
Keyu Chen,
Hao Tang,
Qinglin Liu,
Yizhao Xu
Abstract:
Language model alignment is crucial for ensuring that large language models (LLMs) align with human preferences, yet it often involves sensitive user data, raising significant privacy concerns. While prior work has integrated differential privacy (DP) with alignment techniques, their performance remains limited. In this paper, we propose novel algorithms for privacy-preserving alignment and rigoro…
▽ More
Language model alignment is crucial for ensuring that large language models (LLMs) align with human preferences, yet it often involves sensitive user data, raising significant privacy concerns. While prior work has integrated differential privacy (DP) with alignment techniques, their performance remains limited. In this paper, we propose novel algorithms for privacy-preserving alignment and rigorously analyze their effectiveness across varying privacy budgets and models. Our framework can be deployed on two celebrated alignment techniques, namely direct preference optimization (DPO) and reinforcement learning from human feedback (RLHF). Through systematic experiments on large-scale language models, we demonstrate that our approach achieves state-of-the-art performance. Notably, one of our algorithms, DP-AdamW, combined with DPO, surpasses existing methods, improving alignment quality by up to 15% under moderate privacy budgets (ε=2-5). We further investigate the interplay between privacy guarantees, alignment efficacy, and computational demands, providing practical guidelines for optimizing these trade-offs.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Distributed Quantum Neural Networks on Distributed Photonic Quantum Computing
Authors:
Kuan-Cheng Chen,
Chen-Yu Liu,
Yu Shang,
Felix Burt,
Kin K. Leung
Abstract:
We introduce a distributed quantum-classical framework that synergizes photonic quantum neural networks (QNNs) with matrix-product-state (MPS) mapping to achieve parameter-efficient training of classical neural networks. By leveraging universal linear-optical decompositions of $M$-mode interferometers and photon-counting measurement statistics, our architecture generates neural parameters through…
▽ More
We introduce a distributed quantum-classical framework that synergizes photonic quantum neural networks (QNNs) with matrix-product-state (MPS) mapping to achieve parameter-efficient training of classical neural networks. By leveraging universal linear-optical decompositions of $M$-mode interferometers and photon-counting measurement statistics, our architecture generates neural parameters through a hybrid quantum-classical workflow: photonic QNNs with $M(M+1)/2$ trainable parameters produce high-dimensional probability distributions that are mapped to classical network weights via an MPS model with bond dimension $χ$. Empirical validation on MNIST classification demonstrates that photonic QT achieves an accuracy of $95.50\% \pm 0.84\%$ using 3,292 parameters ($χ= 10$), compared to $96.89\% \pm 0.31\%$ for classical baselines with 6,690 parameters. Moreover, a ten-fold compression ratio is achieved at $χ= 4$, with a relative accuracy loss of less than $3\%$. The framework outperforms classical compression techniques (weight sharing/pruning) by 6--12\% absolute accuracy while eliminating quantum hardware requirements during inference through classical deployment of compressed parameters. Simulations incorporating realistic photonic noise demonstrate the framework's robustness to near-term hardware imperfections. Ablation studies confirm quantum necessity: replacing photonic QNNs with random inputs collapses accuracy to chance level ($10.0\% \pm 0.5\%$). Photonic quantum computing's room-temperature operation, inherent scalability through spatial-mode multiplexing, and HPC-integrated architecture establish a practical pathway for distributed quantum machine learning, combining the expressivity of photonic Hilbert spaces with the deployability of classical neural networks.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
FloE: On-the-Fly MoE Inference on Memory-constrained GPU
Authors:
Yuxin Zhou,
Zheng Li,
Jun Zhang,
Jue Wang,
Yiping Wang,
Zhongle Xie,
Ke Chen,
Lidan Shou
Abstract:
With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive sce…
▽ More
With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive scenarios. To mitigate this, we propose FloE, an on-the-fly MoE inference system on memory-constrained GPUs. FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts. It employs various compression techniques on the expert's internal parameter matrices to reduce the data movement load, combined with low-cost sparse prediction, achieving perceptible inference acceleration in wall-clock time on resource-constrained devices. Empirically, FloE achieves a 9.3x compression of parameters per expert in Mixtral-8x7B; enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5x; and delivers a 48.7x inference speedup compared to DeepSpeed-MII on a single GeForce RTX 3090 - all with only a 4.4$\%$ - 7.6$\%$ average performance degradation.
△ Less
Submitted 11 May, 2025; v1 submitted 9 May, 2025;
originally announced May 2025.
-
FLAM: Frame-Wise Language-Audio Modeling
Authors:
Yusong Wu,
Christos Tsirigotis,
Ke Chen,
Cheng-Zhi Anna Huang,
Aaron Courville,
Oriol Nieto,
Prem Seetharaman,
Justin Salamon
Abstract:
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are…
▽ More
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Listen to Extract: Onset-Prompted Target Speaker Extraction
Authors:
Pengjie Shen,
Kangrui Chen,
Shulin He,
Pengru Chen,
Shuqi Yuan,
He Kong,
Xueliang Zhang,
Zhong-Qiu Wang
Abstract:
We propose $\textit{listen to extract}$ (LExt), a highly-effective while extremely-simple algorithm for monaural target speaker extraction (TSE). Given an enrollment utterance of a target speaker, LExt aims at extracting the target speaker from the speaker's mixed speech with other speakers. For each mixture, LExt concatenates an enrollment utterance of the target speaker to the mixture signal at…
▽ More
We propose $\textit{listen to extract}$ (LExt), a highly-effective while extremely-simple algorithm for monaural target speaker extraction (TSE). Given an enrollment utterance of a target speaker, LExt aims at extracting the target speaker from the speaker's mixed speech with other speakers. For each mixture, LExt concatenates an enrollment utterance of the target speaker to the mixture signal at the waveform level, and trains deep neural networks (DNN) to extract the target speech based on the concatenated mixture signal. The rationale is that, this way, an artificial speech onset is created for the target speaker and it could prompt the DNN (a) which speaker is the target to extract; and (b) spectral-temporal patterns of the target speaker that could help extraction. This simple approach produces strong TSE performance on multiple public TSE datasets including WSJ0-2mix, WHAM! and WHAMR!.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
With a Little Help From My Friends: Exploiting Probability Distribution Advice in Algorithm Design
Authors:
Clément L. Canonne,
Kenny Chen,
Julián Mestre
Abstract:
We study online algorithms with predictions using distributional advice, a type of prediction that arises when leveraging expert knowledge or historical data. To demonstrate the usefulness and versatility of this framework, we focus on two fundamental problems: first, the prophet inequality problem, for which we provide an algorithm achieving $\max\{\frac{1}{2}-η-o(1),\frac{1}{e}\}$-competitive ra…
▽ More
We study online algorithms with predictions using distributional advice, a type of prediction that arises when leveraging expert knowledge or historical data. To demonstrate the usefulness and versatility of this framework, we focus on two fundamental problems: first, the prophet inequality problem, for which we provide an algorithm achieving $\max\{\frac{1}{2}-η-o(1),\frac{1}{e}\}$-competitive ratio, where $η$ quantifies the quality of the prediction. Second, we turn to the online metric matching problem under random arrivals, for which our main positive result is an algorithm achieving the optimal cost under perfect advice, while smoothly defaulting to competitive ratios comparable to advice-free algorithms as the prediction's quality degrades.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
A Unifying Bias-aware Multidisciplinary Framework for Investigating Socio-Technical Issues
Authors:
Sacha Hasan,
Mehdi Rizvi,
Yingfang Yuan,
Kefan Chen,
Lynne Baillie,
Wei Pang
Abstract:
This paper aims to bring together the disciplines of social science (SS) and computer science (CS) in the design and implementation of a novel multidisciplinary framework for systematic, transparent, ethically-informed, and bias-aware investigation of socio-technical issues. For this, various analysis approaches from social science and machine learning (ML) were applied in a structured sequence to…
▽ More
This paper aims to bring together the disciplines of social science (SS) and computer science (CS) in the design and implementation of a novel multidisciplinary framework for systematic, transparent, ethically-informed, and bias-aware investigation of socio-technical issues. For this, various analysis approaches from social science and machine learning (ML) were applied in a structured sequence to arrive at an original methodology of identifying and quantifying objects of inquiry. A core feature of this framework is that it highlights where bias occurs and suggests possible steps to mitigate it. This is to improve the robustness, reliability, and explainability of the framework and its results. Such an approach also ensures that the investigation of socio-technical issues is transparent about its own limitations and potential sources of bias. To test our framework, we utilised it in the multidisciplinary investigation of the online harms encountered by minoritised ethnic (ME) communities when accessing and using digitalised social housing services in the UK. We draw our findings from 100 interviews with ME individuals in four cities across the UK to understand ME vulnerabilities when accessing and using digitalised social housing services. In our framework, a sub-sample of interviews focusing on ME individuals residing in social housing units were inductively coded. This resulted in the identification of the topics of discrimination, digital poverty, lack of digital literacy, and lack of English proficiency as key vulnerabilities of ME communities. Further ML techniques such as Topic Modelling and Sentiment Analysis were used within our framework where we found that Black African communities are more likely to experience these vulnerabilities in the access, use and outcome of digitalised social housing services.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models
Authors:
Zihan Wang,
Hongwei Li,
Rui Zhang,
Wenbo Jiang,
Kangjie Chen,
Tianwei Zhang,
Qingchuan Zhao,
Guowen Xu
Abstract:
In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious enti…
▽ More
In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious entities. We first implement a baseline lingual-backdoor attack, which is carried out by poisoning a set of training data for specific downstream tasks through translation into the trigger language. However, this baseline attack suffers from poor task generalization and is impractical in real-world settings. To address this challenge, we design BadLingual, a novel task-agnostic lingual-backdoor, capable of triggering any downstream tasks within the chat LLMs, regardless of the specific questions of these tasks. We design a new approach using PPL-constrained Greedy Coordinate Gradient-based Search (PGCG) based adversarial training to expand the decision boundary of lingual-backdoor, thereby enhancing the generalization ability of lingual-backdoor across various tasks. We perform extensive experiments to validate the effectiveness of our proposed attacks. Specifically, the baseline attack achieves an ASR of over 90% on the specified tasks. However, its ASR reaches only 37.61% across six tasks in the task-agnostic scenario. In contrast, BadLingual brings up to 37.35% improvement over the baseline. Our study sheds light on a new perspective of vulnerabilities in LLMs with multilingual capabilities and is expected to promote future research on the potential defenses to enhance the LLMs' robustness
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models
Authors:
Bin Yu,
Hang Yuan,
Yuliang Wei,
Bailing Wang,
Weizhen Qi,
Kai Chen
Abstract:
Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning capabilities to non-reasoning models. However, models fine-tuned with this approach inherit the "overthinking" problem from teacher models, producing verbose and redundant…
▽ More
Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning capabilities to non-reasoning models. However, models fine-tuned with this approach inherit the "overthinking" problem from teacher models, producing verbose and redundant reasoning chains during inference. To address this challenge, we propose \textbf{L}ong-\textbf{S}hort Chain-of-Thought \textbf{Mixture} \textbf{S}upervised \textbf{F}ine-\textbf{T}uning (\textbf{LS-Mixture SFT}), which combines long CoT reasoning dataset with their short counterparts obtained through structure-preserved rewriting. Our experiments demonstrate that models trained using the LS-Mixture SFT method, compared to those trained with direct SFT, achieved an average accuracy improvement of 2.3\% across various benchmarks while substantially reducing model response length by approximately 47.61\%. This work offers an approach to endow non-reasoning models with reasoning capabilities through supervised fine-tuning while avoiding the inherent overthinking problems inherited from teacher models, thereby enabling efficient reasoning in the fine-tuned models.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
RIFT: Closed-Loop RL Fine-Tuning for Realistic and Controllable Traffic Simulation
Authors:
Keyu Chen,
Wenchao Sun,
Hao Cheng,
Sifa Zheng
Abstract:
Achieving both realism and controllability in interactive closed-loop traffic simulation remains a key challenge in autonomous driving. Data-driven simulation methods reproduce realistic trajectories but suffer from covariate shift in closed-loop deployment, compounded by simplified dynamics models that further reduce reliability. Conversely, physics-based simulation methods enhance reliable and c…
▽ More
Achieving both realism and controllability in interactive closed-loop traffic simulation remains a key challenge in autonomous driving. Data-driven simulation methods reproduce realistic trajectories but suffer from covariate shift in closed-loop deployment, compounded by simplified dynamics models that further reduce reliability. Conversely, physics-based simulation methods enhance reliable and controllable closed-loop interactions but often lack expert demonstrations, compromising realism. To address these challenges, we introduce a dual-stage AV-centered simulation framework that conducts open-loop imitation learning pre-training in a data-driven simulator to capture trajectory-level realism and multimodality, followed by closed-loop reinforcement learning fine-tuning in a physics-based simulator to enhance controllability and mitigate covariate shift. In the fine-tuning stage, we propose RIFT, a simple yet effective closed-loop RL fine-tuning strategy that preserves the trajectory-level multimodality through a GRPO-style group-relative advantage formulation, while enhancing controllability and training stability by replacing KL regularization with the dual-clip mechanism. Extensive experiments demonstrate that RIFT significantly improves the realism and controllability of generated traffic scenarios, providing a robust platform for evaluating autonomous vehicle performance in diverse and interactive scenarios.
△ Less
Submitted 6 May, 2025;
originally announced May 2025.
-
NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results
Authors:
Nikolay Safonov,
Alexey Bryncev,
Andrey Moskalenko,
Dmitry Kulikov,
Dmitry Vatolin,
Radu Timofte,
Haibo Lei,
Qifan Gao,
Qing Luo,
Yaqing Li,
Jie Song,
Shaozhe Hao,
Meisong Zheng,
Jingyi Xu,
Chengbin Wu,
Jiahui Liu,
Ying Chen,
Xin Deng,
Mai Xu,
Peipei Liang,
Jie Ma,
Junjie Jin,
Yingxue Pang,
Fangzhou Luo,
Kai Chen
, et al. (6 additional authors not shown)
Abstract:
This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such vid…
▽ More
This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such videos. Given the widespread use of UGC on short-form video platforms, this task holds substantial practical importance. The evaluation was based on subjective quality assessment in crowdsourcing, obtaining votes from over 8000 assessors. The challenge attracted more than 25 teams submitting solutions, 7 of which passed the final phase with source code verification. The outcomes may provide insights into the state-of-the-art in UGC video enhancement and highlight emerging trends and effective strategies in this evolving research area. All data, including the processed videos and subjective comparison votes and scores, is made publicly available at https://github.com/msu-video-group/NTIRE25_UGC_Video_Enhancement.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
Authors:
Yi-Fan Zhang,
Xingyu Lu,
Xiao Hu,
Chaoyou Fu,
Bin Wen,
Tianke Zhang,
Changyi Liu,
Kaiyu Jiang,
Kaibing Chen,
Kaiyu Tang,
Haojie Ding,
Jiankang Chen,
Fan Yang,
Zhang Zhang,
Tingting Gao,
Liang Wang
Abstract:
Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In…
▽ More
Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a $8.4\%$ improvement on the VL Reward-Bench and a $14.3\%$ improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.
△ Less
Submitted 9 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
A Survey on Privacy Risks and Protection in Large Language Models
Authors:
Kang Chen,
Xiuze Zhou,
Yuanguo Lin,
Shibo Feng,
Li Shen,
Pengcheng Wu
Abstract:
Although Large Language Models (LLMs) have become increasingly integral to diverse applications, their capabilities raise significant privacy concerns. This survey offers a comprehensive overview of privacy risks associated with LLMs and examines current solutions to mitigate these challenges. First, we analyze privacy leakage and attacks in LLMs, focusing on how these models unintentionally expos…
▽ More
Although Large Language Models (LLMs) have become increasingly integral to diverse applications, their capabilities raise significant privacy concerns. This survey offers a comprehensive overview of privacy risks associated with LLMs and examines current solutions to mitigate these challenges. First, we analyze privacy leakage and attacks in LLMs, focusing on how these models unintentionally expose sensitive information through techniques such as model inversion, training data extraction, and membership inference. We investigate the mechanisms of privacy leakage, including the unauthorized extraction of training data and the potential exploitation of these vulnerabilities by malicious actors. Next, we review existing privacy protection against such risks, such as inference detection, federated learning, backdoor mitigation, and confidential computing, and assess their effectiveness in preventing privacy leakage. Furthermore, we highlight key practical challenges and propose future research directions to develop secure and privacy-preserving LLMs, emphasizing privacy risk assessment, secure knowledge transfer between models, and interdisciplinary frameworks for privacy governance. Ultimately, this survey aims to establish a roadmap for addressing escalating privacy challenges in the LLMs domain.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients
Authors:
Yezhen Wang,
Zhouhao Yang,
Brian K Chen,
Fanyi Pu,
Bo Li,
Tianyu Gao,
Kenji Kawaguchi
Abstract:
Building upon the success of low-rank adapter (LoRA), low-rank gradient projection (LoRP) has emerged as a promising solution for memory-efficient fine-tuning. However, existing LoRP methods typically treat each row of the gradient matrix as the default projection unit, leaving the role of projection granularity underexplored. In this work, we propose a novel framework, VLoRP, that extends low-ran…
▽ More
Building upon the success of low-rank adapter (LoRA), low-rank gradient projection (LoRP) has emerged as a promising solution for memory-efficient fine-tuning. However, existing LoRP methods typically treat each row of the gradient matrix as the default projection unit, leaving the role of projection granularity underexplored. In this work, we propose a novel framework, VLoRP, that extends low-rank gradient projection by introducing an additional degree of freedom for controlling the trade-off between memory efficiency and performance, beyond the rank hyper-parameter. Through this framework, we systematically explore the impact of projection granularity, demonstrating that finer-grained projections lead to enhanced stability and efficiency even under a fixed memory budget. Regarding the optimization for VLoRP, we present ProjFactor, an adaptive memory-efficient optimizer, that significantly reduces memory requirement while ensuring competitive performance, even in the presence of gradient accumulation. Additionally, we provide a theoretical analysis of VLoRP, demonstrating the descent and convergence of its optimization trajectory under both SGD and ProjFactor. Extensive experiments are conducted to validate our findings, covering tasks such as commonsense reasoning, MMLU, and GSM8K.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
Authors:
Jen-Hao Cheng,
Vivian Wang,
Huayu Wang,
Huapeng Zhou,
Yi-Hao Peng,
Hou-I Liu,
Hsiang-Wei Huang,
Kuang-Ming Chen,
Cheng-Yen Yang,
Wenhao Chai,
Yi-Ling Chen,
Vibhav Vineet,
Qin Cai,
Jenq-Neng Hwang
Abstract:
Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Pred…
▽ More
Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Attack and defense techniques in large language models: A survey and new perspectives
Authors:
Zhiyu Liao,
Kang Chen,
Yuanguo Lin,
Kangkang Li,
Yunxuan Liu,
Hefeng Chen,
Xingwang Huang,
Yuanhui Yu
Abstract:
Large Language Models (LLMs) have become central to numerous natural language processing tasks, but their vulnerabilities present significant security and ethical challenges. This systematic survey explores the evolving landscape of attack and defense techniques in LLMs. We classify attacks into adversarial prompt attack, optimized attacks, model theft, as well as attacks on application of LLMs, d…
▽ More
Large Language Models (LLMs) have become central to numerous natural language processing tasks, but their vulnerabilities present significant security and ethical challenges. This systematic survey explores the evolving landscape of attack and defense techniques in LLMs. We classify attacks into adversarial prompt attack, optimized attacks, model theft, as well as attacks on application of LLMs, detailing their mechanisms and implications. Consequently, we analyze defense strategies, including prevention-based and detection-based defense methods. Although advances have been made, challenges remain to adapt to the dynamic threat landscape, balance usability with robustness, and address resource constraints in defense implementation. We highlight open problems, including the need for adaptive scalable defenses, explainable security techniques, and standardized evaluation frameworks. This survey provides actionable insights and directions for developing secure and resilient LLMs, emphasizing the importance of interdisciplinary collaboration and ethical considerations to mitigate risks in real-world applications.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
LLM Ethics Benchmark: A Three-Dimensional Assessment System for Evaluating Moral Reasoning in Large Language Models
Authors:
Junfeng Jiao,
Saleh Afroogh,
Abhejay Murali,
Kevin Chen,
David Atkinson,
Amit Dhurandhar
Abstract:
This study establishes a novel framework for systematically evaluating the moral reasoning capabilities of large language models (LLMs) as they increasingly integrate into critical societal domains. Current assessment methodologies lack the precision needed to evaluate nuanced ethical decision-making in AI systems, creating significant accountability gaps. Our framework addresses this challenge by…
▽ More
This study establishes a novel framework for systematically evaluating the moral reasoning capabilities of large language models (LLMs) as they increasingly integrate into critical societal domains. Current assessment methodologies lack the precision needed to evaluate nuanced ethical decision-making in AI systems, creating significant accountability gaps. Our framework addresses this challenge by quantifying alignment with human ethical standards through three dimensions: foundational moral principles, reasoning robustness, and value consistency across diverse scenarios. This approach enables precise identification of ethical strengths and weaknesses in LLMs, facilitating targeted improvements and stronger alignment with societal values. To promote transparency and collaborative advancement in ethical AI development, we are publicly releasing both our benchmark datasets and evaluation codebase at https://github.com/ The-Responsible-AI-Initiative/LLM_Ethics_Benchmark.git.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Learning to Learn with Quantum Optimization via Quantum Neural Networks
Authors:
Kuan-Cheng Chen,
Hiromichi Matsuyama,
Wei-Hao Huang
Abstract:
Quantum Approximate Optimization Algorithms (QAOA) promise efficient solutions to classically intractable combinatorial optimization problems by harnessing shallow-depth quantum circuits. Yet, their performance and scalability often hinge on effective parameter optimization, which remains nontrivial due to rugged energy landscapes and hardware noise. In this work, we introduce a quantum meta-learn…
▽ More
Quantum Approximate Optimization Algorithms (QAOA) promise efficient solutions to classically intractable combinatorial optimization problems by harnessing shallow-depth quantum circuits. Yet, their performance and scalability often hinge on effective parameter optimization, which remains nontrivial due to rugged energy landscapes and hardware noise. In this work, we introduce a quantum meta-learning framework that combines quantum neural networks, specifically Quantum Long Short-Term Memory (QLSTM) architectures, with QAOA. By training the QLSTM optimizer on smaller graph instances, our approach rapidly generalizes to larger, more complex problems, substantially reducing the number of iterations required for convergence. Through comprehensive benchmarks on Max-Cut and Sherrington-Kirkpatrick model instances, we demonstrate that QLSTM-based optimizers converge faster and achieve higher approximation ratios compared to classical baselines, thereby offering a robust pathway toward scalable quantum optimization in the NISQ era.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
SOTA: Spike-Navigated Optimal TrAnsport Saliency Region Detection in Composite-bias Videos
Authors:
Wenxuan Liu,
Yao Deng,
Kang Chen,
Xian Zhong,
Zhaofei Yu,
Tiejun Huang
Abstract:
Existing saliency detection methods struggle in real-world scenarios due to motion blur and occlusions. In contrast, spike cameras, with their high temporal resolution, significantly enhance visual saliency maps. However, the composite noise inherent to spike camera imaging introduces discontinuities in saliency detection. Low-quality samples further distort model predictions, leading to saliency…
▽ More
Existing saliency detection methods struggle in real-world scenarios due to motion blur and occlusions. In contrast, spike cameras, with their high temporal resolution, significantly enhance visual saliency maps. However, the composite noise inherent to spike camera imaging introduces discontinuities in saliency detection. Low-quality samples further distort model predictions, leading to saliency bias. To address these challenges, we propose Spike-navigated Optimal TrAnsport Saliency Region Detection (SOTA), a framework that leverages the strengths of spike cameras while mitigating biases in both spatial and temporal dimensions. Our method introduces Spike-based Micro-debias (SM) to capture subtle frame-to-frame variations and preserve critical details, even under minimal scene or lighting changes. Additionally, Spike-based Global-debias (SG) refines predictions by reducing inconsistencies across diverse conditions. Extensive experiments on real and synthetic datasets demonstrate that SOTA outperforms existing methods by eliminating composite noise bias. Our code and dataset will be released at https://github.com/lwxfight/sota.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
From GNNs to Trees: Multi-Granular Interpretability for Graph Neural Networks
Authors:
Jie Yang,
Yuwen Wang,
Kaixuan Chen,
Tongya Zheng,
Yihe Zhou,
Zhenbang Xiao,
Ji Cao,
Mingli Song,
Shunyu Liu
Abstract:
Interpretable Graph Neural Networks (GNNs) aim to reveal the underlying reasoning behind model predictions, attributing their decisions to specific subgraphs that are informative. However, existing subgraph-based interpretable methods suffer from an overemphasis on local structure, potentially overlooking long-range dependencies within the entire graphs. Although recent efforts that rely on graph…
▽ More
Interpretable Graph Neural Networks (GNNs) aim to reveal the underlying reasoning behind model predictions, attributing their decisions to specific subgraphs that are informative. However, existing subgraph-based interpretable methods suffer from an overemphasis on local structure, potentially overlooking long-range dependencies within the entire graphs. Although recent efforts that rely on graph coarsening have proven beneficial for global interpretability, they inevitably reduce the graphs to a fixed granularity. Such an inflexible way can only capture graph connectivity at a specific level, whereas real-world graph tasks often exhibit relationships at varying granularities (e.g., relevant interactions in proteins span from functional groups, to amino acids, and up to protein domains). In this paper, we introduce a novel Tree-like Interpretable Framework (TIF) for graph classification, where plain GNNs are transformed into hierarchical trees, with each level featuring coarsened graphs of different granularity as tree nodes. Specifically, TIF iteratively adopts a graph coarsening module to compress original graphs (i.e., root nodes of trees) into increasingly coarser ones (i.e., child nodes of trees), while preserving diversity among tree nodes within different branches through a dedicated graph perturbation module. Finally, we propose an adaptive routing module to identify the most informative root-to-leaf paths, providing not only the final prediction but also the multi-granular interpretability for the decision-making process. Extensive experiments on the graph classification benchmarks with both synthetic and real-world datasets demonstrate the superiority of TIF in interpretability, while also delivering a competitive prediction performance akin to the state-of-the-art counterparts.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Efficient Graph-Based Approximate Nearest Neighbor Search Achieving: Low Latency Without Throughput Loss
Authors:
Jingjia Luo,
Mingxing Zhang,
Kang Chen,
Xia Liao,
Yingdi Shan,
Jinlei Jiang,
Yongwei Wu
Abstract:
The increase in the dimensionality of neural embedding models has enhanced the accuracy of semantic search capabilities but also amplified the computational demands for Approximate Nearest Neighbor Searches (ANNS). This complexity poses significant challenges in online and interactive services, where query latency is a critical performance metric. Traditional graph-based ANNS methods, while effect…
▽ More
The increase in the dimensionality of neural embedding models has enhanced the accuracy of semantic search capabilities but also amplified the computational demands for Approximate Nearest Neighbor Searches (ANNS). This complexity poses significant challenges in online and interactive services, where query latency is a critical performance metric. Traditional graph-based ANNS methods, while effective for managing large datasets, often experience substantial throughput reductions when scaled for intra-query parallelism to minimize latency. This reduction is largely due to inherent inefficiencies in the conventional fork-join parallelism model.
To address this problem, we introduce AverSearch, a novel parallel graph-based ANNS framework that overcomes these limitations through a fully asynchronous architecture. Unlike existing frameworks that struggle with balancing latency and throughput, AverSearch utilizes a dynamic workload balancing mechanism that supports continuous, dependency-free processing. This approach not only minimizes latency by eliminating unnecessary synchronization and redundant vertex processing but also maintains high throughput levels. Our evaluations across various datasets, including both traditional benchmarks and modern large-scale model generated datasets, show that AverSearch consistently outperforms current state-of-the-art systems. It achieves up to 2.1-8.9 times higher throughput at comparable latency levels across different datasets and reduces minimum latency by 1.5 to 1.9 times.
△ Less
Submitted 30 April, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape, Materials and View-dependent Radiance Fields
Authors:
Zhengqin Li,
Dilin Wang,
Ka Chen,
Zhaoyang Lv,
Thu Nguyen-Phuoc,
Milim Lee,
Jia-Bin Huang,
Lei Xiao,
Cheng Zhang,
Yufeng Zhu,
Carl S. Marshall,
Yufeng Ren,
Richard Newcombe,
Zhao Dong
Abstract:
We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. Our model builds upon the recent Large Reconstruction Models (LRMs) that achieve state-of-the-art sparse-view reconstruction quality. However, existing LRMs struggle to reconstruct unseen parts ac…
▽ More
We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. Our model builds upon the recent Large Reconstruction Models (LRMs) that achieve state-of-the-art sparse-view reconstruction quality. However, existing LRMs struggle to reconstruct unseen parts accurately and cannot recover glossy appearance or generate relightable 3D contents that can be consumed by standard Graphics engines. To address these limitations, we make three key technical contributions to build a more practical multi-view 3D reconstruction framework. First, we introduce an update model that allows us to progressively add more input views to improve our reconstruction. Second, we propose a hexa-plane neural SDF representation to better recover detailed textures, geometry and material parameters. Third, we develop a novel neural directional-embedding mechanism to handle view-dependent effects. Trained on a large-scale shape and material dataset with a tailored coarse-to-fine training scheme, our model achieves compelling results. It compares favorably to optimization-based dense-view inverse rendering methods in terms of geometry and relighting accuracy, while requiring only a fraction of the inference time.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Provably Secure Public-Key Steganography Based on Admissible Encoding
Authors:
Xin Zhang,
Kejiang Chen,
Na Zhao,
Weiming Zhang,
Nenghai Yu
Abstract:
The technique of hiding secret messages within seemingly harmless covertext to evade examination by censors with rigorous security proofs is known as provably secure steganography (PSS). PSS evolves from symmetric key steganography to public-key steganography, functioning without the requirement of a pre-shared key and enabling the extension to multi-party covert communication and identity verific…
▽ More
The technique of hiding secret messages within seemingly harmless covertext to evade examination by censors with rigorous security proofs is known as provably secure steganography (PSS). PSS evolves from symmetric key steganography to public-key steganography, functioning without the requirement of a pre-shared key and enabling the extension to multi-party covert communication and identity verification mechanisms. Recently, a public-key steganography method based on elliptic curves was proposed, which uses point compression to eliminate the algebraic structure of curve points. However, this method has strict requirements on the curve parameters and is only available on half of the points. To overcome these limitations, this paper proposes a more general elliptic curve public key steganography method based on admissible encoding. By applying the tensor square function to the known well-distributed encoding, we construct admissible encoding, which can create the pseudo-random public-key encryption function. The theoretical analysis and experimental results show that the proposed provable secure public-key steganography method can be deployed on all types of curves and utilize all points on the curve.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
RadioFormer: A Multiple-Granularity Radio Map Estimation Transformer with 1\textpertenthousand Spatial Sampling
Authors:
Zheng Fang,
Kangjun Liu,
Ke Chen,
Qingyu Liu,
Jianguo Zhang,
Lingyang Song,
Yaowei Wang
Abstract:
The task of radio map estimation aims to generate a dense representation of electromagnetic spectrum quantities, such as the received signal strength at each grid point within a geographic region, based on measurements from a subset of spatially distributed nodes (represented as pixels). Recently, deep vision models such as the U-Net have been adapted to radio map estimation, whose effectiveness c…
▽ More
The task of radio map estimation aims to generate a dense representation of electromagnetic spectrum quantities, such as the received signal strength at each grid point within a geographic region, based on measurements from a subset of spatially distributed nodes (represented as pixels). Recently, deep vision models such as the U-Net have been adapted to radio map estimation, whose effectiveness can be guaranteed with sufficient spatial observations (typically 0.01% to 1% of pixels) in each map, to model local dependency of observed signal power. However, such a setting of sufficient measurements can be less practical in real-world scenarios, where extreme sparsity in spatial sampling can be widely encountered. To address this challenge, we propose RadioFormer, a novel multiple-granularity transformer designed to handle the constraints posed by spatial sparse observations. Our RadioFormer, through a dual-stream self-attention (DSA) module, can respectively discover the correlation of pixel-wise observed signal power and also learn patch-wise buildings' geometries in a style of multiple granularities, which are integrated into multi-scale representations of radio maps by a cross stream cross-attention (CCA) module. Extensive experiments on the public RadioMapSeer dataset demonstrate that RadioFormer outperforms state-of-the-art methods in radio map estimation while maintaining the lowest computational cost. Furthermore, the proposed approach exhibits exceptional generalization capabilities and robust zero-shot performance, underscoring its potential to advance radio map estimation in a more practical setting with very limited observation nodes.
△ Less
Submitted 27 April, 2025;
originally announced April 2025.
-
DiMeR: Disentangled Mesh Reconstruction Model
Authors:
Lutao Jiang,
Jiantao Lin,
Kanghao Chen,
Wenhang Ge,
Xin Yang,
Yifan Jiang,
Yuanhuiyi Lyu,
Xu Zheng,
Yingcong Chen
Abstract:
With the advent of large-scale 3D datasets, feed-forward 3D generative models, such as the Large Reconstruction Model (LRM), have gained significant attention and achieved remarkable success. However, we observe that RGB images often lead to conflicting training objectives and lack the necessary clarity for geometry reconstruction. In this paper, we revisit the inductive biases associated with mes…
▽ More
With the advent of large-scale 3D datasets, feed-forward 3D generative models, such as the Large Reconstruction Model (LRM), have gained significant attention and achieved remarkable success. However, we observe that RGB images often lead to conflicting training objectives and lack the necessary clarity for geometry reconstruction. In this paper, we revisit the inductive biases associated with mesh reconstruction and introduce DiMeR, a novel disentangled dual-stream feed-forward model for sparse-view mesh reconstruction. The key idea is to disentangle both the input and framework into geometry and texture parts, thereby reducing the training difficulty for each part according to the Principle of Occam's Razor. Given that normal maps are strictly consistent with geometry and accurately capture surface variations, we utilize normal maps as exclusive input for the geometry branch to reduce the complexity between the network's input and output. Moreover, we improve the mesh extraction algorithm to introduce 3D ground truth supervision. As for texture branch, we use RGB images as input to obtain the textured mesh. Overall, DiMeR demonstrates robust capabilities across various tasks, including sparse-view reconstruction, single-image-to-3D, and text-to-3D. Numerous experiments show that DiMeR significantly outperforms previous methods, achieving over 30% improvement in Chamfer Distance on the GSO and OmniObject3D dataset.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models
Authors:
Jun Zhang,
Jue Wang,
Huan Li,
Lidan Shou,
Ke Chen,
Gang Chen,
Qin Xie,
Guiming Xie,
Xuejian Gong
Abstract:
The significant computational demands of pretrained language models (PLMs), which often require dedicated hardware, present a substantial challenge in serving them efficiently, especially in multi-tenant environments. To address this, we introduce HMI, a Hierarchical knowledge management-based Multi-tenant Inference system, designed to manage tenants with distinct PLMs resource-efficiently. Our ap…
▽ More
The significant computational demands of pretrained language models (PLMs), which often require dedicated hardware, present a substantial challenge in serving them efficiently, especially in multi-tenant environments. To address this, we introduce HMI, a Hierarchical knowledge management-based Multi-tenant Inference system, designed to manage tenants with distinct PLMs resource-efficiently. Our approach is three-fold: Firstly, we categorize PLM knowledge into general, domain-specific, and task-specific. Leveraging insights on knowledge acquisition across different model layers, we construct hierarchical PLMs (hPLMs) by extracting and storing knowledge at different levels, significantly reducing GPU memory usage per tenant. Secondly, we establish hierarchical knowledge management for hPLMs generated by various tenants in HMI. We manage domain-specific knowledge with acceptable storage increases by constructing and updating domain-specific knowledge trees based on frequency. We manage task-specific knowledge within limited GPU memory through parameter swapping. Finally, we propose system optimizations to enhance resource utilization and inference throughput. These include fine-grained pipelining via hierarchical knowledge prefetching to overlap CPU and I/O operations with GPU computations, and optimizing parallel implementations with batched matrix multiplications. Our experimental results demonstrate that the proposed HMI can efficiently serve up to 10,000 hPLMs (hBERTs and hGPTs) on a single GPU, with only a negligible compromise in accuracy.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
CHASe: Client Heterogeneity-Aware Data Selection for Effective Federated Active Learning
Authors:
Jun Zhang,
Jue Wang,
Huan Li,
Zhongle Xie,
Ke Chen,
Lidan Shou
Abstract:
Active learning (AL) reduces human annotation costs for machine learning systems by strategically selecting the most informative unlabeled data for annotation, but performing it individually may still be insufficient due to restricted data diversity and annotation budget. Federated Active Learning (FAL) addresses this by facilitating collaborative data selection and model training, while preservin…
▽ More
Active learning (AL) reduces human annotation costs for machine learning systems by strategically selecting the most informative unlabeled data for annotation, but performing it individually may still be insufficient due to restricted data diversity and annotation budget. Federated Active Learning (FAL) addresses this by facilitating collaborative data selection and model training, while preserving the confidentiality of raw data samples. Yet, existing FAL methods fail to account for the heterogeneity of data distribution across clients and the associated fluctuations in global and local model parameters, adversely affecting model accuracy. To overcome these challenges, we propose CHASe (Client Heterogeneity-Aware Data Selection), specifically designed for FAL. CHASe focuses on identifying those unlabeled samples with high epistemic variations (EVs), which notably oscillate around the decision boundaries during training. To achieve both effectiveness and efficiency, \model{} encompasses techniques for 1) tracking EVs by analyzing inference inconsistencies across training epochs, 2) calibrating decision boundaries of inaccurate models with a new alignment loss, and 3) enhancing data selection efficiency via a data freeze and awaken mechanism with subset sampling. Experiments show that CHASe surpasses various established baselines in terms of effectiveness and efficiency, validated across diverse datasets, model complexities, and heterogeneous federation settings.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
HeRB: Heterophily-Resolved Structure Balancer for Graph Neural Networks
Authors:
Ke-Jia Chen,
Wenhui Mu,
Zheng Liu
Abstract:
Recent research has witnessed the remarkable progress of Graph Neural Networks (GNNs) in the realm of graph data representation. However, GNNs still encounter the challenge of structural imbalance. Prior solutions to this problem did not take graph heterophily into account, namely that connected nodes process distinct labels or features, thus resulting in a deficiency in effectiveness. Upon verify…
▽ More
Recent research has witnessed the remarkable progress of Graph Neural Networks (GNNs) in the realm of graph data representation. However, GNNs still encounter the challenge of structural imbalance. Prior solutions to this problem did not take graph heterophily into account, namely that connected nodes process distinct labels or features, thus resulting in a deficiency in effectiveness. Upon verifying the impact of heterophily on solving the structural imbalance problem, we propose to rectify the heterophily first and then transfer homophilic knowledge. To the end, we devise a method named HeRB (Heterophily-Resolved Structure Balancer) for GNNs. HeRB consists of two innovative components: 1) A heterophily-lessening augmentation module which serves to reduce inter-class edges and increase intra-class edges; 2) A homophilic knowledge transfer mechanism to convey homophilic information from head nodes to tail nodes. Experimental results demonstrate that HeRB achieves superior performance on two homophilic and six heterophilic benchmark datasets, and the ablation studies further validate the efficacy of two proposed components.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
NLCTables: A Dataset for Marrying Natural Language Conditions with Table Discovery
Authors:
Lingxi Cui,
Huan Li,
Ke Chen,
Lidan Shou,
Gang Chen
Abstract:
With the growing abundance of repositories containing tabular data, discovering relevant tables for in-depth analysis remains a challenging task. Existing table discovery methods primarily retrieve desired tables based on a query table or several vague keywords, leaving users to manually filter large result sets. To address this limitation, we propose a new task: NL-conditional table discovery (nl…
▽ More
With the growing abundance of repositories containing tabular data, discovering relevant tables for in-depth analysis remains a challenging task. Existing table discovery methods primarily retrieve desired tables based on a query table or several vague keywords, leaving users to manually filter large result sets. To address this limitation, we propose a new task: NL-conditional table discovery (nlcTD), where users combine a query table with natural language (NL) requirements to refine search results. To advance research in this area, we present nlcTables, a comprehensive benchmark dataset comprising 627 diverse queries spanning NL-only, union, join, and fuzzy conditions, 22,080 candidate tables, and 21,200 relevance annotations. Our evaluation of six state-of-the-art table discovery methods on nlcTables reveals substantial performance gaps, highlighting the need for advanced techniques to tackle this challenging nlcTD scenario. The dataset, construction framework, and baseline implementations are publicly available at https://github.com/SuDIS-ZJU/nlcTables to foster future research.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
SocialMOIF: Multi-Order Intention Fusion for Pedestrian Trajectory Prediction
Authors:
Kai Chen,
Xiaodong Zhao,
Yujie Huang,
Guoyu Fang,
Xiao Song,
Ruiping Wang,
Ziyuan Wang
Abstract:
The analysis and prediction of agent trajectories are crucial for decision-making processes in intelligent systems, with precise short-term trajectory forecasting being highly significant across a range of applications. Agents and their social interactions have been quantified and modeled by researchers from various perspectives; however, substantial limitations exist in the current work due to th…
▽ More
The analysis and prediction of agent trajectories are crucial for decision-making processes in intelligent systems, with precise short-term trajectory forecasting being highly significant across a range of applications. Agents and their social interactions have been quantified and modeled by researchers from various perspectives; however, substantial limitations exist in the current work due to the inherent high uncertainty of agent intentions and the complex higher-order influences among neighboring groups. SocialMOIF is proposed to tackle these challenges, concentrating on the higher-order intention interactions among neighboring groups while reinforcing the primary role of first-order intention interactions between neighbors and the target agent. This method develops a multi-order intention fusion model to achieve a more comprehensive understanding of both direct and indirect intention information. Within SocialMOIF, a trajectory distribution approximator is designed to guide the trajectories toward values that align more closely with the actual data, thereby enhancing model interpretability. Furthermore, a global trajectory optimizer is introduced to enable more accurate and efficient parallel predictions. By incorporating a novel loss function that accounts for distance and direction during training, experimental results demonstrate that the model outperforms previous state-of-the-art baselines across multiple metrics in both dynamic and static datasets.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
GIFDL: Generated Image Fluctuation Distortion Learning for Enhancing Steganographic Security
Authors:
Xiangkun Wang,
Kejiang Chen,
Yuang Qi,
Ruiheng Liu,
Weiming Zhang,
Nenghai Yu
Abstract:
Minimum distortion steganography is currently the mainstream method for modification-based steganography. A key issue in this method is how to define steganographic distortion. With the rapid development of deep learning technology, the definition of distortion has evolved from manual design to deep learning design. Concurrently, rapid advancements in image generation have made generated images vi…
▽ More
Minimum distortion steganography is currently the mainstream method for modification-based steganography. A key issue in this method is how to define steganographic distortion. With the rapid development of deep learning technology, the definition of distortion has evolved from manual design to deep learning design. Concurrently, rapid advancements in image generation have made generated images viable as cover media. However, existing distortion design methods based on machine learning do not fully leverage the advantages of generated cover media, resulting in suboptimal security performance. To address this issue, we propose GIFDL (Generated Image Fluctuation Distortion Learning), a steganographic distortion learning method based on the fluctuations in generated images. Inspired by the idea of natural steganography, we take a series of highly similar fluctuation images as the input to the steganographic distortion generator and introduce a new GAN training strategy to disguise stego images as fluctuation images. Experimental results demonstrate that GIFDL, compared with state-of-the-art GAN-based distortion learning methods, exhibits superior resistance to steganalysis, increasing the detection error rates by an average of 3.30% across three steganalyzers.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Gaussian Shading++: Rethinking the Realistic Deployment Challenge of Performance-Lossless Image Watermark for Diffusion Models
Authors:
Zijin Yang,
Xin Zhang,
Kejiang Chen,
Kai Zeng,
Qiyi Yao,
Han Fang,
Weiming Zhang,
Nenghai Yu
Abstract:
Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. One effective solution involves watermarking the generated images. Existing methods primarily focus on ensuring that watermark embedding does not degrade the model performance. However, they often overlook critical challenges in real-world dep…
▽ More
Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. One effective solution involves watermarking the generated images. Existing methods primarily focus on ensuring that watermark embedding does not degrade the model performance. However, they often overlook critical challenges in real-world deployment scenarios, such as the complexity of watermark key management, user-defined generation parameters, and the difficulty of verification by arbitrary third parties. To address this issue, we propose Gaussian Shading++, a diffusion model watermarking method tailored for real-world deployment. We propose a double-channel design that leverages pseudorandom error-correcting codes to encode the random seed required for watermark pseudorandomization, achieving performance-lossless watermarking under a fixed watermark key and overcoming key management challenges. Additionally, we model the distortions introduced during generation and inversion as an additive white Gaussian noise channel and employ a novel soft decision decoding strategy during extraction, ensuring strong robustness even when generation parameters vary. To enable third-party verification, we incorporate public key signatures, which provide a certain level of resistance against forgery attacks even when model inversion capabilities are fully disclosed. Extensive experiments demonstrate that Gaussian Shading++ not only maintains performance losslessness but also outperforms existing methods in terms of robustness, making it a more practical solution for real-world deployment.
△ Less
Submitted 13 May, 2025; v1 submitted 21 April, 2025;
originally announced April 2025.
-
Benchmarking Differentially Private Tabular Data Synthesis
Authors:
Kai Chen,
Xiaochen Li,
Chen Gong,
Ryan McKenna,
Tianhao Wang
Abstract:
Differentially private (DP) tabular data synthesis generates artificial data that preserves the statistical properties of private data while safeguarding individual privacy. The emergence of diverse algorithms in recent years has introduced challenges in practical applications, such as inconsistent data processing methods, lack of in-depth algorithm analysis, and incomplete comparisons due to over…
▽ More
Differentially private (DP) tabular data synthesis generates artificial data that preserves the statistical properties of private data while safeguarding individual privacy. The emergence of diverse algorithms in recent years has introduced challenges in practical applications, such as inconsistent data processing methods, lack of in-depth algorithm analysis, and incomplete comparisons due to overlapping development timelines. These factors create significant obstacles to selecting appropriate algorithms.
In this paper, we address these challenges by proposing a benchmark for evaluating tabular data synthesis methods. We present a unified evaluation framework that integrates data preprocessing, feature selection, and synthesis modules, facilitating fair and comprehensive comparisons. Our evaluation reveals that a significant utility-efficiency trade-off exists among current state-of-the-art methods. Some statistical methods are superior in synthesis utility, but their efficiency is not as good as most machine learning-based methods. Furthermore, we conduct an in-depth analysis of each module with experimental validation, offering theoretical insights into the strengths and limitations of different strategies.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
Authors:
ByteDance Seed,
:,
Jiaze Chen,
Tiantian Fan,
Xin Liu,
Lingjun Liu,
Zhiqi Lin,
Mingxuan Wang,
Chengyi Wang,
Xiangpeng Wei,
Wenyuan Xu,
Yufeng Yuan,
Yu Yue,
Lin Yan,
Qiying Yu,
Xiaochen Zuo,
Chi Zhang,
Ruofei Zhu,
Zhecheng An,
Zhihao Bai,
Yu Bao,
Xingyan Bin,
Jiangjie Chen,
Feng Chen,
Hongmin Chen
, et al. (249 additional authors not shown)
Abstract:
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in…
▽ More
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.
△ Less
Submitted 29 April, 2025; v1 submitted 10 April, 2025;
originally announced April 2025.
-
AI as a deliberative partner fosters intercultural empathy for Americans but fails for Latin American participants
Authors:
Isabel Villanueva,
Tara Bobinac,
Binwei Yao,
Junjie Hu,
Kaiping Chen
Abstract:
Despite the growing integration of AI chatbots as conversational agents in public discourse, empirical evidence regarding their capacity to foster intercultural empathy remains limited. Using a randomized dialogue experiment, we examined how different types of AI chatbot interaction, i.e., deliberative versus non-deliberative and culturally aligned versus non-aligned, affect intercultural empathy…
▽ More
Despite the growing integration of AI chatbots as conversational agents in public discourse, empirical evidence regarding their capacity to foster intercultural empathy remains limited. Using a randomized dialogue experiment, we examined how different types of AI chatbot interaction, i.e., deliberative versus non-deliberative and culturally aligned versus non-aligned, affect intercultural empathy across cultural groups. Results show that deliberative conversations increased intercultural empathy among American participants but not Latin American participants, who perceived AI responses as culturally inaccurate and failing to represent their cultural contexts and perspectives authentically. Real-time interaction analyses reveal that these differences stem from cultural knowledge gaps inherent in Large Language Models. Despite explicit prompting and instruction to represent cultural perspectives in participants' native languages, AI systems still exhibit significant disparities in cultural representation. This highlights the importance of designing AI systems capable of culturally authentic engagement in deliberative conversations. Our study contributes to deliberation theory and AI alignment research by underscoring AI's role in intercultural dialogue and the persistent challenge of representational asymmetry in democratic discourse.
△ Less
Submitted 4 April, 2025;
originally announced April 2025.
-
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
Authors:
Yicheng Chen,
Yining Li,
Kai Hu,
Zerun Ma,
Haochen Ye,
Kai Chen
Abstract:
Data quality and diversity are key to the construction of effective instruction-tuning datasets. %
With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. %
Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. %
However, this…
▽ More
Data quality and diversity are key to the construction of effective instruction-tuning datasets. %
With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. %
Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. %
However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. %
Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. %
To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. %
Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space. %
Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. %
Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models
Authors:
Junjie Yang,
Junhao Song,
Xudong Han,
Ziqian Bi,
Tianyang Wang,
Chia Xin Liang,
Xinyuan Song,
Yichao Zhang,
Qian Niu,
Benji Peng,
Keyu Chen,
Ming Liu
Abstract:
Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, suc…
▽ More
Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various applications including image classification, object detection, language modeling, text classification, and sentiment analysis. Recent innovations in KD methods, such as attention-based approaches, block-wise logit distillation, and decoupling distillation, have notably improved student model performance. These techniques focus on stimulus complexity, attention mechanisms, and global information capture to optimize knowledge transfer. In addition, KD has proven effective in compressing large language models while preserving accuracy, reducing computational overhead, and improving inference speed. This survey synthesizes the latest literature, highlighting key findings, contributions, and future directions in knowledge distillation to provide insights for researchers and practitioners on its evolving role in artificial intelligence and machine learning.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Robust Decentralized Quantum Kernel Learning for Noisy and Adversarial Environment
Authors:
Wenxuan Ma,
Kuan-Cheng Chen,
Shang Yu,
Mengxiang Liu,
Ruilong Deng
Abstract:
This paper proposes a general decentralized framework for quantum kernel learning (QKL). It has robustness against quantum noise and can also be designed to defend adversarial information attacks forming a robust approach named RDQKL. We analyze the impact of noise on QKL and study the robustness of decentralized QKL to the noise. By integrating robust decentralized optimization techniques, our me…
▽ More
This paper proposes a general decentralized framework for quantum kernel learning (QKL). It has robustness against quantum noise and can also be designed to defend adversarial information attacks forming a robust approach named RDQKL. We analyze the impact of noise on QKL and study the robustness of decentralized QKL to the noise. By integrating robust decentralized optimization techniques, our method is able to mitigate the impact of malicious data injections across multiple nodes. Experimental results demonstrate that our approach maintains high accuracy under noisy quantum operations and effectively counter adversarial modifications, offering a promising pathway towards the future practical, scalable and secure quantum machine learning (QML).
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
New Results on a General Class of Minimum Norm Optimization Problems
Authors:
Kuowen Chen,
Jian Li,
Yuval Rabani,
Yiran Zhang
Abstract:
We study the general norm optimization for combinatorial problems, initiated by Chakrabarty and Swamy (STOC 2019). We propose a general formulation that captures a large class of combinatorial structures: we are given a set $U$ of $n$ weighted elements and a family of feasible subsets $F$. Each subset $S\in F$ is called a feasible solution/set of the problem. We denote the value vector by…
▽ More
We study the general norm optimization for combinatorial problems, initiated by Chakrabarty and Swamy (STOC 2019). We propose a general formulation that captures a large class of combinatorial structures: we are given a set $U$ of $n$ weighted elements and a family of feasible subsets $F$. Each subset $S\in F$ is called a feasible solution/set of the problem. We denote the value vector by $v=\{v_i\}_{i\in [n]}$, where $v_i\geq 0$ is the value of element $i$. For any subset $S\subseteq U$, we use $v[S]$ to denote the $n$-dimensional vector $\{v_e\cdot \mathbf{1}[e\in S]\}_{e\in U}$. Let $f: \mathbb{R}^n\rightarrow\mathbb{R}_+$ be a symmetric monotone norm function. Our goal is to minimize the norm objective $f(v[S])$ over feasible subset $S\in F$.
We present a general equivalent reduction of the norm minimization problem to a multi-criteria optimization problem with logarithmic budget constraints, up to a constant approximation factor. Leveraging this reduction, we obtain constant factor approximation algorithms for the norm minimization versions of several covering problems, such as interval cover, multi-dimensional knapsack cover, and logarithmic factor approximation for set cover. We also study the norm minimization versions for perfect matching, $s$-$t$ path and $s$-$t$ cut. We show the natural linear programming relaxations for these problems have a large integrality gap. To complement the negative result, we show that, for perfect matching, there is a bi-criteria result: for any constant $ε,δ>0$, we can find in polynomial time a nearly perfect matching (i.e., a matching that matches at least $1-ε$ proportion of vertices) and its cost is at most $(8+δ)$ times of the optimum for perfect matching. Moreover, we establish the existence of a polynomial-time $O(\log\log n)$-approximation algorithm for the norm minimization variant of the $s$-$t$ path problem.
△ Less
Submitted 29 April, 2025; v1 submitted 18 April, 2025;
originally announced April 2025.
-
An Addendum to NeBula: Towards Extending TEAM CoSTAR's Solution to Larger Scale Environments
Authors:
Ali Agha,
Kyohei Otsu,
Benjamin Morrell,
David D. Fan,
Sung-Kyun Kim,
Muhammad Fadhil Ginting,
Xianmei Lei,
Jeffrey Edlund,
Seyed Fakoorian,
Amanda Bouman,
Fernando Chavez,
Taeyeon Kim,
Gustavo J. Correa,
Maira Saboia,
Angel Santamaria-Navarro,
Brett Lopez,
Boseong Kim,
Chanyoung Jung,
Mamoru Sobue,
Oriana Claudia Peltzer,
Joshua Ott,
Robert Trybula,
Thomas Touma,
Marcel Kaufmann,
Tiago Stegun Vaquero
, et al. (64 additional authors not shown)
Abstract:
This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithm…
▽ More
This paper presents an appendix to the original NeBula autonomy solution developed by the TEAM CoSTAR (Collaborative SubTerranean Autonomous Robots), participating in the DARPA Subterranean Challenge. Specifically, this paper presents extensions to NeBula's hardware, software, and algorithmic components that focus on increasing the range and scale of the exploration environment. From the algorithmic perspective, we discuss the following extensions to the original NeBula framework: (i) large-scale geometric and semantic environment mapping; (ii) an adaptive positioning system; (iii) probabilistic traversability analysis and local planning; (iv) large-scale POMDP-based global motion planning and exploration behavior; (v) large-scale networking and decentralized reasoning; (vi) communication-aware mission planning; and (vii) multi-modal ground-aerial exploration solutions. We demonstrate the application and deployment of the presented systems and solutions in various large-scale underground environments, including limestone mine exploration scenarios as well as deployment in the DARPA Subterranean challenge.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Cellular-X: An LLM-empowered Cellular Agent for Efficient Base Station Operations
Authors:
Liujianfu Wang,
Xinyi Long,
Yuyang Du,
Xiaoyan Liu,
Kexin Chen,
Soung Chang Liew
Abstract:
This paper introduces Cellular-X, an LLM-powered agent designed to automate cellular base station (BS) maintenance. Leveraging multimodal LLM and retrieval-augmented generation (RAG) techniques, Cellular-X significantly enhances field engineer efficiency by quickly interpreting user intents, retrieving relevant technical information, and configuring a BS through iterative self-correction. Key feat…
▽ More
This paper introduces Cellular-X, an LLM-powered agent designed to automate cellular base station (BS) maintenance. Leveraging multimodal LLM and retrieval-augmented generation (RAG) techniques, Cellular-X significantly enhances field engineer efficiency by quickly interpreting user intents, retrieving relevant technical information, and configuring a BS through iterative self-correction. Key features of the demo include automatic customized BS setup, document-based query answering, and voice-controlled configuration reporting and revision. We implemented Cellular-X on a USRP X310 testbed for demonstration. Demo videos and implementation details are available at https://github.com/SeaBreezing/Cellular-X.
△ Less
Submitted 10 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results
Authors:
Xin Li,
Kun Yuan,
Bingchen Li,
Fengbin Guan,
Yizhen Shao,
Zihao Yu,
Xijun Wang,
Yiting Lu,
Wei Luo,
Suhang Yao,
Ming Sun,
Chao Zhou,
Zhibo Chen,
Radu Timofte,
Yabin Zhang,
Ao-Xiang Zhang,
Tianwu Zhi,
Jianzhao Liu,
Yang Li,
Jingwen Xu,
Yiting Liao,
Yushen Zuo,
Mingyang Wu,
Renjie Li,
Shengyun Zhong
, et al. (88 additional authors not shown)
Abstract:
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating re…
▽ More
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at https://github.com/lixinustc/KVQE- ChallengeCVPR-NTIRE2025.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
Efficient Medical Image Restoration via Reliability Guided Learning in Frequency Domain
Authors:
Pengcheng Zheng,
Kecheng Chen,
Jiaxin Huang,
Bohao Chen,
Ju Liu,
Yazhou Ren,
Xiaorong Pu
Abstract:
Medical image restoration tasks aim to recover high-quality images from degraded observations, exhibiting emergent desires in many clinical scenarios, such as low-dose CT image denoising, MRI super-resolution, and MRI artifact removal. Despite the success achieved by existing deep learning-based restoration methods with sophisticated modules, they struggle with rendering computationally-efficient…
▽ More
Medical image restoration tasks aim to recover high-quality images from degraded observations, exhibiting emergent desires in many clinical scenarios, such as low-dose CT image denoising, MRI super-resolution, and MRI artifact removal. Despite the success achieved by existing deep learning-based restoration methods with sophisticated modules, they struggle with rendering computationally-efficient reconstruction results. Moreover, they usually ignore the reliability of the restoration results, which is much more urgent in medical systems. To alleviate these issues, we present LRformer, a Lightweight Transformer-based method via Reliability-guided learning in the frequency domain. Specifically, inspired by the uncertainty quantification in Bayesian neural networks (BNNs), we develop a Reliable Lesion-Semantic Prior Producer (RLPP). RLPP leverages Monte Carlo (MC) estimators with stochastic sampling operations to generate sufficiently-reliable priors by performing multiple inferences on the foundational medical image segmentation model, MedSAM. Additionally, instead of directly incorporating the priors in the spatial domain, we decompose the cross-attention (CA) mechanism into real symmetric and imaginary anti-symmetric parts via fast Fourier transform (FFT), resulting in the design of the Guided Frequency Cross-Attention (GFCA) solver. By leveraging the conjugated symmetric property of FFT, GFCA reduces the computational complexity of naive CA by nearly half. Extensive experimental results in various tasks demonstrate the superiority of the proposed LRformer in both effectiveness and efficiency.
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Authors:
Jinguo Zhu,
Weiyun Wang,
Zhe Chen,
Zhaoyang Liu,
Shenglong Ye,
Lixin Gu,
Hao Tian,
Yuchen Duan,
Weijie Su,
Jie Shao,
Zhangwei Gao,
Erfei Cui,
Xuehui Wang,
Yue Cao,
Yangzhou Liu,
Xingguang Wei,
Hongjie Zhang,
Haomin Wang,
Weiye Xu,
Hao Li,
Jiahao Wang,
Nianchen Deng,
Songze Li,
Yinan He,
Tan Jiang
, et al. (26 additional authors not shown)
Abstract:
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single p…
▽ More
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
△ Less
Submitted 18 April, 2025; v1 submitted 14 April, 2025;
originally announced April 2025.
-
MigGPT: Harnessing Large Language Models for Automated Migration of Out-of-Tree Linux Kernel Patches Across Versions
Authors:
Pucheng Dang,
Di Huang,
Dong Li,
Kang Chen,
Yuanbo Wen,
Qi Guo,
Xing Hu,
Ninghui Sun
Abstract:
Out-of-tree kernel patches are essential for adapting the Linux kernel to new hardware or enabling specific functionalities. Maintaining and updating these patches across different kernel versions demands significant effort from experienced engineers. Large language models (LLMs) have shown remarkable progress across various domains, suggesting their potential for automating out-of-tree kernel pat…
▽ More
Out-of-tree kernel patches are essential for adapting the Linux kernel to new hardware or enabling specific functionalities. Maintaining and updating these patches across different kernel versions demands significant effort from experienced engineers. Large language models (LLMs) have shown remarkable progress across various domains, suggesting their potential for automating out-of-tree kernel patch migration. However, our findings reveal that LLMs, while promising, struggle with incomplete code context understanding and inaccurate migration point identification. In this work, we propose MigGPT, a framework that employs a novel code fingerprint structure to retain code snippet information and incorporates three meticulously designed modules to improve the migration accuracy and efficiency of out-of-tree kernel patches. Furthermore, we establish a robust benchmark using real-world out-of-tree kernel patch projects to evaluate LLM capabilities. Evaluations show that MigGPT significantly outperforms the direct application of vanilla LLMs, achieving an average completion rate of 72.59% (50.74% improvement) for migration tasks.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.