-
CAFE: Retrieval Head-based Coarse-to-Fine Information Seeking to Enhance Multi-Document QA Capability
Authors:
Han Peng,
Jinhao Jiang,
Zican Dong,
Wayne Xin Zhao,
Lei Fang
Abstract:
Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. T…
▽ More
Advancements in Large Language Models (LLMs) have extended their input context length, yet they still struggle with retrieval and reasoning in long-context inputs. Existing methods propose to utilize the prompt strategy and retrieval head to alleviate this limitation. However, they still face challenges in balancing retrieval precision and recall, impacting their efficacy in answering questions. To address this, we introduce $\textbf{CAFE}$, a two-stage coarse-to-fine method to enhance multi-document question-answering capacities. By gradually eliminating the negative impacts of background and distracting documents, CAFE makes the responses more reliant on the evidence documents. Initially, a coarse-grained filtering method leverages retrieval heads to identify and rank relevant documents. Then, a fine-grained steering method guides attention to the most relevant content. Experiments across benchmarks show CAFE outperforms baselines, achieving up to 22.1% and 13.7% SubEM improvement over SFT and RAG methods on the Mistral model, respectively.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
High-Temperature Fermionic Gibbs States are Mixtures of Gaussian States
Authors:
Akshar Ramkumar,
Yiyi Cai,
Yu Tong,
Jiaqing Jiang
Abstract:
Efficient simulation of a quantum system generally relies on structural properties of the quantum state. Motivated by the recent results by Bakshi et al. on the sudden death of entanglement in high-temperature Gibbs states of quantum spin systems, we study the high-temperature Gibbs states of bounded-degree local fermionic Hamiltonians, which include the special case of geometrically local fermion…
▽ More
Efficient simulation of a quantum system generally relies on structural properties of the quantum state. Motivated by the recent results by Bakshi et al. on the sudden death of entanglement in high-temperature Gibbs states of quantum spin systems, we study the high-temperature Gibbs states of bounded-degree local fermionic Hamiltonians, which include the special case of geometrically local fermionic systems. We prove that at a sufficiently high temperature that is independent of the system size, the Gibbs state is a probabilistic mixture of fermionic Gaussian states. This forms the basis of an efficient classical algorithm to prepare the Gibbs state by sampling from a distribution of fermionic Gaussian states.
△ Less
Submitted 14 May, 2025;
originally announced May 2025.
-
How Students Use AI Feedback Matters: Experimental Evidence on Physics Achievement and Autonomy
Authors:
Xusheng Dai,
Zhaochun Wen,
Jianxiao Jiang,
Huiqin Liu,
Yu Zhang
Abstract:
Despite the precision and adaptiveness of generative AI (GAI)-powered feedback provided to students, existing practice and literature might ignore how usage patterns impact student learning. This study examines the heterogeneous effects of GAI-powered personalized feedback on high school students' physics achievement and autonomy through two randomized controlled trials, with a major focus on usag…
▽ More
Despite the precision and adaptiveness of generative AI (GAI)-powered feedback provided to students, existing practice and literature might ignore how usage patterns impact student learning. This study examines the heterogeneous effects of GAI-powered personalized feedback on high school students' physics achievement and autonomy through two randomized controlled trials, with a major focus on usage patterns. Each experiment lasted for five weeks, involving a total of 387 students. Experiment 1 (n = 121) assessed compulsory usage of the personalized recommendation system, revealing that low-achieving students significantly improved academic performance (d = 0.673, p < 0.05) when receiving AI-generated heuristic solution hints, whereas medium-achieving students' performance declined (d = -0.539, p < 0.05) with conventional answers provided by workbook. Notably, high-achieving students experienced a significant decline in self-regulated learning (d = -0.477, p < 0.05) without any significant gains in achievement. Experiment 2 (n = 266) investigated the usage pattern of autonomous on-demand help, demonstrating that fully learner-controlled AI feedback significantly enhanced academic performance for high-achieving students (d = 0.378, p < 0.05) without negatively impacting their autonomy. However, autonomy notably declined among lower achievers exposed to on-demand AI interventions (d = -0.383, p < 0.05), particularly in the technical-psychological dimension (d = -0.549, p < 0.05), which has a large overlap with self-regulation. These findings underscore the importance of usage patterns when applying GAI-powered personalized feedback to students.
△ Less
Submitted 15 May, 2025; v1 submitted 13 May, 2025;
originally announced May 2025.
-
Guiding LLM-based Smart Contract Generation with Finite State Machine
Authors:
Hao Luo,
Yuhao Lin,
Xiao Yan,
Xintong Hu,
Yuxiang Wang,
Qiming Zeng,
Hao Wang,
Jiawei Jiang
Abstract:
Smart contract is a kind of self-executing code based on blockchain technology with a wide range of application scenarios, but the traditional generation method relies on manual coding and expert auditing, which has a high threshold and low efficiency. Although Large Language Models (LLMs) show great potential in programming tasks, they still face challenges in smart contract generation w.r.t. eff…
▽ More
Smart contract is a kind of self-executing code based on blockchain technology with a wide range of application scenarios, but the traditional generation method relies on manual coding and expert auditing, which has a high threshold and low efficiency. Although Large Language Models (LLMs) show great potential in programming tasks, they still face challenges in smart contract generation w.r.t. effectiveness and security. To solve these problems, we propose FSM-SCG, a smart contract generation framework based on finite state machine (FSM) and LLMs, which significantly improves the quality of the generated code by abstracting user requirements to generate FSM, guiding LLMs to generate smart contracts, and iteratively optimizing the code with the feedback of compilation and security checks. The experimental results show that FSM-SCG significantly improves the quality of smart contract generation. Compared to the best baseline, FSM-SCG improves the compilation success rate of generated smart contract code by at most 48%, and reduces the average vulnerability risk score by approximately 68%.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Exploiting Text Semantics for Few and Zero Shot Node Classification on Text-attributed Graph
Authors:
Yuxiang Wang,
Xiao Yan,
Shiyu Jin,
Quanqing Xu,
Chuang Hu,
Yuanyuan Zhu,
Bo Du,
Jia Wu,
Jiawei Jiang
Abstract:
Text-attributed graph (TAG) provides a text description for each graph node, and few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. Existing work utilizes various graph-based augmentation techniques to train the node and text embeddings, while text-based augmentations are largely unexplored. In this paper, we propose Text Semantics…
▽ More
Text-attributed graph (TAG) provides a text description for each graph node, and few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. Existing work utilizes various graph-based augmentation techniques to train the node and text embeddings, while text-based augmentations are largely unexplored. In this paper, we propose Text Semantics Augmentation (TSA) to improve accuracy by introducing more text semantic supervision signals. Specifically, we design two augmentation techniques, i.e., positive semantics matching and negative semantics contrast, to provide more reference texts for each graph node or text description. Positive semantic matching retrieves texts with similar embeddings to match with a graph node. Negative semantic contrast adds a negative prompt to construct a text description with the opposite semantics, which is contrasted with the original node and text. We evaluate TSA on 5 datasets and compare with 13 state-of-the-art baselines. The results show that TSA consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
ABS-Mamba: SAM2-Driven Bidirectional Spiral Mamba Network for Medical Image Translation
Authors:
Feng Yuan,
Yifan Gao,
Wenbin Wu,
Keqing Wu,
Xiaotong Guo,
Jie Jiang,
Xin Gao
Abstract:
Accurate multi-modal medical image translation requires ha-rmonizing global anatomical semantics and local structural fidelity, a challenge complicated by intermodality information loss and structural distortion. We propose ABS-Mamba, a novel architecture integrating the Segment Anything Model 2 (SAM2) for organ-aware semantic representation, specialized convolutional neural networks (CNNs) for pr…
▽ More
Accurate multi-modal medical image translation requires ha-rmonizing global anatomical semantics and local structural fidelity, a challenge complicated by intermodality information loss and structural distortion. We propose ABS-Mamba, a novel architecture integrating the Segment Anything Model 2 (SAM2) for organ-aware semantic representation, specialized convolutional neural networks (CNNs) for preserving modality-specific edge and texture details, and Mamba's selective state-space modeling for efficient long- and short-range feature dependencies. Structurally, our dual-resolution framework leverages SAM2's image encoder to capture organ-scale semantics from high-resolution inputs, while a parallel CNNs branch extracts fine-grained local features. The Robust Feature Fusion Network (RFFN) integrates these epresentations, and the Bidirectional Mamba Residual Network (BMRN) models spatial dependencies using spiral scanning and bidirectional state-space dynamics. A three-stage skip fusion decoder enhances edge and texture fidelity. We employ Efficient Low-Rank Adaptation (LoRA+) fine-tuning to enable precise domain specialization while maintaining the foundational capabilities of the pre-trained components. Extensive experimental validation on the SynthRAD2023 and BraTS2019 datasets demonstrates that ABS-Mamba outperforms state-of-the-art methods, delivering high-fidelity cross-modal synthesis that preserves anatomical semantics and structural details to enhance diagnostic accuracy in clinical applications. The code is available at https://github.com/gatina-yone/ABS-Mamba
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Generative Pre-trained Autoregressive Diffusion Transformer
Authors:
Yuan Zhang,
Jiacheng Jiang,
Guoqing Ma,
Zhiying Lu,
Haoyang Huang,
Jianlong Yuan,
Nan Duan
Abstract:
In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semanti…
▽ More
In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semantic consistency across frames. This continuous autoregressive framework not only enhances generation quality but also endows the model with representation capabilities. Additionally, we introduce a lightweight causal attention variant and a parameter-free rotation-based time-conditioning mechanism, improving both the training and inference efficiency. Extensive experiments demonstrate that GPDiT achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, highlighting its potential as an effective framework for video modeling in continuous space.
△ Less
Submitted 15 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications
Authors:
Kuntai Du,
Bowen Wang,
Chen Zhang,
Yiming Cheng,
Qing Lan,
Hejian Sang,
Yihua Cheng,
Jiayi Yao,
Xiaoxuan Liu,
Yifan Qiao,
Ion Stoica,
Junchen Jiang
Abstract:
Besides typical generative applications, like ChatGPT, GitHub Copilot, and Cursor, we observe an emerging trend that LLMs are increasingly used in traditional discriminative tasks, such as recommendation, credit verification, and data labeling. The key characteristic of these emerging use cases is that the LLM generates only a single output token, rather than an arbitrarily long sequence of tokens…
▽ More
Besides typical generative applications, like ChatGPT, GitHub Copilot, and Cursor, we observe an emerging trend that LLMs are increasingly used in traditional discriminative tasks, such as recommendation, credit verification, and data labeling. The key characteristic of these emerging use cases is that the LLM generates only a single output token, rather than an arbitrarily long sequence of tokens. We call this prefill-only workload. However, since existing LLM engines assume arbitrary output lengths, they fail to leverage the unique properties of prefill-only workloads. In this paper, we present PrefillOnly, the first LLM inference engine that improves the inference throughput and latency by fully embracing the properties of prefill-only workloads. First, since it generates only one token, PrefillOnly only needs to store the KV cache of only the last computed layer, rather than of all layers. This drastically reduces the GPU memory footprint of LLM inference and allows handling long inputs without using solutions that reduces throughput, such as cross-GPU KV cache parallelization. Second, because the output length is fixed, rather than arbitrary, PrefillOnly can precisely determine the job completion time (JCT) of each prefill-only request before it starts. This enables efficient JCT-aware scheduling policies such as shortest remaining job first. PrefillOnly can process upto 4x larger queries per second without inflating average and P99 latency.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Seed1.5-VL Technical Report
Authors:
Dong Guo,
Faming Wu,
Feida Zhu,
Fuxing Leng,
Guang Shi,
Haobin Chen,
Haoqi Fan,
Jian Wang,
Jianyu Jiang,
Jiawei Wang,
Jingji Chen,
Jingjia Huang,
Kang Lei,
Liping Yuan,
Lishu Luo,
Pengfei Liu,
Qinghao Ye,
Rui Qian,
Shen Yan,
Shixiong Zhao,
Shuai Peng,
Shuangye Li,
Sihang Yuan,
Sijin Wu,
Tianheng Cheng
, et al. (172 additional authors not shown)
Abstract:
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati…
▽ More
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Technical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Leveraging Color Shift Correction, RoPE-Swin Backbone, and Quantile-based Label Denoising Strategy for Robust Outdoor Scene Understanding
Authors:
Chih-Chung Hsu,
I-Hsuan Wu,
Wen-Hai Tseng,
Ching-Heng Cheng,
Ming-Hsuan Wu,
Jin-Hui Jiang,
Yu-Jou Hsiao
Abstract:
This report presents our semantic segmentation framework developed by team ACVLAB for the ICRA 2025 GOOSE 2D Semantic Segmentation Challenge, which focuses on parsing outdoor scenes into nine semantic categories under real-world conditions. Our method integrates a Swin Transformer backbone enhanced with Rotary Position Embedding (RoPE) for improved spatial generalization, alongside a Color Shift E…
▽ More
This report presents our semantic segmentation framework developed by team ACVLAB for the ICRA 2025 GOOSE 2D Semantic Segmentation Challenge, which focuses on parsing outdoor scenes into nine semantic categories under real-world conditions. Our method integrates a Swin Transformer backbone enhanced with Rotary Position Embedding (RoPE) for improved spatial generalization, alongside a Color Shift Estimation-and-Correction module designed to compensate for illumination inconsistencies in natural environments. To further improve training stability, we adopt a quantile-based denoising strategy that downweights the top 2.5\% of highest-error pixels, treating them as noise and suppressing their influence during optimization. Evaluated on the official GOOSE test set, our approach achieved a mean Intersection over Union (mIoU) of 0.848, demonstrating the effectiveness of combining color correction, positional encoding, and error-aware denoising in robust semantic segmentation.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
Embodied Intelligence: The Key to Unblocking Generalized Artificial Intelligence
Authors:
Jinhao Jiang,
Changlin Chen,
Shile Feng,
Wanru Geng,
Zesheng Zhou,
Ni Wang,
Shuai Li,
Feng-Qi Cui,
Erbao Dong
Abstract:
The ultimate goal of artificial intelligence (AI) is to achieve Artificial General Intelligence (AGI). Embodied Artificial Intelligence (EAI), which involves intelligent systems with physical presence and real-time interaction with the environment, has emerged as a key research direction in pursuit of AGI. While advancements in deep learning, reinforcement learning, large-scale language models, an…
▽ More
The ultimate goal of artificial intelligence (AI) is to achieve Artificial General Intelligence (AGI). Embodied Artificial Intelligence (EAI), which involves intelligent systems with physical presence and real-time interaction with the environment, has emerged as a key research direction in pursuit of AGI. While advancements in deep learning, reinforcement learning, large-scale language models, and multimodal technologies have significantly contributed to the progress of EAI, most existing reviews focus on specific technologies or applications. A systematic overview, particularly one that explores the direct connection between EAI and AGI, remains scarce. This paper examines EAI as a foundational approach to AGI, systematically analyzing its four core modules: perception, intelligent decision-making, action, and feedback. We provide a detailed discussion of how each module contributes to the six core principles of AGI. Additionally, we discuss future trends, challenges, and research directions in EAI, emphasizing its potential as a cornerstone for AGI development. Our findings suggest that EAI's integration of dynamic learning and real-world interaction is essential for bridging the gap between narrow AI and AGI.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
BrainSegDMlF: A Dynamic Fusion-enhanced SAM for Brain Lesion Segmentation
Authors:
Hongming Wang,
Yifeng Wu,
Huimin Huang,
Hongtao Wu,
Jia-Xuan Jiang,
Xiaodong Zhang,
Hao Zheng,
Xian Wu,
Yefeng Zheng,
Jinping Xu,
Jing Cheng
Abstract:
The segmentation of substantial brain lesions is a significant and challenging task in the field of medical image segmentation. Substantial brain lesions in brain imaging exhibit high heterogeneity, with indistinct boundaries between lesion regions and normal brain tissue. Small lesions in single slices are difficult to identify, making the accurate and reproducible segmentation of abnormal region…
▽ More
The segmentation of substantial brain lesions is a significant and challenging task in the field of medical image segmentation. Substantial brain lesions in brain imaging exhibit high heterogeneity, with indistinct boundaries between lesion regions and normal brain tissue. Small lesions in single slices are difficult to identify, making the accurate and reproducible segmentation of abnormal regions, as well as their feature description, highly complex. Existing methods have the following limitations: 1) They rely solely on single-modal information for learning, neglecting the multi-modal information commonly used in diagnosis. This hampers the ability to comprehensively acquire brain lesion information from multiple perspectives and prevents the effective integration and utilization of multi-modal data inputs, thereby limiting a holistic understanding of lesions. 2) They are constrained by the amount of data available, leading to low sensitivity to small lesions and difficulty in detecting subtle pathological changes. 3) Current SAM-based models rely on external prompts, which cannot achieve automatic segmentation and, to some extent, affect diagnostic efficiency.To address these issues, we have developed a large-scale fully automated segmentation model specifically designed for brain lesion segmentation, named BrainSegDMLF. This model has the following features: 1) Dynamic Modal Interactive Fusion (DMIF) module that processes and integrates multi-modal data during the encoding process, providing the SAM encoder with more comprehensive modal information. 2) Layer-by-Layer Upsampling Decoder, enabling the model to extract rich low-level and high-level features even with limited data, thereby detecting the presence of small lesions. 3) Automatic segmentation masks, allowing the model to generate lesion masks automatically without requiring manual prompts.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Nonlinear Motion-Guided and Spatio-Temporal Aware Network for Unsupervised Event-Based Optical Flow
Authors:
Zuntao Liu,
Hao Zhuang,
Junjie Jiang,
Yuhang Song,
Zheng Fang
Abstract:
Event cameras have the potential to capture continuous motion information over time and space, making them well-suited for optical flow estimation. However, most existing learning-based methods for event-based optical flow adopt frame-based techniques, ignoring the spatio-temporal characteristics of events. Additionally, these methods assume linear motion between consecutive events within the loss…
▽ More
Event cameras have the potential to capture continuous motion information over time and space, making them well-suited for optical flow estimation. However, most existing learning-based methods for event-based optical flow adopt frame-based techniques, ignoring the spatio-temporal characteristics of events. Additionally, these methods assume linear motion between consecutive events within the loss time window, which increases optical flow errors in long-time sequences. In this work, we observe that rich spatio-temporal information and accurate nonlinear motion between events are crucial for event-based optical flow estimation. Therefore, we propose E-NMSTFlow, a novel unsupervised event-based optical flow network focusing on long-time sequences. We propose a Spatio-Temporal Motion Feature Aware (STMFA) module and an Adaptive Motion Feature Enhancement (AMFE) module, both of which utilize rich spatio-temporal information to learn spatio-temporal data associations. Meanwhile, we propose a nonlinear motion compensation loss that utilizes the accurate nonlinear motion between events to improve the unsupervised learning of our network. Extensive experiments demonstrate the effectiveness and superiority of our method. Remarkably, our method ranks first among unsupervised learning methods on the MVSEC and DSEC-Flow datasets. Our project page is available at https://wynelio.github.io/E-NMSTFlow.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
Boosting Statistic Learning with Synthetic Data from Pretrained Large Models
Authors:
Jialong Jiang,
Wenkang Hu,
Jian Huang,
Yuling Jiao,
Xu Liu
Abstract:
The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose a novel end-to-end framework that generates and systematically filters synthetic data through domain-specific statistical metho…
▽ More
The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose a novel end-to-end framework that generates and systematically filters synthetic data through domain-specific statistical methods, selectively integrating high-quality samples for effective augmentation. Our experiments demonstrate consistent improvements in predictive performance across various settings, highlighting the potential of our framework while underscoring the inherent limitations of generative models for data augmentation. Despite the ability to produce large volumes of synthetic data, the proportion that effectively improves model performance is limited.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Authors:
Yaoqi Chen,
Jinkai Zhang,
Baotong Lu,
Qianxi Zhang,
Chengruidong Zhang,
Jingjia Luo,
Di Liu,
Huiqiang Jiang,
Qi Chen,
Jing Liu,
Bailu Ding,
Xiao Yan,
Jiawei Jiang,
Chen Chen,
Mingxing Zhang,
Yuqing Yang,
Fan Yang,
Mao Yang
Abstract:
The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index,…
▽ More
The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index, an Attention-aWare VEctor index that enables efficient and accurate retrieval of critical tokens through techniques such as tripartite attention approximation, accuracy-bounded attention estimation, and segmented clustering. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. Unlike prior sparsity-based methods that struggle with token selection and hardware coordination, RetroInfer delivers robust performance without compromising model accuracy. Experiments on long-context benchmarks show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory, all while preserving full-attention-level accuracy.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Adaptive Bidding Policies for First-Price Auctions with Budget Constraints under Non-stationarity
Authors:
Yige Wang,
Jiashuo Jiang
Abstract:
We study how a budget-constrained bidder should learn to adaptively bid in repeated first-price auctions to maximize her cumulative payoff. This problem arose due to an industry-wide shift from second-price auctions to first-price auctions in display advertising recently, which renders truthful bidding (i.e., always bidding one's private value) no longer optimal. We propose a simple dual-gradient-…
▽ More
We study how a budget-constrained bidder should learn to adaptively bid in repeated first-price auctions to maximize her cumulative payoff. This problem arose due to an industry-wide shift from second-price auctions to first-price auctions in display advertising recently, which renders truthful bidding (i.e., always bidding one's private value) no longer optimal. We propose a simple dual-gradient-descent-based bidding policy that maintains a dual variable for budget constraint as the bidder consumes her budget. In analysis, we consider two settings regarding the bidder's knowledge of her private values in the future: (i) an uninformative setting where all the distributional knowledge (can be non-stationary) is entirely unknown to the bidder, and (ii) an informative setting where a prediction of the budget allocation in advance. We characterize the performance loss (or regret) relative to an optimal policy with complete information on the stochasticity. For uninformative setting, We show that the regret is \tilde{O}(\sqrt{T}) plus a variation term that reflects the non-stationarity of the value distributions, and this is of optimal order. We then show that we can get rid of the variation term with the help of the prediction; specifically, the regret is \tilde{O}(\sqrt{T}) plus the prediction error term in the informative setting.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge
Authors:
Vladyslav Zalevskyi,
Thomas Sanchez,
Misha Kaandorp,
Margaux Roulet,
Diego Fajardo-Rojas,
Liu Li,
Jana Hutter,
Hongwei Bran Li,
Matthew Barkovich,
Hui Ji,
Luca Wilhelmi,
Aline Dändliker,
Céline Steger,
Mériam Koob,
Yvan Gomez,
Anton Jakovčić,
Melita Klaić,
Ana Adžić,
Pavel Marković,
Gracia Grabarić,
Milan Rados,
Jordina Aviles Verdera,
Gregor Kasprian,
Gregor Dovjak,
Raphael Gaubert-Rachmühl
, et al. (45 additional authors not shown)
Abstract:
Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics wer…
▽ More
Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics were also expanded to include the topology-specific Euler characteristic difference (ED). Sixteen teams submitted segmentation methods, most of which performed consistently across both high- and low-field scans. However, longitudinal trends indicate that segmentation accuracy may be reaching a plateau, with results now approaching inter-rater variability. The ED metric uncovered topological differences that were missed by conventional metrics, while the low-field dataset achieved the highest segmentation scores, highlighting the potential of affordable imaging systems when paired with high-quality reconstruction. Seven teams participated in the biometry task, but most methods failed to outperform a simple baseline that predicted measurements based solely on gestational age, underscoring the challenge of extracting reliable biometric estimates from image data alone. Domain shift analysis identified image quality as the most significant factor affecting model generalization, with super-resolution pipelines also playing a substantial role. Other factors, such as gestational age, pathology, and acquisition site, had smaller, though still measurable, effects. Overall, FeTA 2024 offers a comprehensive benchmark for multi-class segmentation and biometry estimation in fetal brain MRI, underscoring the need for data-centric approaches, improved topological evaluation, and greater dataset diversity to enable clinically robust and generalizable AI tools.
△ Less
Submitted 8 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications
Authors:
Tao Zhu,
Qi Yu,
Xinru Dong,
Shiyu Li,
Yue Liu,
Jinlong Jiang,
Lei Shu
Abstract:
Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust ba…
▽ More
Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust baseline without being overwhelmed by dominant normal data. The Pseudo-Instance Discriminative Enhancement (PIDE) loss boosts separability by applying targeted contrastive learning exclusively to the most reliable extreme-scoring instances (highest/lowest scores). ProDisc-VAD achieves strong AUCs (97.98% ShanghaiTech, 87.12% UCF-Crime) using only 0.4M parameters, over 800x fewer than recent ViT-based methods like VadCLIP, demonstrating exceptional efficiency alongside state-of-the-art performance. Code is available at https://github.com/modadundun/ProDisc-VAD.
△ Less
Submitted 4 May, 2025;
originally announced May 2025.
-
PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth
Authors:
Bu Jin,
Weize Li,
Baihan Yang,
Zhenxin Zhu,
Junpeng Jiang,
Huan-ang Gao,
Haiyang Sun,
Kun Zhan,
Hengtong Hu,
Xueyang Zhang,
Peng Jia,
Hao Zhao
Abstract:
Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, w…
▽ More
Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.
△ Less
Submitted 3 May, 2025;
originally announced May 2025.
-
2DXformer: Dual Transformers for Wind Power Forecasting with Dual Exogenous Variables
Authors:
Yajuan Zhang,
Jiahai Jiang,
Yule Yan,
Liang Yang,
Ping Zhang
Abstract:
Accurate wind power forecasting can help formulate scientific dispatch plans, which is of great significance for maintaining the safety, stability, and efficient operation of the power system. In recent years, wind power forecasting methods based on deep learning have focused on extracting the spatiotemporal correlations among data, achieving significant improvements in forecasting accuracy. Howev…
▽ More
Accurate wind power forecasting can help formulate scientific dispatch plans, which is of great significance for maintaining the safety, stability, and efficient operation of the power system. In recent years, wind power forecasting methods based on deep learning have focused on extracting the spatiotemporal correlations among data, achieving significant improvements in forecasting accuracy. However, they exhibit two limitations. First, there is a lack of modeling for the inter-variable relationships, which limits the accuracy of the forecasts. Second, by treating endogenous and exogenous variables equally, it leads to unnecessary interactions between the endogenous and exogenous variables, increasing the complexity of the model. In this paper, we propose the 2DXformer, which, building upon the previous work's focus on spatiotemporal correlations, addresses the aforementioned two limitations. Specifically, we classify the inputs of the model into three types: exogenous static variables, exogenous dynamic variables, and endogenous variables. First, we embed these variables as variable tokens in a channel-independent manner. Then, we use the attention mechanism to capture the correlations among exogenous variables. Finally, we employ a multi-layer perceptron with residual connections to model the impact of exogenous variables on endogenous variables. Experimental results on two real-world large-scale datasets indicate that our proposed 2DXformer can further improve the performance of wind power forecasting. The code is available in this repository: \href{https://github.com/jseaj/2DXformer}{https://github.com/jseaj/2DXformer}.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement
Authors:
Kui Jiang,
Yan Luo,
Junjun Jiang,
Xin Xu,
Fei Ma,
Fei Yu
Abstract:
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications, where wavelength-dependent attenuation causes severe content degradation and color distortion. While recent state space models like Mamba show potential for long-range dependency modeling, their unfolding operations and fixed scan paths on 1D sequences fail to adapt to local object semantics and glo…
▽ More
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications, where wavelength-dependent attenuation causes severe content degradation and color distortion. While recent state space models like Mamba show potential for long-range dependency modeling, their unfolding operations and fixed scan paths on 1D sequences fail to adapt to local object semantics and global relation modeling, limiting their efficacy in complex underwater environments. To address this, we enhance conventional Mamba with the sorting-based scanning mechanism that dynamically reorders scanning sequences based on statistical distribution of spatial correlation of all pixels. In this way, it encourages the network to prioritize the most informative components--structural and semantic features. Upon building this mechanism, we devise a Visually Self-adaptive State Block (VSSB) that harmonizes dynamic sorting of Mamba with input-dependent dynamic convolution, enabling coherent integration of global context and local relational cues. This exquisite design helps eliminate global focus bias, especially for widely distributed contents, which greatly weakens the statistical frequency. For robust feature extraction and refinement, we design a cross-feature bridge (CFB) to adaptively fuse multi-scale representations. These efforts compose the novel relation-driven Mamba framework for effective UIE (RD-UIE). Extensive experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba in both quantitative metrics and visual fidelity, averagely achieving 0.55 dB performance gain on the three benchmarks. Our code is available at https://github.com/kkoucy/RD-UIE/tree/main
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
Federated Adapter on Foundation Models: An Out-Of-Distribution Approach
Authors:
Yiyuan Yang,
Guodong Long,
Tianyi Zhou,
Qinghua Lu,
Shanshan Ye,
Jing Jiang
Abstract:
As foundation models gain prominence, Federated Foundation Models (FedFM) have emerged as a privacy-preserving approach to collaboratively fine-tune models in federated learning (FL) frameworks using distributed datasets across clients. A key challenge for FedFM, given the versatile nature of foundation models, is addressing out-of-distribution (OOD) generalization, where unseen tasks or clients m…
▽ More
As foundation models gain prominence, Federated Foundation Models (FedFM) have emerged as a privacy-preserving approach to collaboratively fine-tune models in federated learning (FL) frameworks using distributed datasets across clients. A key challenge for FedFM, given the versatile nature of foundation models, is addressing out-of-distribution (OOD) generalization, where unseen tasks or clients may exhibit distribution shifts leading to suboptimal performance. Although numerous studies have explored OOD generalization in conventional FL, these methods are inadequate for FedFM due to the challenges posed by large parameter scales and increased data heterogeneity. To address these, we propose FedOA, which employs adapter-based parameter-efficient fine-tuning methods for efficacy and introduces personalized adapters with feature distance-based regularization to align distributions and guarantee OOD generalization for each client. Theoretically, we demonstrate that the conventional aggregated global model in FedFM inherently retains OOD generalization capabilities, and our proposed method enhances the personalized model's OOD generalization through regularization informed by the global model, with proven convergence under general non-convex settings. Empirically, the effectiveness of the proposed method is validated on benchmark datasets across various NLP tasks.
△ Less
Submitted 2 May, 2025;
originally announced May 2025.
-
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification
Authors:
Shangyu Li,
Juyong Jiang,
Tiancheng Zhao,
Jiasi Shen
Abstract:
We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks. The benchmark first defines the specification generation problem into a program synthesis problem within a confined scope of syntax and semantics by providing LLMs with the programming model. The LLMs are required to…
▽ More
We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks. The benchmark first defines the specification generation problem into a program synthesis problem within a confined scope of syntax and semantics by providing LLMs with the programming model. The LLMs are required to understand the provided verification assumption and the potential syntax and semantics space to search for, then generate the complete specification for the potentially buggy operating system code implementation under the guidance of the high-level functional description of the operating system. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each is a long context task of about 20k-30k tokens. Our comprehensive evaluation of 12 LLMs exhibits the limited performance of the current LLMs on the specification generation tasks for operating system verification. Significant disparities in their performance on the benchmark highlight differences in their ability to handle long-context code generation tasks. The evaluation toolkit and benchmark are available at https://github.com/lishangyu-hkust/OSVBench.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Efficient Graph-Based Approximate Nearest Neighbor Search Achieving: Low Latency Without Throughput Loss
Authors:
Jingjia Luo,
Mingxing Zhang,
Kang Chen,
Xia Liao,
Yingdi Shan,
Jinlei Jiang,
Yongwei Wu
Abstract:
The increase in the dimensionality of neural embedding models has enhanced the accuracy of semantic search capabilities but also amplified the computational demands for Approximate Nearest Neighbor Searches (ANNS). This complexity poses significant challenges in online and interactive services, where query latency is a critical performance metric. Traditional graph-based ANNS methods, while effect…
▽ More
The increase in the dimensionality of neural embedding models has enhanced the accuracy of semantic search capabilities but also amplified the computational demands for Approximate Nearest Neighbor Searches (ANNS). This complexity poses significant challenges in online and interactive services, where query latency is a critical performance metric. Traditional graph-based ANNS methods, while effective for managing large datasets, often experience substantial throughput reductions when scaled for intra-query parallelism to minimize latency. This reduction is largely due to inherent inefficiencies in the conventional fork-join parallelism model.
To address this problem, we introduce AverSearch, a novel parallel graph-based ANNS framework that overcomes these limitations through a fully asynchronous architecture. Unlike existing frameworks that struggle with balancing latency and throughput, AverSearch utilizes a dynamic workload balancing mechanism that supports continuous, dependency-free processing. This approach not only minimizes latency by eliminating unnecessary synchronization and redundant vertex processing but also maintains high throughput levels. Our evaluations across various datasets, including both traditional benchmarks and modern large-scale model generated datasets, show that AverSearch consistently outperforms current state-of-the-art systems. It achieves up to 2.1-8.9 times higher throughput at comparable latency levels across different datasets and reduces minimum latency by 1.5 to 1.9 times.
△ Less
Submitted 30 April, 2025; v1 submitted 29 April, 2025;
originally announced April 2025.
-
RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library
Authors:
Jiapeng Wang,
Jinhao Jiang,
Zhiqiang Zhang,
Jun Zhou,
Wayne Xin Zhao
Abstract:
The advancement of reasoning capabilities in Large Language Models (LLMs) requires substantial amounts of high-quality reasoning data, particularly in mathematics. Existing data synthesis methods, such as data augmentation from annotated training sets or direct question generation based on relevant knowledge points and documents, have expanded datasets but face challenges in mastering the inner lo…
▽ More
The advancement of reasoning capabilities in Large Language Models (LLMs) requires substantial amounts of high-quality reasoning data, particularly in mathematics. Existing data synthesis methods, such as data augmentation from annotated training sets or direct question generation based on relevant knowledge points and documents, have expanded datasets but face challenges in mastering the inner logic of the problem during generation and ensuring the verifiability of the solutions. To address these issues, we propose RV-Syn, a novel Rational and Verifiable mathematical Synthesis approach. RV-Syn constructs a structured mathematical operation function library based on initial seed problems and generates computational graphs as solutions by combining Python-formatted functions from this library. These graphs are then back-translated into complex problems. Based on the constructed computation graph, we achieve solution-guided logic-aware problem generation. Furthermore, the executability of the computational graph ensures the verifiability of the solving process. Experimental results show that RV-Syn surpasses existing synthesis methods, including those involving human-generated problems, achieving greater efficient data scaling. This approach provides a scalable framework for generating high-quality reasoning datasets.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Terahertz Wireless Data Center: Gaussian Beam or Airy Beam?
Authors:
Wenqi Zhao,
Sergi Abadal,
Guochao Song,
Jiamo Jiang,
Chong Han
Abstract:
Terahertz (THz) communication is emerging as a pivotal enabler for 6G and beyond wireless systems owing to its multi-GHz bandwidth. One of its novel applications is in wireless data centers, where it enables ultra-high data rates while enhancing network reconfigurability and scalability. However, due to numerous racks, supporting walls, and densely deployed antennas, the line-of-sight (LoS) path i…
▽ More
Terahertz (THz) communication is emerging as a pivotal enabler for 6G and beyond wireless systems owing to its multi-GHz bandwidth. One of its novel applications is in wireless data centers, where it enables ultra-high data rates while enhancing network reconfigurability and scalability. However, due to numerous racks, supporting walls, and densely deployed antennas, the line-of-sight (LoS) path in data centers is often instead of fully obstructed, resulting in quasi-LoS propagation and degradation of spectral efficiency. To address this issue, Airy beam-based hybrid beamforming is investigated in this paper as a promising technique to mitigate quasi-LoS propagation and enhance spectral efficiency in THz wireless data centers. Specifically, a cascaded geometrical and wave-based channel model (CGWCM) is proposed for quasi-LoS scenarios, which accounts for diffraction effects while being more simplified than conventional wave-based model. Then, the characteristics and generation of the Airy beam are analyzed, and beam search methods for quasi-LoS scenarios are proposed, including hierarchical focusing-Airy beam search, and low-complexity beam search. Simulation results validate the effectiveness of the CGWCM and demonstrate the superiority of the Airy beam over Gaussian beams in mitigating blockages, verifying its potential for practical THz wireless communication in data centers.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer
Authors:
Junpeng Jiang,
Gangyi Hong,
Miao Zhang,
Hengtong Hu,
Kun Zhan,
Rui Shao,
Liqiang Nie
Abstract:
Collecting multi-view driving scenario videos to enhance the performance of 3D visual perception tasks presents significant challenges and incurs substantial costs, making generative models for realistic data an appealing alternative. Yet, the videos generated by recent works suffer from poor quality and spatiotemporal consistency, undermining their utility in advancing perception tasks under driv…
▽ More
Collecting multi-view driving scenario videos to enhance the performance of 3D visual perception tasks presents significant challenges and incurs substantial costs, making generative models for realistic data an appealing alternative. Yet, the videos generated by recent works suffer from poor quality and spatiotemporal consistency, undermining their utility in advancing perception tasks under driving scenarios. To address this gap, we propose DiVE, a diffusion transformer-based generative framework meticulously engineered to produce high-fidelity, temporally coherent, and cross-view consistent multi-view videos, aligning seamlessly with bird's-eye view layouts and textual descriptions. DiVE leverages a unified cross-attention and a SketchFormer to exert precise control over multimodal data, while incorporating a view-inflated attention mechanism that adds no extra parameters, thereby guaranteeing consistency across views. Despite these advancements, synthesizing high-resolution videos under multimodal constraints introduces dual challenges: investigating the optimal classifier-free guidance coniguration under intricate multi-condition inputs and mitigating excessive computational latency in high-resolution rendering--both of which remain underexplored in prior researches. To resolve these limitations, we introduce two innovations: Multi-Control Auxiliary Branch Distillation, which streamlines multi-condition CFG selection while circumventing high computational overhead, and Resolution Progressive Sampling, a training-free acceleration strategy that staggers resolution scaling to reduce high latency due to high resolution. These innovations collectively achieve a 2.62x speedup with minimal quality degradation. Evaluated on the nuScenes dataset, DiVE achieves SOTA performance in multi-view video generation, yielding photorealistic outputs with exceptional temporal and cross-view coherence.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Authors:
Yixin Cao,
Shibo Hong,
Xinze Li,
Jiahao Ying,
Yubo Ma,
Haiyuan Liang,
Yantao Liu,
Zijun Yao,
Xiaozhi Wang,
Dan Huang,
Wenxuan Zhang,
Lifu Huang,
Muhao Chen,
Lei Hou,
Qianru Sun,
Xingjun Ma,
Zuxuan Wu,
Min-Yen Kan,
David Lo,
Qi Zhang,
Heng Ji,
Jing Jiang,
Juanzi Li,
Aixin Sun,
Xuanjing Huang
, et al. (2 additional authors not shown)
Abstract:
Large Language Models (LLMs) are advancing at an amazing speed and have become indispensable across academia, industry, and daily applications. To keep pace with the status quo, this survey probes the core challenges that the rise of LLMs poses for evaluation. We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around…
▽ More
Large Language Models (LLMs) are advancing at an amazing speed and have become indispensable across academia, industry, and daily applications. To keep pace with the status quo, this survey probes the core challenges that the rise of LLMs poses for evaluation. We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety; and (ii) from manual to automated evaluation, encompassing dynamic dataset curation and "LLM-as-a-judge" scoring.
Yet, even with these transitions, a crucial obstacle persists: the evaluation generalization issue. Bounded test sets cannot scale alongside models whose abilities grow seemingly without limit. We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics. Due to the fast evolving of this field, we will maintain a living GitHub repository (links are in each section) to crowd-source updates and corrections, and warmly invite contributors and collaborators.
△ Less
Submitted 26 April, 2025;
originally announced April 2025.
-
Probabilistic Shaping in MIMO: Going Beyond 1.53dB AWGN Gain With the Non-Linear Demapper
Authors:
Kirill Ivanov,
Wei Yang,
Jing Jiang
Abstract:
Constellation shaping is a well-established method to improve upon a regular quadrature amplitude modulation (QAM). It is known that the gain achieved by any shaping method for an additive white Gaussian noise (AWGN) channel is upper-bounded by 1.53dB. However, the situation becomes less clear in the multiple-input and multiple-output (MIMO) setting.
In this paper, we study the application of pr…
▽ More
Constellation shaping is a well-established method to improve upon a regular quadrature amplitude modulation (QAM). It is known that the gain achieved by any shaping method for an additive white Gaussian noise (AWGN) channel is upper-bounded by 1.53dB. However, the situation becomes less clear in the multiple-input and multiple-output (MIMO) setting.
In this paper, we study the application of probabilistic shaping for MIMO channels. We utilize an efficient near-optimal demapper based on sphere decoding (SD) and demonstrate that it is possible to achieve more than 2dB gains, breaking the AWGN limit. It becomes possible because both signal and interference are shaped and the non-linear methods can capture this property and leverage on it to improve the demodulation performance.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
Evolution of Optimization Algorithms for Global Placement via Large Language Models
Authors:
Xufeng Yao,
Jiaxi Jiang,
Yuxuan Zhao,
Peiyu Liao,
Yibo Lin,
Bei Yu
Abstract:
Optimization algorithms are widely employed to tackle complex problems, but designing them manually is often labor-intensive and requires significant expertise. Global placement is a fundamental step in electronic design automation (EDA). While analytical approaches represent the state-of-the-art (SOTA) in global placement, their core optimization algorithms remain heavily dependent on heuristics…
▽ More
Optimization algorithms are widely employed to tackle complex problems, but designing them manually is often labor-intensive and requires significant expertise. Global placement is a fundamental step in electronic design automation (EDA). While analytical approaches represent the state-of-the-art (SOTA) in global placement, their core optimization algorithms remain heavily dependent on heuristics and customized components, such as initialization strategies, preconditioning methods, and line search techniques. This paper presents an automated framework that leverages large language models (LLM) to evolve optimization algorithms for global placement. We first generate diverse candidate algorithms using LLM through carefully crafted prompts. Then we introduce an LLM-based genetic flow to evolve selected candidate algorithms. The discovered optimization algorithms exhibit substantial performance improvements across many benchmarks. Specifically, Our design-case-specific discovered algorithms achieve average HPWL improvements of \textbf{5.05\%}, \text{5.29\%} and \textbf{8.30\%} on MMS, ISPD2005 and ISPD2019 benchmarks, and up to \textbf{17\%} improvements on individual cases. Additionally, the discovered algorithms demonstrate good generalization ability and are complementary to existing parameter-tuning methods.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
The Fourth Monocular Depth Estimation Challenge
Authors:
Anton Obukhov,
Matteo Poggi,
Fabio Tosi,
Ripudaman Singh Arora,
Jaime Spencer,
Chris Russell,
Simon Hadfield,
Richard Bowden,
Shuaihang Wang,
Zhenxin Ma,
Weijie Chen,
Baobei Xu,
Fengyu Sun,
Di Xie,
Jiang Zhu,
Mykola Lavreniuk,
Haining Guan,
Qun Wu,
Yupei Zeng,
Chao Lu,
Huanran Wang,
Guangyuan Zhou,
Haotian Zhang,
Jianxiong Wang,
Qiang Rao
, et al. (32 additional authors not shown)
Abstract:
This paper presents the results of the fourth edition of the Monocular Depth Estimation Challenge (MDEC), which focuses on zero-shot generalization to the SYNS-Patches benchmark, a dataset featuring challenging environments in both natural and indoor settings. In this edition, we revised the evaluation protocol to use least-squares alignment with two degrees of freedom to support disparity and aff…
▽ More
This paper presents the results of the fourth edition of the Monocular Depth Estimation Challenge (MDEC), which focuses on zero-shot generalization to the SYNS-Patches benchmark, a dataset featuring challenging environments in both natural and indoor settings. In this edition, we revised the evaluation protocol to use least-squares alignment with two degrees of freedom to support disparity and affine-invariant predictions. We also revised the baselines and included popular off-the-shelf methods: Depth Anything v2 and Marigold. The challenge received a total of 24 submissions that outperformed the baselines on the test set; 10 of these included a report describing their approach, with most leading methods relying on affine-invariant predictions. The challenge winners improved the 3D F-Score over the previous edition's best result, raising it from 22.58% to 23.05%.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Representation Learning for Tabular Data: A Comprehensive Survey
Authors:
Jun-Peng Jiang,
Si-Yang Liu,
Hao-Run Cai,
Qile Zhou,
Han-Jia Ye
Abstract:
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of ta…
▽ More
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of tabular representation learning, covering the background, challenges, and benchmarks, along with the pros and cons of using DNNs. We organize existing methods into three main categories according to their generalization capabilities: specialized, transferable, and general models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. We introduce a hierarchical taxonomy for specialized models based on the key aspects of tabular data -- features, samples, and objectives -- and delve into detailed strategies for obtaining high-quality feature- and sample-level representations. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks, leveraging knowledge acquired from homogeneous or heterogeneous sources, or even cross-modalities such as vision and language. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without fine-tuning. We group these general models based on the strategies used to adapt across heterogeneous datasets. Additionally, we explore ensemble methods, which integrate the strengths of multiple tabular models. Finally, we discuss representative extensions of tabular learning, including open-environment tabular machine learning, multimodal learning with tabular data, and tabular understanding. More information can be found in the following repository: https://github.com/LAMDA-Tabular/Tabular-Survey.
△ Less
Submitted 17 April, 2025;
originally announced April 2025.
-
WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents
Authors:
Siyu Zhou,
Tianyi Zhou,
Yijun Yang,
Guodong Long,
Deheng Ye,
Jing Jiang,
Chengqi Zhang
Abstract:
Can we build accurate world models out of large language models (LLMs)? How can world models benefit LLM agents? The gap between the prior knowledge of LLMs and the specified environment's dynamics usually bottlenecks LLMs' performance as world models. To bridge the gap, we propose a training-free "world alignment" that learns an environment's symbolic knowledge complementary to LLMs. The symbolic…
▽ More
Can we build accurate world models out of large language models (LLMs)? How can world models benefit LLM agents? The gap between the prior knowledge of LLMs and the specified environment's dynamics usually bottlenecks LLMs' performance as world models. To bridge the gap, we propose a training-free "world alignment" that learns an environment's symbolic knowledge complementary to LLMs. The symbolic knowledge covers action rules, knowledge graphs, and scene graphs, which are extracted by LLMs from exploration trajectories and encoded into executable codes to regulate LLM agents' policies. We further propose an RL-free, model-based agent "WALL-E 2.0" through the model-predictive control (MPC) framework. Unlike classical MPC requiring costly optimization on the fly, we adopt an LLM agent as an efficient look-ahead optimizer of future steps' actions by interacting with the neurosymbolic world model. While the LLM agent's strong heuristics make it an efficient planner in MPC, the quality of its planned actions is also secured by the accurate predictions of the aligned world model. They together considerably improve learning efficiency in a new environment. On open-world challenges in Mars (Minecraft like) and ALFWorld (embodied indoor environments), WALL-E 2.0 significantly outperforms existing methods, e.g., surpassing baselines in Mars by 16.1%-51.6% of success rate and by at least 61.7% in score. In ALFWorld, it achieves a new record 98% success rate after only 4 iterations.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
Authors:
Guo Chen,
Zhiqi Li,
Shihao Wang,
Jindong Jiang,
Yicheng Liu,
Lidong Lu,
De-An Huang,
Wonmin Byeon,
Matthieu Le,
Tuomas Rintamaki,
Tyler Poon,
Max Ehrlich,
Tuomas Rintamaki,
Tyler Poon,
Tong Lu,
Limin Wang,
Bryan Catanzaro,
Jan Kautz,
Andrew Tao,
Zhiding Yu,
Guilin Liu
Abstract:
We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve con…
▽ More
We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Integrating Response Time and Attention Duration in Bayesian Preference Learning for Multiple Criteria Decision Aiding
Authors:
Jiaxuan Jiang,
Jiapeng Liu,
Miłosz Kadziński,
Xiuwu Liao,
Jingyu Dong
Abstract:
We introduce a multiple criteria Bayesian preference learning framework incorporating behavioral cues for decision aiding. The framework integrates pairwise comparisons, response time, and attention duration to deepen insights into decision-making processes. The approach employs an additive value function model and utilizes a Bayesian framework to derive the posterior distribution of potential ran…
▽ More
We introduce a multiple criteria Bayesian preference learning framework incorporating behavioral cues for decision aiding. The framework integrates pairwise comparisons, response time, and attention duration to deepen insights into decision-making processes. The approach employs an additive value function model and utilizes a Bayesian framework to derive the posterior distribution of potential ranking models by defining the likelihood of observed preference data and specifying a prior on the preference structure. This distribution highlights each model's ability to reconstruct Decision-Makers' holistic pairwise comparisons. By leveraging both response time as a proxy for cognitive effort and alternative discriminability as well as attention duration as an indicator of criterion importance, the proposed model surpasses traditional methods by uncovering richer behavioral patterns. We report the results of a laboratory experiment on mobile phone contract selection involving 30 real subjects using a dedicated application with time-, eye-, and mouse-tracking components. We validate the novel method's ability to reconstruct complete preferences. The detailed ablation studies reveal time- and attention-related behavioral patterns, confirming that integrating comprehensive data leads to developing models that better align with the DM's actual preferences.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
NTIRE 2025 Challenge on Real-World Face Restoration: Methods and Results
Authors:
Zheng Chen,
Jingkai Wang,
Kai Liu,
Jue Gong,
Lei Sun,
Zongwei Wu,
Radu Timofte,
Yulun Zhang,
Jianxing Zhang,
Jinlong Wu,
Jun Wang,
Zheng Xie,
Hakjae Jeon,
Suejin Han,
Hyung-Ju Chun,
Hyunhee Park,
Zhicun Yin,
Junjie Chen,
Ming Liu,
Xiaoming Li,
Chao Zhou,
Wangmeng Zuo,
Weixia Zhang,
Dingquan Li,
Kede Ma
, et al. (29 additional authors not shown)
Abstract:
This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or…
▽ More
This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. The track of the challenge evaluates performance using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 141 registrants, with 13 teams submitting valid models, and ultimately, 10 teams achieved a valid score in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Modality Selection and Skill Segmentation via Cross-Modality Attention
Authors:
Jiawei Jiang,
Kei Ota,
Devesh K. Jha,
Asako Kanezaki
Abstract:
Incorporating additional sensory modalities such as tactile and audio into foundational robotic models poses significant challenges due to the curse of dimensionality. This work addresses this issue through modality selection. We propose a cross-modality attention (CMA) mechanism to identify and selectively utilize the modalities that are most informative for action generation at each timestep. Fu…
▽ More
Incorporating additional sensory modalities such as tactile and audio into foundational robotic models poses significant challenges due to the curse of dimensionality. This work addresses this issue through modality selection. We propose a cross-modality attention (CMA) mechanism to identify and selectively utilize the modalities that are most informative for action generation at each timestep. Furthermore, we extend the application of CMA to segment primitive skills from expert demonstrations and leverage this segmentation to train a hierarchical policy capable of solving long-horizon, contact-rich manipulation tasks.
△ Less
Submitted 20 April, 2025;
originally announced April 2025.
-
Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning
Authors:
Jianing Wang,
Jin Jiang,
Yang Liu,
Mengdi Zhang,
Xunliang Cai
Abstract:
In this paper, we introduce a new \emph{process prejudge} strategy in LLM reasoning to demonstrate that bootstrapping with process prejudge allows the LLM to adaptively anticipate the errors encountered when advancing the subsequent reasoning steps, similar to people sometimes pausing to think about what mistakes may occur and how to avoid them, rather than relying solely on trial and error. Speci…
▽ More
In this paper, we introduce a new \emph{process prejudge} strategy in LLM reasoning to demonstrate that bootstrapping with process prejudge allows the LLM to adaptively anticipate the errors encountered when advancing the subsequent reasoning steps, similar to people sometimes pausing to think about what mistakes may occur and how to avoid them, rather than relying solely on trial and error. Specifically, we define a prejudge node in the rationale, which represents a reasoning step, with at least one step that follows the prejudge node that has no paths toward the correct answer. To synthesize the prejudge reasoning process, we present an automated reasoning framework with a dynamic tree-searching strategy. This framework requires only one LLM to perform answer judging, response critiquing, prejudge generation, and thought completion. Furthermore, we develop a two-phase training mechanism with supervised fine-tuning (SFT) and reinforcement learning (RL) to further enhance the reasoning capabilities of LLMs. Experimental results from competition-level complex reasoning demonstrate that our method can teach the model to prejudge before thinking and significantly enhance the reasoning ability of LLMs. Code and data is released at https://github.com/wjn1996/Prejudge-Before-Think.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs
Authors:
Chang Yang,
Ruiyu Wang,
Junzhe Jiang,
Qi Jiang,
Qinggang Zhang,
Yanchen Deng,
Shuxin Li,
Shuyue Hu,
Bo Li,
Florian T. Pokorny,
Xiao Huang,
Xinrun Wang
Abstract:
Reasoning is the fundamental capability of large language models (LLMs). Due to the rapid progress of LLMs, there are two main issues of current benchmarks: i) these benchmarks can be crushed in a short time (less than 1 year), and ii) these benchmarks may be easily hacked. To handle these issues, we propose the ever-scalingness for building the benchmarks which are uncrushable, unhackable, auto-v…
▽ More
Reasoning is the fundamental capability of large language models (LLMs). Due to the rapid progress of LLMs, there are two main issues of current benchmarks: i) these benchmarks can be crushed in a short time (less than 1 year), and ii) these benchmarks may be easily hacked. To handle these issues, we propose the ever-scalingness for building the benchmarks which are uncrushable, unhackable, auto-verifiable and general. This paper presents Nondeterministic Polynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmark for LLMs. Specifically, the NPPC has three main modules: i) npgym, which provides a unified interface of 25 well-known NP-complete problems and can generate any number of instances with any levels of complexities, ii) npsolver: which provides a unified interface to evaluate the problem instances with both online and offline models via APIs and local deployments, respectively, and iii) npeval: which provides the comprehensive and ready-to-use tools to analyze the performances of LLMs over different problems, the number of tokens, the aha moments, the reasoning errors and the solution errors. Extensive experiments over widely-used LLMs demonstrate: i) NPPC can successfully decrease the performances of advanced LLMs' performances to below 10%, demonstrating that NPPC is uncrushable, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are the most powerful LLMs, where DeepSeek-R1 outperforms Claude-3.7-Sonnet and o1/o3-mini in most NP-complete problems considered, and iii) the numbers of tokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet and DeepSeek-R1, are observed first to increase and then decrease when the problem instances become more and more difficult. We believe that NPPC is the first ever-scaling reasoning benchmark, serving as the uncrushable and unhackable testbed for LLMs toward artificial general intelligence (AGI).
△ Less
Submitted 15 April, 2025;
originally announced April 2025.
-
Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration
Authors:
Gang Wu,
Junjun Jiang,
Kui Jiang,
Xianming Liu,
Liqiang Nie
Abstract:
All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from p…
▽ More
All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but may discard critical visual information for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a novel framework that fundamentally enhances prompt-task alignment through two complementary innovations: a \emph{Sparse Prompt Module (SPM)} that efficiently captures degradation-specific features while minimizing redundancy, and a \emph{Contrastive Prompt Regularization (CPR)} that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL optimizes the critical interaction between prompts and the restoration model itself. Extensive experiments across five comprehensive benchmarks demonstrate that CPL consistently enhances state-of-the-art all-in-one restoration models, achieving significant improvements in both standard multi-task scenarios and challenging composite degradation settings. Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training
Authors:
Juntao Zhao,
Qi Lu,
Wei Jia,
Borui Wan,
Lei Zuo,
Junda Feng,
Jianyu Jiang,
Yangrui Chen,
Shuaishuai Cao,
Jialing He,
Kaihua Jiang,
Yuanzhe Hu,
Yanghua Peng,
Haibin Lin,
Xin Liu,
Chuan Wu
Abstract:
Modern frameworks for training large foundation models (LFMs) employ data loaders in a data parallel paradigm. While this design offers implementation simplicity, it introduces two fundamental challenges. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to a significant workload imbalance among loader…
▽ More
Modern frameworks for training large foundation models (LFMs) employ data loaders in a data parallel paradigm. While this design offers implementation simplicity, it introduces two fundamental challenges. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to a significant workload imbalance among loaders, which degrades the training efficiency. This paradigm also impedes the implementation of data mixing algorithms (e.g., curriculum learning) over different datasets. Second, to acquire a broad range of capability, LFMs training ingests data from diverse sources, each with distinct file access states. Colocating massive datasets within loader instances can easily exceed local pod memory capacity. Additionally, heavy sources with higher transformation latency require larger worker pools, further exacerbating memory consumption.
We present OVERLORD, an industrial-grade distributed data loading architecture with three innovations: (1) A centralized and declarative data plane, which facilitates elastic data orchestration strategy, such as long-short context, multimodal, and curriculum learning; (2) Disaggregated multisource preprocessing through role-specific actors, i.e., Source Loaders and Data Constructors, leveraging autoscaling for Source Loaders towards heterogeneous and evolving source preprocessing cost; (3) Shadow Loaders with differential checkpointing for uninterrupted fault recovery. Deployed on production clusters scaling to multi-thousand GPU, OVERLORD achieves: (1) 4.5x end-to-end training throughput improvement, (2) a minimum 3.6x reduction in CPU memory usage, with further improvements to be added in later experiments.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion
Authors:
Puyu Han,
Jiaju Kang,
Yuhang Pan,
Erting Pan,
Zeyu Zhang,
Qunchao Jin,
Juntao Jiang,
Zhichen Liu,
Luqi Gong
Abstract:
Large-scale pre-trained diffusion models have produced excellent results in the field of conditional image generation. However, restoration of ancient murals, as an important downstream task in this field, poses significant challenges to diffusion model-based restoration methods due to its large defective area and scarce training samples. Conditional restoration tasks are more concerned with wheth…
▽ More
Large-scale pre-trained diffusion models have produced excellent results in the field of conditional image generation. However, restoration of ancient murals, as an important downstream task in this field, poses significant challenges to diffusion model-based restoration methods due to its large defective area and scarce training samples. Conditional restoration tasks are more concerned with whether the restored part meets the aesthetic standards of mural restoration in terms of overall style and seam detail, and such metrics for evaluating heuristic image complements are lacking in current research. We therefore propose DiffuMural, a combined Multi-scale convergence and Collaborative Diffusion mechanism with ControlNet and cyclic consistency loss to optimise the matching between the generated images and the conditional control. DiffuMural demonstrates outstanding capabilities in mural restoration, leveraging training data from 23 large-scale Dunhuang murals that exhibit consistent visual aesthetics. The model excels in restoring intricate details, achieving a coherent overall appearance, and addressing the unique challenges posed by incomplete murals lacking factual grounding. Our evaluation framework incorporates four key metrics to quantitatively assess incomplete murals: factual accuracy, textural detail, contextual semantics, and holistic visual coherence. Furthermore, we integrate humanistic value assessments to ensure the restored murals retain their cultural and artistic significance. Extensive experiments validate that our method outperforms state-of-the-art (SOTA) approaches in both qualitative and quantitative metrics.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
Dupin: A Parallel Framework for Densest Subgraph Discovery in Fraud Detection on Massive Graphs (Technical Report)
Authors:
Jiaxin Jiang,
Siyuan Yao,
Yuchen Li,
Qiange Wang,
Bingsheng He,
Min Chen
Abstract:
Detecting fraudulent activities in financial and e-commerce transaction networks is crucial. One effective method for this is Densest Subgraph Discovery (DSD). However, deploying DSD methods in production systems faces substantial scalability challenges due to the predominantly sequential nature of existing methods, which impedes their ability to handle large-scale transaction networks and results…
▽ More
Detecting fraudulent activities in financial and e-commerce transaction networks is crucial. One effective method for this is Densest Subgraph Discovery (DSD). However, deploying DSD methods in production systems faces substantial scalability challenges due to the predominantly sequential nature of existing methods, which impedes their ability to handle large-scale transaction networks and results in significant detection delays. To address these challenges, we introduce Dupin, a novel parallel processing framework designed for efficient DSD processing in billion-scale graphs. Dupin is powered by a processing engine that exploits the unique properties of the peeling process, with theoretical guarantees on detection quality and efficiency. Dupin provides userfriendly APIs for flexible customization of DSD objectives and ensures robust adaptability to diverse fraud detection scenarios. Empirical evaluations demonstrate that Dupin consistently outperforms several existing DSD methods, achieving performance improvements of up to 100 times compared to traditional approaches. On billion-scale graphs, Dupin demonstrates the potential to enhance the prevention of fraudulent transactions from 45% to 94.5% and reduces density error from 30% to below 5%, as supported by our experimental results. These findings highlight the effectiveness of Dupin in real-world applications, ensuring both speed and accuracy in fraud detection.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
FedMerge: Federated Personalization via Model Merging
Authors:
Shutong Chen,
Tianyi Zhou,
Guodong Long,
Jing Jiang,
Chengqi Zhang
Abstract:
One global model in federated learning (FL) might not be sufficient to serve many clients with non-IID tasks and distributions. While there has been advances in FL to train multiple global models for better personalization, they only provide limited choices to clients so local finetuning is still indispensable. In this paper, we propose a novel ``FedMerge'' approach that can create a personalized…
▽ More
One global model in federated learning (FL) might not be sufficient to serve many clients with non-IID tasks and distributions. While there has been advances in FL to train multiple global models for better personalization, they only provide limited choices to clients so local finetuning is still indispensable. In this paper, we propose a novel ``FedMerge'' approach that can create a personalized model per client by simply merging multiple global models with automatically optimized and customized weights. In FedMerge, a few global models can serve many non-IID clients, even without further local finetuning. We formulate this problem as a joint optimization of global models and the merging weights for each client. Unlike existing FL approaches where the server broadcasts one or multiple global models to all clients, the server only needs to send a customized, merged model to each client. Moreover, instead of periodically interrupting the local training and re-initializing it to a global model, the merged model aligns better with each client's task and data distribution, smoothening the local-global gap between consecutive rounds caused by client drift. We evaluate FedMerge on three different non-IID settings applied to different domains with diverse tasks and data types, in which FedMerge consistently outperforms existing FL approaches, including clustering-based and mixture-of-experts (MoE) based methods.
△ Less
Submitted 24 April, 2025; v1 submitted 9 April, 2025;
originally announced April 2025.
-
Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion
Authors:
Xingyu Hu,
Junjun Jiang,
Chenyang Wang,
Kui Jiang,
Xianming Liu,
Jiayi Ma
Abstract:
Unified image fusion aims to integrate complementary information from multi-source images, enhancing image quality through a unified framework applicable to diverse fusion tasks. While treating all fusion tasks as a unified problem facilitates task-invariant knowledge sharing, it often overlooks task-specific characteristics, thereby limiting the overall performance. Existing general image fusion…
▽ More
Unified image fusion aims to integrate complementary information from multi-source images, enhancing image quality through a unified framework applicable to diverse fusion tasks. While treating all fusion tasks as a unified problem facilitates task-invariant knowledge sharing, it often overlooks task-specific characteristics, thereby limiting the overall performance. Existing general image fusion methods incorporate explicit task identification to enable adaptation to different fusion tasks. However, this dependence during inference restricts the model's generalization to unseen fusion tasks. To address these issues, we propose a novel unified image fusion framework named "TITA", which dynamically balances both Task-invariant Interaction and Task-specific Adaptation. For task-invariant interaction, we introduce the Interaction-enhanced Pixel Attention (IPA) module to enhance pixel-wise interactions for better multi-source complementary information extraction. For task-specific adaptation, the Operation-based Adaptive Fusion (OAF) module dynamically adjusts operation weights based on task properties. Additionally, we incorporate the Fast Adaptive Multitask Optimization (FAMO) strategy to mitigate the impact of gradient conflicts across tasks during joint training. Extensive experiments demonstrate that TITA not only achieves competitive performance compared to specialized methods across three image fusion scenarios but also exhibits strong generalization to unseen fusion tasks.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
Content-Aware Transformer for All-in-one Image Restoration
Authors:
Gang Wu,
Junjun Jiang,
Kui Jiang,
Xianming Liu
Abstract:
Image restoration has witnessed significant advancements with the development of deep learning models. Although Transformer architectures have progressed considerably in recent years, challenges remain, particularly the limited receptive field in window-based self-attention. In this work, we propose DSwinIR, a Deformable Sliding window Transformer for Image Restoration. DSwinIR introduces a novel…
▽ More
Image restoration has witnessed significant advancements with the development of deep learning models. Although Transformer architectures have progressed considerably in recent years, challenges remain, particularly the limited receptive field in window-based self-attention. In this work, we propose DSwinIR, a Deformable Sliding window Transformer for Image Restoration. DSwinIR introduces a novel deformable sliding window self-attention that adaptively adjusts receptive fields based on image content, enabling the attention mechanism to focus on important regions and enhance feature extraction aligned with salient features. Additionally, we introduce a central ensemble pattern to reduce the inclusion of irrelevant content within attention windows. In this way, the proposed DSwinIR model integrates the deformable sliding window Transformer and central ensemble pattern to amplify the strengths of both CNNs and Transformers while mitigating their limitations. Extensive experiments on various image restoration tasks demonstrate that DSwinIR achieves state-of-the-art performance. For example, in image deraining, compared to DRSformer on the SPA dataset, DSwinIR achieves a 0.66 dB PSNR improvement. In all-in-one image restoration, compared to PromptIR, DSwinIR achieves over a 0.66 dB and 1.04 dB improvement on three-task and five-task settings, respectively. Pretrained models and code are available at our project https://github.com/Aitical/DSwinIR.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation
Authors:
Junjie Jiang,
Zelin Wang,
Manqi Zhao,
Yin Li,
DongSheng Jiang
Abstract:
Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. To extend this to multi-object tracking (MOT), we propose SAM2MOT, introducing a novel Tracking by Segmentation paradigm. Unlike Tracking by Detection or Tracking by Query, SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy. SAM2MOT has two key advantages: zer…
▽ More
Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. To extend this to multi-object tracking (MOT), we propose SAM2MOT, introducing a novel Tracking by Segmentation paradigm. Unlike Tracking by Detection or Tracking by Query, SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy. SAM2MOT has two key advantages: zero-shot generalization, allowing it to work across datasets without fine-tuning, and strong object association, inherited from SAM2. To further improve performance, we integrate a trajectory manager system for precise object addition and removal, and a cross-object interaction module to handle occlusions. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT. Code is available at https://github.com/TripleJoy/SAM2MOT.
△ Less
Submitted 5 May, 2025; v1 submitted 6 April, 2025;
originally announced April 2025.
-
DRAMA: A Dynamic Packet Routing Algorithm using Multi-Agent Reinforcement Learning with Emergent Communication
Authors:
Wang Zhang,
Chenguang Liu,
Yue Pi,
Yong Zhang,
Hairong Huang,
Baoquan Rao,
Yulong Ding,
Shuanghua Yang,
Jie Jiang
Abstract:
The continuous expansion of network data presents a pressing challenge for conventional routing algorithms. As the demand escalates, these algorithms are struggling to cope. In this context, reinforcement learning (RL) and multi-agent reinforcement learning (MARL) algorithms emerge as promising solutions. However, the urgency and importance of the problem are clear, as existing RL/MARL-based routi…
▽ More
The continuous expansion of network data presents a pressing challenge for conventional routing algorithms. As the demand escalates, these algorithms are struggling to cope. In this context, reinforcement learning (RL) and multi-agent reinforcement learning (MARL) algorithms emerge as promising solutions. However, the urgency and importance of the problem are clear, as existing RL/MARL-based routing approaches lack effective communication in run time among routers, making it challenging for individual routers to adapt to complex and dynamic changing networks. More importantly, they lack the ability to deal with dynamically changing network topology, especially the addition of the router, due to the non-scalability of their neural networks. This paper proposes a novel dynamic routing algorithm, DRAMA, incorporating emergent communication in multi-agent reinforcement learning. Through emergent communication, routers could learn how to communicate effectively to maximize the optimization objectives. Meanwhile, a new Q-network and graph-based emergent communication are introduced to dynamically adapt to the changing network topology without retraining while ensuring robust performance. Experimental results showcase DRAMA's superior performance over the traditional routing algorithm and other RL/MARL-based algorithms, achieving a higher delivery rate and lower latency in diverse network scenarios, including dynamic network load and topology. Moreover, an ablation experiment validates the prospect of emergent communication in facilitating packet routing.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
Authors:
Kexin Tian,
Jingrui Mao,
Yunlong Zhang,
Jiwan Jiang,
Yang Zhou,
Zhengzhong Tu
Abstract:
Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning-key capabilities for autonomous driving-still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose…
▽ More
Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning-key capabilities for autonomous driving-still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs' spatial reasoning capabilities in driving scenarios. To fill this gap, we propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving. Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline. The benchmark systematically evaluates VLMs' performance in both spatial understanding and reasoning across multiple dimensions. Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving. Surprisingly, the experimental results show that the spatial-enhanced VLM outperforms in qualitative QA but does not demonstrate competitiveness in quantitative QA. In general, VLMs still face considerable challenges in spatial understanding and reasoning.
△ Less
Submitted 6 April, 2025; v1 submitted 4 April, 2025;
originally announced April 2025.
-
HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse
Authors:
Yuwei An,
Yihua Cheng,
Seo Jin Park,
Junchen Jiang
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While re…
▽ More
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While rerankers refine the selection of retrieved documents in RAG pipelines, they introduce computational challenges that hinder high throughput and low latency. To address this problem, we propose HyperRAG, a system that optimizes the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. By reusing document-side KV-cache, HyperRAG achieves both high-quality generation and system-level efficiency. To fully realize the benefits of KV-cache reuse, HyperRAG incorporates a range of system-level optimizations designed to enhance efficiency and scalability. Experiments show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.